Python 从 Twitter 推文中删除 URL 链接的表达式
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/24399820/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Expression to remove URL links from Twitter tweet
提问by hagope
I simply would like to find and replace all occurrences of a twitter url in a string (tweet):
我只是想在字符串(推文)中查找并替换所有出现的 twitter url:
Input:
输入:
This is a tweet with a url: http://t.co/0DlGChTBIx
这是一条带有网址的推文:http: //t.co/0DlGChTBIx
Output:
输出:
This is a tweet with a url:
这是一条带有网址的推文:
I've tried this:
我试过这个:
p=re.compile(r'\<http.+?\>', re.DOTALL)
tweet_clean = re.sub(p, '', tweet)
采纳答案by zx81
Do this:
做这个:
result = re.sub(r"http\S+", "", subject)
http
matches literal characters\S+
matches all non-whitespace characters (the end of the url)- we replace with the empty string
http
匹配文字字符\S+
匹配所有非空白字符(url 的结尾)- 我们用空字符串替换
回答by alfasin
The following regex will capture two matched groups: the first includes everything in the tweet until the url and the second will catch everything that will come after the URL (empty in the example you posted above):
以下正则表达式将捕获两个匹配的组:第一个包含推文中直到 url 的所有内容,第二个将捕获 URL 之后的所有内容(在您上面发布的示例中为空):
import re
str = 'This is a tweet with a url: http://t.co/0DlGChTBIx'
clean_tweet = re.match('(.*?)http.*?\s?(.*?)', str)
if clean_tweet:
print clean_tweet.group(1)
print clean_tweet.group(2) # will print everything after the URL
回答by Avinash Raj
You could try the below re.sub function to remove URL link from your string,
您可以尝试使用以下 re.sub 函数从字符串中删除 URL 链接,
>>> str = 'This is a tweet with a url: http://t.co/0DlGChTBIx'
>>> m = re.sub(r':.*$', ":", str)
>>> m
'This is a tweet with a url:'
It removes everything after first :
symbol and :
in the replacement string would add :
at the last.
它删除第一个:
符号之后的所有内容,并:
在替换字符串:
中最后添加。
This would prints all the characters which are just before to the :
symbol,
这将打印:
符号之前的所有字符,
>>> m = re.search(r'^.*?:', str).group()
>>> m
'This is a tweet with a url:'
回答by Garima Rawat
Try using this:
尝试使用这个:
text = re.sub(r"http\S+", "", text)
回答by nancy agarwal
clean_tweet = re.match('(.*?)http(.*?)\s(.*)', content)
while (clean_tweet):
content = clean_tweet.group(1) + " " + clean_tweet.group(3)
clean_tweet = re.match('(.*?)http(.*?)\s(.*)', content)
clean_tweet = re.match('(.*?)http(.*?)\s(.*)', content)
while (clean_tweet):
content = clean_tweet.group(1) + " " + clean_tweet.group(3 )
clean_tweet = re.match('(.*?)http(.*?)\s(.*)', content)
回答by self.Fool
text = re.sub(r"https:(\/\/t\.co\/([A-Za-z0-9]|[A-Za-z]){10})", "", text)
This matches alphanumerics too after t.co/
这也匹配字母数字 t.co/