Python 从 Twitter 推文中删除 URL 链接的表达式

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/24399820/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-19 04:32:52  来源:igfitidea点击:

Expression to remove URL links from Twitter tweet

pythonregexstring

提问by hagope

I simply would like to find and replace all occurrences of a twitter url in a string (tweet):

我只是想在字符串(推文)中查找并替换所有出现的 twitter url:

Input:

输入:

This is a tweet with a url: http://t.co/0DlGChTBIx

这是一条带有网址的推文:http: //t.co/0DlGChTBIx

Output:

输出:

This is a tweet with a url:

这是一条带有网址的推文:

I've tried this:

我试过这个:

p=re.compile(r'\<http.+?\>', re.DOTALL)
tweet_clean = re.sub(p, '', tweet)

采纳答案by zx81

Do this:

做这个:

result = re.sub(r"http\S+", "", subject)
  • httpmatches literal characters
  • \S+matches all non-whitespace characters (the end of the url)
  • we replace with the empty string
  • http匹配文字字符
  • \S+匹配所有非空白字符(url 的结尾)
  • 我们用空字符串替换

回答by alfasin

The following regex will capture two matched groups: the first includes everything in the tweet until the url and the second will catch everything that will come after the URL (empty in the example you posted above):

以下正则表达式将捕获两个匹配的组:第一个包含推文中直到 url 的所有内容,第二个将捕获 URL 之后的所有内容(在您上面发布的示例中为空):

import re
str = 'This is a tweet with a url: http://t.co/0DlGChTBIx'
clean_tweet = re.match('(.*?)http.*?\s?(.*?)', str)
if clean_tweet: 
    print clean_tweet.group(1)
    print clean_tweet.group(2) # will print everything after the URL 

回答by Avinash Raj

You could try the below re.sub function to remove URL link from your string,

您可以尝试使用以下 re.sub 函数从字符串中删除 URL 链接,

>>> str = 'This is a tweet with a url: http://t.co/0DlGChTBIx'
>>> m = re.sub(r':.*$', ":", str)
>>> m
'This is a tweet with a url:'

It removes everything after first :symbol and :in the replacement string would add :at the last.

它删除第一个:符号之后的所有内容,并:在替换字符串:中最后添加。

This would prints all the characters which are just before to the :symbol,

这将打印:符号之前的所有字符,

>>> m = re.search(r'^.*?:', str).group()
>>> m
'This is a tweet with a url:'

回答by Garima Rawat

Try using this:

尝试使用这个:

text = re.sub(r"http\S+", "", text)

回答by nancy agarwal

clean_tweet = re.match('(.*?)http(.*?)\s(.*)', content)

while (clean_tweet):
content = clean_tweet.group(1) + " " + clean_tweet.group(3)
clean_tweet = re.match('(.*?)http(.*?)\s(.*)', content)

clean_tweet = re.match('(.*?)http(.*?)\s(.*)', content)

while (clean_tweet):
content = clean_tweet.group(1) + " " + clean_tweet.group(3 )
clean_tweet = re.match('(.*?)http(.*?)\s(.*)', content)

回答by self.Fool

text = re.sub(r"https:(\/\/t\.co\/([A-Za-z0-9]|[A-Za-z]){10})", "", text)

This matches alphanumerics too after t.co/

这也匹配字母数字 t.co/