Python 从 Twitter 推文中删除 URL 链接的表达式

Question

提问by hagope

I simply would like to find and replace all occurrences of a twitter url in a string (tweet):

我只是想在字符串（推文）中查找并替换所有出现的 twitter url：

Input:

输入：

This is a tweet with a url: http://t.co/0DlGChTBIx

这是一条带有网址的推文：http: //t.co/0DlGChTBIx

Output:

输出：

This is a tweet with a url:

这是一条带有网址的推文：

I've tried this:

我试过这个：

p=re.compile(r'\<http.+?\>', re.DOTALL)
tweet_clean = re.sub(p, '', tweet)

Answer 1

采纳答案by zx81

Do this:

做这个：

result = re.sub(r"http\S+", "", subject)

httpmatches literal characters
\S+matches all non-whitespace characters (the end of the url)
we replace with the empty string

http匹配文字字符
\S+匹配所有非空白字符（url 的结尾）
我们用空字符串替换

Answer 2

回答by alfasin

The following regex will capture two matched groups: the first includes everything in the tweet until the url and the second will catch everything that will come after the URL (empty in the example you posted above):

以下正则表达式将捕获两个匹配的组：第一个包含推文中直到 url 的所有内容，第二个将捕获 URL 之后的所有内容（在您上面发布的示例中为空）：

import re
str = 'This is a tweet with a url: http://t.co/0DlGChTBIx'
clean_tweet = re.match('(.*?)http.*?\s?(.*?)', str)
if clean_tweet: 
    print clean_tweet.group(1)
    print clean_tweet.group(2) # will print everything after the URL

Answer 3

回答by Avinash Raj

You could try the below re.sub function to remove URL link from your string,

您可以尝试使用以下 re.sub 函数从字符串中删除 URL 链接，

>>> str = 'This is a tweet with a url: http://t.co/0DlGChTBIx'
>>> m = re.sub(r':.*$', ":", str)
>>> m
'This is a tweet with a url:'

It removes everything after first :symbol and :in the replacement string would add :at the last.

它删除第一个:符号之后的所有内容，并:在替换字符串:中最后添加。

This would prints all the characters which are just before to the :symbol,

这将打印:符号之前的所有字符，

>>> m = re.search(r'^.*?:', str).group()
>>> m
'This is a tweet with a url:'

Answer 4

回答by Garima Rawat

Try using this:

尝试使用这个：

text = re.sub(r"http\S+", "", text)

Answer 5

回答by nancy agarwal

clean_tweet = re.match('(.*?)http(.*?)\s(.*)', content)

while (clean_tweet):
content = clean_tweet.group(1) + " " + clean_tweet.group(3)
clean_tweet = re.match('(.*?)http(.*?)\s(.*)', content)

clean_tweet = re.match('(.*?)http(.*?)\s(.*)', content)

while (clean_tweet):
content = clean_tweet.group(1) + " " + clean_tweet.group(3 )
clean_tweet = re.match('(.*?)http(.*?)\s(.*)', content)

Answer 6

回答by self.Fool

text = re.sub(r"https:(\/\/t\.co\/([A-Za-z0-9]|[A-Za-z]){10})", "", text)

This matches alphanumerics too after t.co/

这也匹配字母数字 t.co/

Python 从 Twitter 推文中删除 URL 链接的表达式

提问by hagope

采纳答案by zx81

回答by alfasin

回答by Avinash Raj

回答by Garima Rawat

回答by nancy agarwal

回答by self.Fool

相关推荐

最近更新

标签

Python 从 Twitter 推文中删除 URL 链接的表达式

提问by hagope

采纳答案by zx81

回答by alfasin

回答by Avinash Raj

回答by Garima Rawat

回答by nancy agarwal

回答by self.Fool

相关推荐

带有硒的 Python：无法定位真正存在的元素

Python 选择两个日期之间的 DataFrame 行

Python 类型错误：b'1' 不是 JSON 可序列化的

Python 在 Pandas 中将元组中的字符串拆分为列

相关推荐

最近更新

标签