java 使用java从文本中删除url
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/12366496/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Removing the url from text using java
提问by NLP JAVA
How to remove the URLs present in text example
如何删除文本示例中存在的 URL
String str="Fear psychosis after #AssamRiots - http://www.google.com/LdEbWTgD http://www.yahoo.com/mksVZKBz";
using a regular expression?
使用正则表达式?
I want to remove all the URLs in the text. But it's not working, my code is :
我想删除文本中的所有 URL。但它不起作用,我的代码是:
String pattern = "(http(.*?)\s)";
Pattern pt = Pattern.compile(pattern);
Matcher namemacher = pt.matcher(input);
if (namemacher.find()) {
str=input.replace(namemacher.group(0), "");
}
回答by NLP JAVA
Input the String
that contains the url
输入String
包含网址的
private String removeUrl(String commentstr)
{
String urlPattern = "((https?|ftp|gopher|telnet|file|Unsure|http):((//)|(\\))+[\w\d:#@%/;$()~_?\+-=\\\.&]*)";
Pattern p = Pattern.compile(urlPattern,Pattern.CASE_INSENSITIVE);
Matcher m = p.matcher(commentstr);
int i = 0;
while (m.find()) {
commentstr = commentstr.replaceAll(m.group(i),"").trim();
i++;
}
return commentstr;
}
回答by svz
Well, you haven't provided any info about your text, so with the assumption of your text looking like this: "Some text here http://www.example.com some text there"
, you can do this:
好吧,您尚未提供有关文本的任何信息,因此假设您的文本如下所示:"Some text here http://www.example.com some text there"
,您可以这样做:
String yourText = "blah-blah";
String cleartext = yourText.replaceAll("http.*?\s", " ");
This will remove all sequences starting with "http" and up to the first space character.
这将删除所有以“http”开头到第一个空格字符的序列。
You should read the Javadoc on Stringclass. It will make things clear for you.
您应该阅读有关String类的 Javadoc 。它会让你清楚事情的。
回答by Philipp
How do you define URL? You might not just want to filter http:// but also https:// and other protocols like ftp://, rss:// or custom protocols.
你如何定义网址?您可能不仅要过滤 http://,还要过滤 https:// 和其他协议,例如 ftp://、rss:// 或自定义协议。
Maybe this regular expression would do the job:
也许这个正则表达式可以完成这项工作:
[\S]+://[\S]+
[\S]+://[\S]+
Explanation:
解释:
- one or more non-whitespaces
- followed by the string "://"
- followed by one or more non-whitespaces
- 一个或多个非空格
- 后跟字符串“://”
- 后跟一个或多个非空格
回答by John81
Note that if your URL contains characters like & and \ then the answers above will not work because replaceAll can't handle those characters. What worked for me was to remove those characters in a new string variable then remove those characters from the results of m.find() and use replaceAll on my new string variable.
请注意,如果您的 URL 包含像 & 和 \ 这样的字符,那么上面的答案将不起作用,因为 replaceAll 无法处理这些字符。对我有用的是删除新字符串变量中的这些字符,然后从 m.find() 的结果中删除这些字符,并在我的新字符串变量上使用 replaceAll。
private String removeUrl(String commentstr)
{
// rid of ? and & in urls since replaceAll can't deal with them
String commentstr1 = commentstr.replaceAll("\?", "").replaceAll("\&", "");
String urlPattern = "((https?|ftp|gopher|telnet|file|Unsure|http):((//)|(\\))+[\w\d:#@%/;$()~_?\+-=\\\.&]*)";
Pattern p = Pattern.compile(urlPattern,Pattern.CASE_INSENSITIVE);
Matcher m = p.matcher(commentstr);
int i = 0;
while (m.find()) {
commentstr = commentstr1.replaceAll(m.group(i).replaceAll("\?", "").replaceAll("\&", ""),"").trim();
i++;
}
return commentstr;
}
回答by Mir Saman
As @Ev0oD mentioned, the code works perfect except in the following tweet I'm working on:
RT @_Val83_: The cast of #ThorRagnarok playing "Ragnarok Paper Scissors" #TomHiddleston #MarkRuffalo (https://t.co /k9nYBu3QHu)
正如@Ev0oD 所提到的,除了我正在处理的以下推文之外,代码运行良好:
RT @_Val83_: The cast of #ThorRagnarok playing "Ragnarok Paper Scissors" #TomHiddleston #MarkRuffalo (https://t.co /k9nYBu3QHu)
where the token is going to be removed:
commentstr = commentstr.replaceAll(m.group(i),"").trim();
令牌将被删除的地方:
commentstr = commentstr.replaceAll(m.group(i),"").trim();
I have faced the following error:
我遇到了以下错误:
java.util.regex.PatternSyntaxException: Unmatched closing ')' near index 22
java.util.regex.PatternSyntaxException: Unmatched closing ')' near index 22
where the m.group(i)
is https://t.co /k9nYBu3QHu
)``
其中m.group(i)
是https://t.co /k9nYBu3QHu
)``
回答by tick_tack_techie
m.group(0)
should be replaced with an empty string rather than m.group(i)
where i
is incremented with every call to m.find()
as mentioned in one of the answers above.
m.group(0)
应该用一个空字符串替换,而不是m.group(i)
在上面的答案之一中提到的i
每次调用时都会增加where m.find()
。
private String removeUrl(String commentstr)
{
String urlPattern = "((https?|ftp|gopher|telnet|file|Unsure|http):((//)|(\\))+[\w\d:#@%/;$()~_?\+-=\\\.&]*)";
Pattern p = Pattern.compile(urlPattern,Pattern.CASE_INSENSITIVE);
Matcher m = p.matcher(commentstr);
StringBuffer sb = new StringBuffer(commentstr.length);
while (m.find()) {
m.appendReplacement(sb, "");
}
return sb.toString();
}
回答by Shubham Sharma
If you can move on towards python then you can find much better solution here using these code,
如果您可以继续使用 python,那么您可以使用这些代码在这里找到更好的解决方案,
import re
text = "<hello how are you ?> then ftp and mailto and gopher and file ftp://ideone.com/K3Cut rthen you "
text = re.sub(r"ftp\S+", "", result)
print(result)