java 使用java从文本中删除url

Question

提问by NLP JAVA

How to remove the URLs present in text example

如何删除文本示例中存在的 URL

String str="Fear psychosis after #AssamRiots - http://www.google.com/LdEbWTgD http://www.yahoo.com/mksVZKBz";

using a regular expression?

使用正则表达式？

I want to remove all the URLs in the text. But it's not working, my code is :

我想删除文本中的所有 URL。但它不起作用，我的代码是：

String pattern = "(http(.*?)\s)";
Pattern pt = Pattern.compile(pattern);
Matcher namemacher = pt.matcher(input);
if (namemacher.find()) {
  str=input.replace(namemacher.group(0), "");
}

Answer 1

回答by NLP JAVA

Input the Stringthat contains the url

输入String包含网址的

private String removeUrl(String commentstr)
    {
        String urlPattern = "((https?|ftp|gopher|telnet|file|Unsure|http):((//)|(\\))+[\w\d:#@%/;$()~_?\+-=\\\.&]*)";
        Pattern p = Pattern.compile(urlPattern,Pattern.CASE_INSENSITIVE);
        Matcher m = p.matcher(commentstr);
        int i = 0;
        while (m.find()) {
            commentstr = commentstr.replaceAll(m.group(i),"").trim();
            i++;
        }
        return commentstr;
    }

Answer 2

回答by svz

Well, you haven't provided any info about your text, so with the assumption of your text looking like this: "Some text here http://www.example.com some text there", you can do this:

好吧，您尚未提供有关文本的任何信息，因此假设您的文本如下所示："Some text here http://www.example.com some text there"，您可以这样做：

String yourText = "blah-blah";
String cleartext = yourText.replaceAll("http.*?\s", " ");

This will remove all sequences starting with "http" and up to the first space character.

这将删除所有以“http”开头到第一个空格字符的序列。

You should read the Javadoc on Stringclass. It will make things clear for you.

您应该阅读有关String类的 Javadoc 。它会让你清楚事情的。

Answer 3

回答by Philipp

How do you define URL? You might not just want to filter http:// but also https:// and other protocols like ftp://, rss:// or custom protocols.

你如何定义网址？您可能不仅要过滤 http://，还要过滤 https:// 和其他协议，例如 ftp://、rss:// 或自定义协议。

Maybe this regular expression would do the job:

也许这个正则表达式可以完成这项工作：

[\S]+://[\S]+

Explanation:

解释：

one or more non-whitespaces
followed by the string "://"
followed by one or more non-whitespaces

一个或多个非空格
后跟字符串“：//”
后跟一个或多个非空格

Answer 4

回答by John81

Note that if your URL contains characters like & and \ then the answers above will not work because replaceAll can't handle those characters. What worked for me was to remove those characters in a new string variable then remove those characters from the results of m.find() and use replaceAll on my new string variable.

请注意，如果您的 URL 包含像 & 和 \ 这样的字符，那么上面的答案将不起作用，因为 replaceAll 无法处理这些字符。对我有用的是删除新字符串变量中的这些字符，然后从 m.find() 的结果中删除这些字符，并在我的新字符串变量上使用 replaceAll。

private String removeUrl(String commentstr)
{
    // rid of ? and & in urls since replaceAll can't deal with them
    String commentstr1 = commentstr.replaceAll("\?", "").replaceAll("\&", "");

    String urlPattern = "((https?|ftp|gopher|telnet|file|Unsure|http):((//)|(\\))+[\w\d:#@%/;$()~_?\+-=\\\.&]*)";
    Pattern p = Pattern.compile(urlPattern,Pattern.CASE_INSENSITIVE);
    Matcher m = p.matcher(commentstr);
    int i = 0;
    while (m.find()) {
        commentstr = commentstr1.replaceAll(m.group(i).replaceAll("\?", "").replaceAll("\&", ""),"").trim();
        i++;
    }
    return commentstr;
}

Answer 5

回答by Mir Saman

As @Ev0oD mentioned, the code works perfect except in the following tweet I'm working on: RT @_Val83_: The cast of #ThorRagnarok playing "Ragnarok Paper Scissors" #TomHiddleston #MarkRuffalo (https://t.co /k9nYBu3QHu)

正如@Ev0oD 所提到的，除了我正在处理的以下推文之外，代码运行良好： RT @_Val83_: The cast of #ThorRagnarok playing "Ragnarok Paper Scissors" #TomHiddleston #MarkRuffalo (https://t.co /k9nYBu3QHu)

where the token is going to be removed: commentstr = commentstr.replaceAll(m.group(i),"").trim();

令牌将被删除的地方： commentstr = commentstr.replaceAll(m.group(i),"").trim();

I have faced the following error:

我遇到了以下错误：

java.util.regex.PatternSyntaxException: Unmatched closing ')' near index 22

where the m.group(i)is https://t.co /k9nYBu3QHu)``

其中m.group(i)是https://t.co /k9nYBu3QHu）``

Answer 6

回答by tick_tack_techie

m.group(0)should be replaced with an empty string rather than m.group(i)where iis incremented with every call to m.find()as mentioned in one of the answers above.

m.group(0)应该用一个空字符串替换，而不是m.group(i)在上面的答案之一中提到的i每次调用时都会增加where m.find()。

private String removeUrl(String commentstr)
{
    String urlPattern = "((https?|ftp|gopher|telnet|file|Unsure|http):((//)|(\\))+[\w\d:#@%/;$()~_?\+-=\\\.&]*)";
    Pattern p = Pattern.compile(urlPattern,Pattern.CASE_INSENSITIVE);
    Matcher m = p.matcher(commentstr);
    StringBuffer sb = new StringBuffer(commentstr.length);
    while (m.find()) {
        m.appendReplacement(sb, "");
    }
    return sb.toString();
}

Answer 7

回答by Shubham Sharma

If you can move on towards python then you can find much better solution here using these code,

如果您可以继续使用 python，那么您可以使用这些代码在这里找到更好的解决方案，

import re
text = "<hello how are you ?> then ftp and mailto and gopher and file ftp://ideone.com/K3Cut rthen you "
text = re.sub(r"ftp\S+", "", result)
print(result)

java 使用java从文本中删除url

提问by NLP JAVA

回答by NLP JAVA

回答by svz

回答by Philipp

回答by John81

回答by Mir Saman

回答by tick_tack_techie

回答by Shubham Sharma

相关推荐

最近更新

标签

java 使用java从文本中删除url

提问by NLP JAVA

回答by NLP JAVA

回答by svz

回答by Philipp

回答by John81

回答by Mir Saman

回答by tick_tack_techie

回答by Shubham Sharma

相关推荐

java 返回二叉树中节点的父节点

支持获取、设置和删除某些索引的 Java arraylist 的 C# 等效项

java android中的登录表单验证

java 如何在java中获得比较器的倒数

相关推荐

最近更新

标签