Java 如何检测字符串中是否存在 URL
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/285619/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
How to detect the presence of URL in a string
提问by Rakesh N
I have an input String say Please go to http://stackoverflow.com
. The url part of the String is detected and an anchor <a href=""></a>
is automatically added by many browser/IDE/applications. So it becomes Please go to <a href='http://stackoverflow.com'>http://stackoverflow.com</a>
.
我有一个输入字符串 say Please go to http://stackoverflow.com
。检测到字符串的 url 部分,<a href=""></a>
许多浏览器/IDE/应用程序会自动添加一个锚点。所以就变成了Please go to <a href='http://stackoverflow.com'>http://stackoverflow.com</a>
。
I need to do the same using Java.
我需要使用 Java 做同样的事情。
回答by Jason Coco
You could do something like this (adjust the regex to suit your needs):
你可以做这样的事情(调整正则表达式以满足你的需要):
String originalString = "Please go to http://www.stackoverflow.com";
String newString = originalString.replaceAll("http://.+?(com|net|org)/{0,1}", "<a href=\"String msg = "Please go to http://stackoverflow.com";
String withURL = msg.replaceAll("(?:https?|ftps?)://[\w/%.-]+", "<a href='\(?\bhttp://[-A-Za-z0-9+&@#/%?=~_()|!:,.;]*[-A-Za-z0-9+&@#/%=~_()|]
'>if (s.StartsWith("(") && s.EndsWith(")"))
{
return s.Substring(1, s.Length - 2);
}
</a>");
System.out.println(withURL);
\">import java.net.URL;
import java.net.MalformedURLException;
// Replaces URLs with html hrefs codes
public class URLInString {
public static void main(String[] args) {
String s = args[0];
// separate input by spaces ( URLs don't have spaces )
String [] parts = s.split("\s+");
// Attempt to convert each item into an URL.
for( String item : parts ) try {
URL url = new URL(item);
// If possible then replace with anchor...
System.out.print("<a href=\"" + url + "\">"+ url + "</a> " );
} catch (MalformedURLException e) {
// If there was an URL that was not it!...
System.out.print( item + " " );
}
System.out.println();
}
}
</a>");
回答by PhiLho
Primitive:
原始:
"Please go to http://stackoverflow.com and then mailto:[email protected] to download a file from ftp://user:pass@someserver/someFile.txt"
This needs refinement, to match proper URLs, and particularly GET parameters (?foo=bar&x=25)
这需要改进,以匹配正确的 URL,尤其是 GET 参数 (?foo=bar&x=25)
回答by ykaganovich
Your are asking two separate questions.
你在问两个不同的问题。
- What is the best way to identify URLs in Strings? See this thread
- How to code the above solution in Java? other responses illustrating
String.replaceAll
usage have addressed this
- 在字符串中识别 URL 的最佳方法是什么?看到这个线程
- 如何用Java编写上述解决方案?说明
String.replaceAll
用法的其他回复已经解决了这个问题
回答by Michael Burr
While it's not Java specific, Jeff Atwood recently posted an article about the pitfalls you might run into when trying to locate and match URLs in arbitrary text:
虽然它不是特定于 Java 的,但 Jeff Atwood 最近发表了一篇关于在尝试定位和匹配任意文本中的 URL 时可能遇到的陷阱的文章:
It gives a good regex that can be used along with the snippet of code that you need to use to properly (more or less) handle parens.
它提供了一个很好的正则表达式,可以与您需要用来正确(或多或少)处理括号的代码片段一起使用。
The regex:
正则表达式:
Please go to <a href="http://stackoverflow.com">http://stackoverflow.com</a> and then <a href="mailto:[email protected]">mailto:[email protected]</a> to download a file from <a href="ftp://user:pass@someserver/someFile.txt">ftp://user:pass@someserver/someFile.txt</a>
The paren cleanup:
父母清理:
url.getProtocol();
回答by OscarRyz
Use java.net.URL for that!!
使用 java.net.URL !
Hey, why don't use the core class in java for this "java.net.URL" and let it validate the URL.
嘿,为什么不为这个“java.net.URL”使用 java 中的核心类并让它验证 URL。
While the following code violates the golden principle "Use exception for exceptional conditions only" it does not make sense to me to try to reinvent the wheel for something that is veeery mature on the java platform.
虽然以下代码违反了“仅在异常情况下使用异常”的黄金原则,但对我来说,尝试为 Java 平台上非常成熟的东西重新发明轮子是没有意义的。
Here's the code:
这是代码:
// NOTES: 1) \w includes 0-9, a-z, A-Z, _
// 2) The leading '-' is the '-' character. It must go first in character class expression
private static final String VALID_CHARS = "-\w+&@#/%=~()|";
private static final String VALID_NON_TERMINAL = "?!:,.;";
// Notes on the expression:
// 1) Any number of leading '(' (left parenthesis) accepted. Will be dealt with.
// 2) s? ==> the s is optional so either [http, https] accepted as scheme
// 3) All valid chars accepted and then one or more
// 4) Case insensitive so that the scheme can be hTtPs (for example) if desired
private static final Pattern URI_FINDER_PATTERN = Pattern.compile("\(*https?://["+ VALID_CHARS + VALID_NON_TERMINAL + "]*[" +VALID_CHARS + "]", Pattern.CASE_INSENSITIVE );
/**
* <p>
* Finds all "URL"s in the given _rawText, wraps them in
* HTML link tags and returns the result (with the rest of the text
* html encoded).
* </p>
* <p>
* We employ the procedure described at:
* http://www.codinghorror.com/blog/2008/10/the-problem-with-urls.html
* which is a <b>must-read</b>.
* </p>
* Basically, we allow any number of left parenthesis (which will get stripped away)
* followed by http:// or https://. Then any number of permitted URL characters
* (based on http://www.ietf.org/rfc/rfc1738.txt) followed by a single character
* of that set (basically, those minus typical punctuation). We remove all sets of
* matching left & right parentheses which surround the URL.
*</p>
* <p>
* This method *must* be called from a tag/component which will NOT
* end up escaping the output. For example:
* <PRE>
* <h:outputText ... escape="false" value="#{core:hyperlinkText(textThatMayHaveURLs, '_blank')}"/>
* </pre>
* </p>
* <p>
* Reason: we are adding <code><a href="..."></code> tags to the output *and*
* encoding the rest of the string. So, encoding the outupt will result in
* double-encoding data which was already encoded - and encoding the <code>a href</code>
* (which will render it useless).
* </p>
* <p>
*
* @param _rawText - if <code>null</code>, returns <code>""</code> (empty string).
* @param _target - if not <code>null</code> or <code>""</code>, adds a target attributed to the generated link, using _target as the attribute value.
*/
public static final String hyperlinkText( final String _rawText, final String _target ) {
String returnValue = null;
if ( !StringUtils.isBlank( _rawText ) ) {
final Matcher matcher = URI_FINDER_PATTERN.matcher( _rawText );
if ( matcher.find() ) {
final int originalLength = _rawText.length();
final String targetText = ( StringUtils.isBlank( _target ) ) ? "" : " target=\"" + _target.trim() + "\"";
final int targetLength = targetText.length();
// Counted 15 characters aside from the target + 2 of the URL (max if the whole string is URL)
// Rough guess, but should keep us from expanding the Builder too many times.
final StringBuilder returnBuffer = new StringBuilder( originalLength * 2 + targetLength + 15 );
int currentStart;
int currentEnd;
int lastEnd = 0;
String currentURL;
do {
currentStart = matcher.start();
currentEnd = matcher.end();
currentURL = matcher.group();
// Adjust for URLs wrapped in ()'s ... move start/end markers
// and substring the _rawText for new URL value.
while ( currentURL.startsWith( "(" ) && currentURL.endsWith( ")" ) ) {
currentStart = currentStart + 1;
currentEnd = currentEnd - 1;
currentURL = _rawText.substring( currentStart, currentEnd );
}
while ( currentURL.startsWith( "(" ) ) {
currentStart = currentStart + 1;
currentURL = _rawText.substring( currentStart, currentEnd );
}
// Text since last match
returnBuffer.append( HtmlUtil.encode( _rawText.substring( lastEnd, currentStart ) ) );
// Wrap matched URL
returnBuffer.append( "<a href=\"" + currentURL + "\"" + targetText + ">" + currentURL + "</a>" );
lastEnd = currentEnd;
} while ( matcher.find() );
if ( lastEnd < originalLength ) {
returnBuffer.append( HtmlUtil.encode( _rawText.substring( lastEnd ) ) );
}
returnValue = returnBuffer.toString();
}
}
if ( returnValue == null ) {
returnValue = HtmlUtil.encode( _rawText );
}
return returnValue;
}
Using the following input:
使用以下输入:
if (yourtextview.getText().toString().contains("www") || yourtextview.getText().toString().contains("http://"){ your code here if contains URL;}
Produces the following output:
产生以下输出:
public static Iterator<ExtractedURI> extractURIs(
final Reader reader,
final Iterable<ToURIStrategy> strategies,
String ... schemes);
Of course different protocols could be handled in different ways. You can get all the info with the getters of URL class, for instance
当然,可以以不同的方式处理不同的协议。例如,您可以使用 URL 类的 getter 获取所有信息
public static List<ToURIStrategy> DEFAULT_STRATEGY_CHAIN = ImmutableList.of(
new RemoveSurroundsWithToURIStrategy("'"),
new RemoveSurroundsWithToURIStrategy("\""),
new RemoveSurroundsWithToURIStrategy("(", ")"),
new RemoveEndsWithToURIStrategy("."),
DEFAULT_STRATEGY,
REMOVE_LAST_STRATEGY);
Or the rest of the attributes: spec, port, file, query, ref etc. etc
或者其余的属性:规范、端口、文件、查询、引用等。
http://java.sun.com/javase/6/docs/api/java/net/URL.html
http://java.sun.com/javase/6/docs/api/java/net/URL.html
Handles all the protocols ( at least all of those the java platform is aware ) and as an extra benefit, if there is any URL that java currently does not recognize and eventually gets incorporated into the URL class ( by library updating ) you'll get it transparently!
处理所有协议(至少是所有 java 平台知道的协议),并且作为一个额外的好处,如果有任何 java 当前无法识别的 URL 并最终被合并到 URL 类中(通过库更新),您将获得它透明!
回答by Sérgio Nunes
A good refinement to PhiLho's answer would be:
msg.replaceAll("(?:https?|ftps?)://[\w/%.-][/\??\w=?\w?/%.-]?[/\?&\w=?\w?/%.-]*", "$0");
对 PhiLho 答案的一个很好的改进是:
msg.replaceAll("(?:https?|ftps?)://[\w/%.-][/\??\w=?\w?/%.-]?[/\?&\w=?\w?/%.-]*", "$0");
回答by Jacob Zwiers
The following code makes these modifications to the "Atwood Approach":
以下代码对“阿特伍德方法”进行了这些修改:
- Detects https in addition to http (adding other schemes is trivial)
- The CASE_INSENSTIVE flag is used since HtTpS:// is valid.
- Matching sets of parentheses are peeled off (they can be nested to any level). Further, any remaining unmatched left parentheses are stripped, but trailing right parentheses are left intact (to respect wikipedia-style URLs)
- The URL is HTML Encoded in the link text.
- The target attribute is passed in via method parameter. Other attributes can be added as desired.
- It does not use \b to identify a word break before matching a URL. URLs can begin with a left parenthesis or http[s]:// with no other requirement.
- 除了http之外还检测https(添加其他方案是微不足道的)
- 由于 HtTpS:// 有效,因此使用 CASE_INSENSTIVE 标志。
- 匹配的括号组被剥离(它们可以嵌套到任何级别)。此外,任何剩余的不匹配的左括号都将被删除,但尾随的右括号保持不变(以尊重维基百科风格的 URL)
- URL 在链接文本中是 HTML 编码的。
- 目标属性通过方法参数传入。可以根据需要添加其他属性。
- 在匹配 URL 之前,它不使用 \b 来识别分词符。URL 可以以左括号或 http[s]:// 开头,没有其他要求。
Notes:
笔记:
- Apache Commons Lang's StringUtils are used in the code below
- The call to HtmlUtil.encode() below is a util which ultimately calls some Tomahawk code to HTML-encode the link text, but any similar utility will do.
- See the method comment for a usage in JSF or other environments where output is HTML Encoded by default.
- 下面的代码中使用了Apache Commons Lang的StringUtils
- 下面对 HtmlUtil.encode() 的调用是一个实用程序,它最终调用一些 Tomahawk 代码对链接文本进行 HTML 编码,但任何类似的实用程序都可以。
- 请参阅方法注释以了解在 JSF 或其他默认情况下输出为 HTML 编码的环境中的用法。
This was written in response to our client's requirements and we feel it represents a reasonable compromise between the allowable characters from the RFC and common usage. It is offered here in the hopes that it will be useful to others.
这是为了响应我们客户的要求而编写的,我们认为它代表了 RFC 中允许的字符和常见用法之间的合理折衷。在这里提供它是希望它对其他人有用。
Further expansion could be made which would allow for any Unicode characters to be entered (i.e. not escaped with %XX (two digit hex) and hyperlinked, but that would require accepting all Unicode letters plus limited punctuation and then splitting on the "acceptable" delimiters (eg. .,%,|,#, etc.), URL-encoding each part and then gluing back together. For example, http://en.wikipedia.org/wiki/Bj?rn_Andrésen (which the Stack Overflow generator does not detect) would be "http://en.wikipedia.org/wiki/Bj%C3%B6rn_Andr%C3%A9sen" in the href, but would contain Bj?rn_Andrésen in the linked text on the page.
可以进行进一步扩展,允许输入任何 Unicode 字符(即不使用 %XX(两位十六进制)和超链接进行转义,但这需要接受所有 Unicode 字母加上有限的标点符号,然后在“可接受的”分隔符上拆分(例如.、%、|、#等),对每个部分进行URL编码,然后重新粘合在一起。例如,http://en.wikipedia.org/wiki/Bj?rn_Andrésen(堆栈溢出生成器未检测到)将是“http://en.wikipedia.org/wiki/Bj%C3%B6rn_Andr%C3%A9sen”在 href 中,但会在页面上的链接文本中包含 Bj?rn_Andrésen。
##代码##回答by Tixa
To detect an URL you just need this:
要检测一个 URL,你只需要这个:
##代码##回答by Adam Gent
I wrote my own URI/URL extractor and figured someone might find it useful considering it IMHO is better than the other answers because:
我编写了自己的 URI/URL 提取器,并认为有人可能会觉得它有用,因为恕我直言,它比其他答案更好,因为:
- Its Stream based and can be used on large documents
- Its extendable to handle all kinds of "Atwood Paren"problems through a strategy chain.
- 它基于 Stream,可用于大型文档
- 它可扩展以通过策略链处理各种“Atwood Paren”问题。
Since the code is somewhat long for a post (albeit only one Java file) I have put it on gist github.
由于一篇文章的代码有点长(尽管只有一个 Java 文件),我把它放在了gist github 上。
Here is a signature of one of the main methods to call it to show how its the above bullet points:
这是调用它的主要方法之一的签名,以显示其上述要点:
##代码##There is a default strategy chain which handle most of the Atwood problems.
有一个默认的策略链可以处理大部分 Atwood 问题。
##代码##Enjoy!
享受!
回答by robinst
I made a small library which does exactly this:
我制作了一个小型图书馆,它正是这样做的:
https://github.com/robinst/autolink-java
https://github.com/robinst/autolink-java
Some tricky examples and the links that it detects:
一些棘手的示例及其检测到的链接:
http://example.com.
→ http://example.com.http://example.com,
→ http://example.com,(http://example.com)
→ (http://example.com)(... (see http://example.com))
→ (... (see http://example.com))https://en.wikipedia.org/wiki/Link_(The_Legend_of_Zelda)
→ https://en.wikipedia.org/wiki/Link_(The_Legend_of_Zelda)http://ü????eé.com/
→ http://ü????eé.com/
http://example.com.
→ http://example.com。http://example.com,
→ http://example.com,(http://example.com)
→ ( http://example.com)(... (see http://example.com))
→ (... (见http://example.com))https://en.wikipedia.org/wiki/Link_(The_Legend_of_Zelda)
→ https://en.wikipedia.org/wiki/Link_(The_Legend_of_Zelda)http://ü????eé.com/
→ http://ü????eé.com/