如何在 Java 中规范化 URL?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/2993649/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
How to normalize a URL in Java?
提问by dfrankow
URL normalization(or URL canonicalization) is the process by which URLs are modified and standardized in a consistent manner. The goal of the normalization process is to transform a URL into a normalized or canonical URL so it is possible to determine if two syntactically different URLs are equivalent.
URL 规范化(或 URL 规范化)是以一致的方式修改和标准化 URL 的过程。规范化过程的目标是将 URL 转换为规范化的或规范的 URL,以便可以确定两个在语法上不同的 URL 是否等效。
Strategies include adding trailing slashes, https => http, etc. The Wikipedia page lists many.
策略包括添加尾部斜杠、https => http 等。维基百科页面列出了许多。
Got a favorite method of doing this in Java? Perhaps a library (Nutch?), but I'm open. Smaller and fewer dependencies is better.
有在 Java 中执行此操作的最喜欢的方法吗?也许是图书馆(Nutch?),但我是开放的。依赖项越小越好。
I'll handcode something for now and keep an eye on this question.
我现在会手动编码一些东西并密切关注这个问题。
EDIT: I want to aggressively normalize to count URLs as the same if they refer to the same content. For example, I ignore the parameters utm_source, utm_medium, utm_campaign. For example, I ignore subdomain if the title is the same.
编辑:如果它们引用相同的内容,我想积极规范化以将 URL 计数为相同。例如,我忽略了参数utm_source、utm_medium、utm_campaign。例如,如果标题相同,我会忽略子域。
回答by Nitrodist
Have you taken a look at the URI class?
你看过 URI 类了吗?
http://docs.oracle.com/javase/7/docs/api/java/net/URI.html#normalize()
http://docs.oracle.com/javase/7/docs/api/java/net/URI.html#normalize()
回答by Bruno
You can do this with the Restletframework using Reference.normalize()
. You should also be able to remove the elements you don't need quite conveniently with this class.
您可以使用Restlet框架执行此操作Reference.normalize()
。您还应该能够使用此类非常方便地删除不需要的元素。
回答by Amy B
I found this question last night, but there wasn't an answer I was looking for so I made my own. Here it is incase somebody in the future wants it:
我昨晚发现了这个问题,但没有我想要的答案,所以我自己做了一个。这是因为将来有人想要它:
/**
* - Covert the scheme and host to lowercase (done by java.net.URL)
* - Normalize the path (done by java.net.URI)
* - Add the port number.
* - Remove the fragment (the part after the #).
* - Remove trailing slash.
* - Sort the query string params.
* - Remove some query string params like "utm_*" and "*session*".
*/
public class NormalizeURL
{
public static String normalize(final String taintedURL) throws MalformedURLException
{
final URL url;
try
{
url = new URI(taintedURL).normalize().toURL();
}
catch (URISyntaxException e) {
throw new MalformedURLException(e.getMessage());
}
final String path = url.getPath().replace("/$", "");
final SortedMap<String, String> params = createParameterMap(url.getQuery());
final int port = url.getPort();
final String queryString;
if (params != null)
{
// Some params are only relevant for user tracking, so remove the most commons ones.
for (Iterator<String> i = params.keySet().iterator(); i.hasNext();)
{
final String key = i.next();
if (key.startsWith("utm_") || key.contains("session"))
{
i.remove();
}
}
queryString = "?" + canonicalize(params);
}
else
{
queryString = "";
}
return url.getProtocol() + "://" + url.getHost()
+ (port != -1 && port != 80 ? ":" + port : "")
+ path + queryString;
}
/**
* Takes a query string, separates the constituent name-value pairs, and
* stores them in a SortedMap ordered by lexicographical order.
* @return Null if there is no query string.
*/
private static SortedMap<String, String> createParameterMap(final String queryString)
{
if (queryString == null || queryString.isEmpty())
{
return null;
}
final String[] pairs = queryString.split("&");
final Map<String, String> params = new HashMap<String, String>(pairs.length);
for (final String pair : pairs)
{
if (pair.length() < 1)
{
continue;
}
String[] tokens = pair.split("=", 2);
for (int j = 0; j < tokens.length; j++)
{
try
{
tokens[j] = URLDecoder.decode(tokens[j], "UTF-8");
}
catch (UnsupportedEncodingException ex)
{
ex.printStackTrace();
}
}
switch (tokens.length)
{
case 1:
{
if (pair.charAt(0) == '=')
{
params.put("", tokens[0]);
}
else
{
params.put(tokens[0], "");
}
break;
}
case 2:
{
params.put(tokens[0], tokens[1]);
break;
}
}
}
return new TreeMap<String, String>(params);
}
/**
* Canonicalize the query string.
*
* @param sortedParamMap Parameter name-value pairs in lexicographical order.
* @return Canonical form of query string.
*/
private static String canonicalize(final SortedMap<String, String> sortedParamMap)
{
if (sortedParamMap == null || sortedParamMap.isEmpty())
{
return "";
}
final StringBuffer sb = new StringBuffer(350);
final Iterator<Map.Entry<String, String>> iter = sortedParamMap.entrySet().iterator();
while (iter.hasNext())
{
final Map.Entry<String, String> pair = iter.next();
sb.append(percentEncodeRfc3986(pair.getKey()));
sb.append('=');
sb.append(percentEncodeRfc3986(pair.getValue()));
if (iter.hasNext())
{
sb.append('&');
}
}
return sb.toString();
}
/**
* Percent-encode values according the RFC 3986. The built-in Java URLEncoder does not encode
* according to the RFC, so we make the extra replacements.
*
* @param string Decoded string.
* @return Encoded string per RFC 3986.
*/
private static String percentEncodeRfc3986(final String string)
{
try
{
return URLEncoder.encode(string, "UTF-8").replace("+", "%20").replace("*", "%2A").replace("%7E", "~");
}
catch (UnsupportedEncodingException e)
{
return string;
}
}
}
回答by H6.
Because you also want to identify URLs which refer to the same content, I found this paper from the WWW2007 pretty interesting: Do Not Crawl in the DUST: Different URLs with Similar Text. It provides you with a nice theoretical approach.
因为您还想识别引用相同内容的 URL,所以我发现 WWW2007 中的这篇论文非常有趣:Do Not Crawl in the DUST: Different URLs with Similar Text。它为您提供了一个很好的理论方法。
回答by Randy Hudson
No, there is nothing in the standard libraries to do this. Canonicalization includes things like decoding unnecessarily encoded characters, converting hostnames to lowercase, etc.
不,标准库中没有任何东西可以做到这一点。规范化包括解码不必要的编码字符、将主机名转换为小写等。
e.g. http://ACME.com/./foo%26bar
becomes:
例如http://ACME.com/./foo%26bar
变成:
http://acme.com/foo&bar
http://acme.com/foo&bar
URI's normalize()
does notdo this.
URInormalize()
不这样做。
回答by pdxleif
The RL library: https://github.com/backchatio/rlgoes quite a ways beyond java.net.URL.normalize(). It's in Scala, but I imagine it should be useable from Java.
RL 库:https: //github.com/backchatio/rl远远超出了 java.net.URL.normalize()。它在 Scala 中,但我想它应该可以从 Java 中使用。
回答by Eric Leschinski
In Java, normalize parts of a URL
在 Java 中,规范化 URL 的一部分
Example of a URL: https://i0.wp.com:55/lplresearch.com/wp-content/feb.png?ssl=1&myvar=2#myfragment
网址示例: https://i0.wp.com:55/lplresearch.com/wp-content/feb.png?ssl=1&myvar=2#myfragment
protocol: https
domain name: i0.wp.com
subdomain: i0
port: 55
path: /lplresearch.com/wp-content/uploads/2019/01/feb.png?ssl=1
query: ?ssl=1"
parameters: &myvar=2
fragment: #myfragment
Code to do the URL parsing:
执行 URL 解析的代码:
import java.util.*;
import java.util.regex.*;
public class regex {
public static String getProtocol(String the_url){
Pattern p = Pattern.compile("^(http|https|smtp|ftp|file|pop)://.*");
Matcher m = p.matcher(the_url);
return m.group(1);
}
public static String getParameters(String the_url){
Pattern p = Pattern.compile(".*(\?[-a-zA-Z0-9_.@!$&''()*+,;=]+)(#.*)*$");
Matcher m = p.matcher(the_url);
return m.group(1);
}
public static String getFragment(String the_url){
Pattern p = Pattern.compile(".*(#.*)$");
Matcher m = p.matcher(the_url);
return m.group(1);
}
public static void main(String[] args){
String the_url =
"https://i0.wp.com:55/lplresearch.com/" +
"wp-content/feb.png?ssl=1&myvar=2#myfragment";
System.out.println(getProtocol(the_url));
System.out.println(getFragment(the_url));
System.out.println(getParameters(the_url));
}
}
Prints
印刷
https
#myfragment
?ssl=1&myvar=2
You can then push and pull on the parts of the URL until they are up to muster.
然后,您可以推动和拉动 URL 的各个部分,直到它们达到集合为止。
回答by Thanh Duy Phan
Im have a simple way to solve it. Here is my code
我有一个简单的方法来解决它。这是我的代码
public static String normalizeURL(String oldLink)
{
int pos=oldLink.indexOf("://");
String newLink="http"+oldLink.substring(pos);
return newLink;
}