Java中的HTTP URL地址编码
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/724043/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
HTTP URL Address Encoding in Java
提问by suDocker
My Java standalone application gets a URL (which points to a file) from the user and I need to hit it and download it. The problem I am facing is that I am not able to encode the HTTP URL address properly...
我的 Java 独立应用程序从用户那里获取一个 URL(指向一个文件),我需要点击它并下载它。我面临的问题是我无法正确编码 HTTP URL 地址......
Example:
例子:
URL: http://search.barnesandnoble.com/booksearch/first book.pdf
java.net.URLEncoder.encode(url.toString(), "ISO-8859-1");
returns me:
回我:
http%3A%2F%2Fsearch.barnesandnoble.com%2Fbooksearch%2Ffirst+book.pdf
But, what I want is
但是,我想要的是
http://search.barnesandnoble.com/booksearch/first%20book.pdf
(space replaced by %20)
(空格被 %20 替换)
I guess URLEncoder
is not designed to encode HTTP URLs... The JavaDoc says "Utility class for HTML form encoding"... Is there any other way to do this?
我猜URLEncoder
不是为了编码 HTTP URLs ...... JavaDoc 说“用于 HTML 表单编码的实用程序类”......还有其他方法可以做到这一点吗?
采纳答案by user85421
The java.net.URIclass can help; in the documentation of URL you find
该java.net.URI中的类可以帮助; 在您找到的 URL 文档中
Note, the URI class does perform escaping of its component fields in certain circumstances. The recommended way to manage the encoding and decoding of URLs is to use an URI
请注意,URI 类在某些情况下会执行其组件字段的转义。管理 URL 编码和解码的推荐方法是使用 URI
Use one of the constructors with more than one argument, like:
使用具有多个参数的构造函数之一,例如:
URI uri = new URI(
"http",
"search.barnesandnoble.com",
"/booksearch/first book.pdf",
null);
URL url = uri.toURL();
//or String request = uri.toString();
(the single-argument constructor of URI does NOT escape illegal characters)
(URI 的单参数构造函数不会转义非法字符)
Only illegal characters get escaped by above code - it does NOT escape non-ASCII characters (see fatih's comment).
The toASCIIString
method can be used to get a String only with US-ASCII characters:
只有非法字符才能被上面的代码转义 - 它不会转义非 ASCII 字符(请参阅 fatih 的评论)。
该toASCIIString
方法可用于获取仅包含 US-ASCII 字符的字符串:
URI uri = new URI(
"http",
"search.barnesandnoble.com",
"/booksearch/é",
null);
String request = uri.toASCIIString();
For an URL with a query like http://www.google.com/ig/api?weather=S?o Paulo
, use the 5-parameter version of the constructor:
对于具有类似查询的 URL http://www.google.com/ig/api?weather=S?o Paulo
,请使用构造函数的 5 参数版本:
URI uri = new URI(
"http",
"www.google.com",
"/ig/api",
"weather=S?o Paulo",
null);
String request = uri.toASCIIString();
回答by Nathan Feger
Yeah URL encoding is going to encode that string so that it would be passed properly in a url to a final destination. For example you could not have http://stackoverflow.com?url=http://yyy.com. UrlEncoding the parameter would fix that parameter value.
是的,URL 编码将对该字符串进行编码,以便它可以在 url 中正确传递到最终目的地。例如,您不能拥有http://stackoverflow.com?url=http://yyy.com。UrlEncoding 参数将修复该参数值。
So i have two choices for you:
所以我给你两个选择:
Do you have access to the path separate from the domain? If so you may be able to simply UrlEncode the path. However, if this is not the case then option 2 may be for you.
Get commons-httpclient-3.1. This has a class URIUtil:
System.out.println(URIUtil.encodePath("http://example.com/xy", "ISO-8859-1"));
您是否有权访问与域分开的路径?如果是这样,您可以简单地对路径进行 UrlEncode。但是,如果情况并非如此,则选项 2 可能适合您。
获取 commons-httpclient-3.1。这有一个 URIUtil 类:
System.out.println(URIUtil.encodePath(" http://example.com/xy", "ISO-8859-1"));
This will output exactly what you are looking for, as it will only encode the path part of the URI.
这将准确输出您要查找的内容,因为它只会对 URI 的路径部分进行编码。
FYI, you'll need commons-codec and commons-logging for this method to work at runtime.
仅供参考,您需要 commons-codec 和 commons-logging 才能使此方法在运行时工作。
回答by Brandon Yarbrough
URLEncoding can encode HTTP URLs just fine, as you've unfortunately discovered. The string you passed in, "http://search.barnesandnoble.com/booksearch/firstbook.pdf", was correctly and completely encoded into a URL-encoded form. You could pass that entire long string of gobbledigook that you got back as a parameter in a URL, and it could be decoded back into exactly the string you passed in.
URLEncoding 可以很好地编码 HTTP URL,正如您不幸发现的那样。您传入的字符串“ http://search.barnesandnoble.com/booksearch/firstbook.pdf”已正确且完整地编码为 URL 编码形式。您可以将您返回的整个 gobbledigook 字符串作为 URL 中的参数传递,并且可以将其解码回您传入的字符串。
It sounds like you want to do something a little different than passing the entire URL as a parameter. From what I gather, you're trying to create a search URL that looks like "http://search.barnesandnoble.com/booksearch/whateverTheUserPassesIn". The only thing that you need to encode is the "whateverTheUserPassesIn" bit, so perhaps all you need to do is something like this:
听起来您想做一些与将整个 URL 作为参数传递有点不同的事情。根据我收集的信息,您正在尝试创建一个类似于“ http://search.barnesandnoble.com/booksearch/whateverTheUserPassesIn”的搜索 URL 。您唯一需要编码的是“whateverTheUserPassesIn”位,所以也许您需要做的就是这样:
String url = "http://search.barnesandnoble.com/booksearch/" +
URLEncoder.encode(userInput,"UTF-8");
That should produce something rather more valid for you.
那应该会产生一些对你更有效的东西。
回答by Julian Reschke
Nitpicking: a string containing a whitespace character by definition is not a URI. So what you're looking for is code that implements the URI escaping defined in Section 2.1 of RFC 3986.
挑剔:根据定义包含空白字符的字符串不是 URI。因此,您正在寻找的是实现RFC 3986 的第 2.1 节中定义的 URI 转义的代码。
回答by Matt
Please be warned that most of the answers above are INCORRECT.
请注意,上面的大多数答案都是不正确的。
The URLEncoder
class, despite is name, is NOT what needs to be here. It's unfortunate that Sun named this class so annoyingly. URLEncoder
is meant for passing data as parameters, not for encoding the URL itself.
该URLEncoder
级,尽管是名,是不是有什么需要到这里来。不幸的是,Sun 如此烦人地命名了这个类。 URLEncoder
用于将数据作为参数传递,而不是用于对 URL 本身进行编码。
In other words, "http://search.barnesandnoble.com/booksearch/first book.pdf"
is the URL. Parameters would be, for example, "http://search.barnesandnoble.com/booksearch/first book.pdf?parameter1=this¶m2=that"
. The parameters are what you would use URLEncoder
for.
换句话说,"http://search.barnesandnoble.com/booksearch/first book.pdf"
就是 URL。例如,参数将是"http://search.barnesandnoble.com/booksearch/first book.pdf?parameter1=this¶m2=that"
。参数是您将要使用URLEncoder
的。
The following two examples highlights the differences between the two.
以下两个示例突出显示了两者之间的差异。
The following produces the wrong parameters, according to the HTTP standard. Note the ampersand (&) and plus (+) are encoded incorrectly.
根据 HTTP 标准,以下会产生错误的参数。请注意与号 (&) 和加号 (+) 的编码不正确。
uri = new URI("http", null, "www.google.com", 80,
"/help/me/book name+me/", "MY CRZY QUERY! +&+ :)", null);
// URI: http://www.google.com:80/help/me/book%20name+me/?MY%20CRZY%20QUERY!%20+&+%20:)
The following will produce the correct parameters, with the query properly encoded. Note the spaces, ampersands, and plus marks.
以下将产生正确的参数,查询正确编码。请注意空格、与号和加号。
uri = new URI("http", null, "www.google.com", 80, "/help/me/book name+me/", URLEncoder.encode("MY CRZY QUERY! +&+ :)", "UTF-8"), null);
// URI: http://www.google.com:80/help/me/book%20name+me/?MY+CRZY+QUERY%2521+%252B%2526%252B+%253A%2529
回答by simonox
There is still a problem if you have got an encoded "/" (%2F) in your URL.
如果您的 URL 中有一个编码的“/”(%2F),仍然会出现问题。
RFC 3986 - Section 2.2 says: "If data for a URI component would conflict with a reserved character's purpose as a delimiter, then the conflicting data must be percent-encoded before the URI is formed." (RFC 3986 - Section 2.2)
RFC 3986 - 第 2.2 节说:“如果 URI 组件的数据与作为分隔符的保留字符的用途发生冲突,则必须在形成 URI 之前对冲突数据进行百分比编码。” (RFC 3986 - 第 2.2 节)
But there is an Issue with Tomcat:
但是Tomcat存在一个问题:
http://tomcat.apache.org/security-6.html- Fixed in Apache Tomcat 6.0.10
important: Directory traversal CVE-2007-0450
Tomcat permits '\', '%2F' and '%5C' [...] .
The following Java system properties have been added to Tomcat to provide additional control of the handling of path delimiters in URLs (both options default to false):
- org.apache.tomcat.util.buf.UDecoder.ALLOW_ENCODED_SLASH: true|false
- org.apache.catalina.connector.CoyoteAdapter.ALLOW_BACKSLASH: true|false
Due to the impossibility to guarantee that all URLs are handled by Tomcat as they are in proxy servers, Tomcat should always be secured as if no proxy restricting context access was used.
Affects: 6.0.0-6.0.9
http://tomcat.apache.org/security-6.html- 在 Apache Tomcat 6.0.10 中修复
重要:目录遍历 CVE-2007-0450
Tomcat 允许 '\', '%2F' 和 '%5C' [...] 。
以下 Java 系统属性已添加到 Tomcat,以提供对 URL 中路径分隔符处理的额外控制(两个选项默认为 false):
- org.apache.tomcat.util.buf.UDecoder.ALLOW_ENCODED_SLASH:真|假
- org.apache.catalina.connector.CoyoteAdapter.ALLOW_BACKSLASH:真|假
由于无法保证所有 URL 都由 Tomcat 处理,因为它们在代理服务器中,因此 Tomcat 应该始终受到保护,就好像没有使用限制上下文访问的代理一样。
影响:6.0.0-6.0.9
So if you have got an URL with the %2F character, Tomcat returns: "400 Invalid URI: noSlash"
因此,如果您有一个带有 %2F 字符的 URL,Tomcat 将返回:“400 Invalid URI: noSlash”
You can switch of the bugfix in the Tomcat startup script:
您可以在 Tomcat 启动脚本中切换错误修复:
set JAVA_OPTS=%JAVA_OPTS% %LOGGING_CONFIG% -Dorg.apache.tomcat.util.buf.UDecoder.ALLOW_ENCODED_SLASH=true
回答by fmucar
a solution i developed and much more stable than any other:
我开发的一个解决方案比任何其他解决方案都稳定得多:
public class URLParamEncoder {
public static String encode(String input) {
StringBuilder resultStr = new StringBuilder();
for (char ch : input.toCharArray()) {
if (isUnsafe(ch)) {
resultStr.append('%');
resultStr.append(toHex(ch / 16));
resultStr.append(toHex(ch % 16));
} else {
resultStr.append(ch);
}
}
return resultStr.toString();
}
private static char toHex(int ch) {
return (char) (ch < 10 ? '0' + ch : 'A' + ch - 10);
}
private static boolean isUnsafe(char ch) {
if (ch > 128 || ch < 0)
return true;
return " %$&+,/:;=?@<>#%".indexOf(ch) >= 0;
}
}
回答by Uriah Carpenter
I've created a new project to help construct HTTP URLs. The library will automatically URL encode path segments and query parameters.
我创建了一个新项目来帮助构建 HTTP URL。该库将自动对路径段和查询参数进行 URL 编码。
You can view the source and download a binary at https://github.com/Widen/urlbuilder
您可以在https://github.com/Widen/urlbuilder查看源代码并下载二进制文件
The example URL in this question:
此问题中的示例网址:
new UrlBuilder("search.barnesandnoble.com", "booksearch/first book.pdf").toString()
produces
产生
http://search.barnesandnoble.com/booksearch/first%20book.pdf
http://search.barnesandnoble.com/booksearch/first%20book.pdf
回答by negora
I agree with Matt. Indeed, I've never seen it well explained in tutorials, but one matter is how to encode the URL path, and a very different one is how to encode the parameters which are appended to the URL (the query part, behind the "?" symbol). They use similar encoding, but not the same.
我同意马特。确实,我从来没有在教程中看到过对它进行过很好的解释,但一个问题是如何对 URL 路径进行编码,另一个非常不同的问题是如何对附加到 URL 的参数进行编码(查询部分,在“? “ 象征)。它们使用类似的编码,但并不相同。
Specially for the encoding of the white space character. The URL path needs it to be encoded as %20, whereas the query part allows %20 and also the "+" sign. The best idea is to test it by ourselves against our Web server, using a Web browser.
专门用于空白字符的编码。URL 路径需要将其编码为 %20,而查询部分允许 %20 和“+”号。最好的办法是我们自己使用 Web 浏览器在我们的 Web 服务器上测试它。
For both cases, I ALWAYSwould encode COMPONENT BY COMPONENT, never the whole string. Indeed URLEncoder allows that for the query part. For the path part you can use the class URI, although in this case it asks for the entire string, not a single component.
对于这两种情况,我总是将COMPONENT BY COMPONENT编码,而不是整个字符串。事实上 URLEncoder 允许查询部分。对于路径部分,您可以使用类 URI,尽管在这种情况下它需要整个字符串,而不是单个组件。
Anyway, I believe that the best way to avoid these problems is to use a personal non-conflictive design.How? For example, I never would name directories or parameters using other characters than a-Z, A-Z, 0-9 and _ . That way, the only need is to encode the value of every parameter, since it may come from an user input and the used characters are unknown.
无论如何,我相信避免这些问题的最好方法是使用个人的非冲突设计。如何?例如,我绝不会使用 aZ、AZ、0-9 和 _ 以外的其他字符命名目录或参数。这样,唯一需要的是对每个参数的值进行编码,因为它可能来自用户输入并且使用的字符是未知的。
回答by Jeff Tsay
Unfortunately, org.apache.commons.httpclient.util.URIUtil
is deprecated, and the replacement org.apache.commons.codec.net.URLCodec
does coding suitable for form posts, not in actual URL's. So I had to write my own function, which does a single component (not suitable for entire query strings that have ?'s and &'s)
不幸的是,org.apache.commons.httpclient.util.URIUtil
不推荐使用,并且replacement org.apache.commons.codec.net.URLCodec
编码适合表单帖子,而不是实际的 URL。所以我必须编写自己的函数,它执行单个组件(不适用于具有 ? 和 & 的整个查询字符串)
public static String encodeURLComponent(final String s)
{
if (s == null)
{
return "";
}
final StringBuilder sb = new StringBuilder();
try
{
for (int i = 0; i < s.length(); i++)
{
final char c = s.charAt(i);
if (((c >= 'A') && (c <= 'Z')) || ((c >= 'a') && (c <= 'z')) ||
((c >= '0') && (c <= '9')) ||
(c == '-') || (c == '.') || (c == '_') || (c == '~'))
{
sb.append(c);
}
else
{
final byte[] bytes = ("" + c).getBytes("UTF-8");
for (byte b : bytes)
{
sb.append('%');
int upper = (((int) b) >> 4) & 0xf;
sb.append(Integer.toHexString(upper).toUpperCase(Locale.US));
int lower = ((int) b) & 0xf;
sb.append(Integer.toHexString(lower).toUpperCase(Locale.US));
}
}
}
return sb.toString();
}
catch (UnsupportedEncodingException uee)
{
throw new RuntimeException("UTF-8 unsupported!?", uee);
}
}