java 使用 HttpClient 时正确编码 URL 中的字符
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/6448759/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Correctly encoding characters in a URL when using HttpClient
提问by smm100
I have a list of URLs that I need to verify are valid URLs. I've written a program in Java that uses Apache's HttpClient to check the link. I had to implement my own redirect strategy due to the presence of invalid characters (like {} in the redirect URLS) which the default stratgey didn't take care of. It works fine in the majority of the cases except for 2 of them:
我有一个需要验证的 URL 列表是有效的 URL。我用 Java 编写了一个程序,它使用 Apache 的 HttpClient 检查链接。由于存在无效字符(如重定向 URL 中的 {}),我不得不实施我自己的重定向策略,默认策略没有处理这些字符。除了其中两种情况外,它在大多数情况下都可以正常工作:
Escaped Characters in the path or query params, which should not be encoded further. Example:
String url = "http://www.example.com/chapter1/%3Fref%3Dsomething%26term%3D?ref=xyz"
If I use a URI object, it chokes on the "{" character.
URI myUri = new URI(url) ==> This will fail.
If I run:
URI myUri = new URI(UriUtils.encodeHttpUrl(url))
it encodes the %3F to %253F. However when I follow the link using Chrome or Fiddler, I do not see %3F getting escaped again. How do I protect from over-encoding the path or query params?
The last query param in the URL has a valid URL as well. Eg.
String url = "www.example.com/Chapter1/?param1=xyz¶m2=http://www.google.com/?abc=1"
路径或查询参数中的转义字符,不应进一步编码。例子:
String url = "http://www.example.com/chapter1/%3Fref%3Dsomething%26term%3D?ref=xyz"
如果我使用 URI 对象,它会阻塞在“{”字符上。
URI myUri = new URI(url) ==> This will fail.
如果我运行:
URI myUri = new URI(UriUtils.encodeHttpUrl(url))
它将 %3F 编码为 %253F。但是,当我使用 Chrome 或 Fiddler 访问链接时,我没有看到 %3F 再次被转义。如何防止过度编码路径或查询参数?
URL 中的最后一个查询参数也有一个有效的 URL。例如。
String url = "www.example.com/Chapter1/?param1=xyz¶m2=http://www.google.com/?abc=1"
My current encoding strategy splits up the query params and then calls URLEncoder.encode on the query params. This however causes the last param to be encoded as well (which is not the case when I follow it in Fiddler or Chrome).
我当前的编码策略拆分查询参数,然后在查询参数上调用 URLEncoder.encode。然而,这会导致最后一个参数也被编码(当我在 Fiddler 或 Chrome 中遵循它时,情况并非如此)。
I've tried a number of things (using UriUtils, special cases for URLs as last param and other hacks) but nothing seems to be ideal. Whats the best way to solve this?
我已经尝试了很多东西(使用 UriUtils,将 URL 作为最后一个参数的特殊情况和其他技巧),但似乎没有什么是理想的。解决这个问题的最佳方法是什么?
采纳答案by mgiuca
How do I protect from over-encoding the path or query params?
如何防止过度编码路径或查询参数?
You cannot "protect from over-encoding". You either encode, or you do not. You should always know, for any given string, whether it is encoded or not. You should only encode strings which are not yet encoded, and you should never encode strings which are already encoded.
您不能“防止过度编码”。您要么编码,要么不编码。对于任何给定的字符串,您应该始终知道它是否已编码。您应该只对尚未编码的字符串进行编码,并且永远不要对已经编码的字符串进行编码。
So is this string encoded or not?
那么这个字符串是否编码?
%3Fref%3Dsomething%26term%3D{keyword}
It seems to me like this is bad input: clearly this is not encodedbecause it contains invalid characters ('{' and '}'). Yet it also seems not to be an unencoded string, because it contains '%xx' sequences. So it's partly-encoded. There is no programmatic "solution" once a string is in this form -- you simply need to avoid getting a string into such a form in the first place. You may be able to construct an algorithm which "fixes" this string, by carefully looking for parts looking like a "%" followed by two hex digits, and leaving them alone. But this will break on subtle cases. Consider an unencoded string "42%23", which is supposed to be a literal representation of the mathematical expression "42 mod 23". When I put this into a URI, I expect it to encode as "42%2523" so it decodes as "42%23", but the above algorithm will break and encode it as "42%23" which will then decode as "42#". So there is no way to fix the above string. Encoding "%3F" to "%253F" is exactly what a URI encoder should be doing.
在我看来这是错误的输入:显然这不是编码因为它包含无效字符('{' 和 '}')。然而它似乎也不是一个未编码的字符串,因为它包含 '%xx' 序列。所以它是部分编码的。一旦字符串采用这种形式,就没有程序化的“解决方案”——您只需要首先避免将字符串变成这种形式。您可以构建一个算法来“修复”这个字符串,方法是仔细寻找看起来像“%”后跟两个十六进制数字的部分,然后不理会它们。但这会在微妙的情况下中断。考虑一个未编码的字符串“42%23”,它应该是数学表达式“42 mod 23”的文字表示。当我把它放到一个 URI 中时,我希望它编码为“42%2523”,所以它解码为“42%23”,但是上面的算法会破坏并将其编码为“42%23”,然后将其解码为“42#”。所以没有办法修复上面的字符串。将“%3F”编码为“%253F”正是 URI 编码器应该做的事情。
Note: Having said this, browsers often allow you to get away with typing bad characters into URIs and they automatically encode them. That's not very robust so it shouldn't be used unless you are trying to be very forgiving of user input. In that case, you can do a "best effort" by first decodingthe URI and then re-encoding it. In this case, if I wanted to type "42%23" I would have to manually type in "42%2523".
注意:话虽如此,浏览器通常允许您避免在 URI 中输入错误字符,它们会自动对它们进行编码。这不是很健壮,因此除非您试图对用户输入非常宽容,否则不应使用它。在这种情况下,您可以通过首先解码URI 然后重新编码来“尽力而为” 。在这种情况下,如果我想输入“42%23”,我将不得不手动输入“42%2523”。
As for question 2:
至于问题2:
This however causes the last param to be encoded as well
然而,这会导致最后一个参数也被编码
Similarly, this is exactly what you want. If a URI appears as a parameter inside another URI, it shouldbe percent-encoded. Otherwise, how can you tell where one URI finishes and the other continues? I believe the above URI is actually valid (since ':', '/', '&' and '=' are reserved characters, not forbidden, and therefore they are allowed as long as they do not create ambiguity). But it is much safer to have a URI-inside-a-URI escaped.
同样,这正是您想要的。如果一个 URI 作为参数出现在另一个 URI 中,则它应该是百分比编码的。否则,你怎么知道一个 URI 在哪里结束,另一个在哪里继续?我相信上面的 URI 实际上是有效的(因为 ':'、'/'、'&' 和 '=' 是保留字符,不是禁止的,因此只要它们不引起歧义,它们就被允许)。但转义 URI-inside-a-URI 安全得多。
回答by Martijn Courteaux
I really don't know, but you can try to first decode it, so the %3F
will gets back what is was, and then encode it back.
我真的不知道,但是您可以尝试先对其进行解码,以便%3F
将其恢复原状,然后再对其进行编码。
So:
所以:
String decoded = URLDecoder.decode(url, "UTF-8");
url = URLEncoder.encode(decoded, "UTF-8");
回答by user207421
The correct way to encode an unencoded URL string is via URI.toASCIIString().
对未编码的 URL 字符串进行编码的正确方法是通过 URI.toASCIIString()。
Of course it is up to you to decide whether the URL is already encoded or not.
当然,由您决定 URL 是否已经被编码。
回答by spierce7
Have you tried using the URLEncoder?
您是否尝试过使用 URLEncoder?
URLEncoder.encode(URLString, "UTF-8")
Besides that, your only option is going to encode each URL that is being used as a paramater separately, and then manually building the URL. This is a pretty tricky case.
除此之外,您唯一的选择是对每个用作参数的 URL 分别进行编码,然后手动构建 URL。这是一个相当棘手的案例。