Java 一个关于 URL 的问题
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/2452914/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
a question related to URL
提问by Kevin
Dear all,Now i have this question in my java program,I think it should be classified as URL problem,but not 100% sure.If you think I am wrong,feel free to recategorize this problem,thanks.
亲爱的,现在我的java程序中有这个问题,我认为它应该归类为URL问题,但不是100%肯定。如果你认为我错了,请随时重新分类这个问题,谢谢。
I would state my problem as simply as possible. I did a search on the famouse Chinese search engine baidu.com for a Chinese key word "奥巴马" (Obama in English),and the way I do that is to pass a URL (in a Java Program)to the browser like:
我会尽可能简单地陈述我的问题。我在著名的中文搜索引擎baidu.com上搜索了一个中文关键词“奥巴马”(英文为Obama),我这样做的方法是将一个URL(在Java程序中)传递给浏览器,例如:
http://news.baidu.com/ns?word=奥巴马
http://news.baidu.com/ns?word=北京
and it works perfectly just like I input the "奥巴马”keyword in the text field on baidu.com.
效果很好,就像我在百度的文本框中输入“风景”关键字一样。
However,now my advisor wants another thing.Since he can not read the Chinese webpages,but he wants to make sure the webpages I got from Baidu.com is related to "Obama",he asked me to google translate it back,i.e,using google translate and translate the Chinese webpage to English one.
但是,现在我的顾问想要另外一件事。由于他看不懂中文网页,但他想确保我从百度网获得的网页与“奥巴马”有关,他让我谷歌翻译回来,即,使用谷歌翻译将中文网页翻译成英文。
This sounds straightforward.However,I met my problem here.
这听起来很简单。但是,我在这里遇到了我的问题。
If I simply pass the URL "http://news.baidu.com/ns?word=奥巴马" into Google Translate and tick "Chinese to English" translating option,the result looks awful.(I don't know the clue here,maybe related to Chinese character encoding).
如果我简单地将网址“ http://news.baidu.com/ns?word=北京”输入谷歌翻译并勾选“中文到英文”翻译选项,结果看起来很糟糕。(我不知道这里的线索,可能与汉字编码有关)。
Alternatively,if now my browser opens ""http://news.baidu.com/ns?word=奥巴马" webpage,but I click on the "百度一下" button (that simply means "search"),you will notice the URL will get changed,now if I pass this URL into the Google translate and do the same thing,the result works much better.
或者,如果现在我的浏览器打开“ http://news.baidu.com/ns?word=风景”网页,但是我点击了“百度一下”按钮(简单的意思是“搜索”),你会注意到URL 将被更改,现在如果我将此 URL 传递给 Google 翻译并执行相同的操作,结果会更好。
I hope I am not making this problem sound too complicated,and I appologize for some Chinese words invovled,but I really need your guys' help here.Becasue I did all this in a Java program,I couldn't figure out how to realize that "百度一下"(pressing search button) step then get the new URL.If I could get that new URL,things are easy,I could just call Google translate in my Java code,and pops out the new window to show my advisor.
我希望我没有让这个问题听起来太复杂,我为一些中文单词道歉,但我真的需要你们的帮助。因为我在一个 Java 程序中完成了所有这些,我不知道如何实现那个“百度一下”(按搜索按钮)步骤然后得到新的URL。如果我能得到那个新的URL,事情很简单,我可以在我的Java代码中调用谷歌翻译,然后弹出新窗口给我的顾问.
Please share any of your idea or thougts here.Thanks a lot.
请在这里分享您的任何想法或想法。非常感谢。
Robert
罗伯特
采纳答案by lunohodov
You could use
你可以用
URLEncoder.encode("http://news.baidu.com/ns?word=奥巴马", "utf-8")
then pass the resulting URL to Google Translate like:
然后将生成的 URL 传递给 Google 翻译,例如:
http://translate.google.com/translate?js=y&prev=_t&hl=en&ie=UTF-8&layout=1&eotf=1&sl=zh-CN&tl=en&u=YOUR_URL
Cheers
干杯
回答by Bozho
Try calling
试试打电话
URLEncoder.encode("http://news.baidu.com/ns?word=奥巴马", "utf-8")
(or utf-16
; I'm not quite familiar with the Chinese characters representation)
(或utf-16
; 我不太熟悉汉字的表示)
回答by Josh Lee
When you press the search button, the browser encodes the search term into %E5%A5%A5%E5%B7%B4%E9%A9%AC
, which is the UTF-8 encoding for 奥巴马
. It does this because UTF-8 is the default encoding for HTML forms.
当您按下搜索按钮时,浏览器会将搜索词编码为%E5%A5%A5%E5%B7%B4%E9%A9%AC
,这是 的 UTF-8 编码奥巴马
。这样做是因为 UTF-8 是 HTML 表单的默认编码。
Java uses a UTF-16 encoding internally, so it's possible that the URL library builds a request in that encoding if you do not specify anything.
Java 在内部使用 UTF-16 编码,因此如果您不指定任何内容,则 URL 库可能会以该编码构建请求。
However, I could not reproduce your problem with Google translate — pasting that URL appeared to work correctly no matter how I did it.
但是,我无法使用 Google 翻译重现您的问题 - 无论我如何粘贴该 URL 似乎都可以正常工作。
回答by irreputable
URLs can contain only ASCII characters. All other characters must be converted to bytes then %-encoded in ASCII. However there is no mandate on what charset is used to convert chars to bytes. UTF-8 is recommended, but not required. As long as a server expresses its preference on charset, the client should respect that and use the same charset for encoding.
URL 只能包含 ASCII 字符。所有其他字符必须转换为字节,然后在 ASCII 中进行 % 编码。但是,没有规定使用什么字符集将字符转换为字节。建议使用 UTF-8,但不是必需的。只要服务器表达其对字符集的偏好,客户端就应该尊重它并使用相同的字符集进行编码。
You can see from page info that baidu uses gb2312 encoding. The characters 奥巴马 in a form on its page will be converted to bytes in gb2312: B0C2 B0CD C2ED
, then %-encoded to %B0%C2%B0%CD%C2%ED
. That is what actually sent to baidu server, http://www.baidu.com/s?wd=%B0%C2%B0%CD%C2%ED
从页面信息可以看到百度使用的是gb2312编码。其页面上的某个表单中的文字将转换为gb2312: 中的字节B0C2 B0CD C2ED
,然后%-encoded 为%B0%C2%B0%CD%C2%ED
. 那是实际发送到百度服务器的内容,http://www.baidu.com/s?wd=%B0%C2%B0%CD%C2%ED
Your OS happens to be configured to use gb2312 by default, therefore when you paste http://news.baidu.com/ns?word=奥巴马 to the browser, browser does the same thing, and baidu gets the correct chars. When I paste that URL in my browser, it screws up, because my OS uses UTF-8, and the browser encodes these chinese characters in UTF-8, not something baidu expectes. (when entering a URL directly in a browser, the browser may not have communicated to the server and does not know the charset the server prefers, therefore the browser uses platform default charset)
碰巧你的操作系统默认配置为gb2312,所以当你把http://news.baidu.com/ns?word=风景粘贴到浏览器时,浏览器也是一样的,百度得到正确的字符。当我将该 URL 粘贴到我的浏览器时,它搞砸了,因为我的操作系统使用 UTF-8,而浏览器以 UTF-8 对这些汉字进行编码,这不是百度所期望的。(在浏览器中直接输入 URL 时,浏览器可能没有与服务器通信,也不知道服务器喜欢的字符集,因此浏览器使用平台默认字符集)
Now, Google uses UTF-8. That's why if you paste the URL to google form, it will screw up just like on my OS. The chars are encoded in UTF-8, and baidu will try to parse it as gb2312, and gets totally wrong words.
现在,Google 使用 UTF-8。这就是为什么如果您将 URL 粘贴到 google 表单,它会像在我的操作系统上一样搞砸。字符以UTF-8编码,百度会尝试解析为gb2312,结果完全错误。
Solution is easy. Just encode the parameter in the way that the server expects:
解决方法很简单。只需按照服务器期望的方式对参数进行编码:
"http://news.baidu.com/ns?word=" + URLEncoder.encode("奥巴马", "gb2312")