java 如何使用 utf8 字符正确读取 url 内容?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/4555128/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
How to correctly read url content with utf8 chars?
提问by Infinity
public class URLReader {
public static byte[] read(String from, String to, String string){
try {
String text = "http://translate.google.com/translate_a/t?"+
"client=o&text="+URLEncoder.encode(string, "UTF-8")+
"&hl=en&sl="+from+"&tl="+to+"";
URL url = new URL(text);
BufferedReader in = new BufferedReader(
new InputStreamReader(url.openStream(), "UTF-8"));
String json = in.readLine();
byte[] bytes = json.getBytes("UTF-8");
in.close();
return bytes;
//return text.getBytes();
}
catch (Exception e) {
return null;
}
}
}
and:
和:
public class AbcServlet extends HttpServlet {
public void doGet(HttpServletRequest req, HttpServletResponse resp) throws IOException {
resp.setContentType("text/plain;charset=UTF-8");
resp.getWriter().println(new String(URLReader.read("pl", "en", "koń")));
}
}
When I run this i get:{"sentences"[{"trans":"end","orig":"ko???","translit":"","src_translit":""}],"src":"pl","server_time":30}
so utf doesnt work correctly but if i return encoded url: http://translate.google.com/translate_a/t?client=o&text=ko%C5%84&hl=en&sl=pl&tl=en
and paste at url bar i get correctly:{"sentences":[{"trans":"horse","orig":"koń","translit":"","src_translit":""}],"dict":[{"pos":"noun","terms":["horse"]}],"src":"pl","server_time":76}
当我运行它时,我得到:{"sentences"[{"trans":"end","orig":"ko???","translit":"","src_translit":""}],"src":"pl","server_time":30}
所以 utf 不能正常工作,但如果我返回编码的 url:http://translate.google.com/translate_a/t?client=o&text=ko%C5%84&hl=en&sl=pl&tl=en
并粘贴到 url 栏我得到正确的:{"sentences":[{"trans":"horse","orig":"koń","translit":"","src_translit":""}],"dict":[{"pos":"noun","terms":["horse"]}],"src":"pl","server_time":76}
回答by gigadot
byte[] bytes = json.getBytes("UTF-8");
gives you a UTF-8 bytes sequences so URLReader.read also give you UTF-8 bytes sequences
给你一个 UTF-8 字节序列,所以 URLReader.read 也给你 UTF-8 字节序列
but you tried to decode with without specifying the encoder, i.e. new String(URLReader.read("pl", "en", "koń"))
so Java will use your system default encoding to decode (which is not UTF-8)
但是您尝试在不指定编码器的new String(URLReader.read("pl", "en", "koń"))
情况下进行解码,即因此 Java 将使用您的系统默认编码进行解码(这不是 UTF-8)
Try :
尝试 :
new String(URLReader.read("pl", "en", "koń"), "UTF-8")
Update
更新
Here is fully working code on my machine:
这是我机器上的完整代码:
public class URLReader {
public static byte[] read(String from, String to, String string) {
try {
String text = "http://translate.google.com/translate_a/t?"
+ "client=o&text=" + URLEncoder.encode(string, "UTF-8")
+ "&hl=en&sl=" + from + "&tl=" + to + "";
URL url = new URL(text);
URLConnection conn = url.openConnection();
// Look like faking the request coming from Web browser solve 403 error
conn.setRequestProperty("User-Agent", "Mozilla/5.0 (Windows; U; Windows NT 6.1; en-GB; rv:1.9.2.13) Gecko/20101203 Firefox/3.6.13 (.NET CLR 3.5.30729)");
BufferedReader in = new BufferedReader(new InputStreamReader(conn.getInputStream(), "UTF-8"));
String json = in.readLine();
byte[] bytes = json.getBytes("UTF-8");
in.close();
return bytes;
//return text.getBytes();
} catch (Exception e) {
System.out.println(e);
// becarful with returning null. subsequence call will return NullPointException.
return null;
}
}
}
Don't forget to escape ń to \u0144. Java compiler may not compile Unicode text properly so it is good idea to write it in plain ASCII.
不要忘记将 ń 转义到 \u0144。Java 编译器可能无法正确编译 Unicode 文本,因此最好用纯 ASCII 编写它。
public class AbcServlet extends HttpServlet {
@Override
public void doGet(HttpServletRequest req, HttpServletResponse resp) throws IOException {
resp.setContentType("text/plain;charset=UTF-8");
byte[] read = URLReader.read("pl", "en", "ko\u0144");
resp.getOutputStream().write(read) ;
}
}