java 如何使用 utf8 字符正确读取 url 内容？

Question

提问by Infinity

    public class URLReader {
         public static byte[] read(String from, String to, String string){
          try {
           String text = "http://translate.google.com/translate_a/t?"+
                        "client=o&text="+URLEncoder.encode(string, "UTF-8")+
                        "&hl=en&sl="+from+"&tl="+to+"";

           URL url = new URL(text);
           BufferedReader in = new BufferedReader(
                        new InputStreamReader(url.openStream(), "UTF-8"));
           String json = in.readLine();
           byte[] bytes = json.getBytes("UTF-8");
           in.close();
           return bytes;
                    //return text.getBytes();
          }
          catch (Exception e) {
           return null;
          }
         }
        }

and:

和：

public class AbcServlet extends HttpServlet {
 public void doGet(HttpServletRequest req, HttpServletResponse resp) throws IOException {
  resp.setContentType("text/plain;charset=UTF-8");
  resp.getWriter().println(new String(URLReader.read("pl", "en", "koń")));
 }
}

When I run this i get:{"sentences"[{"trans":"end","orig":"ko???","translit":"","src_translit":""}],"src":"pl","server_time":30}so utf doesnt work correctly but if i return encoded url: http://translate.google.com/translate_a/t?client=o&text=ko%C5%84&hl=en&sl=pl&tl=enand paste at url bar i get correctly:{"sentences":[{"trans":"horse","orig":"koń","translit":"","src_translit":""}],"dict":[{"pos":"noun","terms":["horse"]}],"src":"pl","server_time":76}

当我运行它时，我得到：{"sentences"[{"trans":"end","orig":"ko???","translit":"","src_translit":""}],"src":"pl","server_time":30}所以 utf 不能正常工作，但如果我返回编码的 url:http://translate.google.com/translate_a/t?client=o&text=ko%C5%84&hl=en&sl=pl&tl=en并粘贴到 url 栏我得到正确的：{"sentences":[{"trans":"horse","orig":"koń","translit":"","src_translit":""}],"dict":[{"pos":"noun","terms":["horse"]}],"src":"pl","server_time":76}

Answer 1

回答by gigadot

byte[] bytes = json.getBytes("UTF-8");

gives you a UTF-8 bytes sequences so URLReader.read also give you UTF-8 bytes sequences

给你一个 UTF-8 字节序列，所以 URLReader.read 也给你 UTF-8 字节序列

but you tried to decode with without specifying the encoder, i.e. new String(URLReader.read("pl", "en", "koń"))so Java will use your system default encoding to decode (which is not UTF-8)

但是您尝试在不指定编码器的new String(URLReader.read("pl", "en", "koń"))情况下进行解码，即因此 Java 将使用您的系统默认编码进行解码（这不是 UTF-8）

Try :

尝试：

new String(URLReader.read("pl", "en", "koń"), "UTF-8")

Update

更新

Here is fully working code on my machine:

这是我机器上的完整代码：

public class URLReader {

    public static byte[] read(String from, String to, String string) {
        try {
            String text = "http://translate.google.com/translate_a/t?"
                    + "client=o&text=" + URLEncoder.encode(string, "UTF-8")
                    + "&hl=en&sl=" + from + "&tl=" + to + "";
            URL url = new URL(text);
            URLConnection conn = url.openConnection();
            // Look like faking the request coming from Web browser solve 403 error
            conn.setRequestProperty("User-Agent", "Mozilla/5.0 (Windows; U; Windows NT 6.1; en-GB; rv:1.9.2.13) Gecko/20101203 Firefox/3.6.13 (.NET CLR 3.5.30729)");
            BufferedReader in = new BufferedReader(new InputStreamReader(conn.getInputStream(), "UTF-8"));
            String json = in.readLine();
            byte[] bytes = json.getBytes("UTF-8");
            in.close();
            return bytes;
            //return text.getBytes();
        } catch (Exception e) {
            System.out.println(e);
            // becarful with returning null. subsequence call will return NullPointException.
            return null;
        }
    }
}

Don't forget to escape ń to \u0144. Java compiler may not compile Unicode text properly so it is good idea to write it in plain ASCII.

不要忘记将 ń 转义到 \u0144。Java 编译器可能无法正确编译 Unicode 文本，因此最好用纯 ASCII 编写它。

public class AbcServlet extends HttpServlet {

    @Override
    public void doGet(HttpServletRequest req, HttpServletResponse resp) throws IOException {
        resp.setContentType("text/plain;charset=UTF-8");
        byte[] read = URLReader.read("pl", "en", "ko\u0144");
        resp.getOutputStream().write(read) ;
    }
}

java 如何使用 utf8 字符正确读取 url 内容？

提问by Infinity

回答by gigadot

相关推荐

最近更新

标签

java 如何使用 utf8 字符正确读取 url 内容？

提问by Infinity

回答by gigadot

相关推荐

Java 运行时环境检测到致命错误：SIGSEGV (0xb)

java 在java中的子域之间共享Cookie？

java 将分数转换为十进制数

java 如何为文件下载 servlet 设置 UTF-8 编码

相关推荐

最近更新

标签