Java 无法解析和显示从 http 请求中读取的非 utf8 字符

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/1743935/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-12 22:09:29  来源:igfitidea点击:

Cannot parse and display non-utf8 characters read from an http request

javajsonparsingencoding

提问by

I'm using Java to parse this request

我正在使用 Java 来解析这个请求

http://ajax.googleapis.com/ajax/services/search/web?start=0&rsz=large&v=1.0&q=rz+img+news+recordid+border

http://ajax.googleapis.com/ajax/services/search/web?start=0&rsz=large&v=1.0&q=rz+img+news+recordid+border

which has as a result this (truncated for the sake of brevity) JSON file:

结果是这个(为了简洁起见被截断)JSON文件:

{"responseData":{"results":
<...>
"visibleUrl":"www.coolcook.net",
"cacheUrl":"http://www.google.com/search?q\u003dcache:p4Ke5q6zpnUJ:www.coolcook.net",
"title":"???? ????? - ???? ?????? ??????? ????? ?????",
"titleNoFormatting":"???? ????? - ???? ?????? ??????? ????? ?????","\u003drz+img+news+recordid+border"}}, 
<...>
"responseDetails": null, "responseStatus": 200}

My problem lies in the arabic characters returned (which could be any non-unicode for that matter). I tried to convert them back to unicode using something like:

我的问题在于返回的阿拉伯字符(这可能是任何非 Unicode 字符)。我尝试使用以下方法将它们转换回 unicode:

JSONArray ja = json.getJSONObject("responseData").getJSONArray("results");
JSONObject j = ja.getJSONObject(i);
str = j.getString("titleNoFormatting");
logger.log("before: " + str); // this is just my version of println
enc_str = new String (str.getBytes(), "UTF8");
logger.log("after: " + enc_str);

However, both the 'before' and 'after' results are the same: a set of ????'s, regardless of whether I output them in the server log file or in an HTML page. Is there another way to get back the arabic characters and output them in a webpage?

但是,'before' 和 'after' 结果是相同的:一组 ????'s,无论我是在服务器日志文件中还是在 HTML 页面中输出它们。有没有另一种方法可以取回阿拉伯字符并将它们输出到网页中?

Does JSON have any supporting functionality for this sort of problem perhaps in order to read the non-utf characters straight away from the JSONObject?

JSON 是否具有针对此类问题的任何支持功能,也许是为了直接从 JSONObject 读取非 utf 字符?

回答by BalusC

First try this:

首先试试这个:

str = j.getString("titleNoFormatting");
BufferedWriter writer = new BufferedWriter(new OutputStreamWriter(new FileOutputStream("c:/test.txt"), "UTF-8"));
writer.write(str);
writer.close();

Then open the file in notepad. If this looks fine, then the problem lies in your logger or console that it's not configured to use UTF-8. Else the problem most likely lies in the JSON API which you used that it's not configured to use UTF-8.

然后在记事本中打开文件。如果这看起来不错,那么问题在于您的记录器或控制台未配置为使用UTF-8. 否则,问题很可能在于您使用的 JSON API 未配置为使用UTF-8.

Edit: if the problem is actually in the JSON API used and you don't know which to choose, then I'd recommend to use Gson. It really eases converting a Json string to a easy-to-use javabean. Here's a basic example:

编辑:如果问题实际上出在所使用的 JSON API 中并且您不知道该选择哪个,那么我建议使用Gson。它确实简化了将 Json 字符串转换为易于使用的 javabean 的过程。这是一个基本示例:

import java.io.BufferedReader;
import java.io.InputStreamReader;
import java.net.URL;
import java.util.List;

import com.google.gson.Gson;

public class Test {

    public static void main(String[] args) throws Exception {
        URL url = new URL("http://ajax.googleapis.com/ajax/services/search/web"
            + "?start=0&rsz=large&v=1.0&q=rz+img+news+recordid+border");
        BufferedReader reader = new BufferedReader(new InputStreamReader(url.openStream(), "UTF-8"));
        GoogleResults results = new Gson().fromJson(reader, GoogleResults.class);

        // Show all results.
        System.out.println(results);

        // Show title of 1st result (is arabic).
        System.out.println(results.getResponseData().getResults().get(0).getTitle());
    }

}

class GoogleResults {

    ResponseData responseData;
    public ResponseData getResponseData() { return responseData; }
    public void setResponseData(ResponseData responseData) { this.responseData = responseData; }
    public String toString() { return "ResponseData[" + responseData + "]"; }

    static class ResponseData {
        List<Result> results;
        public List<Result> getResults() { return results; }
        public void setResults(List<Result> results) { this.results = results; }
        public String toString() { return "Results[" + results + "]"; }
    }

    static class Result {
        private String url;
        private String title;
        public String getUrl() { return url; }
        public String getTitle() { return title; }
        public void setUrl(String url) { this.url = url; }
        public void setTitle(String title) { this.title = title; }
        public String toString() { return "Result[url:" + url +",title:" + title + "]"; }
    }

}

It outputs the results nicely. Hope this helps.

它很好地输出结果。希望这可以帮助。

回答by erickson

The important part of the problem is how you are handling the content of the HTTP response. That is, how are you creating the jsonobject? By the time you get to the code in your original post, the content has already been corrupted.

问题的重要部分是您如何处理 HTTP 响应的内容。也就是说,你是如何创建json对象的?当您到达原始帖子中的代码时,内容已经损坏。

The request results in UTF-8 encoded data. How are you parsing it into JSON objects? Is the correct encoding specified to the decoder? Or is your platform's default character encoding being used?

请求结果为 UTF-8 编码数据。您如何将其解析为 JSON 对象?是否为解码器指定了正确的编码?或者是否使用了您平台的默认字符编码?

回答by ZZ Coder

The Google API correctly sends UTF-8. I think the problem is that your default encoding is not capable outputting Arabic. Check your file.encodingproperty or get encoding like this,

Google API 正确发送 UTF-8。我认为问题在于您的默认编码无法输出阿拉伯语。检查您的file.encoding财产或获得这样的编码,

public static String getDefaultCharSet() throws IOException {
    OutputStreamWriter writer = new OutputStreamWriter(new ByteArrayOutputStream());
    return writer.getEncoding();
}

If the default encoding is ASCII or Latin-1, you will get "?"s. You need to change it into UTF-8.

如果默认编码是 ASCII 或 Latin-1,您将得到“?”。您需要将其更改为 UTF-8。

回答by Gareth Davis

The issue you have is most likely caused by incorrect setting of the character encoding at the point that you are reading in the http response from google. Can you post the code that actually gets URL and parses it into the JSON object?

您遇到的问题很可能是由于您在 google 的 http 响应中读取时字符编码设置不正确造成的。你能贴出实际获取 URL 并将其解析为 JSON 对象的代码吗?

As an example run the following:

例如,运行以下命令:

public class Test1 {
  public static void main(String [] args) throws Exception {

    // just testing that the console can output the correct chars
    System.out.println("\"title\":\"???? ????? - ???? ?????? ??????? ????? ?????");

    URL url = new URL("http://ajax.googleapis.com/ajax/services/search/web?start=0&rsz=large&v=1.0&q=rz+img+news+recordid+border");
    HttpURLConnection connection = (HttpURLConnection) url.openConnection();
    InputStream is  = connection.getInputStream();

    // the important bit is here..........................\/\/\/
    InputStreamReader reader = new InputStreamReader(is, "utf-8");


    StringWriter sw = new StringWriter();

    char [] buffer = new char[1024 * 8];
    int count ;

    while( (count = reader.read(buffer)) != -1){
      sw.write(buffer, 0, count);
    }

    System.out.println(sw.toString());
  }
}

This is using the rather ugly standard URL.openConnection()that's been around since the dawn of time. If you are using something like Apache httpclientthen you can do this really easily.

这是使用URL.openConnection()自古以来就存在的相当丑陋的标准。如果您使用的是Apache httpclient 之类的东西,那么您可以非常轻松地做到这一点。

For a bit of back ground reading on encoding and maybe an explaination of why new String (str.getBytes(), "UTF8");will never work read Joel's article on unicode

有关编码的一些背景阅读,并可能解释为什么new String (str.getBytes(), "UTF8");永远行不通,请阅读乔尔关于 unicode 的文章

回答by Marc Hacker

I think the JSON.org Java JSON package cannot handle UTF8, whether it is passed in as a UTF8 character or actually passing in the \uXXXXcode. I tried both as follows:

我认为 JSON.org Java JSON 包无法处理 UTF8,无论是作为 UTF8 字符传入还是实际传入\uXXXX代码。我尝试了以下两种方法:

import org.json.
public class JsonTest extends TestCase {
    public void testParseText() {
        try {
            JSONObject json1 = new JSONObject("{\"a\":\"\u05dd\"}"); // \u05dd is a Hebrew character
            JSONObject json2 = new JSONObject("{\"a\":\"\u05dd\"}"); // \u05dd is a Hebrew character
            System.out.println(json1.toString());
            System.out.println(json2.toString());
        } catch (JSONException e) {
            e.printStackTrace();
        }
    }
}

I get:

我得到:

{"a":"?"}
{"a":"?"}

Any ideas?

有任何想法吗?

回答by lisak

There is a librarywhich retains the encoding of the http response (Czech expressions) with JSon message like this :

有一个保留了带有 JSon 消息的 http 响应(捷克表达式)的编码,如下所示:

private static String inputStreamToString(final InputStream inputStream) throws Exception {
 final StringBuilder outputBuilder = new StringBuilder();

 try {
  String string;
  if (inputStream != null) {
   BufferedReader reader = new BufferedReader(new InputStreamReader(inputStream, "UTF-8"));
   while (null != (string = reader.readLine())) {
    outputBuilder.append(string).append('\n');
   }
  }
 } catch (Exception ex) {
  throw new Exception("[google-api-translate-java] Error reading translation stream.", ex);
 }

 return outputBuilder.toString();
}

The answer is tricky and there are a few points one must pay attention to, mainly to platform encoding:

答案很棘手,有几点必须注意,主要是平台编码:

afaik affects printing out to console, creating files from an inputstream and even communication between DB client and server even though they are both set to use utf-8 charset for encoding - no matter whether I explicitly create utf-8 string, inputstreamReader or set JDBC driver for UTF-8, still setting up $LANG property to xx_XX.UTF-8 on linux systems and add append=" vt.default_utf8=1" to LILO boot loader (on systems that use it), must be done at least for systems running database and java apps working with utf-8 encoded files.

afaik 影响打印到控制台、从输入流创建文件甚至数据库客户端和服务器之间的通信,即使它们都设置为使用 utf-8 字符集进行编码 - 无论我是否明确创建 utf-8 字符串、inputstreamReader 或设置 JDBC UTF-8 驱动程序,仍然在 Linux 系统上将 $LANG 属性设置为 xx_XX.UTF-8 并将 append=" vt.default_utf8=1" 添加到 LILO 引导加载程序(在使用它的系统上),必须至少为运行数据库和 java 应用程序的系统使用 utf-8 编码文件。

Even if I append this JVM parameter -Dfile.encoding=UTF-8, without the platform encoding I didn't succeed in properly encoded streams. Having JDBC connector set up properly is necessary : "jdbc:mysql://localhost/DBname?useUnicode=true&characterEncoding=UTF8", if you are going to persist the strings to a database, which should be in this state:

即使我附加了这个 JVM 参数 -Dfile.encoding=UTF-8,如果没有平台编码,我也无法在正确编码的流中成功。必须正确设置 JDBC 连接器:“jdbc:mysql://localhost/DBname?useUnicode=true&characterEncoding=UTF8”,如果您要将字符串持久化到数据库中,该数据库应处于此状态:

    mysql> SHOW VARIABLES LIKE 'character\_set\_%';
+--------------------------+--------+
| Variable_name            | Value  |
+--------------------------+--------+
| character_set_client     | utf8   |
| character_set_connection | utf8   |
| character_set_database   | utf8   |
| character_set_filesystem | binary |
| character_set_results    | utf8   |
| character_set_server     | utf8   |
| character_set_system     | utf8   |
+--------------------------+--------+