Java中的UTF-8字符编码

Question

提问by cambo

I am having some problems getting some French text to convert to UTF8 so that it can be displayed properly, either in a console, text file or in a GUI element.

我在将一些法语文本转换为 UTF8 以便它可以在控制台、文本文件或 GUI 元素中正确显示时遇到一些问题。

The original string is

原始字符串是

HANDICAP╔ES

差点╔ES

which is supposed to be

这应该是

HANDICAPéES

差点

Here is a code snippet that shows how I am using the HymancessDatabase driver to read in the Acccess MDB file in an Eclipse/Linux environment.

这是一个代码片段，显示了我如何使用Hymancess数据库驱动程序在 Eclipse/Linux 环境中读取 Access MDB 文件。

Database database = Database.open(new File(filepath));
Table table = database.getTable(tableName, true);
Iterator rowIter = table.iterator();
while (rowIter.hasNext()) {
    Map<String, Object> row = this.rowIter.next();
    // convert fields to UTF
    Map<String, Object> rowUTF = new HashMap<String, Object>();
    try {
        for (String key : row.keySet()) {
            Object o = row.get(key);
            if (o != null) {
                String valueCP850 = o.toString();
                // String nameUTF8 = new String(valueCP850.getBytes("CP850"), "UTF8"); // does not work!
                String valueISO = new String(valueCP850.getBytes("CP850"), "ISO-8859-1");
                String valueUTF8 = new String(valueISO.getBytes(), "UTF-8"); // works!
                rowUTF.put(key, valueUTF8);
            }
        }
    } catch (UnsupportedEncodingException e) {
        System.err.println("Encoding exception: " + e);
    }   
}

In the code you'll see where I want to convert directly to UTF8, which doesn't seem to work, so I have to do a double conversion. Also note that there doesn't seem to be a way to specify the encoding type when using the Hymancess driver.

在代码中你会看到我想直接转换为UTF8的地方，这似乎不起作用，所以我必须做一个双重转换。另请注意，在使用 Hymancess 驱动程序时似乎没有指定编码类型的方法。

Thanks, Cam

谢谢，卡姆

Answer 1

采纳答案by Alan Moore

New analysis, based on new information.
It looks like your problem is with the encoding of the text beforeit was stored in the Access DB. It seems it had been encoded as ISO-8859-1 or windows-1252, but decoded as cp850, resulting in the string HANDICAP╔ESbeing stored in the DB.

基于新信息的新分析。
看起来您的问题在于文本在存储在 Access DB之前的编码。它似乎被编码为 ISO-8859-1 或 windows-1252，但解码为 cp850，导致字符串HANDICAP╔ES存储在数据库中。

Having correctly retrieved that string from the DB, you're now trying to reverse the original encoding error and recover the string as it should have been stored: HANDICAPéES. And you're accomplishing that with this line:

从数据库中正确检索到该字符串后，您现在正在尝试反转原始编码错误并恢复应该存储的字符串：HANDICAPéES。你用这条线完成了这一点：

String valueISO = new String(valueCP850.getBytes("CP850"), "ISO-8859-1");

getBytes("CP850")converts the character ╔to the byte value 0xC9, and the String constructor decodes that according to ISO-8859-1, resulting in the character é. The next line:

getBytes("CP850")将字符╔转换为字节 value 0xC9，并且 String 构造函数根据 ISO-8859-1 对其进行解码，从而生成字符é。下一行：

String valueUTF8 = new String(valueISO.getBytes(), "UTF-8");

...does nothing. getBytes()encodes the string in the platform default encoding, which is UTF-8 on your Linux system. Then the String constructor decodes it with the same encoding. Delete that line and you should still get the same result.

...什么也没做。 getBytes()以平台默认编码（Linux 系统上的 UTF-8）对字符串进行编码。然后 String 构造函数使用相同的编码对其进行解码。删除该行，您仍然应该得到相同的结果。

More to the point, your attempt to create a "UTF-8 string" was misguided. You don't need to concern yourself with the encoding of Java's strings--they're always UTF-16. When bringing text into a Java app, you just need to make sure you decode it with the correct encoding.

更重要的是，您尝试创建“UTF-8 字符串”的尝试被误导了。您不需要关心 Java 字符串的编码——它们总是 UTF-16。将文本引入 Java 应用程序时，您只需要确保使用正确的编码对其进行解码。

And if my analysis is correct, your Access driver isdecoding it correctly; the problem is at the other end, possibly before the DB even comes into the picture. That'swhat you need to fix, because that new String(getBytes())hack can't be counted on to work in all cases.

如果我的分析是正确的，那么您的 Access 驱动程序正在正确解码；问题出在另一端，可能在 DB 出现之前。这就是您需要解决的问题，因为new String(getBytes())不能指望这种hack 在所有情况下都有效。

Original analysis, based on noinformation.:-/
If you're seeing HANDICAP╔ESon the console, there's probably no problem. Given this code:

原始分析，基于无信息。:-/
如果您在HANDICAP╔ES控制台上看到，则可能没有问题。鉴于此代码：

System.out.println("HANDICAPéES");

The JVM converts the (Unicode) string to the platform default encoding, windows-1252, before sending it to the console. Then the console decodes that using its owndefault encoding, which happens to be cp850. So the console displays it wrong, but that's normal. If you want it to display correctly, you can change the console's encoding with this command:

JVM 将 (Unicode) 字符串转换为平台默认编码 windows-1252，然后将其发送到控制台。然后控制台使用它自己的默认编码（恰好是 cp850）对其进行解码。所以控制台显示错误，但这是正常的。如果您希望它正确显示，可以使用以下命令更改控制台的编码：

CHCP 1252

To display the string in a GUI element, such as a JLabel, you don't have to do anything special. Just make sure you use a font that can display all the characters, but that shouldn't be problem for French.

要在 GUI 元素（例如 JLabel）中显示字符串，您无需执行任何特殊操作。只要确保您使用可以显示所有字符的字体，但这对于法语来说应该不是问题。

As for writing to a file, just specify the desired encoding when you create the Writer:

至于写入文件，只需在创建 Writer 时指定所需的编码：

OutputStreamWriter osw = new OutputStreamWriter(
    new FileOutputStream("myFile.txt"), "UTF-8");

Answer 2

回答by BalusC

String s = "HANDICAP╔ES";
System.out.println(new String(s.getBytes("CP850"), "ISO-8859-1")); // HANDICAPéES

This shows the correct string value. This means that it was originallyencoded/decoded with ISO-8859-1and then incorrectlyencoded with CP850(originally CP1252a.k.a. Windows ANSI as pointed in a comment is indeed also possible since the éhas the same codepoint there as in ISO-8859-1).

这显示了正确的字符串值。这意味着它最初是用ISO-8859-1编码/解码，然后用CP850错误编码（最初CP1252aka Windows ANSI 正如注释中指出的那样确实也是可能的，因为那里有与 ISO-8859-1 相同的代码点）。é

Align your environment and binary pipelines to use all the one and same character encoding. You can't and shouldn't convert between them. You would risk losing information in the non-ASCIIrange that way.

调整您的环境和二进制管道以使用所有相同的字符编码。你不能也不应该在它们之间转换。那样的话，您可能会丢失非ASCII范围内的信息。

Note: do NOT use the above code snippet to "fix" the problem! That would not be the right solution.

注意：不要使用上面的代码片段来“修复”问题！那不是正确的解决方案。

Update: you are apparently still struggling with the problem. I'll repeat the important parts of the answer:

更新：您显然仍在努力解决这个问题。我将重复答案的重要部分：

Align your environment and binary pipelines to use allthe oneand samecharacter encoding.
You can notand should notconvert between them. You would risk losinginformation in the non-ASCIIrange that way.
Do NOTuse the above code snippet to "fix" the problem! That would not be the rightsolution.

调整你的环境，二元管道使用所有的一个和相同的字符编码。
你可以不，应该不是他们之间的转换。那样的话，您可能会丢失非ASCII范围内的信息。
千万不要使用上面的代码片段，以“修复”的问题！那不是正确的解决方案。

To fix the problem you need to choose character encoding X which you'd like to use throughout the entire application. I suggest UTF-8. Update MS Access to use encoding X. Update your development environment to use encoding X. Update the java.ioreaders and writers in your code to use encoding X. Update your editor to read/write files with encoding X. Update the application's user interface to use encoding X. Do notuse Y or Z or whatever at some step. If the characters are alreadycorrupted in some datastore (MS Access, files, etc), then you need to fix it by manually replacing the characters right there in the datastore. Do not use Java for this.

要解决此问题，您需要选择要在整个应用程序中使用的字符编码 X。我建议UTF-8。更新 MS Access 以使用编码 X。更新您的开发环境以使用编码 X。更新java.io代码中的读取器和编写器以使用编码 X。更新您的编辑器以使用编码 X 读取/写入文件。更新应用程序的用户界面以使用编码十，待办事项不是在一些步骤中使用Y或Z或什么的。如果某些数据存储（MS Access、文件等）中的字符已经损坏，那么您需要通过手动替换数据存储中的字符来修复它。不要为此使用 Java。

If you're actually using the "command prompt" as user interface, then you're actually lost. It doesn't support UTF-8. As suggested in the comments and in the article linked in the comments, you need to create a Swingapplication instead of relying on the restricted command prompt environment.

如果您实际上使用“命令提示符”作为用户界面，那么您实际上迷路了。它不支持 UTF-8。正如评论和评论中链接的文章所建议的，您需要创建一个Swing应用程序，而不是依赖受限的命令提示符环境。

Answer 3

回答by leylek

Using "ISO-8859-1" helped me deal with the French charactes.

使用“ ISO-8859-1”帮助我处理法语字符。

Answer 4

回答by Xupypr MV

You can specify encoding when establishing connection. This way was perfect and solve my encoding problem:

您可以在建立连接时指定编码。这种方式很完美，解决了我的编码问题：

    DatabaseImpl open = DatabaseImpl.open(new File("main.mdb"), true, null, Database.DEFAULT_AUTO_SYNC, java.nio.charset.Charset.availableCharsets().get("windows-1251"), null, null);
    Table table = open.getTable("FolderInfo");

Java中的UTF-8字符编码

提问by cambo

采纳答案by Alan Moore

回答by BalusC

回答by leylek

回答by Xupypr MV

相关推荐

最近更新

标签

Java中的UTF-8字符编码

提问by cambo

采纳答案by Alan Moore

回答by BalusC

回答by leylek

回答by Xupypr MV

相关推荐

Java instanceof 被认为是不好的做法吗？如果是这样，在什么情况下 instanceof 仍然更可取？

Java 如何将分钟转换为天、小时、分钟

Java 如何从 BindingResult 获取控制器中的错误文本

Java 使用固定值映射 JPA 中的枚举？

相关推荐

最近更新

标签