java 如何解析与java不同编码的字符串

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/4016671/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-10-30 04:24:07  来源:igfitidea点击:

How to parse a string that is in a different encoding from java

javacharacter-encoding

提问by Derek

I have a string that I have read in from a Word document. I think it is in "Cp1252" encoding. Java uses UTF8.

我有一个从 Word 文档中读入的字符串。我认为它是“Cp1252”编码。Java 使用 UTF8。

How do I search that string for those special characters in Cp1252 and replace them with an appropriate UTF8 character?

如何在该字符串中搜索 Cp1252 中的那些特殊字符并将它们替换为适当的 UTF8 字符?

specifically, I want to replace the "En Dash" character with a plain "-"

具体来说,我想用普通的“-”替换“En Dash”字符

The following code block takes the projDateString which is coming from the Word document, and trying to do such a thing

以下代码块采用来自 Word 文档的 projDateString,并尝试做这样的事情

    char[] test = projDateString.getBytes("Cp1252");
    for(int i = 0; i < test.length; i++){
    System.out.println "test["+ i + "] = " + Integer.toHexString((byte)test[i]);
    }
    String projDateString2 = new String(test);
    projDateString2.replaceAll("
test[0] = 30
test[1] = 38
test[2] = 2f
test[3] = 32
test[4] = 30
test[5] = 31
test[6] = 30
test[7] = 20
test[8] = ffffff96
test[9] = 20
test[10] = 50
test[11] = 72
test[12] = 65
test[13] = 73
test[14] = 65
test[15] = 6e
test[16] = 74
projDateString2: 08/2010 Γ?? Present
x96", "\u2013"); System.out.println("projDateString2: " + projDateString)

I am not sure I am setting up projDateString2 correctly. As you can see, the hex value of that dash is ffffff96 when I getBytes on the string using Cp1252 encoding. If I getBytes with UTF8 it comes in as 3 hex values instead of one.

我不确定我是否正确设置了 projDateString2。如您所见,当我使用 Cp1252 编码在字符串上获取字节时,该破折号的十六进制值为 ffffff96。如果我使用 UTF8 获取字节,它会以 3 个十六进制值的形式出现,而不是一个。

This gives me the following output:

这给了我以下输出:

String text = new String(bytes, encoding);

As you can see, the replace did nothing, and the println still gives me garbage chars instead of a plaintext "-"

如您所见,替换什么也没做,而且 println 仍然给我垃圾字符而不是纯文本“-”

回答by Jon Skeet

Java strings are alwaysin UTF-16, at least as far as the API is concerned... but you can generally just think of them as being "Unicode". The fact that they're UTF-16 is only really relevant when it comes to characters outside the Basic Multilingual Plane, i.e. with Unicode values above U+FFFF. They have to be represented as surrogate pairsin Java. But I don't think you need to worry about this in your case. So just think of the values in Strings as "Unicode text" without a specific encoding... in particular, definitely notin UTF-8 or CP1252. Those are the encodings used to convert binarydata (e.g. a byte array) into text data (e.g. a string).

Java 字符串总是采用 UTF-16 格式,至少就 API 而言是这样……但您通常可以将它们视为“Unicode”。它们是 UTF-16 的事实仅在涉及基本多语言平面之外的字符时才真正相关,即具有高于 U+FFFF 的 Unicode 值。它们必须在 Java 中表示为代理对。但我不认为你需要担心你的情况。因此,只需将字符串中的值视为没有特定编码的“Unicode 文本”……特别是,绝对不是UTF-8 或 CP1252。这些是用于将二进制数据(例如字节数组)转换为文本数据(例如字符串)的编码。

You shouldn't be using String.getBytes()or new String(byte[])without specifying the encoding - that'sthe problem. Those always use the platform default encoding - which is almost alwaysthe wrong choice.

您不应该使用String.getBytes()new String(byte[])不指定编码 -这就是问题所在。那些总是使用平台默认编码——这几乎总是错误的选择。

You say you "have a string that I have read in from a Word document" - how did you read it in? How did it start off life?

你说你“有一个我从 Word 文档中读入的字符串”——你是如何读入的?人生是如何开始的?

If you have the bytesand you know the relevant encoding, you should use:

如果您有字节并且知道相关编码,则应使用:

projDateString2.replaceAll("
projDateString2 = projDateString2.replaceAll("
String properlyEncoded = 
    new String(original.getBytes(originalEncoding), newEncoding);
x96", "\u2013");
x96", "\u2013");

You should never have to deal with a string which has been created using the wrong encoding - if you get to that stage, you're almost boundto be risking information loss. Tackle the problem as early as you possibly can, rather than trying to fix the data up later on.

您永远不必处理使用错误编码创建的字符串 - 如果您到了那个阶段,您几乎注定要冒信息丢失的风险。尽早解决问题,而不是尝试稍后修复数据。

The nextthing to understand is that the Stringclass in Java is immutable. Calling replaceAllon a string won't change the existing string. It will instead return a newstring with the replacements made.

接下来的事情要明白的是,String在Java类是不可变的。调用replaceAllstring不会更改现有 string。相反,它将返回一个带有替换的字符串。

So this statement:

所以这个声明:

##代码##

will neverdo what you want. Even if everything else is correct, you should be using:

永远做你想做的。即使其他一切都是正确的,您也应该使用:

##代码##

(or something similar). I don't think that actually willdo what you want anyway, but you need to be aware of it for when everything else is sorted out.

(或类似的东西)。我不认为这实际上做你想做的事,但是当其他一切都被整理出来时,你需要意识到这一点。

回答by Bozho

Conversion is generally done by something like this:

转换通常是这样完成的:

##代码##

Note that it is not unlikely that some information is lost during the conversion.

请注意,在转换过程中丢失某些信息的可能性不大。

回答by adietrich

First you need to make sure that you properly convert from CP1252 bytes to Java's character representation(which is UTF-16). Since you're using a library for parsing .docxfiles, this has probably happened.

首先,您需要确保从 CP1252 字节正确转换为 Java 的字符表示(即 UTF-16)。由于您正在使用库来解析.docx文件,因此这可能已经发生了。

Now all you need to do is call projDateString.replace('\u2013', '-')and do something with the return value. No need for replaceAll(), since you're not working with regular expressions.

现在您需要做的就是调用projDateString.replace('\u2013', '-')并处理返回值。不需要replaceAll(),因为您不使用正则表达式。