java 比较java中的utf-8字符串

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/2792778/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-10-29 22:51:13  来源:igfitidea点击:

Comparing utf-8 strings in java

javaunicode

提问by cppdev

In my java program, I am retrieving some data from xml. This xml has few international characters and is encoded in utf8. Now I read this xml using xml parser. Once I retrieve a particular international string from xml parser, I need to compare it with set of predefined strings. Problem is when I use string.equals on internatinal string comparison fails.

在我的 java 程序中,我正在从 xml 中检索一些数据。此 xml 很少有国际字符,并以 utf8 编码。现在我使用 xml 解析器读取了这个 xml。一旦我从 xml 解析器中检索到一个特定的国际字符串,我需要将它与一组预定义的字符串进行比较。问题是当我在内部字符串比较上使用 string.equals 时失败。

How to compare strings with international strings in java ? I am using SAXParser & XMLReader to read strings from xml.

java中如何将字符串与国际字符串进行比较?我正在使用 SAXParser & XMLReader 从 xml 读取字符串。

Here's the line that compares strings

这是比较字符串的行

 String country;
 country = getXMLNodeString();

 if(country.equals("C?te d'Ivtheitroade"))
 {    

 } 

  getXMLNodeString()
  {

  /* Get a SAXParser from the SAXPArserFactory. */  
        SAXParserFactory spf = SAXParserFactory.newInstance();
        SAXParser sp = spf.newSAXParser();

        /* Get the XMLReader of the SAXParser we created. */
        XMLReader xr = sp.getXMLReader();
        /* Create a new ContentHandler and apply it to the XML-Reader*/
        XmlParser xmlParser = new XmlParser();  //my class to parse xml
        xr.setContentHandler(xmlParser);  

        /* Parse the xml-data from our URL. */
        xr.parse(new InputSource(url.openStream()));
        /* Parsing has finished. */


       //return string here
  }

回答by cletus

Java stores Strings internally as an array of chars, which are 16 bit unsigned values. This was based on an earlier Unicode standard that supported 64K characters.

JavaString在内部将s存储为s 数组char,这些 s 是 16 位无符号值。这是基于支持 64K 字符的早期 Unicode 标准。

Your String constant "C?te d'Ivtheitroade"is in this format. If your character encoding on your XML document is correct then the Stringread from there will also be in the correct format. So possible errors are:

您的 String 常量"C?te d'Ivtheitroade"采用这种格式。如果您在 XML 文档上的字符编码正确,那么String从那里读取的内容也将采用正确的格式。所以可能的错误是:

  1. The XML document doesn't declare a character encoding;

  2. The declared character encoding does not match the actual character encoding used.

  1. XML 文档没有声明字符编码;

  2. 声明的字符编码与实际使用的字符编码不匹配。

Perhaps the XML string is being treated as US-ASCII instead of UTF-8. I would output both and eyeball them. If they look the same, compare them character by character to see where teh comparison fails. You may also want to compare the UTF8 encoding of your constant Stringto what's in the XML document:

也许 XML 字符串被视为 US-ASCII 而不是 UTF-8。我会输出两者并观察它们。如果它们看起来相同,请逐个比较它们以查看比较失败的地方。您可能还想将常量的 UTF8 编码String与 XML 文档中的内容进行比较:

byte[] bytes = "C?te d'Ivtheitroade".getBytes("UTF-8");

It gets more complicated when you start getting into "supplementary characters". These are characters beyond the originally intended 64K ("code points" in Unicode parlance). See Supplementary Characters in the Java Platform. This isn't an issue with any of the characters you're using but it's worth noting for completeness.

当您开始使用“补充字符”时,情况会变得更加复杂。这些字符超出了最初预期的 64K(Unicode 术语中的“代码点”)。请参阅Java 平台中的补充字符。这不是您使用的任何字符的问题,但值得注意的是完整性。

回答by John Flatness

Since you're comparing with a string literal, you need to make sure that you're saving your source file in the same encoding that javacis expecting. You can also specify what encoding your source files are in with the -encodingargument to javac.

由于您正在与字符串文字进行比较,因此您需要确保以javac预期的相同编码保存源文件。您还可以指定编码源文件都与-encoding参数javac

That seems like the most likely "gotcha" in this scenario.

在这种情况下,这似乎是最有可能的“陷阱”。

Note that I'm talking about the encoding of your Java source code, not the XML document.

请注意,我说的是 Java 源代码的编码,而不是 XML 文档。

回答by Wyzard

Java strings are always UTF-16. Your XML parser should be converting the file's UTF-8 characters into UTF-16 while reading, and your own strings are already UTF-16 in memory, so you can compare them with an ordinary equals()call. If they aren't comparing equal when you think they should, the problem is likely something else.

Java 字符串始终是 UTF-16。您的 XML 解析器在读取时应该将文件的 UTF-8 字符转换为 UTF-16,而您自己的字符串在内存中已经是 UTF-16,因此您可以将它们与普通equals()调用进行比较。如果他们在您认为应该比较时不相等,则问题可能出在其他方面。

回答by anchorite

If your XML file is tagged as and the text file is saved as an actual UTF-8 file you can use contentEquals(literal or string) like so:

如果您的 XML 文件被标记为并且文本文件被保存为实际的 UTF-8 文件,您可以像这样使用 contentEquals(literal or string):

if (strMyvalue.contentEquals("C?te d'Ivtheitroade") {
    // execute
}