在java中读取unicode字符

Question

提问by Rakesh

I'm a bit new to java, When I assign a unicode string to

我对 java 有点陌生，当我将一个 unicode 字符串分配给

  String str = "\u0142o\u017Cy\u0142";
  System.out.println(str);

  final StringBuilder stringBuilder = new StringBuilder();
  InputStream inStream = new FileInputStream("C:/a.txt");
  final InputStreamReader streamReader = new InputStreamReader(inStream, "UTF-8");
  final BufferedReader bufferedReader = new BufferedReader(streamReader);
  String line = "";
  while ((line = bufferedReader.readLine()) != null) {
      System.out.println(line);
      stringBuilder.append(line);
  }

Why are the results different in both cases the file a.txt also contains the same string. but when i print output of the file it prints z\u0142o\u017Cy\u0142instead of the actual unicode characters. Any idea how do i do this if i want to file content also to be printed as string is being printed.

为什么在两种情况下结果不同，文件 a.txt 也包含相同的字符串。但是当我打印文件的输出时，它会打印z\u0142o\u017Cy\u0142而不是实际的 unicode 字符。如果我想将文件内容也打印为正在打印的字符串，我知道该怎么做。

Answer 1

采纳答案by AndiDog

Your code should be correct, but I guess that the file "a.txt" does not contain the Unicode characters encoded with UTF-8, but the escaped string "\u0142o\u017Cy\u0142".

您的代码应该是正确的，但我猜文件“a.txt”不包含用 UTF-8 编码的 Unicode 字符，而是包含转义字符串“\u0142o\u017Cy\u0142”。

Please check if the text file is correct, using an UTF-8 aware editor such as recent versions of Notepad or Notepad++ on Windows. Or edit it with your favorite hex editor - it should not contain backslashes.

请使用支持 UTF-8 的编辑器（例如 Windows 上的 Notepad 或 Notepad++ 的最新版本）检查文本文件是否正确。或者用你最喜欢的十六进制编辑器编辑它 - 它不应该包含反斜杠。

I tried it with "" as UTF-8-encoded content of the file and it gets printed correctly. Note that not all Unicode characters can be printed, depending on your terminal encoding (really a hassle on Windows) and font.

我尝试使用 "" 作为文件的 UTF-8 编码内容，并正确打印。请注意，并非所有 Unicode 字符都可以打印，这取决于您的终端编码（在 Windows 上确实很麻烦）和字体。

Answer 2

回答by InsertNickHere

I think its just "UTF8" not "UTF-8".

我认为它只是“UTF8”而不是“UTF-8”。

Here I saw it: Source

我在这里看到它：来源

Answer 3

回答by Richard Fearn

It sounds as though your file literally contains the text z\u0142o\u017Cy\u014, i.e. has Unicode escape sequences in it.

听起来好像您的文件字面上包含 text z\u0142o\u017Cy\u014，即其中包含 Unicode 转义序列。

There's probably a library for decoding these but you could do it yourself - according to the Java Language Specificationan escape sequence is always of the form \uxxxx, so you could get the 4-digit hex value xxxxfor the character, convert it to an integer with Integer.parseInt, convert it to a character and finally replace the whole \uxxxxsequence with the character.

可能有一个用于解码这些的库，但您可以自己完成 - 根据Java 语言规范，转义序列始终为形式\uxxxx，因此您可以获得xxxx字符的 4 位十六进制值，将其转换为整数Integer.parseInt，将其转换为字符，最后用字符替换整个\uxxxx序列。

Answer 4

回答by Stephen P

Java interprets unicode escapes such as your \u0142that are in the source code as if you had actually typed that character (latin small letter L with stroke) into the source. Java does notinterpret unicode escapes that it reads from a file.

Java 解释 unicode 转义\u0142符，例如您在源代码中的转义符，就好像您实际上已将该字符（带有笔划的拉丁小写字母 L）键入源代码中一样。Java并没有解释Unicode转义字符，它从文件中读取。

If you take your String str = "\u0142o\u017Cy\u0142";and write it to a file a.txtfrom your Java program, then open the file in an editor, you'll see the characters themselves in the file, notthe \uNNNN sequence.

如果您从 Java 程序中将其String str = "\u0142o\u017Cy\u0142";写入文件a.txt，然后在编辑器中打开该文件，您将在文件中看到字符本身，而不是\uNNNN 序列。

If you then take your original posted program and read thata.txtfile you should see what you expected.

如果您随后使用原始发布的程序并阅读该a.txt文件，您应该会看到您所期望的。

Answer 5

回答by BalusC

So, you want to unescape unicode codepoints? There is no public API available for this. The java.util.Propertieshas a loadConvert()method which does exactly this, but it's private. Check the Java source for the case you'd like to reuse this. It's doing the conversion by simple parsing. I wouldn't use regex for this since this is too error prone in very specific circumstances.

那么，您想对 unicode 代码点进行转义吗？没有可用的公共 API。该java.util.Properties有loadConvert()它正是这样做的方法，但它的private。检查 Java 源代码以了解您想重用它的情况。它通过简单的解析进行转换。我不会为此使用正则表达式，因为这在非常特定的情况下太容易出错。

Or you should probably after all be using java.util.Propertiesor its i18n counterpart java.util.ResourceBundlewith a .propertiesfile instead of a plain .txtfile.

或者，你也许应该毕竟是用java.util.Properties或它的国际化对应java.util.ResourceBundle一个.properties文件，而不是一个普通的.txt文件。

也可以看看：

Answer 6

回答by Alex

You have used FileInputStream and is a byte code reader not character reader. Try using FileReader instead

您使用过 FileInputStream 并且是字节码阅读器而不是字符阅读器。尝试改用 FileReader

something like:

就像是：

BufferedReader inputStream = new BufferedReader(new FileReader("C:/a.txt"));

then you can use the line oriented I/O BufferedReader to read each line. FileInputREader is a low level I/O that you should avoid. You're writing the characters to your file not the bytes, the best approach is to use character streams. for wrinting and reading unless you need to write bytes/binary data.

那么您可以使用面向行的 I/O BufferedReader 来读取每一行。FileInputREader 是您应该避免的低级 I/O。您将字符写入文件而不是字节，最好的方法是使用字符流。用于写入和读取，除非您需要写入字节/二进制数据。

Answer 7

回答by tchrist

I posted Java code to unescape (“descape”?) such things and many others in this answer.

我在这个答案中发布了 Java 代码来取消转义（“转义”？）这样的事情和许多其他事情。

在java中读取unicode字符

提问by Rakesh

采纳答案by AndiDog

回答by InsertNickHere

回答by Richard Fearn

回答by Stephen P

回答by BalusC

See also:

也可以看看：

回答by Alex

回答by tchrist

相关推荐

最近更新

标签

在java中读取unicode字符

提问by Rakesh

采纳答案by AndiDog

回答by InsertNickHere

回答by Richard Fearn

回答by Stephen P

回答by BalusC

See also:

也可以看看：

回答by Alex

回答by tchrist

相关推荐

Java addMouseListener 或 addActionListener 或 JButton？

Java Swing 中的选框效果

Java 无法使用 AM/PM 标记解析日期时间字符串

Java 将连字符分隔的单词（例如“do-some-stuff”）转换为较小的驼峰变体（例如“doSomeStuff”）的最优雅方法是什么？

相关推荐

最近更新

标签