在java中读取unicode字符
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/3630609/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Reading unicode character in java
提问by Rakesh
I'm a bit new to java, When I assign a unicode string to
我对 java 有点陌生,当我将一个 unicode 字符串分配给
String str = "\u0142o\u017Cy\u0142";
System.out.println(str);
final StringBuilder stringBuilder = new StringBuilder();
InputStream inStream = new FileInputStream("C:/a.txt");
final InputStreamReader streamReader = new InputStreamReader(inStream, "UTF-8");
final BufferedReader bufferedReader = new BufferedReader(streamReader);
String line = "";
while ((line = bufferedReader.readLine()) != null) {
System.out.println(line);
stringBuilder.append(line);
}
Why are the results different in both cases the file a.txt also contains the same string. but when i print output of the file it prints z\u0142o\u017Cy\u0142
instead of the actual unicode characters. Any idea how do i do this if i want to file content also to be printed as string is being printed.
为什么在两种情况下结果不同,文件 a.txt 也包含相同的字符串。但是当我打印文件的输出时,它会打印z\u0142o\u017Cy\u0142
而不是实际的 unicode 字符。如果我想将文件内容也打印为正在打印的字符串,我知道该怎么做。
采纳答案by AndiDog
Your code should be correct, but I guess that the file "a.txt" does not contain the Unicode characters encoded with UTF-8, but the escaped string "\u0142o\u017Cy\u0142".
您的代码应该是正确的,但我猜文件“a.txt”不包含用 UTF-8 编码的 Unicode 字符,而是包含转义字符串“\u0142o\u017Cy\u0142”。
Please check if the text file is correct, using an UTF-8 aware editor such as recent versions of Notepad or Notepad++ on Windows. Or edit it with your favorite hex editor - it should not contain backslashes.
请使用支持 UTF-8 的编辑器(例如 Windows 上的 Notepad 或 Notepad++ 的最新版本)检查文本文件是否正确。或者用你最喜欢的十六进制编辑器编辑它 - 它不应该包含反斜杠。
I tried it with "" as UTF-8-encoded content of the file and it gets printed correctly. Note that not all Unicode characters can be printed, depending on your terminal encoding (really a hassle on Windows) and font.
我尝试使用 "" 作为文件的 UTF-8 编码内容,并正确打印。请注意,并非所有 Unicode 字符都可以打印,这取决于您的终端编码(在 Windows 上确实很麻烦)和字体。
回答by InsertNickHere
回答by Richard Fearn
It sounds as though your file literally contains the text z\u0142o\u017Cy\u014
, i.e. has Unicode escape sequences in it.
听起来好像您的文件字面上包含 text z\u0142o\u017Cy\u014
,即其中包含 Unicode 转义序列。
There's probably a library for decoding these but you could do it yourself - according to the Java Language Specificationan escape sequence is always of the form \uxxxx
, so you could get the 4-digit hex value xxxx
for the character, convert it to an integer with Integer.parseInt
, convert it to a character and finally replace the whole \uxxxx
sequence with the character.
可能有一个用于解码这些的库,但您可以自己完成 - 根据Java 语言规范,转义序列始终为 形式\uxxxx
,因此您可以获得xxxx
字符的 4 位十六进制值,将其转换为整数Integer.parseInt
,将其转换为字符,最后用字符替换整个\uxxxx
序列。
回答by Stephen P
Java interprets unicode escapes such as your \u0142
that are in the source code as if you had actually typed that character (latin small letter L with stroke) into the source.
Java does notinterpret unicode escapes that it reads from a file.
Java 解释 unicode 转义\u0142
符,例如您在源代码中的转义符,就好像您实际上已将该字符(带有笔划的拉丁小写字母 L)键入源代码中一样。Java并没有解释Unicode转义字符,它从文件中读取。
If you take your String str = "\u0142o\u017Cy\u0142";
and write it to a file a.txt
from your Java program, then open the file in an editor, you'll see the characters themselves in the file, notthe \uNNNN sequence.
如果您从 Java 程序中将其String str = "\u0142o\u017Cy\u0142";
写入文件a.txt
,然后在编辑器中打开该文件,您将在文件中看到字符本身,而不是\uNNNN 序列。
If you then take your original posted program and read thata.txt
file you should see what you expected.
如果您随后使用原始发布的程序并阅读该a.txt
文件,您应该会看到您所期望的。
回答by BalusC
So, you want to unescape unicode codepoints? There is no public API available for this. The java.util.Properties
has a loadConvert()
method which does exactly this, but it's private
. Check the Java source for the case you'd like to reuse this. It's doing the conversion by simple parsing. I wouldn't use regex for this since this is too error prone in very specific circumstances.
那么,您想对 unicode 代码点进行转义吗?没有可用的公共 API。该java.util.Properties
有loadConvert()
它正是这样做的方法,但它的private
。检查 Java 源代码以了解您想重用它的情况。它通过简单的解析进行转换。我不会为此使用正则表达式,因为这在非常特定的情况下太容易出错。
Or you should probably after all be using java.util.Properties
or its i18n counterpart java.util.ResourceBundle
with a .properties
file instead of a plain .txt
file.
或者,你也许应该毕竟是用java.util.Properties
或它的国际化对应java.util.ResourceBundle
一个.properties
文件,而不是一个普通的.txt
文件。
See also:
也可以看看:
回答by Alex
You have used FileInputStream and is a byte code reader not character reader. Try using FileReader instead
您使用过 FileInputStream 并且是字节码阅读器而不是字符阅读器。尝试改用 FileReader
something like:
就像是:
BufferedReader inputStream = new BufferedReader(new FileReader("C:/a.txt"));
BufferedReader inputStream = new BufferedReader(new FileReader("C:/a.txt"));
then you can use the line oriented I/O BufferedReader to read each line. FileInputREader is a low level I/O that you should avoid. You're writing the characters to your file not the bytes, the best approach is to use character streams. for wrinting and reading unless you need to write bytes/binary data.
那么您可以使用面向行的 I/O BufferedReader 来读取每一行。FileInputREader 是您应该避免的低级 I/O。您将字符写入文件而不是字节,最好的方法是使用字符流。用于写入和读取,除非您需要写入字节/二进制数据。
回答by tchrist
I posted Java code to unescape (“descape”?) such things and many others in this answer.
我在这个答案中发布了 Java 代码来取消转义(“转义”?)这样的事情和许多其他事情。