java 如何使用 RandomAccessFile 读取 UTF8 编码的文件?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/9964892/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-10-30 23:05:42  来源:igfitidea点击:

How to read UTF8 encoded file using RandomAccessFile?

javaunicodeutf-8iotextfield

提问by kenny

I have text file that was encoded with UTF8 (for language specific characters). I need to use RandomAccessFile to seek specific position and read from.

我有用 UTF8 编码的文本文件(用于语言特定字符)。我需要使用 RandomAccessFile 来寻找特定位置并从中读取。

I want read line-by-line.

我想逐行阅读。

String str = myreader.readLine(); //returns wrong text, not decoded 
String str myreader.readUTF(); //An exception occurred: java.io.EOFException

回答by picoworm

You can convert string, read by readLine to UTF8, using following code:

您可以使用以下代码将 readLine 读取的字符串转换为 UTF8:

public static void main(String[] args) throws IOException {
    RandomAccessFile raf = new RandomAccessFile(new File("MyFile.txt"), "r");
    String line = raf.readLine();
    String utf8 = new String(line.getBytes("ISO-8859-1"), "UTF-8");
    System.out.println("Line: " + line);
    System.out.println("UTF8: " + utf8);
}

Content of MyFile.txt: (UTF-8 Encoding)

MyFile.txt 的内容:(UTF-8 编码)

Привет из Украины

Console output:

控制台输出:

Line: D?D?D2Dμ? D?D· D£Do?D°D?D??
UTF8: Привет из Украины

回答by Edwin Dalorzo

The API docs say the following for readUTF8

API 文档对 readUTF8 说明如下

Reads in a string from this file. The string has been encoded using a modified UTF-8 format.

The first two bytes are read, starting from the current file pointer, as if by readUnsignedShort. This value gives the number of following bytes that are in the encoded string, not the length of the resulting string. The following bytes are then interpreted as bytes encoding characters in the modified UTF-8 format and are converted into characters.

This method blocks until all the bytes are read, the end of the stream is detected, or an exception is thrown.

从此文件中读入一个字符串。该字符串已使用修改后的 UTF-8 格式进行编码。

从当前文件指针开始读取前两个字节,就像通过 readUnsignedShort 一样。该值给出了编码字符串中的后续字节数,而不是结果字符串的长度。然后将以下字节解释为修改后的 UTF-8 格式中的字节编码字符并转换为字符。

此方法会阻塞,直到读取所有字节、检测到流的结尾或引发异常。

Is your string formatted in this way?

你的字符串是这样格式化的吗?

This appears to explain your EOF exceptuon.

这似乎可以解释您的 EOF 异常。

Your file is a text file so your actual problem is the decoding.

您的文件是文本文件,因此您的实际问题是解码。

The simplest answer I know is:

我知道的最简单的答案是:

try(BufferedReader reader = new BufferedReader(new InputStreamReader(new FileInputStream("jedis.txt"),"UTF-8"))){

    String line = null;
    while( (line = reader.readLine()) != null){
        if(line.equals("Obi-wan")){
            System.out.println("Yay, I found " + line +"!");
        }
    }
}catch(IOException e){
    e.printStackTrace();
}

Or you can set the current system encoding with the system property file.encodingto UTF-8.

或者您可以使用系统属性将当前系统编码设置file.encoding为 UTF-8。

java -Dfile.encoding=UTF-8 com.jediacademy.Runner arg1 arg2 ...

You may also set it as a system property at runtime with System.setProperty(...)if you only need it for this specific file, but in a case like this I think I would prefer the OutputStreamWriter.

您也可以在运行时将其设置为系统属性,System.setProperty(...)如果您只需要为这个特定文件使用它,但在这种情况下,我想我更喜欢OutputStreamWriter.

By setting the system property you can use FileReaderand expect that it will use UTF-8 as the default encoding for your files. In this case for all the files that you read and write.

通过设置系统属性,您可以使用FileReader并期望它将使用 UTF-8 作为您的文件的默认编码。在这种情况下,对于您读取和写入的所有文件。

If you intend to detect decoding errors in your file you would be forced to use the InputStreamReaderapproach and use the constructor that receives an decoder.

如果您打算检测文件中的解码错误,您将被迫使用该InputStreamReader方法并使用接收解码器的构造函数。

Somewhat like

有点像

CharsetDecoder decoder = Charset.forName("UTF-8").newDecoder();
decoder.onMalformedInput(CodingErrorAction.REPORT);
decoder.onUnmappableCharacter(CodingErrorAction.REPORT);
BufeferedReader out = new BufferedReader(new InpuStreamReader(new FileInputStream("jedis.txt),decoder));

You may choose between actions IGNORE | REPLACE | REPORT

您可以在操作之间进行选择 IGNORE | REPLACE | REPORT

EDIT

编辑

If you insist in using RandomAccessFile, you would need to know the exact offset of the line that you are intending to read. And not only that, in order to read with readUTF()method, you should have written the file with writeUTF()method. Because this method, as JavaDocs stated above, expects a specific formatting in which the first 2 unsigned bytes represent the length in bytes of the UTF-8 string.

如果您坚持使用RandomAccessFile,则需要知道要阅读的行的确切偏移量。不仅如此,为了使用readUTF()方法读取,您应该使用writeUTF()方法编写文件。因为这种方法,正如上面 JavaDocs 所述,需要一种特定的格式,其中前 2 个无符号字节表示 UTF-8 字符串的长度(以字节为单位)。

As such, if you do:

因此,如果您这样做:

try(RandomAccessFile raf = new RandomAccessFile("jedis.bin", "rw")){

    raf.writeUTF("Luke\n"); //2 bytes for length + 5 bytes
    raf.writeUTF("Obiwan\n"); //2 bytes for length + 7 bytes
    raf.writeUTF("Yoda\n"); //2 bytes for lenght + 5 bytes

}catch(IOException e){
    e.printStackTrace();
}

You should not have any problems reading back from this file using the method readUTF(), as long as you can determine the offset of the given line that you want to read back.

readUTF()只要您可以确定要读回的给定行的偏移量,使用 方法从该文件读回应该不会有任何问题。

If you'd open the file jedis.binyou would notice it is a binary file, not a text file.

如果您打开该文件,jedis.bin您会注意到它是一个二进制文件,而不是一个文本文件。

Now, I know that "Luke\n"is 5 bytes in UTF-8 and "Obiwan\n"is 7 bytes in UTF-8. And that the writeUTF()method will insert 2 bytes in front of every one of these strings. Therefore, before "Yoda\n"there are (5+2) + (7+2) = 16 bytes.

现在,我知道"Luke\n"在 UTF-8中是 5 个字节,在 UTF-8 中"Obiwan\n"是 7 个字节。并且该writeUTF()方法将在这些字符串中的每一个之前插入 2 个字节。因此,之前"Yoda\n"有 (5+2) + (7+2) = 16 个字节。

So, I could do something like this to reach the last line:

所以,我可以做这样的事情来到达最后一行:

try (RandomAccessFile raf = new RandomAccessFile("jedis.bin", "r")) {

    raf.seek(16);
    String val = raf.readUTF();
    System.out.println(val); //prints Yoda

} catch (IOException e) {
    e.printStackTrace();
}

But this will not work if you wrote the file with a Writerclass because writers do not follow the formatting rules of the method writeUFT().

但是,如果您使用Writer类编写文件,这将不起作用,因为编写者不遵循该方法的格式规则writeUFT()

In a case like this, the best would be that your binaryfile would be formatted in such a way that all strings occupied the same amount of space (number of bytes, not number of characteres, because the number of bytes is variable in UTF-8 depending on the characters in your String), if not all the space is need it you pad it:

在这种情况下,最好的办法是将二进制文件的格式设置为所有字符串占用相同的空间量(字节数,而不是字符数,因为字节数在 UTF- 8 取决于你的字符串中的字符),如果不是所有的空间都需要它,你填充它:

That way you could easily calculate the offset of a given line because they all would occupy the same amount of space.

这样您就可以轻松计算给定线的偏移量,因为它们都将占用相同的空间量。

回答by tchrist

You aren't going to be able to go at it this way. The seekfunction will position you by some number of bytes. There is no guarantee that you are aligned to a UTF-8 character boundary.

你将无法以这种方式进行。该seek函数将按一定数量的字节定位您。无法保证您与 UTF-8 字符边界对齐。

回答by martinjs

I realise that this is an old question, but it still seems to have some interest, and no accepted answer.

我意识到这是一个老问题,但它似乎仍然有一些兴趣,并且没有被接受的答案。

What you are describing is essentially a data structures problem. The discussion of UTF8 here is a red herring - you would face the same problem using a fixed length encoding such as ASCII, because you have variable length lines. What you need is some kind of index.

你所描述的本质上是一个数据结构问题。此处对 UTF8 的讨论是一个红鲱鱼 - 使用固定长度编码(如 ASCII)时,您会遇到同样的问题,因为您有可变长度的行。你需要的是某种索引。

If you absolutely can't change the file itself (the "string file") - as seems to be the case - you could always construct an external index. The first time (and onlythe first time) the string file is accessed, you read it all the way through (sequentially), recording the byte position of the start of every line, and finishing by recording the end-of-file position (to make life simpler). This can be achieved by the following code:

如果您绝对不能更改文件本身(“字符串文件”) - 似乎是这种情况 - 您总是可以构建一个外部索引。第一次(而且只有字符串文件被访问的第一次),你看它一路走过(按顺序),记录每一行的起始字节位置,并通过记录档案结尾的位置整理(让生活更简单)。这可以通过以下代码实现:

myList.add(0); // assuming first string starts at beginning of file
while ((line = myRandomAccessFile.readLine()) != null) {
    myList.add(myRandomAccessFile.getFilePointer());
}

You then write these integers into a separate file ("index file"), which you will read back in every subsequent time you start your program and intend to access the string file. To access the nth string, pick the nth and n+1th index from the index file (call these Aand B). You then seek to position Ain the string file and read B-Abytes, which you then decode from UTF8. For instance, to get line i:

然后您将这些整数写入一个单独的文件(“索引文件”),您将在以后每次启动程序并打算访问字符串文件时读回该文件。要访问第nth 字符串,请从索引文件中选择第nth 和n+1th 索引(调用这些AB)。然后寻找A在字符串文件中的位置并读取B-A字节,然后从 UTF8 解码。例如,要获取 line i

myRandomAccessFile.seek(myList.get(i));
byte[] bytes = new byte[myList.get(i+1) - myList.get(i)];
myRandomAccessFile.readFully(bytes);
String result = new String(bytes, "UTF-8");

In many cases, however, it would be better to use a database such as SQLite, which creates and maintains the index for you. That way, you can add and modify extra "lines" without having to recreate the entire index. See https://www.sqlite.org/cvstrac/wiki?p=SqliteWrappersfor Java implementations.

但是,在许多情况下,最好使用 SQLite 之类的数据库,它会为您创建和维护索引。这样,您可以添加和修改额外的“行”而无需重新创建整个索引。有关Java 实现,请参阅https://www.sqlite.org/cvstrac/wiki?p=SqliteWrappers

回答by soulsurfer

Reading the file via readLine() worked for me:

通过 readLine() 读取文件对我有用:

RandomAccessFile raf = new RandomAccessFile( ... );
String line;
while ((line = raf.readLine()) != null) { 
    String utf = new String(line.getBytes("ISO-8859-1"));
    ...
}

// my file content has been created with:
raf.write(myStringContent.getBytes());

回答by Arnav Rao

The readUTF() method of RandomAccessFile treats first two bytes from the current pointer as size of bytes, after the two bytes from current position, to be read and returned as string.

RandomAccessFile 的 readUTF() 方法将当前指针的前两个字节视为字节大小,在当前位置的两个字节之后,将被读取并作为字符串返回。

In order for this method to work, content should be written using writeUTF() method as it uses first two bytes after the current position for saving the content size and then writes the content. Otherwise, most of the times you will get EOFException.

为了使此方法起作用,应使用 writeUTF() 方法写入内容,因为它使用当前位置后的前两个字节来保存内容大小,然后写入内容。否则,大多数情况下您会得到 EOFException。

See http://www.zoftino.com/java-random-access-filesfor details.

有关详细信息,请参阅http://www.zoftino.com/java-random-access-files

回答by Ludovic Kuty

Once you are positioned on a given line (this means you have answered the first part of your problem, see @martinjs answer), you can read the whole line and make a Stringout of it using a statement given in the answer by @Matthieu. But to check if the statement in question is correct, we have to ask ourselves 4 questions. It is not self-evident.

一旦您定位在给定的行上(这意味着您已经回答了问题的第一部分,请参阅@martinjs 的回答),您可以阅读整行并String使用@Matthieu 的回答中给出的语句对其进行分析。但是要检查有问题的陈述是否正确,我们必须问自己 4 个问题。这不是不言而喻的。

Note that the problem of getting at the start of a line may require to analyze the text to build an index if you need to randomly and quickly access many lines.

请注意,如果您需要随机快速访问多行,则获取行首的问题可能需要分析文本以构建索引。

The statement to read a line and turn it into a Stringis :

读取一行并将其转换为 a 的语句String是:

String utf8 = new String(raf.readLine().getBytes("ISO-8859-1"), "UTF-8");
  1. What is a byte in UTF-8 ? That means which values are allowed. We'll see the question is in fact useless once we answer question 2.
  2. readLine(). UTF-8 bytes → UTF-16 bytes ok ? Yes. Because UTF-16 gives a meaning to all the integers from 0 to 255 coded on 2 bytes if the most signification byte (MSB) is 0. This is guaranteed by readLine().
  3. getBytes("ISO-8859-1"). Characters encoded in UTF-16 (Java Stringwith 1 or 2 char(code unit) per character) → ISO-8859-1 bytes ok ? Yes. The code points of the characters in the Java string are ≤ 255 and ISO-8859-1 is a "raw" encoding which means it can encode every character as a single byte.
  4. new String(..., "UTF-8"). ISO-8859-1 bytes → UTF-8 bytes ok ? Yes. Since the original bytes come from UTF-8 encoded text and have been extracted as is, they still represent text encoded in UTF-8.
  1. 什么是 UTF-8 中的字节?这意味着允许哪些值。一旦我们回答了问题 2,我们就会发现这个问题实际上毫无用处。
  2. readLine(). UTF-8 字节 → UTF-16 字节可以吗?是的。因为如果最高有效字节 (MSB) 为 0,则 UTF-16 会赋予从 0 到 255 的所有整数以 2 个字节编码的含义readLine()。这是由.
  3. getBytes("ISO-8859-1"). 以 UTF-16 编码的字符(Java Stringchar每个字符有 1 或 2 个(代码单元))→ ISO-8859-1 字节可以吗?是的。Java 字符串中字符的代码点≤ 255,ISO-8859-1 是“原始”编码,这意味着它可以将每个字符编码为单个字节。
  4. new String(..., "UTF-8"). ISO-8859-1 字节 → UTF-8 字节可以吗?是的。由于原始字节来自 UTF-8 编码文本并已按原样提取,因此它们仍代表以 UTF-8 编码的文本。

Concerning the raw nature of ISO-8859-1 in which every byte (value 0 to 255) is mapped onto a character, I copy/paste below the comment I made on the answer by @Matthieu.

关于 ISO-8859-1 的原始性质,其中每个字节(值 0 到 255)都映射到一个字符上,我复制/粘贴在我对@Matthieu 的回答所做的评论下方。

See this questionconcerning the notion of "raw" encoding with ISO-8859-1. Note the difference between ISO/IEC 8859-1 (191 bytes defined) and ISO-8859-1 (256 bytes defined). You can find the definition of ISO-8859-1 in RFC1345and see that control codes C0 and C1 are mapped onto the 65 unused bytes of ISO/IEC 8859-1.

请参阅有关 ISO-8859-1 的“原始”编码概念的问题。请注意 ISO/IEC 8859-1(定义了 191 个字节)和 ISO-8859-1(定义了 256 个字节)之间的区别。您可以在RFC1345 中找到 ISO-8859-1 的定义,并看到控制代码 C0 和 C1 被映射到 ISO/IEC 8859-1 的 65 个未使用字节上。

回答by kevinarpe

I find the API for RandomAccessFileis challenging.

我发现 APIRandomAccessFile具有挑战性。

If your text is actually limited to UTF-8 values 0-127 (the lowest 7 bits of UTF-8), then it is safe to use readLine(), but read those Javadocs carefully: That is one strange method. To quote:

如果您的文本实际上仅限于 UTF-8 值 0-127(UTF-8 的最低 7 位),那么使用 是安全的readLine(),但请仔细阅读这些 Javadoc:这是一种奇怪的方法。去引用:

This method successively reads bytes from the file, starting at the current file pointer, until it reaches a line terminator or the end of the file. Each byte is converted into a character by taking the byte's value for the lower eight bits of the character and setting the high eight bits of the character to zero. This method does not, therefore, support the full Unicode character set.

此方法从文件中连续读取字节,从当前文件指针开始,直到到达行终止符或文件末尾。通过取字符低八位的字节值并将字符的高八位设置为零,将每个字节转换为字符。因此,此方法不支持完整的 Unicode 字符集。

To read UTF-8 safely, I suggest you read (some or all of the) raw bytes with a combination of length()and read(byte[]). Then convert your UTF-8 bytes to a Java Stringwith this constructor: new String(byte[], "UTF-8").

要读取UTF-8安全,我建议你阅读(部分或全部)的原始字节用的组合length()read(byte[])。然后转换您的UTF-8字节为JavaString与此构造函数:new String(byte[], "UTF-8")

To write UTF-8 safely, first convert your Java Stringto the correct bytes with someText.getBytes("UTF-8"). Finally, write the bytes using write(byte[]).

要安全地编写 UTF-8,首先将您的 Java 转换String为正确的字节someText.getBytes("UTF-8")。最后,使用write(byte[]).