Java：从带有缓冲输入的随机访问文件中读取字符串

Question

提问by GreyCat

I've never had close experiences with Java IO API before and I'm really frustrated now. I find it hard to believe how strange and complex it is and how hard it could be to do a simple task.

我以前从未接触过 Java IO API，现在我真的很沮丧。我发现很难相信它有多么奇怪和复杂，而且做一个简单的任务是多么困难。

My task: I have 2 positions (starting byte, ending byte), pos1and pos2. I need to read lines between these two bytes (including the starting one, not including the ending one) and use them as UTF8 String objects.

我的任务：我有 2 个位置（起始字节、结束字节）pos1和pos2. 我需要读取这两个字节之间的行（包括起始字节，不包括结尾字节）并将它们用作 UTF8 String 对象。

For example, in most script languages it would be a very simple 1-2-3-liner like that (in Ruby, but it will be essentially the same for Python, Perl, etc):

例如，在大多数脚本语言中，这将是一个非常简单的 1-2-3-liner（在 Ruby 中，但对于 Python、Perl 等基本上是相同的）：

f = File.open("file.txt").seek(pos1)
while f.pos < pos2 {
  s = f.readline
  # do something with "s" here
}

It quickly comes hell with Java IO APIs ;) In fact, I see two ways to read lines (ending with \n) from regular local files:

Java IO API 很快就会出现问题；) 事实上，我看到了两种\n从常规本地文件读取行（以结尾）的方法：

RandomAccessFilehas getFilePointer()and seek(long pos), but it's readLine()reads non-UTF8 strings (and even not byte arrays), but very strange strings with broken encoding, and it has no buffering (which probably means that every read*()call would be translated into single undelying OS read()=> fairly slow).
BufferedReaderhas great readLine()method, and it can even do some seeking with skip(long n), but it has no way to determine even number of bytes that has been already read, not mentioning the current position in a file.

RandomAccessFile有getFilePointer()and seek(long pos)，但它的readLine()读取非 UTF8 字符串（甚至不是字节数组），但非常奇怪的字符串编码损坏，并且它没有缓冲（这可能意味着每个read*()调用都将被转换为单个 undelying OS read()= > 相当慢）。
BufferedReader有很好的readLine()方法，它甚至可以用做一些查找skip(long n)，但是它无法确定已读取的偶数字节数，更不用说当前在文件中的位置。

I've tried to use something like:

我试过使用类似的东西：

    FileInputStream fis = new FileInputStream(fileName);
    FileChannel fc = fis.getChannel();
    BufferedReader br = new BufferedReader(
            new InputStreamReader(
                    fis,
                    CHARSET_UTF8
            )
    );

... and then using fc.position()to get current file reading position and fc.position(newPosition)to set one, but it doesn't seem to work in my case: looks like it returns position of a buffer pre-filling done by BufferedReader, or something like that - these counters seem to be rounded up in 16K increments.

...然后fc.position()用于获取当前文件读取位置并fc.position(newPosition)设置一个，但在我的情况下似乎不起作用：看起来它返回由 BufferedReader 完成的缓冲区预填充的位置，或者类似的东西 - 这些计数器似乎以 16K 为增量向上舍入。

Do I really have to implement it all by myself, i.e. a file readering interface which would:

我真的必须自己实现这一切吗，即一个文件阅读器界面，它会：

allow me to get/set position in a file
buffer file reading operations
allow reading UTF8 strings (or at least allow operations like "read everything till the next \n")

允许我在文件中获取/设置位置
缓冲文件读取操作
允许读取 UTF8 字符串（或至少允许诸如“读取所有内容直到下一个\n”之类的操作）

Is there a quicker way than implementing it all myself? Am I overseeing something?

有没有比我自己实现它更快的方法？我在监督什么吗？

Answer 1

采纳答案by Ken Bloom

import org.apache.commons.io.input.BoundedInputStream

FileInputStream file = new FileInputStream(filename);
file.skip(pos1);
BufferedReader br = new BufferedReader(
   new InputStreamReader(new BoundedInputStream(file,pos2-pos1))
);

If you didn't care about pos2, then you woundn't need Apache Commons IO.

如果您不关心pos2，那么您就不需要 Apache Commons IO。

Answer 2

回答by AlexR

The java IO API is very flexible. Unfortunately sometimes the flexibility makes it verbose. The main idea here is that there are many streams, writers and readers that implement wrapper patter. For example BufferedInputStream wraps any other InputStream. The same is about output streams.

java IO API 非常灵活。不幸的是，有时灵活性使它变得冗长。这里的主要思想是有许多流、写入器和读取器实现了包装器模式。例如 BufferedInputStream 包装任何其他 InputStream。输出流也是如此。

The difference between streams and readers/writers is that streams work with bytes while readers/writers work with characters.

流和读取器/写入器之间的区别在于，流处理字节，而读取器/写入器处理字符。

Fortunately some streams, writers and readers have convenient constructors that simplify coding. If you want to read file you just have to say

幸运的是，一些流、作者和读者有方便的构造函数来简化编码。如果你想阅读文件，你只需要说

    InputStream in = new FileInputStream("/usr/home/me/myfile.txt");
    if (in.markSupported()) {
        in.skip(1024);
        in.read();
    }

It is not so complicated as you afraid.

它没有你害怕的那么复杂。

Channels is something different. It is a part of so called "new IO" or nio. New IO is not blocked - it is its main advantage. You can search in internet for any "nio java tutorial" and read about it. But it is more complicated than regular IO and is not needed for most applications.

渠道是不同的。它是所谓的“新 IO”或 nio 的一部分。新 IO 不会被阻止 - 这是它的主要优势。您可以在互联网上搜索任何“nio java 教程”并阅读它。但它比常规 IO 更复杂，大多数应用程序不需要它。

Answer 3

回答by Ken Bloom

Start with a RandomAccessFileand use reador readFullyto get a byte array between pos1and pos2. Let's say that we've stored the data read in a variable named rawBytes.

以 a 开头RandomAccessFile并使用readorreadFully获取pos1和之间的字节数组pos2。假设我们已将读取的数据存储在名为的变量中rawBytes。

Then create your BufferedReaderusing

然后创建您的BufferedReader使用

new BufferedReader(new InputStreamReader(new ByteArrayInputStream(rawBytes)))

Then you can call readLineon the BufferedReader.

然后就可以调用readLine了BufferedReader。

Caveat: this probably uses more memory than if you could make the BufferedReaderseek to the right location itself, because it preloads everything into memory.

警告：这可能比您自己BufferedReader寻找正确位置使用更多的内存，因为它会将所有内容预加载到内存中。

Answer 4

回答by Jonathan B

I think the confusion is caused by the UTF-8 encoding and the possibility of double byte characters.

我认为混淆是由 UTF-8 编码和双字节字符的可能性引起的。

UTF8 doesn't specify how many bytes are in a single character. I'm assuming from your post that you are using single byte characters. For example, 412 bytes would mean 411 characters. But if the string were using double byte characters, you would get the 206 character.

UTF8 不指定单个字符中有多少字节。我从您的帖子中假设您使用的是单字节字符。例如，412 个字节意味着 411 个字符。但是如果字符串使用双字节字符，你会得到 206 个字符。

The original java.io package didn't deal well with this multi-byte confusion. So, they added more classes to deal specifically with strings. The package mixes two different types of file handlers (and they can be confusing until the nomenclature is sorted out). The streamclasses provide for direct data I/O without any conversion. The readerclasses convert files to strings with full support for multi-byte characters. That might help clarify part of the problem.

原始的 java.io 包没有很好地处理这种多字节混淆。因此，他们添加了更多类来专门处理字符串。该包混合了两种不同类型的文件处理程序（在整理出命名法之前，它们可能会令人困惑）。的流类提供了直接的数据I / O而无需任何转换。该阅读器类文件转换成字符串使用多字节字符的完全支持。这可能有助于澄清问题的一部分。

Since you state you are using UTF-8 characters, you want the reader classes. In this case, I suggest FileReader. The skip() method in FileReader allows you to pass by X characters and then start reading text. Alternatively, I prefer the overloaded read() method since it allows you to grab all the text at one time.

由于您声明您使用的是 UTF-8 字符，因此您需要阅读器类。在这种情况下，我建议使用 FileReader。FileReader 中的 skip() 方法允许您传递 X 个字符，然后开始阅读文本。或者，我更喜欢重载的 read() 方法，因为它允许您一次抓取所有文本。

If you assume your "bytes" are individual characters, try something like this:

如果您假设您的“字节”是单个字符，请尝试以下操作：

FileReader fr = new FileReader( new File("x.txt") );
char[] buffer = new char[ pos2 - pos ];
fr.read( buffer, pos, buffer.length );
...

Answer 5

回答by Martijn Verburg

For @Ken Bloom A very quick go at a Java 7 version. Note: I don't think this is the most efficient way, I'm still getting my head around NIO.2, Oracle has started their tutorial here

对于@Ken Bloom，Java 7 版本非常快速。注意：我认为这不是最有效的方法，我仍然对 NIO.2 有所了解，Oracle 已经在这里开始了他们的教程

Also note that this isn't using Java 7's new ARM syntax (which takes care of the Exception handling for file based resources), it wasn't working in the latest openJDK build that I have. But if people want to see the syntax, let me know.

还要注意，这不是使用 Java 7 的新 ARM 语法（它负责基于文件的资源的异常处理），它在我拥有的最新 openJDK 版本中不起作用。但是，如果人们想查看语法，请告诉我。

/* 
 * Paths uses the default file system, note no exception thrown at this stage if 
 * file is missing
 */
Path file = Paths.get("C:/Projects/timesheet.txt");
ByteBuffer readBuffer = ByteBuffer.allocate(readBufferSize);
FileChannel fc = null;
try
{
    /*
     * newByteChannel is a SeekableByteChannel - this is the fun new construct that 
     * supports asynch file based I/O, e.g. If you declared an AsynchronousFileChannel 
     * you could read and write to that channel simultaneously with multiple threads.
     */
    fc = (FileChannel)file.newByteChannel(StandardOpenOption.READ);
    fc.position(startPosition);
    while (fc.read(readBuffer) != -1)
    {
        readBuffer.rewind();
        System.out.println(Charset.forName(encoding).decode(readBuffer));
        readBuffer.flip();
    }
}

Answer 6

回答by scube

I wrote this code to read utf-8 using randomaccessfiles

我写了这段代码来使用 randomaccessfiles 读取 utf-8

//File: CyclicBuffer.java
public class CyclicBuffer {
private static final int size = 3;
private FileChannel channel;
private ByteBuffer buffer = ByteBuffer.allocate(size);

public CyclicBuffer(FileChannel channel) {
    this.channel = channel;
}

private int read() throws IOException {
    return channel.read(buffer);
}

/**
 * Returns the byte read
 *
 * @return byte read -1 - end of file reached
 * @throws IOException
 */
public byte get() throws IOException {
    if (buffer.hasRemaining()) {
        return buffer.get();
    } else {
        buffer.clear();
        int eof = read();
        if (eof == -1) {
            return (byte) eof;
        }
        buffer.flip();
        return buffer.get();
    }
}
}
//File: UTFRandomFileLineReader.java


public class UTFRandomFileLineReader {
private final Charset charset = Charset.forName("utf-8");
private CyclicBuffer buffer;
private ByteBuffer temp = ByteBuffer.allocate(4096);
private boolean eof = false;

public UTFRandomFileLineReader(FileChannel channel) {
    this.buffer = new CyclicBuffer(channel);
}

public String readLine() throws IOException {
    if (eof) {
        return null;
    }
    byte x = 0;
    temp.clear();

    while ((byte) -1 != (x = (buffer.get())) &amp;&amp; x != '\n') {
        if (temp.position() == temp.capacity()) {
            temp = addCapacity(temp);
        }
        temp.put(x);
    }
    if (x == -1) {
        eof = true;
    }
    temp.flip();
    if (temp.hasRemaining()) {
        return charset.decode(temp).toString();
    } else {
        return null;
    }
}

private ByteBuffer addCapacity(ByteBuffer temp) {
    ByteBuffer t = ByteBuffer.allocate(temp.capacity() + 1024);
    temp.flip();
    t.put(temp);
    return t;
}

public static void main(String[] args) throws IOException {
    RandomAccessFile file = new RandomAccessFile("/Users/sachins/utf8.txt",
            "r");
    UTFRandomFileLineReader reader = new UTFRandomFileLineReader(file
            .getChannel());
    int i = 1;
    while (true) {
        String s = reader.readLine();
        if (s == null)
            break;
        System.out.println("\n line  " + i++);
        s = s + "\n";
        for (byte b : s.getBytes(Charset.forName("utf-8"))) {
            System.out.printf("%x", b);
        }
        System.out.printf("\n");

    }
}
}

Answer 7

回答by Jeff Terrell Ph.D.

I'm late to the party here, but I ran across this problem in my own project.

我参加聚会迟到了，但我在自己的项目中遇到了这个问题。

After much traversal of Javadocs and Stack Overflow, I think I found a simple solution.

在大量遍历 Javadocs 和 Stack Overflow 之后，我想我找到了一个简单的解决方案。

After seeking to the appropriate place in your RandomAccessFile, which I am here calling raFile, do the following:

在您的 RandomAccessFile 中寻找适当的位置后，我在这里称之为raFile，请执行以下操作：

FileDescriptor fd = raFile.getFD();
FileReader     fr = new FileReader(fd);
BufferedReader br = new BufferedReader(fr);

Then you should be able to call br.readLine()to your heart's content, which will be much faster than calling raFile.readLine().

然后你应该可以尽情调用br.readLine()，这会比调用快得多raFile.readLine()。

The one thing I'm not sure about is whether UTF8 strings are handled correctly.

我不确定的一件事是是否正确处理了 UTF8 字符串。

Java：从带有缓冲输入的随机访问文件中读取字符串

提问by GreyCat

采纳答案by Ken Bloom

回答by AlexR

回答by Ken Bloom

回答by Jonathan B

回答by Martijn Verburg

回答by scube

回答by Jeff Terrell Ph.D.

相关推荐

最近更新

标签

Java：从带有缓冲输入的随机访问文件中读取字符串

提问by GreyCat

采纳答案by Ken Bloom

回答by AlexR

回答by Ken Bloom

回答by Jonathan B

回答by Martijn Verburg

回答by scube

回答by Jeff Terrell Ph.D.

相关推荐

java.lang.NumberFormatException: Invalid int: "null"

Java com.mysql.jdbc.exceptions.jdbc4.MySQLSyntaxErrorException：sun.reflect 的“字段列表”中的未知列“raj”

简单的 GUI Java 计算器

Java 以编程方式将 .cer 证书导入密钥库

相关推荐

最近更新

标签