如何在 Java 中快速搜索大文件中的字符串?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/36917209/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-11-03 02:00:51  来源:igfitidea点击:

How to quickly search a large file for a String in Java?

javaiojava.util.scanner

提问by Chief DMG

I am trying to search a large text file (400MB) for a particular String using the following:

我正在尝试使用以下方法搜索特定字符串的大型文本文件 (400MB):

File file = new File("fileName.txt");
try {
    int count = 0;
    Scanner scanner = new Scanner(file);
    while(scanner.hasNextLine()) {
        if(scanner.nextLine().contains("particularString")) {
            count++;
            System.out.println("Number of instances of String: " + count);
        }
    }
} catch (FileNotFoundException e){
    System.out.println(e);
}

This works fine for small files however for this particular file and other large ones it takes far too long (>10mins).

这适用于小文件,但是对于这个特定文件和其他大文件,它需要太长时间(> 10 分钟)。

What would be the quickest, most efficient way of doing this?

这样做的最快、最有效的方法是什么?

I have now changed to the following and it completes within seconds -

我现在已更改为以下内容,并在几秒钟内完成 -

try {
        int count = 0;
        FileReader fileIn = new FileReader(file);
        BufferedReader reader = new BufferedReader(fileIn);
        String line;
        while((line = reader.readLine()) != null) {
            if((line.contains("particularString"))) {
                count++;
                System.out.println("Number of instances of String " + count);
            }
        }
    }catch (IOException e){
        System.out.println(e);
    }

采纳答案by radai

1st figure out how long it takes you to actually read the entire file's contents vs how long it takes to scan them for your pattern.

1st 弄清楚实际读取整个文件的内容需要多长时间,以及扫描它们以获取您的模式需要多长时间。

if your results are dominated by the read time (and assumming you read it properly, so channels or at the very least buffered readers) there's not much to do.

如果您的结果由阅读时间决定(并假设您正确阅读了它,因此通道或至少缓冲阅读器)没有太多可做的。

if its the scanning time that dominates you could read all lines and then ship small batches of lines to be searched in to a work queue, where you could have multiple threads picking up line batches and searching in them.

如果扫描时间占主导地位,您可以读取所有行,然后将要搜索的小批量行发送到工作队列,在那里您可以让多个线程获取行批次并在其中进行搜索。

ballpark figures

棒球场人物

  • assuming 50 MB/sec as the hard drive read speed (and thats slow by modern standards) you should be able to read up the entire file into memory in <10 seconds.
  • looking at MD5 hashing speed benchmarks (example here) shows us that the hashing rate can be at least as fast (often faster) than disk read speed. also, string searching is faster, simpler and parallelizes better than hashing.
  • 假设硬盘读取速度为 50 MB/秒(按照现代标准,这很慢),您应该能够在 <10 秒内将整个文件读入内存。
  • 查看 MD5 散列速度基准(此处的示例)向我们展示了散列率至少可以与磁盘读取速度一样快(通常更快)。此外,字符串搜索比哈希更快、更简单并且并行化更好。

given those 2 estimates i think a proper implementation can easily land you a run time on the order of 10 seconds (if you start kicking off search jobs as you read line batches), and be largely dominated by your disk read time.

考虑到这 2 个估计,我认为正确的实现可以轻松地让您获得大约 10 秒的运行时间(如果您在读取行批次时开始搜索作业),并且在很大程度上取决于您的磁盘读取时间。

回答by mtj

Scanner is simply not useful in this case. Under the hood, it does all kinds of input parsing, checking, caching and whatnot. If your case is simply "iterate over all lines of a file", use something that is based on a simple BufferedReader.

在这种情况下,扫描仪根本没有用。在幕后,它执行各种输入解析、检查、缓存等等。如果您的情况只是“遍历文件的所有行”,请使用基于简单 BufferedReader 的内容。

In your particular case, I recommend using Files.lines.

在您的特定情况下,我建议使用 Files.lines。

Example:

例子:

  long count = Files.lines(Paths.get("testfile.txt"))
     .filter(s -> s.contains("particularString"))
     .count();
  System.out.println(count);

(Note that this particular case of the streaming api probably does not cover what you are actually trying to achieve - unfortunately your question does not indicate what the result of the method should be.)

(请注意,流 api 的这种特殊情况可能没有涵盖您实际尝试实现的目标 - 不幸的是,您的问题并未表明该方法的结果应该是什么。)

On my system, I get about 15% of Scanner runtime with Files.lines() or a buffered reader.

在我的系统上,我使用 Files.lines() 或缓冲读取器获得了大约 15% 的 Scanner 运行时。

回答by Mindaugas Nakro?is

Use a method from Scanner object - FindWithinHorizon. Scanner will internally make a FileChannel to read the file. And for pattern matching it will end up using a Boyer-Moore algorithm for efficient string searching.

使用来自 Scanner 对象的方法 - FindWithinHorizo​​n。Scanner 会在内部创建一个 FileChannel 来读取文件。对于模式匹配,它将最终使用 Boyer-Moore 算法进行高效的字符串搜索。