在 Java 中排序（内存映射？）文件中的二进制搜索

Question

提问by sds

I am struggling to port a Perl program to Java, and learning Java as I go. A central component of the original program is a Perl modulethat does string prefix lookups in a +500 GB sorted text file using binary search (essentially, "seek" to a byte offset in the middle of the file, backtrack to nearest newline, compare line prefix with the search string, "seek" to half/double that byte offset, repeat until found...)

我正在努力将 Perl 程序移植到 Java，并一边学习一边学习 Java。原始程序的一个核心组件是一个Perl 模块，它使用二进制搜索在 +500 GB 排序的文本文件中进行字符串前缀查找（本质上，“寻找”到文件中间的字节偏移量，回溯到最近的换行符，比较带有搜索字符串的行前缀，“搜索”到该字节偏移量的一半/两倍，重复直到找到...）

I have experimented with several database solutions but found that nothing beats this in sheer lookup speed with data sets of this size. Do you know of any existing Java library that implements such functionality? Failing that, could you point me to some idiomatic example code that does random access reads in text files?

我已经尝试了几种数据库解决方案，但发现在使用这种大小的数据集的绝对查找速度方面，没有什么比这更好的了。您是否知道任何现有的实现此类功能的 Java 库？如果做不到这一点，您能否指出一些在文本文件中进行随机访问读取的惯用示例代码？

Alternatively, I am not familiar with the new (?) Java I/O libraries but would it be an option to memory-map the 500 GB text file (I'm on a 64-bit machine with memory to spare) and do binary search on the memory-mapped byte array? I would be very interested to hear any experiences you have to share about this and similar problems.

或者，我不熟悉新的 (?) Java I/O 库，但它是否可以选择对 500 GB 文本文件进行内存映射（我在 64 位机器上有空闲内存）并执行二进制搜索内存映射字节数组？我很想听听您分享有关此问题和类似问题的任何经验。

Answer 1

采纳答案by Stu Thompson

I am a bigfan of Java's MappedByteBuffersfor situations like this. It is blazing fast. Below is a snippet I put together for you that maps a buffer to the file, seeks to the middle, and then searches backwards to a newline character. This should be enough to get you going?

我是一个很大的Java的风扇MappedByteBuffers像这样的情况。它的速度非常快。下面是我为您整理的一个片段，它将缓冲区映射到文件，查找到中间，然后向后搜索到换行符。这应该足以让你去吗？

I have similar code (seek, read, repeat until done) in my own application, benchmarked java.iostreams against MappedByteBufferin a production environment and posted the results on my blog (Geekomatic posts tagged 'java.nio') with raw data, graphs and all.

我有类似的代码（寻找，读，重复，直到完成）在我自己的应用程序，基准 java.io针对流MappedByteBuffer在生产环境和贴在我的博客的结果（Geekomatic文章标签“的java.nio”）与原始数据，图表和所有。

Two second summary? My MappedByteBuffer-based implementation was about 275% faster.YMMV.

两秒总结？基于我MappedByteBuffer的实现快了大约 275%。天啊。

To work for files larger than ~2GB, which is a problem because of the cast and .position(int pos), I've crafted paging algorithm backed by an array of MappedByteBuffers. You'll need to be working on a 64-bit system for this to work with files larger than 2-4GB because MBB's use the OS's virtual memory system to work their magic.

为了处理大于 ~2GB 的文件，这是一个问题，因为 cast 和.position(int pos)，我精心设计了由MappedByteBuffers数组支持的分页算法。您需要在 64 位系统上工作才能处理大于 2-4GB 的文件，因为 MBB 使用操作系统的虚拟内存系统来发挥其魔力。

public class StusMagicLargeFileReader  {
    private static final long PAGE_SIZE = Integer.MAX_VALUE;
    private List<MappedByteBuffer> buffers = new ArrayList<MappedByteBuffer>();
    private final byte raw[] = new byte[1];

    public static void main(String[] args) throws IOException {
        File file = new File("/Users/stu/test.txt");
        FileChannel fc = (new FileInputStream(file)).getChannel(); 
        StusMagicLargeFileReader buffer = new StusMagicLargeFileReader(fc);
        long position = file.length() / 2;
        String candidate = buffer.getString(position--);
        while (position >=0 && !candidate.equals('\n')) 
            candidate = buffer.getString(position--);
        //have newline position or start of file...do other stuff    
    }
    StusMagicLargeFileReader(FileChannel channel) throws IOException {
        long start = 0, length = 0;
        for (long index = 0; start + length < channel.size(); index++) {
            if ((channel.size() / PAGE_SIZE) == index)
                length = (channel.size() - index *  PAGE_SIZE) ;
            else
                length = PAGE_SIZE;
            start = index * PAGE_SIZE;
            buffers.add(index, channel.map(READ_ONLY, start, length));
        }    
    }
    public String getString(long bytePosition) {
        int page  = (int) (bytePosition / PAGE_SIZE);
        int index = (int) (bytePosition % PAGE_SIZE);
        raw[0] = buffers.get(page).get(index);
        return new String(raw);
    }
}

Answer 2

回答by Stu Thompson

I have the same problem. I am trying to find all lines that start with some prefix in a sorted file.

我也有同样的问题。我正在尝试在排序文件中查找以某个前缀开头的所有行。

Here is a method I cooked up which is largely a port of Python code found here: http://www.logarithmic.net/pfh/blog/01186620415

这是我编写的一种方法，它主要是此处找到的 Python 代码端口：http: //www.logarithmic.net/pfh/blog/01186620415

I have tested it but not thoroughly just yet. It does not use memory mapping, though.

我已经测试过了，但还没有彻底。但是，它不使用内存映射。

public static List<String> binarySearch(String filename, String string) {
    List<String> result = new ArrayList<String>();
    try {
        File file = new File(filename);
        RandomAccessFile raf = new RandomAccessFile(file, "r");

        long low = 0;
        long high = file.length();

        long p = -1;
        while (low < high) {
            long mid = (low + high) / 2;
            p = mid;
            while (p >= 0) {
                raf.seek(p);

                char c = (char) raf.readByte();
                //System.out.println(p + "\t" + c);
                if (c == '\n')
                    break;
                p--;
            }
            if (p < 0)
                raf.seek(0);
            String line = raf.readLine();
            //System.out.println("-- " + mid + " " + line);
            if (line.compareTo(string) < 0)
                low = mid + 1;
            else
                high = mid;
        }

        p = low;
        while (p >= 0) {
            raf.seek(p);
            if (((char) raf.readByte()) == '\n')
                break;
            p--;
        }

        if (p < 0)
            raf.seek(0);

        while (true) {
            String line = raf.readLine();
            if (line == null || !line.startsWith(string))
                break;
            result.add(line);
        }

        raf.close();
    } catch (IOException e) {
        System.out.println("IOException:");
        e.printStackTrace();
    }
    return result;
}

Answer 3

回答by dmeister

I am not aware of any library that has that functionality. However, a correct code for a external binary search in Java should be similar to this:

我不知道任何具有该功能的库。但是，Java 中外部二进制搜索的正确代码应该类似于：

class ExternalBinarySearch {
final RandomAccessFile file;
final Comparator<String> test; // tests the element given as search parameter with the line. Insert a PrefixComparator here
public ExternalBinarySearch(File f, Comparator<String> test) throws FileNotFoundException {
    this.file = new RandomAccessFile(f, "r");
    this.test = test;
}
public String search(String element) throws IOException {
    long l = file.length();
    return search(element, -1, l-1);
}
/**
 * Searches the given element in the range [low,high]. The low value of -1 is a special case to denote the beginning of a file.
 * In contrast to every other line, a line at the beginning of a file doesn't need a \n directly before the line
 */
private String search(String element, long low, long high) throws IOException {
    if(high - low < 1024) {
        // search directly
        long p = low;
        while(p < high) {
            String line = nextLine(p);
            int r = test.compare(line,element);
            if(r > 0) {
                return null;
            } else if (r < 0) {
                p += line.length();
            } else {
                return line;
            }
        }
        return null;
    } else {
        long m  = low + ((high - low) / 2);
        String line = nextLine(m);
        int r = test.compare(line, element);
        if(r > 0) {
            return search(element, low, m);
        } else if (r < 0) {
            return search(element, m, high);
        } else {
            return line;
        }
    }
}
private String nextLine(long low) throws IOException {
    if(low == -1) { // Beginning of file
        file.seek(0);           
    } else {
        file.seek(low);
    }
    int bufferLength = 65 * 1024;
    byte[] buffer = new byte[bufferLength];
    int r = file.read(buffer);
    int lineBeginIndex = -1;

    // search beginning of line
    if(low == -1) { //beginning of file
        lineBeginIndex = 0;
    } else {
        //normal mode
        for(int i = 0; i < 1024; i++) {
        if(buffer[i] == '\n') {
            lineBeginIndex = i + 1;
            break;
        }
        }
    }
    if(lineBeginIndex == -1) {
        // no line begins within next 1024 bytes
        return null;
    }
    int start = lineBeginIndex;
        for(int i = start; i < r; i++) {
            if(buffer[i] == '\n') {
                // Found end of line
                return new String(buffer, lineBeginIndex, i - lineBeginIndex + 1);
                return line.toString();
            }
        }
        throw new IllegalArgumentException("Line to long");
}
}

Please note: I made up this code ad-hoc: Corner cases are not tested nearly good enough, the code assumes that no single line is larger than 64K, etc.

请注意：我临时编写了此代码：Corner case 的测试几乎不够好，代码假定没有任何一行大于 64K，等等。

I also think that building an index of the offsets where lines start might be a good idea. For a 500 GB file, that index should be stored in an index file. You should gain a not-so-small constant factor with that index because than there is no need to search for the next line in each step.

我还认为在行开始处建立偏移量的索引可能是一个好主意。对于 500 GB 的文件，该索引应存储在索引文件中。您应该使用该索引获得一个不那么小的常数因子，因为无需在每个步骤中搜索下一行。

I know that was not the question, but building a prefix tree data structure like (Patrica) Tries (on disk/SSD) might be a good idea to do the prefix search.

我知道这不是问题，但是构建前缀树数据结构（如 (Patrica) Tries（在磁盘/SSD 上））可能是进行前缀搜索的好主意。

Answer 4

回答by Rick C. Petty

This is a simple example of what you want to achieve. I would probably first index the file, keeping track of the file position for each string. I'm assuming the strings are separated by newlines (or carriage returns):

这是您要实现的目标的简单示例。我可能会首先索引文件，跟踪每个字符串的文件位置。我假设字符串由换行符（或回车符）分隔：

    RandomAccessFile file = new RandomAccessFile("filename.txt", "r");
    List<Long> indexList = new ArrayList();
    long pos = 0;
    while (file.readLine() != null)
    {
        Long linePos = new Long(pos);
        indexList.add(linePos);
        pos = file.getFilePointer();
    }
    int indexSize = indexList.size();
    Long[] indexArray = new Long[indexSize];
    indexList.toArray(indexArray);

The last step is to convert to an array for a slight speed improvement when doing lots of lookups. I would probably convert the Long[]to a long[]also, but I did not show that above. Finally the code to read the string from a given indexed position:

最后一步是在进行大量查找时转换为数组以稍微提高速度。我可能也会将 the 转换Long[]为 a long[]，但我没有在上面显示。最后是从给定索引位置读取字符串的代码：

    int i; // Initialize this appropriately for your algorithm.
    file.seek(indexArray[i]);
    String line = file.readLine();
            // At this point, line contains the string #i.

Answer 5

回答by Larry Watanabe

If you are dealing with a 500GB file, then you might want to use a faster lookup method than binary search - namely a radix sort which is essentially a variant of hashing. The best method for doing this really depends on your data distributions and types of lookup, but if you are looking for string prefixes there should be a good way to do this.

如果您正在处理一个 500GB 的文件，那么您可能希望使用比二分搜索更快的查找方法——即基数排序，它本质上是散列的一种变体。执行此操作的最佳方法实际上取决于您的数据分布和查找类型，但如果您正在查找字符串前缀，则应该有一个很好的方法来执行此操作。

I posted an example of a radix sort solution for integers, but you can use the same idea - basically to cut down the sort time by dividing the data into buckets, then using O(1) lookup to retrieve the bucket of data that is relevant.

我发布了一个整数基数排序解决方案的示例，但您可以使用相同的想法 - 基本上是通过将数据分成多个桶来减少排序时间，然后使用 O(1) 查找来检索相关的数据桶.

Option Strict On
Option Explicit On

Module Module1

Private Const MAX_SIZE As Integer = 100000
Private m_input(MAX_SIZE) As Integer
Private m_table(MAX_SIZE) As List(Of Integer)
Private m_randomGen As New Random()
Private m_operations As Integer = 0

Private Sub generateData()
    ' fill with random numbers between 0 and MAX_SIZE - 1
    For i = 0 To MAX_SIZE - 1
        m_input(i) = m_randomGen.Next(0, MAX_SIZE - 1)
    Next

End Sub

Private Sub sortData()
    For i As Integer = 0 To MAX_SIZE - 1
        Dim x = m_input(i)
        If m_table(x) Is Nothing Then
            m_table(x) = New List(Of Integer)
        End If
        m_table(x).Add(x)
        ' clearly this is simply going to be MAX_SIZE -1
        m_operations = m_operations + 1
    Next
End Sub

 Private Sub printData(ByVal start As Integer, ByVal finish As Integer)
    If start < 0 Or start > MAX_SIZE - 1 Then
        Throw New Exception("printData - start out of range")
    End If
    If finish < 0 Or finish > MAX_SIZE - 1 Then
        Throw New Exception("printData - finish out of range")
    End If
    For i As Integer = start To finish
        If m_table(i) IsNot Nothing Then
            For Each x In m_table(i)
                Console.WriteLine(x)
            Next
        End If
    Next
End Sub

' run the entire sort, but just print out the first 100 for verification purposes
Private Sub test()
    m_operations = 0
    generateData()
    Console.WriteLine("Time started = " & Now.ToString())
    sortData()
    Console.WriteLine("Time finished = " & Now.ToString & " Number of operations = " & m_operations.ToString())
    ' print out a random 100 segment from the sorted array
    Dim start As Integer = m_randomGen.Next(0, MAX_SIZE - 101)
    printData(start, start + 100)
End Sub

Sub Main()
    test()
    Console.ReadLine()
End Sub

End Module

Answer 6

回答by mikee805

I post a gist https://gist.github.com/mikee805/c6c2e6a35032a3ab74f643a1d0f8249c

我发布了一个要点https://gist.github.com/mikee805/c6c2e6a35032a3ab74f643a1d0f8249c

that is rather complete example based on what I found on stack overflow and some blogs hopefully someone else can use it

这是基于我在堆栈溢出和一些博客上发现的内容的相当完整的示例，希望其他人可以使用它

import static java.nio.file.Files.isWritable;
import static java.nio.file.StandardOpenOption.READ;
import static org.apache.commons.io.FileUtils.forceMkdir;
import static org.apache.commons.io.IOUtils.closeQuietly;
import static org.apache.commons.lang3.StringUtils.isBlank;
import static org.apache.commons.lang3.StringUtils.trimToNull;

import java.io.File;
import java.io.IOException;
import java.nio.Buffer;
import java.nio.MappedByteBuffer;
import java.nio.channels.FileChannel;
import java.nio.file.Path;

public class FileUtils {

    private FileUtils() {
    }

    private static boolean found(final String candidate, final String prefix) {
        return isBlank(candidate) || candidate.startsWith(prefix);
    }

    private static boolean before(final String candidate, final String prefix) {
        return prefix.compareTo(candidate.substring(0, prefix.length())) < 0;
    }

    public static MappedByteBuffer getMappedByteBuffer(final Path path) {
        FileChannel fileChannel = null;
        try {
            fileChannel = FileChannel.open(path, READ);
            return fileChannel.map(FileChannel.MapMode.READ_ONLY, 0, fileChannel.size()).load();
        } 
        catch (Exception e) {
            throw new RuntimeException(e);
        }
        finally {
            closeQuietly(fileChannel);
        }
    }

    public static String binarySearch(final String prefix, final MappedByteBuffer buffer) {
        if (buffer == null) {
            return null;
        }
        try {
            long low = 0;
            long high = buffer.limit();
            while (low < high) {
                int mid = (int) ((low + high) / 2);
                final String candidate = getLine(mid, buffer);
                if (found(candidate, prefix)) {
                    return trimToNull(candidate);
                } 
                else if (before(candidate, prefix)) {
                    high = mid;
                } 
                else {
                    low = mid + 1;
                }
            }
        } 
        catch (Exception e) {
            throw new RuntimeException(e);
        } 
        return null;
    }

    private static String getLine(int position, final MappedByteBuffer buffer) {
        // search backwards to the find the proceeding new line
        // then search forwards again until the next new line
        // return the string in between
        final StringBuilder stringBuilder = new StringBuilder();
        // walk it back
        char candidate = (char)buffer.get(position);
        while (position > 0 && candidate != '\n') {
            candidate = (char)buffer.get(--position);
        }
        // we either are at the beginning of the file or a new line
        if (position == 0) {
            // we are at the beginning at the first char
            candidate = (char)buffer.get(position);
            stringBuilder.append(candidate);
        }
        // there is/are char(s) after new line / first char
        if (isInBuffer(buffer, position)) {
            //first char after new line
            candidate = (char)buffer.get(++position);
            stringBuilder.append(candidate);
            //walk it forward
            while (isInBuffer(buffer, position) && candidate != ('\n')) {
                candidate = (char)buffer.get(++position);
                stringBuilder.append(candidate);
            }
        }
        return stringBuilder.toString();
    }

    private static boolean isInBuffer(final Buffer buffer, int position) {
        return position + 1 < buffer.limit();
    }

    public static File getOrCreateDirectory(final String dirName) { 
        final File directory = new File(dirName);
        try {
            forceMkdir(directory);
            isWritable(directory.toPath());
        } 
        catch (IOException e) {
            throw new RuntimeException(e);
        }
        return directory;
    }
}

Answer 7

回答by Karry

I had similar problem, so I created (Scala) library from solutions provided in this thread:

我有类似的问题，所以我从这个线程中提供的解决方案创建了（Scala）库：

https://github.com/avast/BigMap

It contains utility for sorting huge file and binary search in this sorted file...

它包含用于在此排序文件中排序大文件和二进制搜索的实用程序...

Answer 8

回答by Eddie

If you truly want to try memory mapping the file, I found a tutorial on how to use memory mappingin Java nio.

如果你真的想尝试内存映射文件，我找到了一个关于如何在 Java nio 中使用内存映射的教程。

在 Java 中排序（内存映射？）文件中的二进制搜索

提问by sds

采纳答案by Stu Thompson

回答by Stu Thompson

回答by dmeister

回答by Rick C. Petty

回答by Larry Watanabe

回答by mikee805

回答by Karry

回答by Eddie

相关推荐

最近更新

标签

在 Java 中排序（内存映射？）文件中的二进制搜索

提问by sds

采纳答案by Stu Thompson

回答by Stu Thompson

回答by dmeister

回答by Rick C. Petty

回答by Larry Watanabe

回答by mikee805

回答by Karry

回答by Eddie

相关推荐

java JTable - 拖放

java jsp获取ip地址

java 正则表达式匹配方括号或双引号中的字符串

java oracle Weblogic 10.01 部署Webservices问题

相关推荐

最近更新

标签