java 如何对文本文件进行二分查找

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/10010151/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-10-30 23:14:50  来源:igfitidea点击:

How to perform a binary search of a text file

javaandroidbinary-search

提问by Beno

I have a big text file (5Mb) that I use in my Android application. I create the file as a list of pre-sorted Strings, and the file doesn't change once it is created. How can I perform a binary search on the contents of this file, without reading line-by-line to find the matching String?

我在我的 Android 应用程序中使用了一个大文本文件 (5Mb)。我将文件创建为预先排序的字符串列表,并且文件一旦创建就不会更改。如何对该文件的内容执行二进制搜索,而无需逐行读取以查找匹配的字符串?

采纳答案by unholysampler

Since the content of the file does not change, you can break the file into multiple pieces. Say A-G, H-N, 0-T and U-Z. This allows you to check the first character and immediately be able to cut the possible set to a fourth of the original size. Now a linear search will not take as long or reading the whole file could be an option. This process could be extended if n/4 is still too large, but the idea is the same. Build the search breakdowns into the file structure instead of trying to do it all in memory.

由于文件的内容不会改变,您可以将文件分成多个部分。说 AG、HN、0-T 和 UZ。这使您可以检查第一个字符并立即将可能的字符集剪切为原始大小的四分之一。现在线性搜索不需要那么长时间,或者读取整个文件可能是一种选择。如果 n/4 仍然太大,这个过程可以扩展,但想法是一样的。将搜索分解构建到文件结构中,而不是尝试在内存中完成所有操作。

回答by wattostudios

A 5MB file isn't that big - you should be able to read each line into a String[]array, which you can then use java.util.Arrays.binarySearch()to find the line you want. This is my recommended approach.

一个 5MB 的文件并没有那么大 - 您应该能够将每一行读入一个String[]数组,然后您可以使用它java.util.Arrays.binarySearch()来查找所需的行。这是我推荐的方法。

If you don't want to read the whole file in to your app, then it gets more complicated. If each line of the file is the same length, and the file is already sorted, then you can open the file in RandomAccessFile and perform a binary search yourself by using seek()like this...

如果您不想将整个文件读入您​​的应用程序,那么它会变得更加复杂。如果文件的每一行长度相同,并且文件已经排序,那么你可以在 RandomAccessFile 中打开文件并使用seek()这样的方法自己执行二进制搜索......

// open the file for reading
RandomAccessFile raf = new RandomAccessFile("myfile.txt","r");
String searchValue = "myline";
int lineSize = 50;
int numberOfLines = raf.length() / lineSize;

// perform the binary search...
byte[] lineBuffer = new byte[lineSize];
int bottom = 0;
int top = numberOfLines;
int middle;
while (bottom <= top){
  middle = (bottom+top)/2;
  raf.seek(middle*lineSize); // jump to this line in the file
  raf.read(lineBuffer); // read the line from the file
  String line = new String(lineBuffer); // convert the line to a String

  int comparison = line.compareTo(searchValue);
  if (comparison == 0){
    // found it
    break;
    }
  else if (comparison < 0){
    // line comes before searchValue
    bottom = middle + 1;
    }
  else {
    // line comes after searchValue
    top = middle - 1;
    }
  }

raf.close(); // close the file when you're finished

However, if the file doesn't have fixed-width lines, then you can't easily perform a binary search without loading it into memory first, as you can't quickly jump to a specific line in the file like you can with fixed-width lines.

但是,如果文件没有固定宽度的行,那么如果不先将其加载到内存中,就无法轻松执行二分查找,因为您无法像使用 fixed 那样快速跳转到文件中的特定行-宽度线。

回答by M.J. Rayburn

In a uniform character length text file you could seek to the middle of the interval in question character wise, start reading characters until you hit your deliminator, then use the subsequent string as an approximation for the element wise middle. The problem with doing this in android, though, is you apparently can't get random access to a resource(although I suppose you could just reopen it every time). Furthermore this technique doesn't generalize to maps and sets of other types.

在统一字符长度的文本文件中,您可以寻找有问题的间隔的中间字符,开始读取字符直到遇到分隔符,然后使用后续字符串作为元素中间的近似值。但是,在 android 中执行此操作的问题在于,您显然无法随机访问资源(尽管我想您每次都可以重新打开它)。此外,这种技术不能推广到地图和其他类型的集合。

Another option would be to (using a RandomAccessFile) write an "array" of ints - one for each String - at the beginning of the file then go back and update them with the locations of their corresponding Strings. Again the search will require jumping around.

另一种选择是(使用RandomAccessFile)在文件的开头写入一个整数“数组” - 每个字符串一个 - 然后返回并使用相应字符串的位置更新它们。再次搜索将需要跳来跳去。

What I would do (and did do in my own app) is implement a hash setin a file. This one does separate chaining with trees.

我会做的(并在我自己的应用程序中做的)是在文件中实现一个散列集。这个与树分开链接。

import java.io.BufferedInputStream;
import java.io.DataInputStream;
import java.io.File;
import java.io.FileInputStream;
import java.io.IOException;
import java.io.RandomAccessFile;
import java.util.ArrayList;
import java.util.Collections;
import java.util.LinkedList;
import java.util.Set;

class StringFileSet {

    private static final double loadFactor = 0.75;

    public static void makeFile(String fileName, String comment, Set<String> set) throws IOException {
        new File(fileName).delete();
        RandomAccessFile fout = new RandomAccessFile(fileName, "rw");

        //Write comment
        fout.writeUTF(comment);

        //Make bucket array
        int numBuckets = (int)(set.size()/loadFactor);

        ArrayList<ArrayList<String>> bucketArray = new ArrayList<ArrayList<String>>(numBuckets);
        for (int ii = 0; ii < numBuckets; ii++){
            bucketArray.add(new ArrayList<String>());
        }

        for (String key : set){
            bucketArray.get(Math.abs(key.hashCode()%numBuckets)).add(key);
        }

        //Sort key lists in preparation for creating trees
        for (ArrayList<String> keyList : bucketArray){
            Collections.sort(keyList);
        }

        //Make queues in preparation for creating trees
        class NodeInfo{

            public final int lower;
            public final int upper;
            public final long callingOffset;

            public NodeInfo(int lower, int upper, long callingOffset){
                this.lower = lower;
                this.upper = upper;
                this.callingOffset = callingOffset;
            }

        }

        ArrayList<LinkedList<NodeInfo>> queueList = new ArrayList<LinkedList<NodeInfo>>(numBuckets);
        for (int ii = 0; ii < numBuckets; ii++){
            queueList.add(new LinkedList<NodeInfo>());
        }

        //Write bucket array
        fout.writeInt(numBuckets);
        for (int index = 0; index < numBuckets; index++){
            queueList.get(index).add(new NodeInfo(0, bucketArray.get(index).size()-1, fout.getFilePointer()));
            fout.writeInt(-1);
        }

        //Write trees
        for (int bucketIndex = 0; bucketIndex < numBuckets; bucketIndex++){
            while (queueList.get(bucketIndex).size() != 0){
                NodeInfo nodeInfo = queueList.get(bucketIndex).poll();
                if (nodeInfo.lower <= nodeInfo.upper){
                    //Set respective pointer in parent node
                    fout.seek(nodeInfo.callingOffset);
                    fout.writeInt((int)(fout.length() - (nodeInfo.callingOffset + 4))); //Distance instead of absolute position so that the get method can use a DataInputStream
                    fout.seek(fout.length());

                    int middle = (nodeInfo.lower + nodeInfo.upper)/2;

                    //Key
                    fout.writeUTF(bucketArray.get(bucketIndex).get(middle));

                    //Left child
                    queueList.get(bucketIndex).add(new NodeInfo(nodeInfo.lower, middle-1, fout.getFilePointer()));
                    fout.writeInt(-1);

                    //Right child
                    queueList.get(bucketIndex).add(new NodeInfo(middle+1, nodeInfo.upper, fout.getFilePointer()));
                    fout.writeInt(-1);
                }
            }
        }

        fout.close();
    }

    private final String fileName;
    private final int numBuckets;
    private final int bucketArrayOffset;

    public StringFileSet(String fileName) throws IOException {
        this.fileName = fileName;

        DataInputStream fin = new DataInputStream(new BufferedInputStream(new FileInputStream(fileName)));

        short numBytes = fin.readShort();
        fin.skipBytes(numBytes);
        this.numBuckets = fin.readInt();
        this.bucketArrayOffset = numBytes + 6;

        fin.close();
    }

    public boolean contains(String key) throws IOException {
        boolean containsKey = false;

        DataInputStream fin = new DataInputStream(new BufferedInputStream(new FileInputStream(this.fileName)));

        fin.skipBytes(4*(Math.abs(key.hashCode()%this.numBuckets)) + this.bucketArrayOffset);

        int distance = fin.readInt();
        while (distance != -1){
            fin.skipBytes(distance);

            String candidate = fin.readUTF();
            if (key.compareTo(candidate) < 0){
                distance = fin.readInt();
            }else if (key.compareTo(candidate) > 0){
                fin.skipBytes(4);
                distance = fin.readInt();
            }else{
                fin.skipBytes(8);
                containsKey = true;
                break;
            }
        }

        fin.close();

        return containsKey;
    }

}

A test program

一个测试程序

import java.io.File;
import java.io.IOException;
import java.util.HashSet;

class Test {
    public static void main(String[] args) throws IOException {
        HashSet<String> stringMemorySet = new HashSet<String>();

        stringMemorySet.add("red");
        stringMemorySet.add("yellow");
        stringMemorySet.add("blue");

        StringFileSet.makeFile("stringSet", "Provided under ... included in all copies and derivatives ...", stringMemorySet);
        StringFileSet stringFileSet = new StringFileSet("stringSet");

        System.out.println("orange -> " + stringFileSet.contains("orange"));
        System.out.println("red -> " + stringFileSet.contains("red"));
        System.out.println("yellow -> " + stringFileSet.contains("yellow"));
        System.out.println("blue -> " + stringFileSet.contains("blue"));

        new File("stringSet").delete();

        System.out.println();
    }
}

You'll also need to pass a Contextto it, if and when you modify it for android, so it can access the getResources() method.

您还需要将 Context 传递给它,如果以及何时为 android 修改它,以便它可以访问 getResources() 方法。

You're also probably going to want to stop the android build tools from compressing the file, which can apparently only be done - if you're working with the GUI - by changing the file's extension to something such as jpg. This made the process about 100 to 300 times faster in my app.

您可能还想阻止 android 构建工具压缩文件,这显然只能通过将文件的扩展名更改为 jpg 等扩展名来完成 - 如果您正在使用 GUI。这使我的应用程序中的过程快了大约 100 到 300 倍。

You might also look into giving yourself more memoryby using the NDK.

您也可以考虑使用NDK为自己提供更多内存

回答by live-love

Here's something I quickly put together. It uses two files, one with the words, the other with the offsets. The format of the offset file is this: the first 10 bits contains the word size, the last 22 bits contains the offset (the word position, for example, aaah would be 0, abasementable would be 4, etc.). It's encoded in big endian (java standard). Hope it helps somebody.

这是我快速整理的内容。它使用两个文件,一个是单词,另一个是偏移量。偏移文件的格式是这样的:前 10 位包含字大小,后 22 位包含偏移量(字位置,例如,aaah 为 0,abasementable 为 4,等等)。它以大端(java 标准)编码。希望它可以帮助某人。

word.dat:

字.dat:

aaahabasementableabnormalabnormalityabortionistabortion-rightsabracadabra

aaahabasementableabnormalabnormalityabortionistabortion-rightsabracadabra

wordx.dat:

wordx.dat:

00 80 00 00 01 20 00 04 00 80 00 0D 01 00 00 11   _____ __________
01 60 00 19 01 60 00 24 01 E0 00 2F 01 60 00 3E   _`___`_$___/_`_>

I created these files in C#, but here's the code for it (it uses a txt file with words separated by crlfs)

我在 C# 中创建了这些文件,但这是它的代码(它使用一个 txt 文件,单词由 crlfs 分隔)

static void Main(string[] args)
{
    const string fIn = @"C:\projects\droid\WriteFiles\input\allwords.txt";
    const string fwordxOut = @"C:\projects\droid\WriteFiles\output\wordx.dat";
    const string fWordOut = @"C:\projects\droid\WriteFiles\output\word.dat";

    int i = 0;
    int offset = 0;
    int j = 0;
    var lines = File.ReadLines(fIn);

    FileStream stream = new FileStream(fwordxOut, FileMode.Create, FileAccess.ReadWrite);
    using (EndianBinaryWriter wwordxOut = new EndianBinaryWriter(EndianBitConverter.Big, stream))
    {
        using (StreamWriter wWordOut = new StreamWriter(File.Open(fWordOut, FileMode.Create)))
        {
            foreach (var line in lines)
            {
                wWordOut.Write(line);
                i = offset | ((int)line.Length << 22); //first 10 bits to the left is the word size
                offset = offset + (int)line.Length;
                wwordxOut.Write(i);
                //if (j == 7)
                  //  break;
                j++;
            }
        }
    }
}

And this is the Java code for the binary file search:

这是二进制文件搜索的 Java 代码:

public static void binarySearch() {
    String TAG = "TEST";
    String wordFilePath = Environment.getExternalStorageDirectory().getAbsolutePath() + "/word.dat";
    String wordxFilePath = Environment.getExternalStorageDirectory().getAbsolutePath() + "/wordx.dat";

    String target = "abracadabra"; 
    boolean targetFound = false; 
    int searchCount = 0; 

    try {
        RandomAccessFile raf = new RandomAccessFile(wordxFilePath, "r");
        RandomAccessFile rafWord = new RandomAccessFile(wordFilePath, "r");
        long low = 0;
        long high = (raf.length() / 4) - 1;
        int cur = 0;
        long wordOffset = 0;
        int len = 0;

        while (high >= low) {
            long mid = (low + high) / 2;
            raf.seek(mid * 4);
            cur = raf.readInt();
            Log.v(TAG + "-cur", String.valueOf(cur));

            len = cur >> 22; //word length

            cur = cur & 0x3FFFFF;  //first 10 bits are 0

            rafWord.seek(cur);
            byte [] bytes = new byte[len];

            wordOffset = rafWord.read(bytes, 0, len);
            Log.v(TAG + "-wordOffset", String.valueOf(wordOffset));

            searchCount++;

            String str = new String(bytes);

            Log.v(TAG, str);

            if (target.compareTo(str) < 0) {
                high = mid - 1;
            } else if (target.compareTo(str) == 0) {
                targetFound = true;
                break;
            } else {
                low = mid + 1;
            }
        }

        raf.close();
        rafWord.close();
    } catch (FileNotFoundException e) {
        e.printStackTrace();
    } catch (IOException e) {
        e.printStackTrace();
    }

    if (targetFound == true) {
        Log.v(TAG + "-found " , String.valueOf(searchCount));
    } else {
        Log.v(TAG + "-not found " , String.valueOf(searchCount));
    }

}

回答by Tatarize

Though it might sound like overkill, don't store data you need to do this with as a flat file. Make a database and query the data in the database. This should be both effective and fast.

尽管这听起来有点矫枉过正,但不要将您需要这样做的数据存储为平面文件。制作一个数据库,查询数据库中的数据。这应该既有效又快速。