java 用于搜索字符串的更快数据结构

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/5201228/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-10-30 09:55:32  来源:igfitidea点击:

Faster data structure for searching a string

javadata-structures

提问by Mariska

I have this code that determines whether a word (ignoring case) is included in a wordList text file. However, the wordList text file may have 65000++ lines, and to just search a word using my implementation below takes nearly a minute. Could you think of any better implementation?

我有这段代码可以确定单词(忽略大小写)是否包含在 wordList 文本文件中。但是,wordList 文本文件可能有 65000++ 行,使用下面的实现只搜索一个单词需要将近一分钟。你能想到更好的实现吗?

Thanks!

谢谢!

import java.io.*;
import java.util.*;

public class WordSearch 
{
    LinkedList<String> lxx;
    FileReader fxx;
    BufferedReader bxx;

    public WordSearch(String wordlist) 
        throws IOException
    {
        fxx = new FileReader(wordlist);
        bxx = new BufferedReader(fxx);
        lxx = new LinkedList<String>();
        String word;

        while ( (word = bxx.readLine()) != null) 
            {
            lxx.add(word);
        }

        bxx.close();
    }

    public boolean inTheList (String theWord)
    {
        for(int i =0 ; i < lxx.size(); i++)
            {
            if (theWord.compareToIgnoreCase(lxx.get(i)) == 0)
                    {
                return true;
            }
        }

        return false;
    }
}

回答by Aasmund Eldhuset

Use a HashSetinto which you put a lowercase version of each word. Checking if a HashSetcontains a specified string is, on average, a constant-time (read: extremely fast) operation.

使用 aHashSet将每个单词的小写版本放入其中。检查 a 是否HashSet包含指定的字符串平均来说是一个恒定时间(读取:非常快)的操作。

回答by vanza

Since you're searching, you may want to consider sorting the list before searching; then you can do binary search which is much faster than linear search. That can help if you'll perform multiple searches on the same list, otherwise the penalty you pay to sort the list isn't worth it for searching only once.

由于您正在搜索,因此您可能需要考虑在搜索前对列表进行排序;那么你可以进行比线性搜索快得多的二分搜索。如果您将在同一个列表上执行多次搜索,这会有所帮助,否则您为对列表进行排序所付出的代价对于只搜索一次是不值得的。

Also, doing linear search on a linked list using "lxx.get(i)" is asking for trouble. LinkedList.get() is O(n). You can either use an Iterator (easy way: for (String s : lxx)) or switch to a list type that has O(1) access time, such as ArrayList.

此外,使用“lxx.get(i)”对链表进行线性搜索会带来麻烦。LinkedList.get() 是 O(n)。您可以使用迭代器(简单的方法:for (String s : lxx))或切换到具有 O(1) 访问时间的列表类型,例如 ArrayList。

回答by Viktor Dahl

Each search through lin an O(n) operation, so this will get quite costly when you have thousands of words. Instead, use a HashSet:

每次搜索都l在 O(n) 操作中进行,因此当您有数千个单词时,这将变得非常昂贵。相反,使用一个HashSet

Set<String> lxx;

...

lxx = new HashSet<String>();
while ( (word = bxx.readLine()) != null) {
        lxx.add(word.toLowerCase());
}
bxx.close();

and then use lxx.contains(theWord.toLowerCase())to check if the word is in the file. Each lookup in the HashSetis an O(1) operation, so the times it takes is (nearly) independet of the size of your file.

然后用于lxx.contains(theWord.toLowerCase())检查该单词是否在文件中。中的每次查找HashSet都是 O(1) 操作,因此所需的时间(几乎)与文件大小无关。

回答by TofuBeer

First off, don't declare your variable to be a LinkedList, declare it to be a List (parts of the code not concerned with the List deleted:

首先,不要将变量声明为 LinkedList,而是将其声明为 List(与删除的 List 无关的代码部分:

public class WordSearch 
{
    List<String> lxx;

    public WordSearch(String wordlist) 
        throws IOException
    {
        lxx = new LinkedList<String>();
    }
}

Next up do not call get on the list, using a LinkedList get will be VERY slow. Instead use an iterator... better yet use the new stype for loop which uses an iterator for you:

接下来不要在列表上调用 get,使用 LinkedList get 会很慢。而是使用迭代器……最好使用新的 stype for 循环,它为您使用迭代器:

    public boolean inTheList (String theWord)
    {
        for(String word : lxx)
        {
            if (theWord.compareToIgnoreCase(word) == 0)
            {
                return true;
            }
        }

        return false;
    }

Next change the new LinkedList to a new ArrayList:

接下来将新的 LinkedList 更改为新的 ArrayList:

lxx = new ArrayList();

lxx = 新的 ArrayList();

This code should be faster, but you can still do better.

这段代码应该更快,但你仍然可以做得更好。

Since you do not care about duplicate words use Set instead of a List and use a HashSet instead of an ArrayList.

由于您不关心重复的单词,因此请使用 Set 而不是 List 并使用 HashSet 而不是 ArrayList。

Doing that will speed the program up significantly.

这样做将大大加快程序的速度。

Your original code, using a LinkedList with get has to start over at the start of the list each time when searching for the next word in the list. Using the Iterator (via the new style for-each loop) stops that from happening.

使用 LinkedList 和 get 的原始代码每次在搜索列表中的下一个单词时都必须在列表的开头重新开始。使用迭代器(通过新样式的 for-each 循环)阻止这种情况发生。

Using a LinkedList means that each time you have to go to the next word in the list there is a lookup involved, the ArrayList doesn't have that overhead.

使用 LinkedList 意味着每次您必须转到列表中的下一个单词时,都会涉及查找,而 ArrayList 没有该开销。

Using a HashSet winds up (probably) using a tree structure that has very fast lookups.

使用 HashSet 最终(可能)使用具有非常快速查找的树结构。

回答by OscarRyz

Here's my implementation searching under 50 ms.

这是我在 50 毫秒内搜索的实现。

First you have to load the file and keep it sorted in memory.

首先,您必须加载文件并将其排序在内存中。

You may load it however you want, but if you loaded it in big chunks will be easier.

您可以随心所欲地加载它,但是如果您将其大块加载会更容易。

My input was the byte into python book( downloaded the HTML single file version ) and the Java language specification( downloaded the html and create a single file out of all the html pages )

我的输入是python 书字节(下载了 HTML 单文件版本)和Java 语言规范(下载了 html 并从所有 html 页面中创建了一个文件)

To create the list into a big file I used this same program ( see commented code ).

要将列表创建到一个大文件中,我使用了相同的程序(请参阅注释代码)。

Once I have a big file with about 300k words, I ran the program with this output:

一旦我有一个大约 30 万字的大文件,我就用这个输出运行程序:

C:\Users\oreyes\langs\java\search>dir singlelineInput.txt
 El volumen de la unidad C no tiene etiqueta.
 El número de serie del volumen es: 22A8-203B

 Directorio de C:\Users\oreyes\langs\java\search

04/03/2011  09:37 p.m.         3,898,345 singlelineInput.txt
               1 archivos      3,898,345 bytes

C:\Users\oreyes\langs\java\search>javac WordSearch.java

C:\Users\oreyes\langs\java\search>java WordSearch singlelineInput.txt "great"
Loaded 377381 words in 2844 ms
true
in 31 ms

C:\Users\oreyes\langs\java\search>java WordSearch singlelineInput.txt "great"
Loaded 377381 words in 2812 ms
true
in 31 ms

C:\Users\oreyes\langs\java\search>java WordSearch singlelineInput.txt "awesome"
Loaded 377381 words in 2813 ms
false
in 47 ms

C:\Users\oreyes\langs\java\search>gvim singlelineInput.txt

C:\Users\oreyes\langs\java\search>java WordSearch singlelineInput.txt "during"
Loaded 377381 words in 2813 ms
true
in 15 ms

C:\Users\oreyes\langs\java\search>java WordSearch singlelineInput.txt "specification"
Loaded 377381 words in 2875 ms
true
in 47 ms

C:\Users\oreyes\langs\java\search>java WordSearch singlelineInput.txt "<href"
Loaded 377381 words in 2844 ms
false
in 47 ms

C:\Users\oreyes\langs\java\search>java WordSearch singlelineInput.txt "<br>"
Loaded 377381 words in 2829 ms
true
in 15 ms

Always under 50 ms.

始终低于 50 毫秒。

Here's the code:

这是代码:

   import java.io.*;
   import java.util.*;

   class WordSearch {
       String inputFile;
       List<String> words;
       public WordSearch(String file ) { 
           inputFile = file;
       }
       public void initialize() throws IOException { 
           long start = System.currentTimeMillis();
           File file = new File( inputFile );
           ByteArrayOutputStream baos = new ByteArrayOutputStream(( int ) file.length());
           FileInputStream in = new FileInputStream( file );
           copyLarge( in, baos, (int)file.length() );

           Scanner scanner = new Scanner( new ByteArrayInputStream(  baos.toByteArray() ));
           words = new LinkedList<String>();
           while( scanner.hasNextLine() ) { 
              String l = scanner.nextLine().trim();
              //for( String s : l.split("\s+")){
                //System.out.println( s );
                words.add( l.toLowerCase() );
              //}
           }

           Collections.sort( words );
           for( String s : words ) { 
               //System.out.println( s );
           }
           System.out.println("Loaded " + words.size() + " words in "+  ( System.currentTimeMillis() - start ) + " ms"  );
       }

       public boolean contains( String aWord ) { 
           return words.contains( aWord.toLowerCase() );
       }
        // taken from:  http://stackoverflow.com/questions/326390/how-to-create-a-java-string-from-the-contents-of-a-file/326413#326413 
        public static long copyLarge(InputStream input, OutputStream output, int size )
               throws IOException {
           byte[] buffer = new byte[size];// something biggie 
           long count = 0;
           int n = 0;
           while (-1 != (n = input.read(buffer))) {
               output.write(buffer, 0, n);
               count += n;
           }
           return count;
       }
       public static void main( String ... args ) throws IOException  { 
           WordSearch ws = new WordSearch( args[0] );
           ws.initialize();
           long start = System.currentTimeMillis();
           System.out.println( ws.contains( args[1] ) );
           System.out.println("in "+  ( System.currentTimeMillis() - start ) +" ms ");

       }
    }

The hard part was to get a sample input :P

困难的部分是获得样本输入:P

回答by OscarRyz

Guess what, using a HashMap returns in no time:

猜猜看,使用 HashMap 立即返回:

Here's the modified version and it finish always in 0 ms.

这是修改后的版本,它总是在 0 毫秒内完成。

   import java.io.*;
   import java.util.*;

   class WordSearch {
       String inputFile;
       //List<String> words;
       Set<String> words;
       public WordSearch(String file ) { 
           inputFile = file;
       }
       public void initialize() throws IOException { 
           long start = System.currentTimeMillis();
           File file = new File( inputFile );
           ByteArrayOutputStream baos = new ByteArrayOutputStream(( int ) file.length());
           FileInputStream in = new FileInputStream( file );
           copyLarge( in, baos, (int)file.length() );

           Scanner scanner = new Scanner( new ByteArrayInputStream(  baos.toByteArray() ));
           words = new HashSet<String>();
           while( scanner.hasNextLine() ) { 
              String l = scanner.nextLine().trim();
              //for( String s : l.split("\s+")){
                //System.out.println( s );
                words.add( l.toLowerCase() );
              //}
           }

           //Collections.sort( words );
           for( String s : words ) { 
               System.out.println( s );
           }
           System.out.println("Loaded " + words.size() + " words in "+  ( System.currentTimeMillis() - start ) + " ms"  );
       }

       public boolean contains( String aWord ) { 
           return words.contains( aWord.toLowerCase() );
       }

        public static long copyLarge(InputStream input, OutputStream output, int size )
               throws IOException {
           byte[] buffer = new byte[size];// something biggie 
           long count = 0;
           int n = 0;
           while (-1 != (n = input.read(buffer))) {
               output.write(buffer, 0, n);
               count += n;
           }
           return count;
       }
       public static void main( String ... args ) throws IOException  { 
           WordSearch ws = new WordSearch( args[0] );
           ws.initialize();
           long start = System.currentTimeMillis();
           System.out.println( ws.contains( args[1] ) );
           System.out.println("in "+  ( System.currentTimeMillis() - start ) +" ms ");

       }
    }

Now I know for sure :)

现在我确定了:)

回答by Tal Fisharov

two suggestions: Both data structures give you a better performance.

两个建议:这两种数据结构都可以为您提供更好的性能。

  1. Directed acyclic word graph (DAWG)
  2. Dictionary data structure. n-tree
  1. 有向无环词图 (DAWG)
  2. 字典数据结构。n-树