查找大型列表是否包含特定字符串的最有效方法 (Python)

Question

提问by Roee Adler

I have a file containing roughly all the words in English (~60k words, ~500k characters). I want to test whether a certain word I receive as input is "in English" (i.e. if this exact word is in the list).

我有一个包含大致所有英文单词的文件（~60k 个单词，~500k 个字符）。我想测试我输入的某个单词是否是“英语”（即这个确切的单词是否在列表中）。

What would be the most efficient way to do this in Python?

在 Python 中执行此操作的最有效方法是什么？

The trivial solution is to load the file into a list and check whether the word is in that list. The list can be sorted, which I believe will shrink the complexity to O(logn). However I'm not sure about how Python implements searching through lists, and whether there's a performance penalty if such a large list is in memory. Can I "abuse" the fact I can put a cap on the length of words? (e.g. say the longest one is 15 characters long).

简单的解决方案是将文件加载到列表中并检查单词是否在该列表中。列表可以排序，我相信这会将复杂性缩小到 O(logn)。但是，我不确定 Python 如何实现列表搜索，以及如果内存中存在如此大的列表是否会降低性能。我可以“滥用”我可以限制单词长度的事实吗？（例如，最长的一个是 15 个字符长）。

Please note I run the application on a machine with lots of memory, so I care less for memory consumption than for speed and CPU utilization.

请注意，我在具有大量内存的机器上运行该应用程序，因此我不太关心内存消耗，而不是速度和 CPU 利用率。

Thanks

谢谢

Answer 1

回答by gimel

The python Setis what you should try.

python Set是你应该尝试的。

A set object is an unordered collection of distinct hashable objects. Common uses include membership testing, removing duplicates from a sequence, and computing mathematical operations such as intersection, union, difference, and symmetric difference.

集合对象是不同的可散列对象的无序集合。常见用途包括成员资格测试、从序列中删除重复项以及计算交集、并集、差和对称差等数学运算。

Answer 2

回答by Roman Zeyde

Sample Python code:

示例 Python 代码：

L = ['foo', 'bar', 'baz'] # Your list
s = set(L)  # Converted to Set

print 'foo'  in s # True
print 'blah' in s # False

Answer 3

回答by Paul Dixon

A Triestructure would suit your purposes. There are undoubtedly Python implementations to be found out there...

一个特里结构将满足您的要求。毫无疑问，可以在那里找到 Python 实现......

Answer 4

回答by behindthefall

Two things:

两件事情：

The Python 'mutable set' type has an 'add' method ( s.add(item) ), so you could go right from reading (a line) from your big file straight into a set without using a list as an intermediate data structure.

Python 'mutable set' 类型有一个 'add' 方法（ s.add(item) ），因此您可以直接从大文件中读取（一行）直接进入一个集合，而无需使用列表作为中间数据结构.

Python lets you 'pickle' a data structure, so you could save your big set to a file and save the time of reinitiating the set.

Python 允许您“腌制”数据结构，因此您可以将大集保存到文件中并节省重新启动该集的时间。

Second, I've been looking for a list of all the single-syllable words in English for my own amusement, but the ones I've found mentioned seem to be proprietary. If it isn't being intrusive, could I ask whether your list of English words can be obtained by others?

其次，我一直在寻找一个包含所有英语单音节单词的列表以供自己娱乐，但我发现提到的那些似乎是专有的。如果不打扰的话，请问你的英文单词列表是否可以被他人获取？

Answer 5

回答by SilentGhost

500k character is not a large list. if items in your list are unique and you need to do this search repeatedly use setwhich would lower the complexity to O(1)in the best case.

500k 字符不是一个大列表。如果您的列表中的项目是唯一的，并且您需要重复使用此搜索set，这将降低复杂性，O(1)以达到最佳情况。

Answer 6

回答by Brian

Others have given you the in-memory way using set(), and this is generally going to be the fastest way, and should not tax your memory for a 60k word dataset (a few MiBs at most). You should be able to construct your set with:

其他人使用 set() 为您提供了内存中方式，这通常是最快的方式，并且不应为 60k 字数据集（最多几个 MiB）占用您的内存。您应该能够使用以下方法构建您的集合：

f=open('words.txt')
s = set(word.strip() for word in f)

However, it does require some time to load the set into memory. If you are checking lots of words, this is no problem - the lookup time will more than make up for it. However if you're only going to be checking one word per command execution (eg. this is a commandline app like "checkenglish [word]" ) the startup time will be longer than it would have taken you just to search through the file line by line.

但是，将集合加载到内存中确实需要一些时间。如果您正在检查大量单词，这没问题 - 查找时间将足以弥补它。但是，如果您只打算在每次执行命令时检查一个单词（例如，这是一个命令行应用程序，例如 "checkenglish [word]" ），则启动时间将比仅搜索文件行所需的时间要长按行。

If this is your situation, or you have a much bigger dataset, using an on-disk format may be better. The simplest way would be using the dbmmodule. Create such a database from a wordlist with:

如果这是您的情况，或者您有一个更大的数据集，使用磁盘格式可能会更好。最简单的方法是使用dbm模块。从单词表创建这样的数据库：

import dbm
f=open('wordlist.txt')
db = dbm.open('words.db','c')
for word in f:
    db[word] = '1'
f.close()
db.close()

Then your program can check membership with:

然后您的计划可以通过以下方式检查会员资格：

db = dbm.open('words.db','r')
if db.has_key(word):
    print "%s is english" % word
else:
    print "%s is not english" % word

This will be slower than a set lookup, since there will be disk access, but will be faster than searching, have low memory use and no significant initialisation time.

这将比集合查找慢，因为会有磁盘访问，但会比搜索快，内存使用量低且没有明显的初始化时间。

There are also other alternatives, such as using a SQL database (eg sqlite).

还有其他替代方法，例如使用 SQL 数据库（例如 sqlite）。

Answer 7

回答by John Feminella

If memory consumption isn't an issue and the words won't change, the fastest way to do this is put everything in a hash and search that way. In Python, this is the Set. You'll have constant-time lookup.

如果内存消耗不是问题并且单词不会改变，那么最快的方法是将所有内容放入散列中并以这种方式搜索。在 Python 中，这是Set. 您将进行恒定时间查找。

Answer 8

回答by Swaroop C H

You're basically testing whether a member is in a set or not, right?

你基本上是在测试一个成员是否在一个集合中，对吧？

If so, and because you said you have lots of memory, why not just load all the words as keys in memcache, and then for every word, just check if it is present in memcache or not.

如果是这样，并且因为您说您有很多内存，为什么不将所有单词作为键加载到内存缓存中，然后对于每个单词，只需检查它是否存在于内存缓存中。

Or use that data structure that is used by bash to autocomplete command names - this is fast and highly efficient in memory (can't remember the name).

或者使用 bash 使用的数据结构来自动完成命令名称 - 这在内存中快速且高效（不记得名称）。

Answer 9

回答by Jason Baker

Converting the list to a set will only be helpful if you repeatedly run this kind of query against the data, as will sorting the list and doing a binary search. If you're only going to pull data out of the list once, a plain old linear search is your best bet:

仅当您对数据重复运行此类查询时，将列表转换为集合才会有帮助，对列表进行排序并进行二分搜索也是如此。如果您只想从列表中提取数据一次，那么简单的旧线性搜索是您最好的选择：

if 'foo' in some_list:
    do_something()

Otherwise, your best bet is to use either a set as has been mentioned or a binary search. Which one you should choose depends largely on how big the data is and how much memory you can spare. I'm told that really large lists tend to benefit more from hashing, although the amount of memory that's taken up can be prohibitively expensive.

否则，最好的办法是使用前面提到的集合或二分搜索。您应该选择哪一个在很大程度上取决于数据有多大以及您可以节省多少内存。我听说非常大的列表往往会从散列中受益更多，尽管占用的内存量可能非常昂贵。

Finally, a third option is that you can import the data into a sqlite database and read directly from it. Sqlite is very fast and it may save you the trouble of loading the wholelist from file. Python has a very good built-in sqlite library.

最后，第三种选择是您可以将数据导入sqlite 数据库并直接从中读取。Sqlite 非常快，它可以省去您从文件中加载整个列表的麻烦。Python 有一个非常好的内置sqlite 库。

查找大型列表是否包含特定字符串的最有效方法 (Python)

提问by Roee Adler

回答by gimel

回答by Roman Zeyde

回答by Paul Dixon

回答by behindthefall

回答by SilentGhost

回答by Brian

回答by John Feminella

回答by Swaroop C H

回答by Jason Baker

相关推荐

最近更新

标签

查找大型列表是否包含特定字符串的最有效方法 (Python)

提问by Roee Adler

回答by gimel

回答by Roman Zeyde

回答by Paul Dixon

回答by behindthefall

回答by SilentGhost

回答by Brian

回答by John Feminella

回答by Swaroop C H

回答by Jason Baker

相关推荐

python 通过 Django 管理站点添加数据时更改大小写（大写/小写）

python 如何在python中将2字节长的字符串转换为整数

python 按公共列组合 2 个 .csv 文件

python Python双指针

相关推荐

最近更新

标签