Python 高效的字典搜索?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/19103785/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-19 12:52:47  来源:igfitidea点击:

Efficient Dictionary Searching?

pythonsearchoptimizationdictionary

提问by Brumdog22

I had a quick question about efficiency of searching through largedictionaries in Python. I am reading a large comma-separated file and getting a key and value from each line. If my key is already in the dictionary, I'm adding the value to the value listed in the dictionary, if the key doesn't exist in the dictionary, I simply add the value. Previously I was using this:

我有一个关于在 Python中搜索大型字典的效率的快速问题。我正在读取一个大的逗号分隔文件并从每一行获取一个键和值。如果我的键已经在字典中,我会将该值添加到字典中列出的值中,如果该键在字典中不存在,我只需添加该值。以前我是用这个:

if key in data_dict.keys():
    add values
else:
    data_dict[key] = value

This starts pretty fast, but as the dictionary grows it becomes slower and slower, to the point where I can't use it at all. I changed the way I search for the key in the dictionary to this:

这开始得很快,但随着字典的增长,它变得越来越慢,以至于我根本无法使用它。我将在字典中搜索关键字的方式更改为:

try:
    # This will fail if key not present
    data_dict[keyStr] = input_data[keyStr] + load_val
except:
    data_dict[keyStr] = load_val

This is infinitely faster, and can read / write over 350,000 lines of code in 3 seconds.

这是无限快的,并且可以在 3 秒内读/写超过 350,000 行代码。

My question was why does the if key in data_dict.keys():command take sooo much longer than the call to try: data_dict[keyStr]? And why wouldn't Python utilize the trystatement when searching for a key in a dictionary?

我的问题是为什么该if key in data_dict.keys():命令比调用try: data_dict[keyStr]? 为什么 Pythontry在字典中搜索键时不使用该语句?

采纳答案by Mark Ransom

The problem is that for every test you're generating a new list of keys with .keys(). As the list of keys gets longer, the time required goes up. Also as noted by dckrooney, the search for the key becomes linear instead of taking advantage of the hash-table structure of the dictionary.

问题是,对于每次测试,您都会生成一个带有.keys(). 随着密钥列表变长,所需的时间也会增加。同样如 dckrooney 所指出的,对键的搜索变成线性的,而不是利用字典的哈希表结构。

Replace with:

用。。。来代替:

if key in data_dict:

回答by wflynny

This doesn't answer the question, but rather avoids it. Try using collections.defaultdict. You won't need the if/elseor try/except.

这并没有回答这个问题,而是避免了它。尝试使用collections.defaultdict. 您将不需要if/elsetry/except

from collections import defaultdict

data_dict = defaultdict(list)
for keyStr, load_val in data:
    data_dict[keyStr].append(load_val)

回答by William Gaul

This is because data_dict.keys()returns a listcontaining the keys in the dictionary (at least in Python 2.x). Which, in order to find if a key is in the list, requires a linear search.

这是因为data_dict.keys()返回一个包含字典中键的列表(至少在 Python 2.x 中)。其中,为了查找键是否在列表中,需要线性搜索。

Whereas, trying to access an element of the dict directly takes advantage of the awesome properties of dictionaries so the access is almost instantaneous.

而尝试直接访问字典的元素会利用字典的强大属性,因此访问几乎是即时的。

回答by SQLesion

There is something akin to the try function that should help you: dict.get(key, default)

有一些类似于 try 函数的东西应该可以帮助你: dict.get(key, default)

data_dict[keyStr] = data_dict.get(keyStr, '') + load_val

回答by dckrooney

data_dict.keys()returns an unsortedlist of keys in the dictionary. Thus each time you check to see if a given key is in the dictionary, you're doing a linear search across the list of keys (an O(n) operation). The longer your list, the longer it takes to search for a given key.

data_dict.keys()返回字典中未排序的键列表。因此,每次检查给定的键是否在字典中时,都是在键列表中进行线性搜索(O(n) 运算)。您的列表越长,搜索给定键所需的时间就越长。

Contrast this to data_dict[keyStr]. This performs a hash lookup, which is an O(1) operation. It doesn't (directly) depend on the number of keys in the dictionary; even as you add more keys, the time to check if a given key is in the dictionary stays constant.

将此与data_dict[keyStr]. 这将执行哈希查找,这是一个 O(1) 操作。它不(直接)取决于字典中的键数;即使添加更多键,检查给定键是否在字典中的时间也保持不变。

回答by Corley Brigman

You can also simply use

您也可以简单地使用

if key in data_dict:

instead of

代替

 if key in data_dict.keys():

As mentioned, the first is a direct hash lookup - the intended offset is computed directly, and then checked - it is roughly O(1), whereas the check of keys is a linear search, which is O(n).

如前所述,第一个是直接哈希查找 - 直接计算预期的偏移量,然后进行检查 - 大致为 O(1),而键的检查是线性搜索,即 O(n)。

In [258]: data_dict = dict([(x, x) for x in range(100000)])

In [259]: %timeit 999999 in data_dict.keys()
100 loops, best of 3: 3.47 ms per loop

In [260]: %timeit 999999 in data_dict
10000000 loops, best of 3: 49.3 ns per loop

回答by Christian Heimes

Back in the old days we used setdefault:

回到过去,我们使用setdefault

data_dict.setdefault(keyStr, []).append(load_val)

回答by martineau

As several others have noted, the problem lies in the fact that key in data_dict.keys()uses the unordered listreturned from the keys()method (in Python 2.x), which takes linear timeO(n)to search through, meaning that the running time increases linearly with the dictionary's size, plus generating the list of keys itself will take longer and longer as the size goes up.

正如其他几个人所指出的,问题在于key in data_dict.keys()使用方法list返回的无序keys()(在 Python 2.x 中),这需要线性时间O(n)来搜索,这意味着运行时间随字典的线性增加大小,加上生成密钥列表本身会随着大小的增加而花费越来越长的时间。

On the other hand, key in data_dictonly requires constant time O(1), on average, to perform a search regardless of the size of the dictionary because internally it does a hash tablelook-up. In addition, this hash table already exists since its part of the internal representation of dictionaries, and therefore doesn't have to be generated before using it.

另一方面,无论字典的大小如何,平均key in data_dict只需要恒定的时间O(1)来执行搜索,因为它在内部执行哈希表查找。此外,这个哈希表已经存在,因为它是字典内部表示的一部分,因此在使用之前不必生成。

Python doesn't do this automatically because the inoperator only knows the type of its two operands, not their sources, so it can't automatically optimize the first case where all it sees is the key and a list.

Python 不会自动执行此in操作,因为运算符只知道其两个操作数的类型,而不知道它们的来源,因此它无法自动优化第一种情况,即它看到的只是键和列表。

However, in this case the issue of search speed can probably be avoided altogether by storing the data in a specialized version of a dictionary called a defaultdictfound in the built-in collectionsmodule. Here's how your code might look if you used one:

但是,在这种情况下,通过将数据存储defaultdict在内置collections模块中称为 a found的字典的专用版本中,可能可以完全避免搜索速度问题。如果您使用其中的代码,您的代码可能如下所示:

from collections import defaultdict

input_data = defaultdict(float)  # (guessing factory type)
...
data_dict[keyStr] = input_data[keyStr] + load_val

When there's no preexisting entry for input_data[keyStr]one will be generated automatically with a default value (0.0for floatin this example). As you can see, the code is shorter and very likely faster, all without the need for any iftests or exception handling.

当没有预先存在的条目时,input_data[keyStr]将使用默认值(在本例中0.0为 for float)自动生成。如您所见,代码更短,而且很可能更快,所有这些都不需要任何if测试或异常处理。

回答by pegah

As an extra analysis, I did a simple performance test to see how try/except method mentioned in the question compares with the proposed solution of using "if key in data_dict" instead of "if key in data_dict.keys()" (I am using Python 3.7):

作为额外的分析,我做了一个简单的性能测试,看看问题中提到的 try/except 方法与使用“if key in data_dict”而不是“if key in data_dict.keys()”的建议解决方案相比如何(我是使用 Python 3.7):

    import timeit

    k = '84782005' # this keys exists in the dictionary
    def t1():
        if k in data_dict:
            pass
    def t2():
        if k in data_dict.keys():
            pass
    def t3():
        try:
            a = data_dict[k]
        except:
            pass

    print(timeit.timeit(t1,number= 100000))
    print(timeit.timeit(t2,number= 100000))
    print(timeit.timeit(t3,number= 100000))

    >> 0.01741484600097465
    >> 0.025949209000827977
    >> 0.017266065000512754

For the key that already exists in the dictionary, the search time seems to be the same for try/except and offered solution. However, if the key does not exist:

对于字典中已经存在的键,try/except 和提供的解决方案的搜索时间似乎相同。但是,如果密钥不存在:

    k = '8' # this keys does NOT exist in the dictionary
    def t1():
        if k in data_dict:
            pass
    def t2():
        if k in data_dict.keys():
            pass
    def t3():
        try:
            a = data_dict[k]
        except:
            pass

    print(timeit.timeit(t1,number= 100000))
    print(timeit.timeit(t2,number= 100000))
    print(timeit.timeit(t3,number= 100000))

    >> 0.014406295998924179
    >> 0.0236777299996902
    >> 0.035819852999338764

The exception seems to take much more time than even using '.keys()'! So, I second the solution proposed by Mark.

异常似乎比使用 '.keys()' 花费的时间要多得多!所以,我支持 Mark 提出的解决方案。