Pandas DataFrame 搜索是线性时间还是常数时间?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/45240803/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-14 04:03:56  来源:igfitidea点击:

Pandas DataFrame search is linear time or constant time?

pythonpandassearchdataframetime-complexity

提问by Sayan Sil

I have a dataframe object dfof over 15000 rows like:

我有一个df超过 15000 行的数据框对象,例如:

anime_id          name              genre    rating
1234      Kimi no nawa    Romance, Comedy     9.31
5678       Stiens;Gate             Sci-fi     8.92

And I am trying to find the row with a particular anime_id.

我试图找到具有特定anime_id 的行。

a_id = "5678"
temp = (df.query("anime_id == "+a_id).genre)

I just wanted to know if this search was done in constant time (like dictionaries) or linear time(like lists).

我只是想知道这个搜索是在恒定时间(如字典)还是线性时间(如列表)内完成的。

采纳答案by MaxU

This is a very interesting question!

这是一个非常有趣的问题!

I think it depends on the following aspects:

我认为这取决于以下几个方面:

accessing single row by index (index is sorted and unique) should have runtime O(m)where m << n_rows

通过索引访问单行(指标进行排序和独特的)应该有运行O(m)在那里m << n_rows

accessing single row by index (index is NOT unique and is NOT sorted) should have runtime O(n_rows)

按索引访问单行(索引不是唯一的并且没有排序)应该有运行时O(n_rows)

accessing single row by index (index is NOT unique and is sorted) should have runtime O(m)where m < n_rows)

按索引访问单行(索引不是唯一的并且已排序)应该有运行时O(m)在哪里m < n_rows

accessing row(s) (independently of an index) by boolean indexing should have runtime O(n_rows)

通过布尔索引访问行(独立于索引)应该有运行时 O(n_rows)



Demo:

演示:

index is sorted and unique:

索引已排序且唯一:

In [49]: df = pd.DataFrame(np.random.rand(10**5,6), columns=list('abcdef'))

In [50]: %timeit df.loc[random.randint(0, 10**4)]
The slowest run took 27.65 times longer than the fastest. This could mean that an intermediate result is being cached.
1000 loops, best of 3: 331 μs per loop

In [51]: %timeit df.iloc[random.randint(0, 10**4)]
1000 loops, best of 3: 275 μs per loop

In [52]: %timeit df.query("a > 0.9")
100 loops, best of 3: 7.84 ms per loop

In [53]: %timeit df.loc[df.a > 0.9]
100 loops, best of 3: 2.96 ms per loop

index is NOT sorted and is NOT unique:

索引未排序且不唯一:

In [54]: df = pd.DataFrame(np.random.rand(10**5,6), columns=list('abcdef'), index=np.random.randint(0, 10000, 10**5))

In [55]: %timeit df.loc[random.randint(0, 10**4)]
100 loops, best of 3: 12.3 ms per loop

In [56]: %timeit df.iloc[random.randint(0, 10**4)]
1000 loops, best of 3: 262 μs per loop

In [57]: %timeit df.query("a > 0.9")
100 loops, best of 3: 7.78 ms per loop

In [58]: %timeit df.loc[df.a > 0.9]
100 loops, best of 3: 2.93 ms per loop

index is NOT unique and is sorted:

索引不是唯一的并且已排序:

In [64]: df = pd.DataFrame(np.random.rand(10**5,6), columns=list('abcdef'), index=np.random.randint(0, 10000, 10**5)).sort_index()

In [65]: df.index.is_monotonic_increasing
Out[65]: True

In [66]: %timeit df.loc[random.randint(0, 10**4)]
The slowest run took 9.70 times longer than the fastest. This could mean that an intermediate result is being cached.
1000 loops, best of 3: 478 μs per loop

In [67]: %timeit df.iloc[random.randint(0, 10**4)]
1000 loops, best of 3: 262 μs per loop

In [68]: %timeit df.query("a > 0.9")
100 loops, best of 3: 7.81 ms per loop

In [69]: %timeit df.loc[df.a > 0.9]
100 loops, best of 3: 2.95 ms per loop

回答by galaxyan

I can't tell you how it implemented, but after run a little test. It seems dataframe boolean mask more like linear.

我不能告诉你它是如何实现的,但在运行了一些测试之后。似乎数据帧布尔掩码更像是线性的。

>>> timeit.timeit('dict_data[key]',setup=setup,number = 10000)
0.0005770014540757984
>>> timeit.timeit('df[df.val==key]',setup=setup,number = 10000)
17.583375428628642
>>> timeit.timeit('[i == key for i in dict_data ]',setup=setup,number = 10000)
16.613936403242406

回答by blacknight12321

You should note that even iloc is about 2 orders of magnitude slower then hashmap when your index is unique:

您应该注意到,当您的索引唯一时,即使 iloc 也比 hashmap 慢大约 2 个数量级:

df = pd.DataFrame(np.random.randint(0, 10**7, 10**5), columns=['a'])
%timeit df.iloc[random.randint(0,10**5)]
10000 loops, best of 3: 51.5 μs per loop

s = set(np.random.randint(0, 10**7, 10**5))
%timeit random.randint(0,10**7) in s
The slowest run took 9.70 times longer than the fastest. This could mean that an intermediate result is being cached.
1000000 loops, best of 3: 615 ns per loop