Pandas DataFrame 搜索是线性时间还是常数时间？

Question

提问by Sayan Sil

I have a dataframe object dfof over 15000 rows like:

我有一个df超过 15000 行的数据框对象，例如：

anime_id          name              genre    rating
1234      Kimi no nawa    Romance, Comedy     9.31
5678       Stiens;Gate             Sci-fi     8.92

And I am trying to find the row with a particular anime_id.

我试图找到具有特定anime_id 的行。

a_id = "5678"
temp = (df.query("anime_id == "+a_id).genre)

I just wanted to know if this search was done in constant time (like dictionaries) or linear time(like lists).

我只是想知道这个搜索是在恒定时间（如字典）还是线性时间（如列表）内完成的。

Answer 1

采纳答案by MaxU

This is a very interesting question!

这是一个非常有趣的问题！

I think it depends on the following aspects:

我认为这取决于以下几个方面：

accessing single row by index (index is sorted and unique) should have runtime O(m)where m << n_rows

通过索引访问单行（指标进行排序和独特的）应该有运行O(m)在那里m << n_rows

accessing single row by index (index is NOT unique and is NOT sorted) should have runtime O(n_rows)

按索引访问单行（索引不是唯一的并且没有排序）应该有运行时O(n_rows)

accessing single row by index (index is NOT unique and is sorted) should have runtime O(m)where m < n_rows)

按索引访问单行（索引不是唯一的并且已排序）应该有运行时O(m)在哪里m < n_rows）

accessing row(s) (independently of an index) by boolean indexing should have runtime O(n_rows)

通过布尔索引访问行（独立于索引）应该有运行时 O(n_rows)

Demo:

演示：

index is sorted and unique:

索引已排序且唯一：

In [49]: df = pd.DataFrame(np.random.rand(10**5,6), columns=list('abcdef'))

In [50]: %timeit df.loc[random.randint(0, 10**4)]
The slowest run took 27.65 times longer than the fastest. This could mean that an intermediate result is being cached.
1000 loops, best of 3: 331 μs per loop

In [51]: %timeit df.iloc[random.randint(0, 10**4)]
1000 loops, best of 3: 275 μs per loop

In [52]: %timeit df.query("a > 0.9")
100 loops, best of 3: 7.84 ms per loop

In [53]: %timeit df.loc[df.a > 0.9]
100 loops, best of 3: 2.96 ms per loop

index is NOT sorted and is NOT unique:

索引未排序且不唯一：

In [54]: df = pd.DataFrame(np.random.rand(10**5,6), columns=list('abcdef'), index=np.random.randint(0, 10000, 10**5))

In [55]: %timeit df.loc[random.randint(0, 10**4)]
100 loops, best of 3: 12.3 ms per loop

In [56]: %timeit df.iloc[random.randint(0, 10**4)]
1000 loops, best of 3: 262 μs per loop

In [57]: %timeit df.query("a > 0.9")
100 loops, best of 3: 7.78 ms per loop

In [58]: %timeit df.loc[df.a > 0.9]
100 loops, best of 3: 2.93 ms per loop

index is NOT unique and is sorted:

索引不是唯一的并且已排序：

In [64]: df = pd.DataFrame(np.random.rand(10**5,6), columns=list('abcdef'), index=np.random.randint(0, 10000, 10**5)).sort_index()

In [65]: df.index.is_monotonic_increasing
Out[65]: True

In [66]: %timeit df.loc[random.randint(0, 10**4)]
The slowest run took 9.70 times longer than the fastest. This could mean that an intermediate result is being cached.
1000 loops, best of 3: 478 μs per loop

In [67]: %timeit df.iloc[random.randint(0, 10**4)]
1000 loops, best of 3: 262 μs per loop

In [68]: %timeit df.query("a > 0.9")
100 loops, best of 3: 7.81 ms per loop

In [69]: %timeit df.loc[df.a > 0.9]
100 loops, best of 3: 2.95 ms per loop

Answer 2

回答by galaxyan

I can't tell you how it implemented, but after run a little test. It seems dataframe boolean mask more like linear.

我不能告诉你它是如何实现的，但在运行了一些测试之后。似乎数据帧布尔掩码更像是线性的。

>>> timeit.timeit('dict_data[key]',setup=setup,number = 10000)
0.0005770014540757984
>>> timeit.timeit('df[df.val==key]',setup=setup,number = 10000)
17.583375428628642
>>> timeit.timeit('[i == key for i in dict_data ]',setup=setup,number = 10000)
16.613936403242406

Answer 3

回答by blacknight12321

You should note that even iloc is about 2 orders of magnitude slower then hashmap when your index is unique:

您应该注意到，当您的索引唯一时，即使 iloc 也比 hashmap 慢大约 2 个数量级：

df = pd.DataFrame(np.random.randint(0, 10**7, 10**5), columns=['a'])
%timeit df.iloc[random.randint(0,10**5)]
10000 loops, best of 3: 51.5 μs per loop

s = set(np.random.randint(0, 10**7, 10**5))
%timeit random.randint(0,10**7) in s
The slowest run took 9.70 times longer than the fastest. This could mean that an intermediate result is being cached.
1000000 loops, best of 3: 615 ns per loop

Pandas DataFrame 搜索是线性时间还是常数时间？

提问by Sayan Sil

采纳答案by MaxU

回答by galaxyan

回答by blacknight12321

相关推荐

最近更新

标签

Pandas DataFrame 搜索是线性时间还是常数时间？

提问by Sayan Sil

采纳答案by MaxU

回答by galaxyan

回答by blacknight12321

相关推荐

基于 python pandas 中其他列的值创建一个新列

pandas 使用 sklearn 的 KFold 分离熊猫数据框

Django 可以与 Pandas 和 numpy 配合使用吗？

pandas 按总和分组作为新列名

相关推荐

最近更新

标签