Pandas 中非唯一索引的性能影响是什么?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/16626058/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
What is the performance impact of non-unique indexes in pandas?
提问by ChrisB
From the pandas documentation, I've gathered that unique-valued indices make certain operations efficient, and that non-unique indices are occasionally tolerated.
从 pandas 文档中,我收集到唯一值索引使某些操作高效,并且偶尔会容忍非唯一索引。
From the outside, it doesn't look like non-unique indices are taken advantage of in any way. For example, the following ixquery is slow enough that it seems to be scanning the entire dataframe
从外面看,似乎没有以任何方式利用非唯一索引。例如,以下ix查询足够慢以至于它似乎正在扫描整个数据帧
In [23]: import numpy as np
In [24]: import pandas as pd
In [25]: x = np.random.randint(0, 10**7, 10**7)
In [26]: df1 = pd.DataFrame({'x':x})
In [27]: df2 = df1.set_index('x', drop=False)
In [28]: %timeit df2.ix[0]
1 loops, best of 3: 402 ms per loop
In [29]: %timeit df1.ix[0]
10000 loops, best of 3: 123 us per loop
(I realize the two ixqueries don't return the same thing -- it's just an example that calls to ixon a non-unique index appear much slower)
(我意识到这两个ix查询不会返回相同的东西——这只是一个例子,调用ix非唯一索引显得慢得多)
Is there any way to coax pandas into using faster lookup methods like binary search on non-unique and/or sorted indices?
有没有办法哄Pandas使用更快的查找方法,例如对非唯一和/或排序索引进行二分搜索?
回答by HYRY
When index is unique, pandas use a hashtable to map key to value O(1). When index is non-unique and sorted, pandas use binary search O(logN), when index is random ordered pandas need to check all the keys in the index O(N).
当索引唯一时,pandas 使用哈希表将键映射到值 O(1)。当索引不唯一且已排序时,pandas 使用二分查找 O(logN),当索引是随机排序时,pandas 需要检查索引 O(N) 中的所有键。
You can call sort_indexmethod:
您可以调用sort_index方法:
import numpy as np
import pandas as pd
x = np.random.randint(0, 200, 10**6)
df1 = pd.DataFrame({'x':x})
df2 = df1.set_index('x', drop=False)
df3 = df2.sort_index()
%timeit df1.loc[100]
%timeit df2.loc[100]
%timeit df3.loc[100]
result:
结果:
10000 loops, best of 3: 71.2 μs per loop
10 loops, best of 3: 38.9 ms per loop
10000 loops, best of 3: 134 μs per loop
回答by cs95
@HYRY said it well, but nothing says it quite like a colourful graph with timings.
@HYRY 说得很好,但没有什么比带有时间的彩色图表更能说明问题的了。
Plots were generated using perfplot. Code, for your reference:
绘图是使用perfplot生成的。代码,供您参考:
import pandas as pd
import perfplot
_rnd = np.random.RandomState(42)
def make_data(n):
x = _rnd.randint(0, 200, n)
df1 = pd.DataFrame({'x':x})
df2 = df1.set_index('x', drop=False)
df3 = df2.sort_index()
return df1, df2, df3
perfplot.show(
setup=lambda n: make_data(n),
kernels=[
lambda dfs: dfs[0].loc[100],
lambda dfs: dfs[1].loc[100],
lambda dfs: dfs[2].loc[100],
],
labels=['Unique index', 'Non-unique, unsorted index', 'Non-unique, sorted index'],
n_range=[2 ** k for k in range(8, 23)],
xlabel='N',
logx=True,
logy=True,
equality_check=False)


