Pandas 中非唯一索引的性能影响是什么？

Question

提问by ChrisB

From the pandas documentation, I've gathered that unique-valued indices make certain operations efficient, and that non-unique indices are occasionally tolerated.

从 pandas 文档中，我收集到唯一值索引使某些操作高效，并且偶尔会容忍非唯一索引。

From the outside, it doesn't look like non-unique indices are taken advantage of in any way. For example, the following ixquery is slow enough that it seems to be scanning the entire dataframe

从外面看，似乎没有以任何方式利用非唯一索引。例如，以下ix查询足够慢以至于它似乎正在扫描整个数据帧

In [23]: import numpy as np
In [24]: import pandas as pd
In [25]: x = np.random.randint(0, 10**7, 10**7)
In [26]: df1 = pd.DataFrame({'x':x})
In [27]: df2 = df1.set_index('x', drop=False)
In [28]: %timeit df2.ix[0]
1 loops, best of 3: 402 ms per loop
In [29]: %timeit df1.ix[0]
10000 loops, best of 3: 123 us per loop

(I realize the two ixqueries don't return the same thing -- it's just an example that calls to ixon a non-unique index appear much slower)

（我意识到这两个ix查询不会返回相同的东西——这只是一个例子，调用ix非唯一索引显得慢得多）

Is there any way to coax pandas into using faster lookup methods like binary search on non-unique and/or sorted indices?

有没有办法哄Pandas使用更快的查找方法，例如对非唯一和/或排序索引进行二分搜索？

Answer 1

回答by HYRY

When index is unique, pandas use a hashtable to map key to value O(1). When index is non-unique and sorted, pandas use binary search O(logN), when index is random ordered pandas need to check all the keys in the index O(N).

当索引唯一时，pandas 使用哈希表将键映射到值 O(1)。当索引不唯一且已排序时，pandas 使用二分查找 O(logN)，当索引是随机排序时，pandas 需要检查索引 O(N) 中的所有键。

You can call sort_indexmethod:

您可以调用sort_index方法：

import numpy as np
import pandas as pd
x = np.random.randint(0, 200, 10**6)
df1 = pd.DataFrame({'x':x})
df2 = df1.set_index('x', drop=False)
df3 = df2.sort_index()
%timeit df1.loc[100]
%timeit df2.loc[100]
%timeit df3.loc[100]

result:

结果：

10000 loops, best of 3: 71.2 μs per loop
10 loops, best of 3: 38.9 ms per loop
10000 loops, best of 3: 134 μs per loop

Answer 2

回答by cs95

@HYRY said it well, but nothing says it quite like a colourful graph with timings.

@HYRY 说得很好，但没有什么比带有时间的彩色图表更能说明问题的了。

Plots were generated using perfplot. Code, for your reference:

绘图是使用perfplot生成的。代码，供您参考：

import pandas as pd
import perfplot

_rnd = np.random.RandomState(42)

def make_data(n):    
    x = _rnd.randint(0, 200, n)
    df1 = pd.DataFrame({'x':x})
    df2 = df1.set_index('x', drop=False)
    df3 = df2.sort_index()

    return df1, df2, df3

perfplot.show(
    setup=lambda n: make_data(n),
    kernels=[
        lambda dfs: dfs[0].loc[100],
        lambda dfs: dfs[1].loc[100],        
        lambda dfs: dfs[2].loc[100],
    ],
    labels=['Unique index', 'Non-unique, unsorted index', 'Non-unique, sorted index'],
    n_range=[2 ** k for k in range(8, 23)],
    xlabel='N',
    logx=True,
    logy=True,
    equality_check=False)

Pandas 中非唯一索引的性能影响是什么？

提问by ChrisB

回答by HYRY

回答by cs95

相关推荐

最近更新

标签

Pandas 中非唯一索引的性能影响是什么？

提问by ChrisB

回答by HYRY

回答by cs95

相关推荐

pandas 使用 NaN 添加两个系列

在 Python 中处理 Pandas DataFrames 列分区中的零

在 Pandas 中交换轴

pandas 为什么pandas groupby().transform() 需要唯一索引？

相关推荐

最近更新

标签