从 Pandas 数组中获取 N 个最大值,索引和列标题完好无损
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/25511223/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Get N largest values from pandas array, with index and column headings intact
提问by Sirrah
Lets say I have just calculated a correlation matrix. Using a pandas dataframe, I would now like to obtain the highest correlations with their axes names in place.
假设我刚刚计算了一个相关矩阵。使用 Pandas 数据框,我现在想获得与其坐标轴名称的最高相关性。
E.g. from:
例如来自:
a, b, c, d, e, f
a, 0, 1, 2, 3, 4, 5,
b, 1, 0, 3, 4, 5, 6,
c, 2, 3, 0, 5, 6, 7,
d, 3, 4, 5, 0, 7, 8,
e, 4, 5, 6, 7, 0, 9,
f, 5, 6, 7, 8, 9, 0
get:
得到:
e f 9
f d 8
f c 7
e d 7
etc...
等等...
I have read through the pandas docs and see the groupby methods as well as functions like head, but I'm a bit lost on how one would be expected to perform this operation.
我已经通读了 Pandas 文档并查看了 groupby 方法以及 head 等函数,但我对如何执行此操作有点迷茫。
回答by DSM
You can use stackhere, which will produce a Series with the row and column information in the index, and then call nlargeston that:
您可以stack在此处使用,它将生成一个包含索引中的行和列信息的系列,然后调用nlargest它:
>>> df.stack()
a a 0
b 1
c 2
d 3
e 4
f 5
b a 1
b 0
c 3
[etc.]
>>> df.stack().nlargest(6)
e f 9
f e 9
d f 8
f d 8
c f 7
d e 7
dtype: int64
回答by jpp
You can use np.argpartition. Dropping down to NumPy here seems to give a 2-3x performance improvement.
您可以使用np.argpartition. 在这里使用 NumPy 似乎可以将性能提高 2-3 倍。
np.random.seed(0)
df = pd.DataFrame(np.abs(np.random.randn(500, 400)))
def jpp(df, n):
flat_indices = np.argpartition(df.values.ravel(), -n)[-n:]
row_idx, col_idx = np.unravel_index(flat_indices, df.values.shape)
indices = list(zip(row_idx, col_idx))
values = df.values[(row_idx, col_idx)]
res_idx = pd.MultiIndex.from_tuples(indices)
return pd.Series(values, index=res_idx).sort_values(ascending=False)
def dsm(df, n):
return df.stack().nlargest(n)
assert jpp(df, n=1000).equals(dsm(df, n=1000))
%timeit jpp(df, n=1000) # 4.65 ms per loop
%timeit dsm(df, n=1000) # 12.1 ms per loop
%timeit jpp(df, n=5) # 3.33 ms per loop
%timeit dsm(df, n=5) # 10.1 ms per loop

