获取 Dataframe Pandas 中最大值的列和行索引
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/48016629/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Get Column and Row Index for Highest Value in Dataframe Pandas
提问by christfan868
I'd like to know if there's a way to find the location (column and row index) of the highest value in a dataframe. So if for example my dataframe looks like this:
我想知道是否有办法找到数据框中最高值的位置(列和行索引)。因此,例如,如果我的数据框如下所示:
A B C D E
0 100 9 1 12 6
1 80 10 67 15 91
2 20 67 1 56 23
3 12 51 5 10 58
4 73 28 72 25 1
How do I get a result that looks like this: [0, 'A']
using Pandas?
如何获得如下所示的结果:[0, 'A']
使用 Pandas?
回答by Mike Müller
Use np.argmax
用 np.argmax
NumPy's argmax
can be helpful:
NumPyargmax
可能会有所帮助:
>>> df.stack().index[np.argmax(df.values)]
(0, 'A')
In steps
在步骤
df.values
is a two-dimensional NumPy array:
df.values
是一个二维 NumPy 数组:
>>> df.values
array([[100, 9, 1, 12, 6],
[ 80, 10, 67, 15, 91],
[ 20, 67, 1, 56, 23],
[ 12, 51, 5, 10, 58],
[ 73, 28, 72, 25, 1]])
argmax
gives you the index for the maximum value for the "flattened" array:
argmax
为您提供“扁平化”数组最大值的索引:
>>> np.argmax(df.values)
0
Now, you can use this index to find the row-column location on the stacked dataframe:
现在,您可以使用此索引来查找堆叠数据框上的行列位置:
>>> df.stack().index[0]
(0, 'A')
Fast Alternative
快速替代
If you need it fast, do as few steps as possible.
Working only on the NumPy array to find the indices np.argmax
seems best:
如果您需要快速,请执行尽可能少的步骤。仅在 NumPy 数组上工作以查找索引np.argmax
似乎是最好的:
v = df.values
i, j = [x[0] for x in np.unravel_index([np.argmax(v)], v.shape)]
[df.index[i], df.columns[j]]
Result:
结果:
[0, 'A']
Timings
时间安排
Timing works best for lareg data frames:
时序最适合 lareg 数据帧:
df = pd.DataFrame(data=np.arange(int(1e6)).reshape(-1,5), columns=list('ABCDE'))
Sorted slowest to fastest:
从最慢到最快排序:
Mask:
面具:
%timeit df.mask(~(df==df.max().max())).stack().index.tolist()
33.4 ms ± 982 μs per loop (mean ± std. dev. of 7 runs, 10 loops each)
Stack-idmax
堆栈-idmax
%timeit list(df.stack().idxmax())
17.1 ms ± 139 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Stack-argmax
堆栈参数最大值
%timeit df.stack().index[np.argmax(df.values)]
14.8 ms ± 392 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Where
在哪里
%%timeit
i,j = np.where(df.values == df.values.max())
list((df.index[i].values.tolist()[0],df.columns[j].values.tolist()[0]))
4.45 ms ± 84.7 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Argmax-unravel_index
Argmax-unravel_index
%%timeit
v = df.values
i, j = [x[0] for x in np.unravel_index([np.argmax(v)], v.shape)]
[df.index[i], df.columns[j]]
499 μs ± 12 μs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Compare
相比
d = {'name': ['Mask', 'Stack-idmax', 'Stack-argmax', 'Where', 'Argmax-unravel_index'],
'time': [33.4, 17.1, 14.8, 4.45, 499],
'unit': ['ms', 'ms', 'ms', 'ms', 'μs']}
timings = pd.DataFrame(d)
timings['seconds'] = timings.time * timings.unit.map({'ms': 1e-3, 'μs': 1e-6})
timings['factor slower'] = timings.seconds / timings.seconds.min()
timings.sort_values('factor slower')
Output:
输出:
name time unit seconds factor slower
4 Argmax-unravel_index 499.00 μs 0.000499 1.000000
3 Where 4.45 ms 0.004450 8.917836
2 Stack-argmax 14.80 ms 0.014800 29.659319
1 Stack-idmax 17.10 ms 0.017100 34.268537
0 Mask 33.40 ms 0.033400 66.933868
So the "Argmax-unravel_index" version seems to be one to nearly two orders of magnitude faster for large data frames, i.e. where often speeds matters most.
因此,对于大型数据帧,“Argmax-unravel_index”版本似乎快了一到近两个数量级,即通常速度最重要的地方。
回答by jezrael
Use stack
for Series
with MultiIndex
and idxmax
for index of max value:
使用stack
了Series
与MultiIndex
和idxmax
为最大值的指标:
print (df.stack().idxmax())
(0, 'A')
print (list(df.stack().idxmax()))
[0, 'A']
Detail:
细节:
print (df.stack())
0 A 100
B 9
C 1
D 12
E 6
1 A 80
B 10
C 67
D 15
E 91
2 A 20
B 67
C 1
D 56
E 23
3 A 12
B 51
C 5
D 10
E 58
4 A 73
B 28
C 72
D 25
E 1
dtype: int64
回答by YOBEN_S
mask
+ max
mask
+ max
df.mask(~(df==df.max().max())).stack().index.tolist()
Out[17]: [(0, 'A')]
回答by Scott Boston
In my opinion for larger datasets, stack() becomes inefficient, let's use np.where
to return index positions:
在我看来,对于较大的数据集,stack() 变得效率低下,让我们使用np.where
返回索引位置:
i,j = np.where(df.values == df.values.max())
list((df.index[i].values.tolist()[0],df.columns[j].values.tolist()[0]))
Output:
输出:
[0, 'A']
Timings for larger datafames:
更大数据名的时间:
df = pd.DataFrame(data=np.arange(10000).reshape(-1,5), columns=list('ABCDE'))
np.where method
np.where 方法
> %%timeit i,j = np.where(df.values == df.values.max())
> list((df.index[i].values.tolist()[0],df.columns[j].values.tolist()[0]))
1000 loops, best of 3: 364 μs per loop
1000 个循环,最好的 3 个:每个循环 364 μs
Other stack methods
其他堆栈方法
> %timeit df.mask(~(df==df.max().max())).stack().index.tolist()
100 loops, best of 3: 7.68 ms per loop
100 个循环,最好的 3 个:每个循环 7.68 毫秒
> %timeit df.stack().index[np.argmax(df.values)`]
10 loops, best of 3: 50.5 ms per loop
10 个循环,最好的 3 个:每个循环 50.5 毫秒
> %timeit list(df.stack().idxmax())
1000 loops, best of 3: 1.58 ms per loop
1000 个循环,最好的 3 个:每个循环 1.58 毫秒
Even larger dataframe:
更大的数据框:
df = pd.DataFrame(data=np.arange(100000).reshape(-1,5), columns=list('ABCDE'))
Respectively:
分别:
1000 loops, best of 3: 1.62 ms per loop
10 loops, best of 3: 18.2 ms per loop
100 loops, best of 3: 5.69 ms per loop
100 loops, best of 3: 6.64 ms per loop
回答by Alex Deineha
print('Max value:', df.stack().max())
print('Parameters :', df.stack().idxmax())
This is the best way imho.
这是最好的方式恕我直言。
回答by rassar
This should work:
这应该有效:
def max_df(df):
m = None
p = None
for idx, item in enumerate(df.idxmax()):
c = df.columns[item]
val = df[c][idx]
if m is None or val > m:
m = val
p = idx, c
return p
This uses the idxmaxfunction, then compares all of the values returned by it.
这使用idxmax函数,然后比较它返回的所有值。
Example usage:
用法示例:
>>> df
A B
0 100 9
1 90 8
>>> max_df(df)
(0, 'A')
Here's a one-liner (for fun):
这是一个单行(为了好玩):
def max_df2(df):
return max((df[df.columns[item]][idx], idx, df.columns[item]) for idx, item in enumerate(df.idxmax()))[1:]