Pandas - 过滤所有列
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/41128456/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Pandas - Filter across all columns
提问by Thomas Murphy
I have a square correlation matrix in pandas, and am trying to divine the most efficient way to return all values where the value (always a float -1 <= x <= 1) is above a certain threshold.
我在 Pandas 中有一个平方相关矩阵,我试图找出最有效的方法来返回值(总是浮点数 -1 <= x <= 1)高于某个阈值的所有值。
The pandas.DataFrame.filtermethod asks for a list of columns or a RegEx, but I always want to pass all columns in. Is there a best practice on this?
该pandas.DataFrame.filter方法请求列的列表或一个正则表达式,但我总是想传递中的所有列。是否有一个最佳实践呢?
回答by juanpa.arrivillaga
There are two ways to go about this:
有两种方法可以解决这个问题:
Suppose:
认为:
In [7]: c = np.array([-1,-2,-2,-3,-4,-6,-7,-8])
In [8]: a = np.array([1,2,3,4,6,7,8,9])
In [9]: b = np.array([2,4,6,8,10,12,13,15])
In [10]: c = np.array([-1,-2,-2,-3,-4,-6,-7,-8])
In [11]: corr = np.corrcoef([a,b,c])
In [12]: df = pd.DataFrame(corr)
In [13]: df
Out[13]:
0 1 2
0 1.000000 0.995350 -0.980521
1 0.995350 1.000000 -0.971724
2 -0.980521 -0.971724 1.000000
Then you can simply:
然后你可以简单地:
In [14]: df > 0.5
Out[14]:
0 1 2
0 True True False
1 True True False
2 False False True
In [15]: df[df > 0.5]
Out[15]:
0 1 2
0 1.00000 0.99535 NaN
1 0.99535 1.00000 NaN
2 NaN NaN 1.0
If you want only the values, then the easiest way is to work with the underlying numpy data structures using the values
attribute:
如果您只需要值,那么最简单的方法是使用以下values
属性处理底层 numpy 数据结构:
In [17]: df.values
Out[17]:
array([[ 1. , 0.99535001, -0.9805214 ],
[ 0.99535001, 1. , -0.97172394],
[-0.9805214 , -0.97172394, 1. ]])
In [18]: df.values[(df > 0.5).values]
Out[18]: array([ 1. , 0.99535001, 0.99535001, 1. , 1. ])
Instead of .values
, as pointed out by ayhan, you can use stack
which automatically drops NaN
and also keeps labels...
而不是.values
,正如ayhan所指出的,您可以使用stack
which 自动删除NaN
并保留标签......
In [22]: df.index = ['a','b','c']
In [23]: df.columns=['a','b','c']
In [24]: df
Out[24]:
a b c
a 1.000000 0.995350 -0.980521
b 0.995350 1.000000 -0.971724
c -0.980521 -0.971724 1.000000
In [25]: df.stack() > 0.5
Out[25]:
a a True
b True
c False
b a True
b True
c False
c a False
b False
c True
dtype: bool
In [26]: df.stack()[df.stack() > 0.5]
Out[26]:
a a 1.00000
b 0.99535
b a 0.99535
b 1.00000
c c 1.00000
dtype: float64
You can always go back...
你随时可以回去...
In [29]: (df.stack()[df.stack() > 0.5]).unstack()
Out[29]:
a b c
a 1.00000 0.99535 NaN
b 0.99535 1.00000 NaN
c NaN NaN 1.0
回答by Julien Marrec
Not sure what you desired output is since you didn't provide a sample, but I'll give you my two cents on what I would do:
不确定你想要的输出是什么,因为你没有提供样本,但我会给你我会做的两分钱:
In[1]:
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.rand(10,5))
corr = df.corr()
corr.shape
Out[1]: (5, 5)
Now, let's extract the upper triangle of the correlation matrix (it's symetric), excluding the diagonal. For this we are going to use np.tril
, cast this as a boolean, and get the opposite of it using the ~
operator.
现在,让我们提取相关矩阵的上三角形(它是对称的),不包括对角线。为此,我们将使用np.tril
,将其转换为布尔值,并使用~
运算符获得相反的结果。
In [2]: corr_triu = corr.where(~np.tril(np.ones(corr.shape)).astype(np.bool))
corr_triu
Out[2]:
0 1 2 3 4
0 NaN 0.228763 -0.276406 0.286771 -0.050825
1 NaN NaN -0.562459 -0.596057 0.540656
2 NaN NaN NaN 0.402752 0.042400
3 NaN NaN NaN NaN -0.642285
4 NaN NaN NaN NaN NaN
Now let's stack this and filter all values that are above 0.3
for example:
现在让我们堆叠它并过滤上面的所有值,0.3
例如:
In [3]: corr_triu = corr_triu.stack()
corr_triu[corr_triu > 0.3]
Out[3]:
1 4 0.540656
2 3 0.402752
dtype: float64
If you want to make it a bit prettier:
如果你想让它更漂亮一点:
In [4]: corr_triu.name = 'Pearson Correlation Coefficient'
corr_triu.index.names = ['Col1', 'Col2']
In [5]: corr_triu[corr_triu > 0.3].to_frame()
Out[5]:
Pearson Correlation Coefficient
Col1 Col2
1 4 0.540656
2 3 0.402752
回答by msklc
For easily get a meaningful correlation result from a pandas dataframe;
为了从Pandas数据框中轻松获得有意义的相关结果;
For example our data:
例如我们的数据:
df = pd.DataFrame(np.random.randn(10, 5),
columns=['a', 'b', 'c', 'd', 'e'])
df
we get the correlationbetween the values with df.corr()
我们得到值之间的相关性df.corr()
To filter the result by ignoring the 1.0 (which are correlation of same values)and filter a limit by;
通过忽略 1.0(相同值的相关性)过滤结果并过滤限制;
corr_result=df.corr()
corr_result = corr_result.stack()
corr_result[(corr_result != 1.0)&((corr_result > 0.9)|(corr_result < -0.9))]