pandas.DataFrame corrwith() 方法
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/38422001/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
pandas.DataFrame corrwith() method
提问by Nikita Sivukhin
I recently start working with pandas
. Can anyone explain me difference in behaviour of function .corrwith()
with Series
and DataFrame
?
我最近开始使用pandas
. 谁能.corrwith()
用Series
和解释我在功能行为上的差异DataFrame
?
Suppose i have one DataFrame
:
假设我有一个DataFrame
:
frame = pd.DataFrame(data={'a':[1,2,3], 'b':[-1,-2,-3], 'c':[10, -10, 10]})
And i want calculate correlation between features 'a' and all other features. I can do it in the following way:
我想计算特征“a”和所有其他特征之间的相关性。我可以通过以下方式做到这一点:
frame.drop(labels='a', axis=1).corrwith(frame['a'])
And result will be:
结果将是:
b -1.0
c 0.0
But very similar code:
但非常相似的代码:
frame.drop(labels='a', axis=1).corrwith(frame[['a']])
Generate absolutely different and unacceptable table:
生成完全不同且不可接受的表:
a NaN
b NaN
c NaN
So, my question is: why in case of DataFrame
as second argument we get such strange output?
所以,我的问题是:为什么在DataFrame
作为第二个参数的情况下,我们会得到如此奇怪的输出?
回答by piRSquared
What I think you're looking for:
我认为你在找什么:
Let's say your frame is:
假设您的框架是:
frame = pd.DataFrame(np.random.rand(10, 6), columns=['cost', 'amount', 'day', 'month', 'is_sale', 'hour'])
You want the 'cost'
and 'amount'
columns to be correlated with all other columns in every combination.
您希望'cost'
和'amount'
列与每个组合中的所有其他列相关联。
focus_cols = ['cost', 'amount']
frame.corr().filter(focus_cols).drop(focus_cols)
Answering what you asked:
回答你的问题:
Compute pairwise correlation between rows or columns of two DataFrame objects.
Parameters:
other: DataFrame
axis : {0 or ‘index', 1 or ‘columns'},
default 0 0 or ‘index' to compute column-wise, 1 or ‘columns' for row-wise drop : boolean, default False Drop missing indices from result, default returns union of all Returns: correls : Series
计算两个 DataFrame 对象的行或列之间的成对相关性。
参数:
其他:数据帧
轴:{0 或“索引”,1 或“列”},
默认 0 0 或 'index' 计算列方式,1 或 'columns' 为行方式 drop :boolean,默认 False 从结果中删除缺失的索引,默认返回所有的并集返回:correls:系列
corrwith
is behaving similarly to add
, sub
, mul
, div
in that it expects to find a DataFrame
or a Series
being passed in other
despite the documentation saying just DataFrame
.
corrwith
同样表现于add
,sub
,mul
,div
,它希望找到一个DataFrame
或Series
在被传递other
,尽管文档只是说DataFrame
。
When other
is a Series
it broadcast that series and matches along the axis specified by axis
, default is 0. This is why the following worked:
当other
是Series
它的广播沿着指定的轴那个系列和火柴axis
,默认值为0。这就是为什么以下工作:
frame.drop(labels='a', axis=1).corrwith(frame.a)
b -1.0
c 0.0
dtype: float64
When other
is a DataFrame
it will match the axis specified by axis
and correlate each pair identified by the other axis. If we did:
当other
是 a 时DataFrame
,它将匹配由指定的轴axis
并关联由另一个轴标识的每一对。如果我们这样做:
frame.drop('a', axis=1).corrwith(frame.drop('b', axis=1))
a NaN
b NaN
c 1.0
dtype: float64
Only c
was in common and only c
had its correlation calculated.
只有c
共同点,只c
计算其相关性。
In the case you specified:
在您指定的情况下:
frame.drop(labels='a', axis=1).corrwith(frame[['a']])
frame[['a']]
is a DataFrame
because of the [['a']]
and now plays by the DataFrame
rules in which its columns must match up with what its being correlated with. But you explicitly drop a
from the first frame then correlate with a DataFrame
with nothing but a
. The result is NaN
for every column.
frame[['a']]
是DataFrame
因为[['a']]
并且现在DataFrame
遵循规则,其中的列必须与其相关联的内容相匹配。但是您明确地a
从第一帧中删除,然后与 a 关联DataFrame
,除了a
. 结果是NaN
针对每一列的。
回答by MaxU
corrwith defined as DataFrame.corrwith(other, axis=0, drop=False)
, so the axis=0
per default - i.e. Compute pairwise correlation between columns of two **DataFrame** objects
corrwith 定义为DataFrame.corrwith(other, axis=0, drop=False)
,因此axis=0
默认情况下 - 即Compute pairwise correlation between columns of two **DataFrame** objects
So the column names / labels must be the same in both DFs:
因此,两个 DF 中的列名/标签必须相同:
In [134]: frame.drop(labels='a', axis=1).corrwith(frame[['a']].rename(columns={'a':'b'}))
Out[134]:
b -1.0
c NaN
dtype: float64
NaN
- means (in this case) there is nothing to compare / correlate with, because there is NO column named c
in other
DF
NaN
- 意味着(在这种情况下)没有什么可比较/关联的,因为c
在other
DF 中没有命名列
if you pass a series as other
it will be translated (from the link, you've posted in comment) into:
如果您传递一个系列,other
因为它将被翻译(来自链接,您已在评论中发布)为:
In [142]: frame.drop(labels='a', axis=1).apply(frame.a.corr)
Out[142]:
b -1.0
c 0.0
dtype: float64
回答by Zahoor Ahmad
回答by Zahoor Ahmad
Sorry a bit late.. There is no way of corwith of series while panda dataframe could only be analyzed with having same columnz
抱歉有点晚了.. 没有办法与系列相关联,而Pandas数据框只能使用相同的 columnz 进行分析
like
喜欢
x = np.array([2, 4, 6, 8.2]).reshape(-1, 1)
x = np.array([2, 4, 6, 8.2]).reshape(-1, 1)
y = np.array([2.3, 3.11, .5, 7, 10, 11, 12]).reshape(-1, 1)
y = np.array([2.3, 3.11, .5, 7, 10, 11, 12]).reshape(-1, 1)
a = pd.DataFrame(x, columns=['aa']) b = pd.DataFrame(y, columns=['aa'])
a = pd.DataFrame(x, columns=['aa']) b = pd.DataFrame(y, columns=['aa'])
a.corrwith(b)
a.对应(b)