pandas 如何计算pandas中n列而不是行的差异
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/29218398/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
How to calculate differences across n columns in pandas rather than rows
提问by John Smizz
I am playing around with data and need to look at differences across columns (as well as rows) in a fairly large dataframe. The easiest way for rows is clearly the diff() method, but I cannot find the equivalent for columns?
我正在处理数据,需要在相当大的数据框中查看列(以及行)之间的差异。行的最简单方法显然是 diff() 方法,但我找不到列的等效方法?
My current solution to obtain a dataframe with the columns differenced for via
我当前的解决方案是获取一个数据框,其中的列差异为 via
df.transpose().diff().transpose()
df.transpose().diff().transpose()
Is there a more efficient alternative? Or is this such odd usage of pandas that this was just never requested/ considered useful? :)
有没有更有效的替代方案?或者这是Pandas的这种奇怪用法,以至于从未被要求/认为有用?:)
Thanks,
谢谢,
回答by unutbu
Pandas DataFrames are excellent for manipulating table-like data whose columns have different dtypes.
Pandas DataFrames 非常适合处理列具有不同 dtype 的类似表的数据。
If subtracting across columns and rows both make sense, then it means all the values are the same kindof quantity. That mightbe an indication that you should be using a NumPy array instead of a Pandas DataFrame.
如果跨列和跨行减去都有意义,那么这意味着所有值都是同一种数量。这可能表明您应该使用 NumPy 数组而不是 Pandas DataFrame。
In any case, you can use arr = df.valuesto extract a NumPy array of the underlying data from the DataFrame. If all the columns share the same dtype, then the NumPy array will have the same dtype. (When the columns have different dtypes, df.valueshas objectdtype).
在任何情况下,您都可以使用arr = df.values从 DataFrame 中提取底层数据的 NumPy 数组。如果所有列共享相同的 dtype,则 NumPy 数组将具有相同的 dtype。(当列具有不同的 dtypes 时,df.values具有dtype object)。
Then you can compute the differences along rows or columns using np.diff(arr, axis=...):
然后,您可以使用以下方法计算沿行或列的差异np.diff(arr, axis=...):
import numpy as np
import pandas as pd
df = pd.DataFrame(np.arange(12).reshape(3,4), columns=list('ABCD'))
# A B C D
# 0 0 1 2 3
# 1 4 5 6 7
# 2 8 9 10 11
np.diff(df.values, axis=0) # difference of the rows
# array([[4, 4, 4, 4],
# [4, 4, 4, 4]])
np.diff(df.values, axis=1) # difference of the columns
# array([[1, 1, 1],
# [1, 1, 1],
# [1, 1, 1]])
回答by Alexander
Just difference the columns, e.g.
只是区分列,例如
df['new_col'] = df['a'] - df['b']
For multiple columns, I believe unutbu's answer is the best (although it returns a np.ndarray object instead of a dataframe, it is still faster even after then converting it to a dataframe).
对于多列,我相信 unutbu 的答案是最好的(虽然它返回一个 np.ndarray 对象而不是数据帧,但即使在将其转换为数据帧之后它仍然更快)。
# Create a large dataframe.
df = pd.DataFrame(np.random.randn(1e6, 100))
%%timeit
np.diff(df.values, axis=1)
1 loops, best of 3: 450 ms per loop
%%timeit
df - df.shift(axis=1)
1 loops, best of 3: 727 ms per loop
%%timeit
df.T.diff().T
1 loops, best of 3: 1.52 s per loop
回答by Adrian Martin
Use the axisparameter in diff:
在 中使用axis参数diff:
df = pd.DataFrame(np.arange(12).reshape(3, 4), columns=list('ABCD'))
# A B C D
# 0 0 1 2 3
# 1 4 5 6 7
# 2 8 9 10 11
df.diff(axis=1) # subtracting column wise
# A B C D
# 0 NaN 1 1 1
# 1 NaN 1 1 1
# 2 NaN 1 1 1
df.diff() # subtracting row wise
# A B C D
# 0 NaN NaN NaN NaN
# 1 4 4 4 4
# 2 4 4 4 4

