Python Pandas:从行中的每个元素中减去行平均值
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/26081300/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Pandas: Subtract row mean from each element in row
提问by jeremy radcliff
I have a dataframe with rows indexed by chemical element type and columns representing different samples. The values are floats representing the degree of presence of the row element in each sample.
我有一个数据框,其中的行按化学元素类型索引,列代表不同的样本。这些值是浮点数,表示每个样本中行元素的存在程度。
I want to compute the mean of each row and subtract it from each value in that specific row to normalize the data, and make a new dataframe of that dataset.
我想计算每一行的平均值并从该特定行中的每个值中减去它以规范化数据,并为该数据集创建一个新的数据框。
I tried using mean(1), which give me a Series object with the mean for each chemical element, which is good, but then I tried using subtract, which didn't work.
我尝试使用 mean(1),它给了我一个带有每个化学元素均值的 Series 对象,这很好,但后来我尝试使用减法,但没有用。
采纳答案by Alex Riley
You could use DataFrame's submethod and specify that the subtraction should happen row-wise (axis=0) as opposed to the default column-wise behaviour:
您可以使用 DataFrame 的sub方法并指定减法应按行 ( axis=0)发生,而不是默认的按列行为:
df.sub(df.mean(axis=1), axis=0)
Here's an example:
下面是一个例子:
>>> df = pd.DataFrame({'a': [1.5, 2.5], 'b': [0.25, 2.75], 'c': [1.25, 0.75]})
>>> df
a b c
0 1.5 0.25 1.25
1 2.5 2.75 0.75
The mean of each row is straightforward to calculate:
每行的平均值很容易计算:
>>> df.mean(axis=1)
0 1
1 2
dtype: float64
To de-mean the rows of the DataFrame, just subtract the mean values of rows from dflike this:
要对 DataFrame 的行进行去均值,只需从df这样减去行的平均值:
>>> df.sub(df.mean(axis=1), axis=0)
a b c
0 0.5 -0.75 0.25
1 0.5 0.75 -1.25
回答by LondonRob
Additionally to @ajcr's excellent answer, you might want to consider rearranging how you store your data.
除了@ajcr 的出色回答之外,您可能还需要考虑重新安排数据的存储方式。
The way you're doing it at the moment, with different samples in different columns, is the way it would be represented if you were using a spreadsheet, but this might not be the most helpful way to represent your data.
您目前的做法是在不同的列中使用不同的样本,这就是您使用电子表格时的表示方式,但这可能不是表示数据最有用的方式。
Normally, each column represents a unique piece of information about a single real-world entity. The typical example of this kind of data is a person:
通常,每一列代表关于单个真实世界实体的唯一信息。这种数据的典型例子是一个人:
id name hair_colour Age
1 Bob Brown 25
Really, your different samples are different real-world entities.
真的,您的不同样本是不同的现实世界实体。
I would therefore suggest having a two-level index to describe each single piece of information. This makes manipulating your data in the way you want far more convenient.
因此,我建议使用两级索引来描述每条信息。这使得以您想要的方式操作数据更加方便。
Thus:
因此:
>>> df = pd.DataFrame([['Sn',1,2,3],['Pb',2,4,6]],
columns=['element', 'A', 'B', 'C']).set_index('element')
>>> df.columns.name = 'sample'
>>> df # This is how your DataFrame looks at the moment
sample A B C
element
Sn 1 2 3
Pb 2 4 6
>>> # Now make those columns into a second level of index
>>> df = df.stack()
>>> df
element sample
Sn A 1
B 2
C 3
Pb A 2
B 4
C 6
We now have all the delicious functionality of groupbyat our disposal:
我们现在拥有所有美味的功能groupby:
>>> demean = lambda x: x - x.mean()
>>> df.groupby(level='element').transform(demean)
element sample
Sn A -1
B 0
C 1
Pb A -2
B 0
C 2
When you view your data in this way, you'll find that many, many use cases which used to be multi-column DataFramesare in fact MultiIndexed Series, and you have much more power over how the data is represented and transformed.
当您以这种方式查看数据时,您会发现许多曾经是多列的用例DataFrames实际上是 MultiIndexed Series,并且您对数据的表示和转换方式拥有更大的权力。

