Python 熊猫跨列求和并将每个单元格从该值中除以
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/26537878/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Pandas sum across columns and divide each cell from that value
提问by add-semi-colons
I have read a csv file and pivoted it to get to following structure:
我已经阅读了一个 csv 文件并将其旋转到以下结构:
pivoted = df.pivot('user_id', 'group', 'value')
lookup = df.drop_duplicates('user_id')[['user_id', 'group']]
lookup.set_index(['user_id'], inplace=True)
result = pivoted.join(lookup)
result = result.fillna(0)
Section of the result:
结果部分:
0 1 2 3 4 5 6 7 8 9 10 11 12 13 group
user_id
2 33653 2325 916 720 867 187 31 0 6 3 42 56 92 15 l-1
4 18895 414 1116 570 1190 55 92 0 122 23 78 6 4 2 l-2
16 1383 70 27 17 17 1 0 0 0 0 1 0 0 0 l-2
50 396 72 34 5 18 0 0 0 0 0 0 0 0 0 l-3
51 3915 1170 402 832 2791 316 12 5 118 51 32 9 62 27 l-4
I want to sum across column 0 to column 13 by each row and divide each cell by the sum of that row. I am still getting used to pandas; if I understand correctly, we should try to avoid for loops when doing things like this? In other words, how can I do this in a 'pandas' way?
我想按每行从第 0 列到第 13 列求和,并将每个单元格除以该行的总和。我还是习惯了熊猫;如果我理解正确,我们应该在做这样的事情时尽量避免 for 循环?换句话说,我怎样才能以“熊猫”的方式做到这一点?
采纳答案by Jerome Montino
Try the following:
请尝试以下操作:
In [1]: import pandas as pd
In [2]: df = pd.read_csv("test.csv")
In [3]: df
Out[3]:
id value1 value2 value3
0 A 1 2 3
1 B 4 5 6
2 C 7 8 9
In [4]: df["sum"] = df.sum(axis=1)
In [5]: df
Out[5]:
id value1 value2 value3 sum
0 A 1 2 3 6
1 B 4 5 6 15
2 C 7 8 9 24
In [6]: df_new = df.loc[:,"value1":"value3"].div(df["sum"], axis=0)
In [7]: df_new
Out[7]:
value1 value2 value3
0 0.166667 0.333333 0.500
1 0.266667 0.333333 0.400
2 0.291667 0.333333 0.375
Or you can do the following:
或者您可以执行以下操作:
In [8]: df.loc[:,"value1":"value3"] = df.loc[:,"value1":"value3"].div(df["sum"], axis=0)
In [9]: df
Out[9]:
id value1 value2 value3 sum
0 A 0.166667 0.333333 0.500 6
1 B 0.266667 0.333333 0.400 15
2 C 0.291667 0.333333 0.375 24
Or just straight up from the beginning:
或者直接从头开始:
In [10]: df = pd.read_csv("test.csv")
In [11]: df
Out[11]:
id value1 value2 value3
0 A 1 2 3
1 B 4 5 6
2 C 7 8 9
In [12]: df.loc[:,"value1":"value3"] = df.loc[:,"value1":"value3"].div(df.sum(axis=1), axis=0)
In [13]: df
Out[13]:
id value1 value2 value3
0 A 0.166667 0.333333 0.500
1 B 0.266667 0.333333 0.400
2 C 0.291667 0.333333 0.375
Changing the column value1and the like to your headers should work similarly.
将列value1等更改为标题应该类似。
回答by EdChum
The following seemed to work fine for me:
以下对我来说似乎工作正常:
In [39]:
cols = ['0','1','2','3','4','5','6','7','8','9','10','11','12','13']
result[cols] = result[cols].apply(lambda row: row / row.sum(axis=1), axis=1)
result
Out[39]:
0 1 2 3 4 5 6 \
user_id
2 0.864827 0.059749 0.023540 0.018503 0.022280 0.004806 0.000797
4 0.837285 0.018345 0.049453 0.025258 0.052732 0.002437 0.004077
16 0.912269 0.046174 0.017810 0.011214 0.011214 0.000660 0.000000
50 0.754286 0.137143 0.064762 0.009524 0.034286 0.000000 0.000000
51 0.401868 0.120099 0.041265 0.085403 0.286491 0.032437 0.001232
7 8 9 10 11 12 13 \
user_id
2 0.000000 0.000154 0.000077 0.001079 0.001439 0.002364 0.000385
4 0.000000 0.005406 0.001019 0.003456 0.000266 0.000177 0.000089
16 0.000000 0.000000 0.000000 0.000660 0.000000 0.000000 0.000000
50 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
51 0.000513 0.012113 0.005235 0.003285 0.000924 0.006364 0.002772
group
user_id
2 l-1
4 l-2
16 l-2
50 l-3
51 l-4
OK scratch the above, the following will be much faster:
OK从头开始,下面的会快很多:
result[cols] = result[cols].div(result[cols].sum(axis=1), axis=0)
And just to prove the result is the same:
只是为了证明结果是一样的:
In [47]:
cols = ['0','1','2','3','4','5','6','7','8','9','10','11','12','13']
np.alltrue(result[cols].div(result[cols].sum(axis=1), axis=0) == result[cols].apply(lambda row: row / row.sum(axis=1), axis=1))
Out[47]:
True
And that it's faster:
而且它更快:
In [48]:
cols = ['0','1','2','3','4','5','6','7','8','9','10','11','12','13']
%timeit result[cols].div(result[cols].sum(axis=1), axis=0)
%timeit result[cols].apply(lambda row: row / row.sum(axis=1), axis=1)
100 loops, best of 3: 2.38 ms per loop
100 loops, best of 3: 4.47 ms per loop
回答by Souf Ee
More simply:
更简单:
result.div(result.sum(axis=1), axis=0)
result.div(result.sum(axis=1), axis=0)
(Edited to use code highlighting)
(编辑为使用代码突出显示)
回答by ihadanny
easier to work per column:
每列更容易工作:
df = pd.DataFrame([[1,2,3],[4,5,6],[7,8,9]])
(df.T / df.T.sum()).T
result:
结果:
0 1 2
0 0.166667 0.333333 0.500
1 0.266667 0.333333 0.400
2 0.291667 0.333333 0.375

