Python 熊猫跨列求和并将每个单元格从该值中除以

Question

提问by add-semi-colons

I have read a csv file and pivoted it to get to following structure:

我已经阅读了一个 csv 文件并将其旋转到以下结构：

pivoted = df.pivot('user_id', 'group', 'value')
lookup = df.drop_duplicates('user_id')[['user_id', 'group']]
lookup.set_index(['user_id'], inplace=True)
result = pivoted.join(lookup)
result = result.fillna(0)

Section of the result:

结果部分：

             0     1     2    3     4    5   6  7    8   9  10  11  12  13  group
user_id                                                                      
2        33653  2325   916  720   867  187  31  0    6   3  42  56  92  15    l-1
4        18895   414  1116  570  1190   55  92  0  122  23  78   6   4   2    l-2 
16        1383    70    27   17    17    1   0  0    0   0   1   0   0   0    l-2
50         396    72    34    5    18    0   0  0    0   0   0   0   0   0    l-3
51        3915  1170   402  832  2791  316  12  5  118  51  32   9  62  27    l-4

I want to sum across column 0 to column 13 by each row and divide each cell by the sum of that row. I am still getting used to pandas; if I understand correctly, we should try to avoid for loops when doing things like this? In other words, how can I do this in a 'pandas' way?

我想按每行从第 0 列到第 13 列求和，并将每个单元格除以该行的总和。我还是习惯了熊猫；如果我理解正确，我们应该在做这样的事情时尽量避免 for 循环？换句话说，我怎样才能以“熊猫”的方式做到这一点？

Answer 1

采纳答案by Jerome Montino

Try the following:

请尝试以下操作：

In [1]: import pandas as pd

In [2]: df = pd.read_csv("test.csv")

In [3]: df
Out[3]: 
  id  value1  value2  value3
0  A       1       2       3
1  B       4       5       6
2  C       7       8       9

In [4]: df["sum"] = df.sum(axis=1)

In [5]: df
Out[5]: 
  id  value1  value2  value3  sum
0  A       1       2       3    6
1  B       4       5       6   15
2  C       7       8       9   24

In [6]: df_new = df.loc[:,"value1":"value3"].div(df["sum"], axis=0)

In [7]: df_new
Out[7]: 
     value1    value2  value3
0  0.166667  0.333333   0.500
1  0.266667  0.333333   0.400
2  0.291667  0.333333   0.375

Or you can do the following:

或者您可以执行以下操作：

In [8]: df.loc[:,"value1":"value3"] = df.loc[:,"value1":"value3"].div(df["sum"], axis=0)

In [9]: df
Out[9]: 
  id    value1    value2  value3  sum
0  A  0.166667  0.333333   0.500    6
1  B  0.266667  0.333333   0.400   15
2  C  0.291667  0.333333   0.375   24

Or just straight up from the beginning:

或者直接从头开始：

In [10]: df = pd.read_csv("test.csv")

In [11]: df
Out[11]: 
  id  value1  value2  value3
0  A       1       2       3
1  B       4       5       6
2  C       7       8       9

In [12]: df.loc[:,"value1":"value3"] = df.loc[:,"value1":"value3"].div(df.sum(axis=1), axis=0)

In [13]: df
Out[13]: 
  id    value1    value2  value3
0  A  0.166667  0.333333   0.500
1  B  0.266667  0.333333   0.400
2  C  0.291667  0.333333   0.375

Changing the column value1and the like to your headers should work similarly.

将列value1等更改为标题应该类似。

Answer 2

回答by EdChum

The following seemed to work fine for me:

以下对我来说似乎工作正常：

In [39]:

cols = ['0','1','2','3','4','5','6','7','8','9','10','11','12','13']
result[cols]  = result[cols].apply(lambda row: row / row.sum(axis=1), axis=1)
result

Out[39]:
                0         1         2         3         4         5         6  \
user_id                                                                         
2        0.864827  0.059749  0.023540  0.018503  0.022280  0.004806  0.000797   
4        0.837285  0.018345  0.049453  0.025258  0.052732  0.002437  0.004077   
16       0.912269  0.046174  0.017810  0.011214  0.011214  0.000660  0.000000   
50       0.754286  0.137143  0.064762  0.009524  0.034286  0.000000  0.000000   
51       0.401868  0.120099  0.041265  0.085403  0.286491  0.032437  0.001232   

                7         8         9        10        11        12        13  \
user_id                                                                         
2        0.000000  0.000154  0.000077  0.001079  0.001439  0.002364  0.000385   
4        0.000000  0.005406  0.001019  0.003456  0.000266  0.000177  0.000089   
16       0.000000  0.000000  0.000000  0.000660  0.000000  0.000000  0.000000   
50       0.000000  0.000000  0.000000  0.000000  0.000000  0.000000  0.000000   
51       0.000513  0.012113  0.005235  0.003285  0.000924  0.006364  0.002772   

        group  
user_id        
2         l-1  
4         l-2  
16        l-2  
50        l-3  
51        l-4

OK scratch the above, the following will be much faster:

OK从头开始，下面的会快很多：

result[cols]  = result[cols].div(result[cols].sum(axis=1), axis=0)

And just to prove the result is the same:

只是为了证明结果是一样的：

In [47]:

cols = ['0','1','2','3','4','5','6','7','8','9','10','11','12','13']
np.alltrue(result[cols].div(result[cols].sum(axis=1), axis=0) == result[cols].apply(lambda row: row / row.sum(axis=1), axis=1))
Out[47]:
True

And that it's faster:

而且它更快：

In [48]:

cols = ['0','1','2','3','4','5','6','7','8','9','10','11','12','13']
%timeit result[cols].div(result[cols].sum(axis=1), axis=0) 
%timeit result[cols].apply(lambda row: row / row.sum(axis=1), axis=1)
100 loops, best of 3: 2.38 ms per loop
100 loops, best of 3: 4.47 ms per loop

Answer 3

回答by Souf Ee

More simply:

更简单：

result.div(result.sum(axis=1), axis=0)

(Edited to use code highlighting)

（编辑为使用代码突出显示）

Answer 4

回答by ihadanny

easier to work per column:

每列更容易工作：

df = pd.DataFrame([[1,2,3],[4,5,6],[7,8,9]])
(df.T / df.T.sum()).T

result:

结果：

         0         1      2
0  0.166667  0.333333  0.500
1  0.266667  0.333333  0.400
2  0.291667  0.333333  0.375

Python 熊猫跨列求和并将每个单元格从该值中除以

提问by add-semi-colons

采纳答案by Jerome Montino

回答by EdChum

回答by Souf Ee

回答by ihadanny

相关推荐

最近更新

标签

Python 熊猫跨列求和并将每个单元格从该值中除以

提问by add-semi-colons

采纳答案by Jerome Montino

回答by EdChum

回答by Souf Ee

回答by ihadanny

相关推荐

Python 为什么 mysql 连接器中断（“在查询过程中丢失与 MySQL 服务器的连接”错误）

如何在不使用追加的情况下将元素插入数组，Python？

Python 导入错误：没有名为 MySQLdb 的模块

Ubuntu - 如何在 Python 3.3 而不是 Python 2.7 上安装 Python 模块 (BeautifulSoup)？

相关推荐

最近更新

标签