pandas Python 中的 plyr 或 dplyr

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/26878476/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-13 22:39:18  来源:igfitidea点击:

plyr or dplyr in Python

pythonrpandasplyrdplyr

提问by user1617979

This is more of a conceptual question, I do not have a specific problem.

这更多是一个概念性问题,我没有具体问题。

I am learning python for data analysis, but I am very familiar with R - one of the great things about R is plyr (and of course ggplot2) and even better dplyr. Pandas of course has split-apply as well however in R I can do things like (in dplyr, a bit different in plyr, and I can see now how dplyr mimics the . notation from object programming)

我正在学习 Python 进行数据分析,但我对 R 非常熟悉——R 的一大优点是 plyr(当然还有 ggplot2),甚至更好的 dplyr。Pandas 当然也有拆分应用,但是在 RI 中可以做这样的事情(在 dplyr 中,plyr 有点不同,我现在可以看到 dplyr 如何模仿对象编程中的 . 符号)

   data %.% group_by(c(.....)) %.% summarise(new1 = ...., new2 = ...., ..... newn=....)

in which I create multiple summary calculations at the same time

我同时创建多个汇总计算

How do I do that in python, because

我如何在 python 中做到这一点,因为

df[...].groupby(.....).sum() only sums columns, 

while on R I can have one mean, one sum, one special function, etc. on one call

而在 RI 上,一次调用可以有一个均值、一个总和、一个特殊函数等

I realize I can do all my operations separately and merge them, and that is fine if I am using python, but when it comes down to choosing a tool, any line of code you do not have to type and check and validate adds up in time

我意识到我可以单独完成所有操作并合并它们,如果我使用 python,那很好,但是当涉及到选择工具时,您不必键入、检查和验证的任何代码行都会加起来时间

in addition, in dplyr you can also add mutate statements as well, so it seems to me it is way more powerful - so what am I missing about pandas or python -

此外,在 dplyr 中,您还可以添加 mutate 语句,所以在我看来它更强大 - 所以我错过了什么关于Pandas或蟒蛇 -

My goal is to learn, I have spent a lot of effort to learn python and it is a worthy investment, but still the question remains

我的目标是学习,我花了很多精力来学习python,这是一项值得的投资,但问题仍然存在

采纳答案by szxk

I think you're looking for the agg function, which is applied to groupby objects.

我认为您正在寻找适用于 groupby 对象的agg 函数

From the docs:

从文档:

In [48]: grouped = df.groupby('A')

In [49]: grouped['C'].agg([np.sum, np.mean, np.std])
Out[49]: 
          sum      mean       std
A                                
bar  0.443469  0.147823  0.301765
foo  2.529056  0.505811  0.96

回答by lgallen

I'm also a big fan of dplyr for R and am working to improve my knowledge of Pandas. Since you don't have a specific problem, I'd suggest checking out the post below that breaks down the entire introductory dplyr vignette and shows how all of it can be done with Pandas.

我也是 R 的 dplyr 的忠实粉丝,并且正在努力提高我对 Pandas 的了解。由于您没有特定问题,我建议您查看下面的帖子,该帖子分解了整个介绍性 dplyr 小插图,并展示了如何使用 Pandas 完成所有这些工作。

For example, the author demonstrates chaining with the pipe operator in R:

例如,作者在 R 中演示了与管道运算符的链接:

 flights %>%
   group_by(year, month, day) %>%
   select(arr_delay, dep_delay) %>%
   summarise(
      arr = mean(arr_delay, na.rm = TRUE),
      dep = mean(dep_delay, na.rm = TRUE)
       ) %>%
   filter(arr > 30 | dep > 30)

And here is the Pandas implementation:

这是 Pandas 的实现:

flights.groupby(['year', 'month', 'day'])
   [['arr_delay', 'dep_delay']]
   .mean()
   .query('arr_delay > 30 | dep_delay > 30')

There are many more comparisons of how to implement dplyr like operations with Pandas at the original post. http://nbviewer.ipython.org/gist/TomAugspurger/6e052140eaa5fdb6e8c0

原帖中有更多关于如何使用 Pandas 实现 dplyr 之类的操作的比较。 http://nbviewer.ipython.org/gist/TomAugspurger/6e052140eaa5fdb6e8c0

回答by lgautier

One could simply use dplyr from Python.

可以简单地使用 Python 中的 dplyr。

There is an interface to dplyrin rpy2 (introduced with rpy2-2.7.0) that lets you write things like:

dplyrrpy2 中有一个接口(在 rpy2-2.7.0 中引入),可让您编写如下内容:

dataf = (DataFrame(mtcars).
         filter('gear>3').
         mutate(powertoweight='hp*36/wt').
         group_by('gear').
         summarize(mean_ptw='mean(powertoweight)'))

There is an example in the documentation. This part of the doc is (also) a jupyter notebook. Look for the links near the top of page.

文档中有一个例子。文档的这一部分(也)是一个 jupyter 笔记本。寻找靠近页面顶部的链接。

An other answer to the question is comparing R's dplyr and pandas (see @lgallen). That same R one-liner chaining dplyr statements write's essentially the same in rpy2's interface to dplyr.

该问题的另一个答案是比较 R 的 dplyr 和 pandas(请参阅@lgallen)。相同的 R 单行链接 dplyr 语句在 rpy2 与 dplyr 的接口中写入的内容基本相同。

R:

回复:

flights %>%
   group_by(year, month, day) %>%
   select(arr_delay, dep_delay) %>%
   summarise(
      arr = mean(arr_delay, na.rm = TRUE),
      dep = mean(dep_delay, na.rm = TRUE)
      ) %>%
   filter(arr > 30 | dep > 30)

Python+rpy2:

Python+rpy2:

(DataFrame(flights).
 group_by('year', 'month', 'day').
 select('arr_delay', 'dep_delay').
 summarize(arr = 'mean(arr_delay, na.rm=TRUE)',
           dep = 'mean(dep_delay, na.rm=TRUE)').
 filter('arr > 30 | dep > 30'))

回答by Rafael Díaz

The most similar way to use dplyr in python, is with the dfply package. Here is an example.

在 python 中使用 dplyr 最相似的方法是使用 dfply 包。这是一个例子。

R dplyr

R dplyr

library(nycflights13)
library(dplyr)

flights %>%
  filter(hour > 10) %>% # step 1
  mutate(speed = distance / (air_time * 60)) %>% # step 2
  group_by(origin) %>% # step 3a
  summarize(mean_speed = sprintf("%0.6f",mean(speed, na.rm = T))) %>% # step 3b
  arrange(desc(mean_speed)) # step 4

# A tibble: 3 x 2
  origin mean_speed
  <chr>  <chr>     
1 EWR    0.109777  
2 JFK    0.109427  
3 LGA    0.107362 

Python dfply

Python dfply

from dfply import *
import pandas as pd

flight_data = pd.read_csv('nycflights13.csv')

(flight_data >>
  mask(X.hour > 10) >> # step 1
  mutate(speed = X.distance / (X.air_time * 60)) >> # step 2
  group_by(X.origin) >> # step 3a
  summarize(mean_speed = X.speed.mean()) >> # step 3b
  arrange(X.mean_speed, ascending=False) # step 4
)


Out[1]: 
  origin  mean_speed
0    EWR    0.109777
1    JFK    0.109427
2    LGA    0.107362

Python Pandas

蟒蛇Pandas

flight_data.loc[flight_data['hour'] > 10, 'speed'] = flight_data['distance'] / (flight_data['air_time'] * 60)
result = flight_data.groupby('origin', as_index=False)['speed'].mean()
result.sort_values('speed', ascending=False)

Out[2]: 
  origin     speed
0    EWR  0.109777
1    JFK  0.109427
2    LGA  0.107362

Note: For more information you can check the following link.

注意:有关更多信息,您可以查看以下链接