pandas Groupby 和应用熊猫 vs dask

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/45107528/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-14 04:00:18  来源:igfitidea点击:

Groupby and apply pandas vs dask

pythonpandasgroup-byapplydask

提问by rpanai

there is something that I quite don't understand about dask.dataframebehavior. Let say I want to replicate this from pandas

关于dask.dataframe行为,我有些不明白。假设我想从大Pandas复制这个

import pandas as pd
import dask.dataframe as dd
import random

s = "abcd"
lst = 10*[0]+list(range(1,6))
n = 100
df = pd.DataFrame({"col1": [random.choice(s) for i in range(n)],
                   "col2": [random.choice(lst) for i in range(n)]})
# I will need an hash in dask
df["hash"] = 2*df.col1
df = df[["hash","col1","col2"]]

def fun(data):
    if data["col2"].mean()>1:
        data["col3"]=2
    else:
        data["col3"]=1
    return(data)

df1 = df.groupby("col1").apply(fun)
df1.head()

this returns

这返回

  hash col1  col2  col3
0   dd    d     0     1
1   aa    a     0     2
2   bb    b     0     1
3   bb    b     0     1
4   aa    a     0     2

In Dask I tried

在 Dask 我试过

def fun2(data):
    if data["col2"].mean()>1:
        return 2
    else:
        return 1

ddf = df.copy()
ddf.set_index("hash",inplace=True)
ddf = dd.from_pandas(ddf, npartitions=2)

gpb = ddf.groupby("col1").apply(fun2, meta=pd.Series())

where the groupby lead to the same result as in pandas but I'm having hard time merging the result on a new column preserving the hash index. I'd like to have the following result

groupby 导致与Pandas相同的结果,但我很难将结果合并到一个保留哈希索引的新列上。我想得到以下结果

      col1  col2  col3
hash           
aa      a     5     2
aa      a     0     2
aa      a     0     2
aa      a     0     2
aa      a     4     2

UPDATE

更新

Playing with merge I found this solution

玩合并我找到了这个解决方案

ddf1 = dd.merge(ddf, gpb.to_frame(), 
                left_on="col1",
                left_index=False, right_index=True)
ddf1 = ddf1.rename(columns={0:"col3"})

I'm not sure how this is going to work if I have to a groupby over several columns. Plus is not exactly elegant.

如果我必须对多个列进行分组,我不确定这将如何工作。Plus 并不完全优雅。

回答by Bob Haffner

How about using join?

使用join怎么样?

This is your dask code with the exception of naming the Series pd.Series(name='col3')

这是您的 dask 代码,但命名系列除外 pd.Series(name='col3')

def fun2(data):
    if data["col2"].mean()>1:
        return 2
    else:
        return 1

ddf = df.copy()
ddf.set_index("hash",inplace=True)
ddf = dd.from_pandas(ddf, npartitions=2)

gpb = ddf.groupby("col1").apply(fun2, meta=pd.Series(name='col3'))

then the join

然后加入

ddf.join(gpb.to_frame(), on='col1')
print(ddf.compute().head())
      col1  col2  col3
hash                 
cc      c     0     2
cc      c     0     2
cc      c     0     2
cc      c     2     2
cc      c     0     2