pandas Groupby 和应用熊猫 vs dask

Question

提问by rpanai

there is something that I quite don't understand about dask.dataframebehavior. Let say I want to replicate this from pandas

关于dask.dataframe行为，我有些不明白。假设我想从大Pandas复制这个

import pandas as pd
import dask.dataframe as dd
import random

s = "abcd"
lst = 10*[0]+list(range(1,6))
n = 100
df = pd.DataFrame({"col1": [random.choice(s) for i in range(n)],
                   "col2": [random.choice(lst) for i in range(n)]})
# I will need an hash in dask
df["hash"] = 2*df.col1
df = df[["hash","col1","col2"]]

def fun(data):
    if data["col2"].mean()>1:
        data["col3"]=2
    else:
        data["col3"]=1
    return(data)

df1 = df.groupby("col1").apply(fun)
df1.head()

this returns

这返回

  hash col1  col2  col3
0   dd    d     0     1
1   aa    a     0     2
2   bb    b     0     1
3   bb    b     0     1
4   aa    a     0     2

In Dask I tried

在 Dask 我试过

def fun2(data):
    if data["col2"].mean()>1:
        return 2
    else:
        return 1

ddf = df.copy()
ddf.set_index("hash",inplace=True)
ddf = dd.from_pandas(ddf, npartitions=2)

gpb = ddf.groupby("col1").apply(fun2, meta=pd.Series())

where the groupby lead to the same result as in pandas but I'm having hard time merging the result on a new column preserving the hash index. I'd like to have the following result

groupby 导致与Pandas相同的结果，但我很难将结果合并到一个保留哈希索引的新列上。我想得到以下结果

      col1  col2  col3
hash           
aa      a     5     2
aa      a     0     2
aa      a     0     2
aa      a     0     2
aa      a     4     2

UPDATE

更新

Playing with merge I found this solution

玩合并我找到了这个解决方案

ddf1 = dd.merge(ddf, gpb.to_frame(), 
                left_on="col1",
                left_index=False, right_index=True)
ddf1 = ddf1.rename(columns={0:"col3"})

I'm not sure how this is going to work if I have to a groupby over several columns. Plus is not exactly elegant.

如果我必须对多个列进行分组，我不确定这将如何工作。Plus 并不完全优雅。

Answer 1

回答by Bob Haffner

How about using join?

使用join怎么样？

This is your dask code with the exception of naming the Series pd.Series(name='col3')

这是您的 dask 代码，但命名系列除外 pd.Series(name='col3')

def fun2(data):
    if data["col2"].mean()>1:
        return 2
    else:
        return 1

ddf = df.copy()
ddf.set_index("hash",inplace=True)
ddf = dd.from_pandas(ddf, npartitions=2)

gpb = ddf.groupby("col1").apply(fun2, meta=pd.Series(name='col3'))

then the join

然后加入

ddf.join(gpb.to_frame(), on='col1')
print(ddf.compute().head())
      col1  col2  col3
hash                 
cc      c     0     2
cc      c     0     2
cc      c     0     2
cc      c     2     2
cc      c     0     2

pandas Groupby 和应用熊猫 vs dask

提问by rpanai

回答by Bob Haffner

相关推荐

最近更新

标签

pandas Groupby 和应用熊猫 vs dask

提问by rpanai

回答by Bob Haffner

相关推荐

pandas 使用条件从数据框中打印特定行

pandas 如何使用熊猫删除第一行？

Pandas DataFrame 可变性

Pandas pivot_table 保留顺序

相关推荐

最近更新

标签