pandas Groupby 和应用熊猫 vs dask
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/45107528/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Groupby and apply pandas vs dask
提问by rpanai
there is something that I quite don't understand about dask.dataframe
behavior. Let say I want to replicate this from pandas
关于dask.dataframe
行为,我有些不明白。假设我想从大Pandas复制这个
import pandas as pd
import dask.dataframe as dd
import random
s = "abcd"
lst = 10*[0]+list(range(1,6))
n = 100
df = pd.DataFrame({"col1": [random.choice(s) for i in range(n)],
"col2": [random.choice(lst) for i in range(n)]})
# I will need an hash in dask
df["hash"] = 2*df.col1
df = df[["hash","col1","col2"]]
def fun(data):
if data["col2"].mean()>1:
data["col3"]=2
else:
data["col3"]=1
return(data)
df1 = df.groupby("col1").apply(fun)
df1.head()
this returns
这返回
hash col1 col2 col3
0 dd d 0 1
1 aa a 0 2
2 bb b 0 1
3 bb b 0 1
4 aa a 0 2
In Dask I tried
在 Dask 我试过
def fun2(data):
if data["col2"].mean()>1:
return 2
else:
return 1
ddf = df.copy()
ddf.set_index("hash",inplace=True)
ddf = dd.from_pandas(ddf, npartitions=2)
gpb = ddf.groupby("col1").apply(fun2, meta=pd.Series())
where the groupby lead to the same result as in pandas but I'm having hard time merging the result on a new column preserving the hash index. I'd like to have the following result
groupby 导致与Pandas相同的结果,但我很难将结果合并到一个保留哈希索引的新列上。我想得到以下结果
col1 col2 col3
hash
aa a 5 2
aa a 0 2
aa a 0 2
aa a 0 2
aa a 4 2
UPDATE
更新
Playing with merge I found this solution
玩合并我找到了这个解决方案
ddf1 = dd.merge(ddf, gpb.to_frame(),
left_on="col1",
left_index=False, right_index=True)
ddf1 = ddf1.rename(columns={0:"col3"})
I'm not sure how this is going to work if I have to a groupby over several columns. Plus is not exactly elegant.
如果我必须对多个列进行分组,我不确定这将如何工作。Plus 并不完全优雅。
回答by Bob Haffner
How about using join?
使用join怎么样?
This is your dask code with the exception of naming the Series pd.Series(name='col3')
这是您的 dask 代码,但命名系列除外 pd.Series(name='col3')
def fun2(data):
if data["col2"].mean()>1:
return 2
else:
return 1
ddf = df.copy()
ddf.set_index("hash",inplace=True)
ddf = dd.from_pandas(ddf, npartitions=2)
gpb = ddf.groupby("col1").apply(fun2, meta=pd.Series(name='col3'))
then the join
然后加入
ddf.join(gpb.to_frame(), on='col1')
print(ddf.compute().head())
col1 col2 col3
hash
cc c 0 2
cc c 0 2
cc c 0 2
cc c 2 2
cc c 0 2