将大型 Dask 数据框与小型 Pandas 数据框合并

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/39470332/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-14 02:00:14  来源:igfitidea点击:

Merge a large Dask dataframe with a small Pandas dataframe

pythonpandasdask

提问by dleal

Following the example here: YouTube: Dask-Pandas Dataframe JoinI attempting to merge a ~70GB Dask data frame with a ~24MB that I loaded as a Pandas dataframe.

按照这里的示例:YouTube:Dask-Pandas Dataframe Join我试图将一个 ~70GB 的 Dask 数据帧与我作为 Pandas 数据帧加载的 ~24MB 合并。

The merge is on two columns A and B, and I did not set any as indices:

合并在两列 A 和 B 上,我没有将任何设置为索引:

import dask.dataframe as dd
from dask.diagnostics import ProgressBar

small_df = pd.read_csv(dataframe1) # as pandas
large_df = dd.read_csv(dataframe2) #as dask.dataframe

df2 = large_df.merge(small_df, how='left', left_on=leftcolumns, right_on=rightcolumns) #do the merge

A = df2[df2['some column'] == 'somevalue'] #do a reduction that would fit on my computer

pbar = ProgressBar()
pbar.register()

result = A.compute()

I'm using a Windows computer with 16GB of RAM and 4 cores. I use the progress bar to assess how far along the merging process it is. I left it all night last night. I restarted it this morning and so far its about half an hour in and 0% progress.

我正在使用具有 16GB RAM 和 4 个内核的 Windows 计算机。我使用进度条来评估合并过程的进展。昨晚我整夜离开了它。我今天早上重新启动了它,到目前为止大约半小时,进度为 0%。

Thank you and I appreciate your help,

谢谢你,我感谢你的帮助,

Update

更新

I tried it on my Mac with 8GB of RAM and worked pretty well. I have the Dask distribution that comes with Anaconda I believe. I don't think I did anything different in any case.

我在配备 8GB RAM 的 Mac 上试用了它,并且运行良好。我相信我有 Anaconda 附带的 Dask 发行版。我认为我在任何情况下都没有做任何不同的事情。

I share my results and time following the above coding (21 minutes):

我分享我的结果和上面编码后的时间(21 分钟):

In [26]: C = result1.compute()
[########################################] | 100% Completed | 21min 13.4s
[########################################] | 100% Completed | 21min 13.5s
[########################################] | 100% Completed | 21min 13.6s
[########################################] | 100% Completed | 21min 13.6s

Update 2

更新 2

I updated to the latest version of Dask on my Windows computer and it worked well.

我在我的 Windows 计算机上更新到最新版本的 Dask,它运行良好。

回答by Bar?? Can Tayiz

you can iterate over unique equal values and assign other columns with loop:

您可以迭代唯一的相等值并使用循环分配其他列:

unioun_set = list(set(small_df['common_column']) & set(large_df['common_column']))
for el in union_set:
    for column in small_df.columns:
        if column not in large_df.columns:
            large_df.loc[large_df['common_column'] == el,column] = small_df.loc[small_df['common_column'] ==  el,column]


回答by Kriti Pawar

While working with big data, partitioning data is very important at the same time having enough cluster and memory size is mandatory.

在处理大数据时,对数据进行分区非常重要,同时必须拥有足够的集群和内存大小。

You can try using spark.

您可以尝试使用spark.

DASK is a pure Python framework, which does more of same i.e. it allows one to run the same Pandas or NumPy code either locally or on a cluster. Whereas, Apache Spark brings about a learning curve involving a new API and execution model although with a Python wrapper.

DASK 是一个纯 Python 框架,它做更多相同的事情,即它允许一个人在本地或集群上运行相同的 Pandas 或 NumPy 代码。而 Apache Spark 带来了一个涉及新 API 和执行模型的学习曲线,尽管它使用了 Python 包装器。

You can try partitioning data and storing it into parquetfiles.

您可以尝试对数据进行分区并将其存储到parquet文件中。