在 Pandas 中加入两个大型数据集的最佳方式
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/37756991/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Best way to join two large datasets in Pandas
提问by Nickpick
I'm downloading two datasets from two different databases that need to be joined. Each of them separately is around 500MB when I store them as CSV. Separately the fit into the memory but when I load both I sometimes get a memory error. I definitely get into trouble when I try to merge them with pandas.
我正在从需要加入的两个不同数据库下载两个数据集。当我将它们存储为 CSV 时,它们中的每一个分别约为 500MB。单独装入内存,但是当我同时加载两者时,有时会出现内存错误。当我尝试将它们与Pandas合并时,我肯定会遇到麻烦。
What is the best way to do an outer join on them so that I don't get a memory error? I don't have any database servers at hand but I can install any kind of open source software on my computer if that helps. Ideally I would still like to solve it in pandas only but not sure if this is possible at all.
对它们进行外连接以便我不会收到内存错误的最佳方法是什么?我手头没有任何数据库服务器,但如果有帮助,我可以在我的计算机上安装任何类型的开源软件。理想情况下,我仍然只想在Pandas中解决它,但不确定这是否可能。
To clarify: with merging I mean an outer join. Each table has two row: product and version. I want to check which products and versions are in the left table only, right table only and both tables. That I do with a
澄清一下:合并是指外连接。每个表有两行:产品和版本。我想检查哪些产品和版本仅在左表、右表和两个表中。我用一个
pd.merge(df1,df2,left_on=['product','version'],right_on=['product','version'], how='outer')
回答by root
This seems like a task that dask
was designed for. Essentially, dask
can do pandas
operations out-of-core, so you can work with datasets that don't fit into memory. The dask.dataframe
API is a subset of the pandas
API, so there shouldn't be much of a learning curve. See the Dask DataFrame Overviewpage for some additional DataFrame specific details.
这似乎是一项dask
专门为之设计的任务。本质上,dask
可以进行pandas
核外操作,因此您可以处理不适合内存的数据集。该dask.dataframe
API是的一个子集pandas
API,所以不应该有太大的学习曲线。请参阅Dask DataFrame 概览页面了解一些额外的 DataFrame 特定细节。
import dask.dataframe as dd
# Read in the csv files.
df1 = dd.read_csv('file1.csv')
df2 = dd.read_csv('file2.csv')
# Merge the csv files.
df = dd.merge(df1, df2, how='outer', on=['product','version'])
# Write the output.
df.to_csv('file3.csv', index=False)
Assuming that 'product'
and 'version'
are the only columns, it may be more efficient to replace the merge
with:
假设'product'
和'version'
是唯一的列,将其替换为以下可能更有效merge
:
df = dd.concat([df1, df2]).drop_duplicates()
I'm not entirely sure if that will be better, but apparently merges that aren't done on the index are "slow-ish" in dask
, so it could be worth a try.
我不完全确定这是否会更好,但显然未在索引上完成的合并在dask
.
回答by MaxU
I would recommend you to use RDBMS like MySQL for that...
我建议你使用像 MySQL 这样的 RDBMS ......
So you would need to load your CSV files into tablesfirst.
因此,您需要先将CSV 文件加载到表格中。
After that you can perform your checks:
之后,您可以执行检查:
which products and versions are in the left table only
哪些产品和版本仅在左表中
SELECT a.product, a.version
FROM table_a a
LEFT JOIN table_b b
ON a.product = b.product AND a.version = b.version
WHERE b.product IS NULL;
which products and versions are in the right table only
哪些产品和版本仅在右表中
SELECT b.product, b.version
FROM table_a a
RIGHT JOIN table_b b
ON a.product = b.product AND a.version = b.version
WHERE a.product IS NULL;
in both
同时
SELECT a.product, a.version
FROM table_a a
JOIN table_b b
ON a.product = b.product AND a.version = b.version;
Configure your MySQL Server, so that it uses at least 2GB of RAM
配置您的 MySQL 服务器,使其至少使用 2GB 的 RAM
You may also want to use MyISAM engine for your tables, in this case check this
您可能还想为您的表使用 MyISAM 引擎,在这种情况下,请检查此
It might work slower compared to Pandas, but you definitely won't have memory issues.
与 Pandas 相比,它可能工作得更慢,但你绝对不会有内存问题。
Another possible solutions:
另一种可能的解决方案:
- increase your RAM
- use Apache Spark SQL (distributed DataFrame) on multiple cluster nodes - it will be much cheaper though to increase your RAM
- 增加你的内存
- 在多个集群节点上使用 Apache Spark SQL(分布式数据帧) - 尽管增加 RAM 会便宜得多