在 Pandas 中加入两个大型数据集的最佳方式

Question

提问by Nickpick

I'm downloading two datasets from two different databases that need to be joined. Each of them separately is around 500MB when I store them as CSV. Separately the fit into the memory but when I load both I sometimes get a memory error. I definitely get into trouble when I try to merge them with pandas.

我正在从需要加入的两个不同数据库下载两个数据集。当我将它们存储为 CSV 时，它们中的每一个分别约为 500MB。单独装入内存，但是当我同时加载两者时，有时会出现内存错误。当我尝试将它们与Pandas合并时，我肯定会遇到麻烦。

What is the best way to do an outer join on them so that I don't get a memory error? I don't have any database servers at hand but I can install any kind of open source software on my computer if that helps. Ideally I would still like to solve it in pandas only but not sure if this is possible at all.

对它们进行外连接以便我不会收到内存错误的最佳方法是什么？我手头没有任何数据库服务器，但如果有帮助，我可以在我的计算机上安装任何类型的开源软件。理想情况下，我仍然只想在Pandas中解决它，但不确定这是否可能。

To clarify: with merging I mean an outer join. Each table has two row: product and version. I want to check which products and versions are in the left table only, right table only and both tables. That I do with a

澄清一下：合并是指外连接。每个表有两行：产品和版本。我想检查哪些产品和版本仅在左表、右表和两个表中。我用一个

pd.merge(df1,df2,left_on=['product','version'],right_on=['product','version'], how='outer')

Answer 1

回答by root

This seems like a task that daskwas designed for. Essentially, daskcan do pandasoperations out-of-core, so you can work with datasets that don't fit into memory. The dask.dataframeAPI is a subset of the pandasAPI, so there shouldn't be much of a learning curve. See the Dask DataFrame Overviewpage for some additional DataFrame specific details.

这似乎是一项dask专门为之设计的任务。本质上，dask可以进行pandas核外操作，因此您可以处理不适合内存的数据集。该dask.dataframeAPI是的一个子集pandasAPI，所以不应该有太大的学习曲线。请参阅Dask DataFrame 概览页面了解一些额外的 DataFrame 特定细节。

import dask.dataframe as dd

# Read in the csv files.
df1 = dd.read_csv('file1.csv')
df2 = dd.read_csv('file2.csv')

# Merge the csv files.
df = dd.merge(df1, df2, how='outer', on=['product','version'])

# Write the output.
df.to_csv('file3.csv', index=False)

Assuming that 'product'and 'version'are the only columns, it may be more efficient to replace the mergewith:

假设'product'和'version'是唯一的列，将其替换为以下可能更有效merge：

df = dd.concat([df1, df2]).drop_duplicates()

I'm not entirely sure if that will be better, but apparently merges that aren't done on the index are "slow-ish" in dask, so it could be worth a try.

我不完全确定这是否会更好，但显然未在索引上完成的合并在dask.

Answer 2

回答by MaxU

I would recommend you to use RDBMS like MySQL for that...

我建议你使用像 MySQL 这样的 RDBMS ......

So you would need to load your CSV files into tablesfirst.

因此，您需要先将CSV 文件加载到表格中。

After that you can perform your checks:

之后，您可以执行检查：

which products and versions are in the left table only

哪些产品和版本仅在左表中

SELECT a.product, a.version
FROM table_a a
LEFT JOIN table_b b
ON a.product = b.product AND a.version = b.version
WHERE b.product IS NULL;

which products and versions are in the right table only

哪些产品和版本仅在右表中

SELECT b.product, b.version
FROM table_a a
RIGHT JOIN table_b b
ON a.product = b.product AND a.version = b.version
WHERE a.product IS NULL;

in both

同时

SELECT a.product, a.version
FROM table_a a
JOIN table_b b
ON a.product = b.product AND a.version = b.version;

Configure your MySQL Server, so that it uses at least 2GB of RAM

配置您的 MySQL 服务器，使其至少使用 2GB 的 RAM

You may also want to use MyISAM engine for your tables, in this case check this

您可能还想为您的表使用 MyISAM 引擎，在这种情况下，请检查此

It might work slower compared to Pandas, but you definitely won't have memory issues.

与 Pandas 相比，它可能工作得更慢，但你绝对不会有内存问题。

Another possible solutions:

另一种可能的解决方案：

increase your RAM
use Apache Spark SQL (distributed DataFrame) on multiple cluster nodes - it will be much cheaper though to increase your RAM

增加你的内存
在多个集群节点上使用 Apache Spark SQL（分布式数据帧） - 尽管增加 RAM 会便宜得多

在 Pandas 中加入两个大型数据集的最佳方式

提问by Nickpick

回答by root

回答by MaxU

相关推荐

最近更新

标签

在 Pandas 中加入两个大型数据集的最佳方式

提问by Nickpick

回答by root

回答by MaxU

相关推荐

pandas CParserError：标记数据时出错

如何从 Pandas 的 OLS 摘要中提取特定值？

Pandas read_csv dtype 指定除一列之外的所有列

使用索引值访问 Pandas Data Frame 行

相关推荐

最近更新

标签