在Python中合并具有数百万行的两个表

Question

提问by user2027051

I am using Python for some data analysis. I have two tables, the first (let's call it 'A') has 10 million rows and 10 columns and the second ('B') has 73 million rows and 2 columns. They have 1 column with common ids and I want to intersect the two tables based on that column. In particular I want the inner join of the tables.

我正在使用 Python 进行一些数据分析。我有两个表，第一个（我们称之为“A”）有 1000 万行和 10 列，第二个（“B”）有 7300 万行和 2 列。他们有 1 列具有公共 ID，我想根据该列将两个表相交。特别是我想要表的内部连接。

I could not load the table B on memory as a pandas dataframe to use the normal merge function on pandas. I tried by reading the file of table B on chunks, intersecting each chunk with A and the concatenating these intersections (output from inner joins). This is OK on speed but every now and then this gives me problems and spits out a segmentation fault ... no so great. This error is difficult to reproduce, but it happens on two different machines (Mac OS X v10.6 (Snow Leopard) and UNIX, Red Hat Linux).

我无法将表 B 作为熊猫数据帧加载到内存中以在熊猫上使用正常的合并功能。我尝试通过读取块上表 B 的文件，将每个块与 A 相交并连接这些交集（来自内部联接的输出）。这在速度上是可以的，但时不时这会给我带来问题并吐出一个分段错误......没有那么好。这个错误很难重现，但它发生在两台不同的机器上（Mac OS X v10.6 (Snow Leopard) 和 UNIX、Red Hat Linux）。

I finally tried with the combination of Pandas and PyTables by writing table B to disk and then iterating over table A and selecting from table B the matching rows. This last options works but it is slow. Table B on pytables has been indexed already by default.

我最终通过将表 B 写入磁盘，然后迭代表 A 并从表 B 中选择匹配行来尝试结合 Pandas 和 PyTables。最后一个选项有效，但速度很慢。默认情况下，pytables 上的表 B 已被索引。

How do I tackle this problem?

我该如何解决这个问题？

Answer 1

采纳答案by Jeff

This is a little pseudo codish, but I think should be quite fast.

这有点伪代码，但我认为应该很快。

Straightforward disk based merge, with all tables on disk. The key is that you are not doing selection per se, just indexing into the table via start/stop, which is quite fast.

简单的基于磁盘的合并，所有表都在磁盘上。关键是您本身并没有进行选择，只是通过启动/停止索引到表中，这非常快。

Selecting the rows that meet a criteria in B (using A's ids) won't be very fast, because I think it might be bringing the data into Python space rather than an in-kernel search (I am not sure, but you might want to investigate on pytables.org more in the in-kernel optimization section. There is a way to tell if it's going to be in-kernel or not).

选择满足 B 中条件的行（使用 A 的 ID）不会很快，因为我认为它可能会将数据带入 Python 空间而不是内核搜索（我不确定，但您可能想要在内核优化部分在 pytables.org 上进行更多调查。有一种方法可以判断它是否将在内核中）。

Also if you are up to it, this is a very parallel problem (just don't write the results to the same file from multiple processes. pytables is not write-safe for that).

此外，如果您愿意，这是一个非常并行的问题（只是不要将结果从多个进程写入同一个文件。pytables 不是写安全的）。

See this answerfor a comment on how doing a join operation will actually be an 'inner' join.

有关执行连接操作实际上如何成为“内部”连接的评论，请参阅此答案。

For your merge_a_b operation I think you can use a standard pandas join which is quite efficient (when in-memory).

对于您的 merge_a_b 操作，我认为您可以使用非常有效的标准熊猫连接（在内存中时）。

One other option (depending on how 'big' A) is, might be to separate A into 2 pieces (that are indexed the same), using a smaller (maybe use single column) in the first table; instead of storing the merge results per se, store the row index; later you can pull out the data you need (kind of like using an indexer and take). See http://pandas.pydata.org/pandas-docs/stable/io.html#multiple-table-queries

另一个选项（取决于“A”的大小）可能是将 A 分成 2 个部分（索引相同），在第一个表中使用较小的（可能使用单列）；不是存储合并结果本身，而是存储行索引；稍后您可以提取所需的数据（有点像使用索引器并获取）。请参阅http://pandas.pydata.org/pandas-docs/stable/io.html#multiple-table-queries

A = HDFStore('A.h5')
B = HDFStore('B.h5')

nrows_a = A.get_storer('df').nrows
nrows_b = B.get_storer('df').nrows
a_chunk_size = 1000000
b_chunk_size = 1000000

def merge_a_b(a,b):
    # Function that returns an operation on passed
    # frames, a and b.
    # It could be a merge, join, concat, or other operation that
    # results in a single frame.


for a in xrange(int(nrows_a / a_chunk_size) + 1):

    a_start_i = a * a_chunk_size
    a_stop_i  = min((a + 1) * a_chunk_size, nrows_a)

    a = A.select('df', start = a_start_i, stop = a_stop_i)

    for b in xrange(int(nrows_b / b_chunk_size) + 1):

        b_start_i = b * b_chunk_size
        b_stop_i = min((b + 1) * b_chunk_size, nrows_b)

        b = B.select('df', start = b_start_i, stop = b_stop_i)

        # This is your result store
        m = merge_a_b(a, b)

        if len(m):
            store.append('df_result', m)

在Python中合并具有数百万行的两个表

提问by user2027051

采纳答案by Jeff

相关推荐

最近更新

标签

在Python中合并具有数百万行的两个表

提问by user2027051

采纳答案by Jeff

相关推荐

Python argparse忽略无法识别的参数

如何跳出 Python 中的 while 循环？

允许在 Python 的命令行中覆盖配置选项的最佳方法是什么？

pylab.ion() 在 python 2、matplotlib 1.1.1 和程序运行时更新绘图

相关推荐

最近更新

标签