pandas 逐行组合熊猫数据框的有效方法

Question

提问by sedeh

I've 14 data frames each with 14 columns and more than 250,000 rows. The data frame have identical column headers and I would like to merge the data frames row-wise. I attempted to concatenate the data frames to a 'growing' DataFrame and it's taking several hours.

我有 14 个数据框，每个数据框有 14 列和超过 250,000 行。数据框具有相同的列标题，我想按行合并数据框。我试图将数据帧连接到一个“不断增长的”数据帧，这需要几个小时。

Essentially, I was doing something like below 13 times:

本质上，我做了 13 次以下的事情：

DF = pd.DataFrame()
for i in range(13):   
    DF = pd.concat([DF, subDF])

The stackoverflow answer heresuggests appending all sub data frames to a list and then concatenating the list of sub data frames.

这里的 stackoverflow 答案建议将所有子数据帧附加到列表中，然后连接子数据帧列表。

That sounds like doing something like this:

这听起来像是在做这样的事情：

DF = pd.DataFrame()
lst = [subDF, subDF, subDF....subDF] #up to 13 times
for subDF in lst:
    DF = pd.concat([DF, subDF])

Aren't they the same thing? Perhaps I'm misunderstanding the suggested workflow. Here's what I tested.

他们不是一回事吗？也许我误解了建议的工作流程。这是我测试的。

import numpy
import pandas as pd
import timeit

def test1():
    "make all subDF and then concatenate them"
    numpy.random.seed(1)
    subDF = pd.DataFrame(numpy.random.rand(1))
    lst = [subDF, subDF, subDF]
    DF = pd.DataFrame()
    for subDF in lst:
        DF = pd.concat([DF, subDF], axis=0,ignore_index=True)

def test2():
    "add each subDF to the collecitng DF as you're making the subDF"
    numpy.random.seed(1)
    DF = pd.DataFrame()
    for i in range(3):
        subDF = pd.DataFrame(numpy.random.rand(1))
        DF = pd.concat([DF, subDF], axis=0,ignore_index=True)

print('test1() takes {0} sec'.format(timeit.timeit(test1, number=1000)))
print('test2() takes {0} sec'.format(timeit.timeit(test2, number=1000)))

>> Output

test1() takes 12.732409087137057 sec
test2() takes 15.097430311612698 sec

I would appreciate your suggestions on efficient ways to concatenate multiple largedata frames row-wise. Thanks!

我很感激您对按行连接多个大型数据框的有效方法的建议。谢谢！

Answer 1

回答by Alberto Garcia-Raboso

Create a list with all your data frames:

创建一个包含所有数据框的列表：

dfs = []
for i in range(13):
    df = ... # However it is that you create your dataframes   
    dfs.append(df)

Then concatenate them in one swoop:

然后一举将它们连接起来：

merged = pd.concat(dfs) # add ignore_index=True if appropriate

This is a lot faster than your code because it creates exactly 14 dataframes (your original 13 plus merged), while your code creates 26 of them (your original 13 plus 13 intermediate merges).

这比您的代码快得多，因为它创建了 14 个数据帧（您原来的 13 个加merged），而您的代码创建了其中的 26 个（您原来的 13 个加 13 个中间合并）。

EDIT:

编辑：

Here's a variation on your testing code.

这是您的测试代码的变体。

import numpy
import pandas as pd
import timeit

def test_gen_time():
    """Create three large dataframes, but don't concatenate them"""
    for i in range(3):
        df = pd.DataFrame(numpy.random.rand(10**6))

def test_sequential_concat():
    """Create three large dataframes, concatenate them one by one"""
    DF = pd.DataFrame()
    for i in range(3):
        df = pd.DataFrame(numpy.random.rand(10**6))
        DF = pd.concat([DF, df], ignore_index=True)

def test_batch_concat():
    """Create three large dataframes, concatenate them at the end"""
    dfs = []
    for i in range(3):
        df = pd.DataFrame(numpy.random.rand(10**6))
        dfs.append(df)
    DF = pd.concat(dfs, ignore_index=True)

print('test_gen_time() takes {0} sec'
          .format(timeit.timeit(test_gen_time, number=200)))
print('test_sequential_concat() takes {0} sec'
          .format(timeit.timeit(test_sequential_concat, number=200)))
print('test_batch_concat() takes {0} sec'
          .format(timeit.timeit(test_batch_concat, number=200)))

Output:

输出：

test_gen_time() takes 10.095820872998956 sec
test_sequential_concat() takes 17.144756617000894 sec
test_batch_concat() takes 12.99131180600125 sec

The lion's share corresponds to generating the dataframes. Batch concatenation takes around 2.9 seconds; sequential concatenation takes more than 7 seconds.

最大的份额对应于生成数据帧。批量连接大约需要 2.9 秒；顺序串联需要超过 7 秒。

pandas 逐行组合熊猫数据框的有效方法

提问by sedeh

回答by Alberto Garcia-Raboso

相关推荐

最近更新

标签

pandas 逐行组合熊猫数据框的有效方法

提问by sedeh

回答by Alberto Garcia-Raboso

相关推荐

pandas read_csv 列 dtype 设置为十进制但转换为字符串

pandas 多列的熊猫箱线图

Pandas Python，根据行条件选择列

如何使用 groupby 在 python pandas 中连接字符串？

相关推荐

最近更新

标签