pandas 逐行组合熊猫数据框的有效方法

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/38246166/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-14 01:32:21  来源:igfitidea点击:

Efficient way to combine pandas data frames row-wise

pythonnumpypandas

提问by sedeh

I've 14 data frames each with 14 columns and more than 250,000 rows. The data frame have identical column headers and I would like to merge the data frames row-wise. I attempted to concatenate the data frames to a 'growing' DataFrame and it's taking several hours.

我有 14 个数据框,每个数据框有 14 列和超过 250,000 行。数据框具有相同的列标题,我想按行合并数据框。我试图将数据帧连接到一个“不断增长的”数据帧,这需要几个小时。

Essentially, I was doing something like below 13 times:

本质上,我做了 13 次以下的事情:

DF = pd.DataFrame()
for i in range(13):   
    DF = pd.concat([DF, subDF])

The stackoverflow answer heresuggests appending all sub data frames to a list and then concatenating the list of sub data frames.

这里的 stackoverflow 答案建议将所有子数据帧附加到列表中,然后连接子数据帧列表。

That sounds like doing something like this:

这听起来像是在做这样的事情:

DF = pd.DataFrame()
lst = [subDF, subDF, subDF....subDF] #up to 13 times
for subDF in lst:
    DF = pd.concat([DF, subDF])

Aren't they the same thing? Perhaps I'm misunderstanding the suggested workflow. Here's what I tested.

他们不是一回事吗?也许我误解了建议的工作流程。这是我测试的。

import numpy
import pandas as pd
import timeit

def test1():
    "make all subDF and then concatenate them"
    numpy.random.seed(1)
    subDF = pd.DataFrame(numpy.random.rand(1))
    lst = [subDF, subDF, subDF]
    DF = pd.DataFrame()
    for subDF in lst:
        DF = pd.concat([DF, subDF], axis=0,ignore_index=True)

def test2():
    "add each subDF to the collecitng DF as you're making the subDF"
    numpy.random.seed(1)
    DF = pd.DataFrame()
    for i in range(3):
        subDF = pd.DataFrame(numpy.random.rand(1))
        DF = pd.concat([DF, subDF], axis=0,ignore_index=True)

print('test1() takes {0} sec'.format(timeit.timeit(test1, number=1000)))
print('test2() takes {0} sec'.format(timeit.timeit(test2, number=1000)))

>> Output

test1() takes 12.732409087137057 sec
test2() takes 15.097430311612698 sec

I would appreciate your suggestions on efficient ways to concatenate multiple largedata frames row-wise. Thanks!

我很感激您对按行连接多个大型数据框的有效方法的建议。谢谢!

回答by Alberto Garcia-Raboso

Create a list with all your data frames:

创建一个包含所有数据框的列表:

dfs = []
for i in range(13):
    df = ... # However it is that you create your dataframes   
    dfs.append(df)

Then concatenate them in one swoop:

然后一举将它们连接起来:

merged = pd.concat(dfs) # add ignore_index=True if appropriate

This is a lot faster than your code because it creates exactly 14 dataframes (your original 13 plus merged), while your code creates 26 of them (your original 13 plus 13 intermediate merges).

这比您的代码快得多,因为它创建了 14 个数据帧(您原来的 13 个加merged),而您的代码创建了其中的 26 个(您原来的 13 个加 13 个中间合并)。

EDIT:

编辑:

Here's a variation on your testing code.

这是您的测试代码的变体。

import numpy
import pandas as pd
import timeit

def test_gen_time():
    """Create three large dataframes, but don't concatenate them"""
    for i in range(3):
        df = pd.DataFrame(numpy.random.rand(10**6))

def test_sequential_concat():
    """Create three large dataframes, concatenate them one by one"""
    DF = pd.DataFrame()
    for i in range(3):
        df = pd.DataFrame(numpy.random.rand(10**6))
        DF = pd.concat([DF, df], ignore_index=True)

def test_batch_concat():
    """Create three large dataframes, concatenate them at the end"""
    dfs = []
    for i in range(3):
        df = pd.DataFrame(numpy.random.rand(10**6))
        dfs.append(df)
    DF = pd.concat(dfs, ignore_index=True)

print('test_gen_time() takes {0} sec'
          .format(timeit.timeit(test_gen_time, number=200)))
print('test_sequential_concat() takes {0} sec'
          .format(timeit.timeit(test_sequential_concat, number=200)))
print('test_batch_concat() takes {0} sec'
          .format(timeit.timeit(test_batch_concat, number=200)))

Output:

输出:

test_gen_time() takes 10.095820872998956 sec
test_sequential_concat() takes 17.144756617000894 sec
test_batch_concat() takes 12.99131180600125 sec

The lion's share corresponds to generating the dataframes. Batch concatenation takes around 2.9 seconds; sequential concatenation takes more than 7 seconds.

最大的份额对应于生成数据帧。批量连接大约需要 2.9 秒;顺序串联需要超过 7 秒。