Python 创建大型 Pandas DataFrames:预分配 vs 追加 vs 连接

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/31690076/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-19 10:26:01  来源:igfitidea点击:

Creating large Pandas DataFrames: preallocation vs append vs concat

pythonpandas

提问by andrew

I am confused by the performance in Pandas when building a large dataframe chunk by chunk. In Numpy, we (almost) always see better performance by preallocating a large empty array and then filling in the values. As I understand it, this is due to Numpy grabbing all the memory it needs at once instead of having to reallocate memory with every appendoperation.

在逐块构建大型数据帧时,我对 Pandas 的性能感到困惑。在 Numpy 中,我们(几乎)总是通过预分配一个大的空数组然后填充值来获得更好的性能。据我了解,这是因为 Numpy 一次获取了它需要的所有内存,而不必在每次append操作时重新分配内存。

In Pandas, I seem to be getting better performance by using the df = df.append(temp)pattern.

在 Pandas 中,我似乎通过使用该df = df.append(temp)模式获得了更好的性能。

Here is an example with timing. The definition of the Timerclass follows. As you, see I find that preallocating is roughly 10x slower than using append! Preallocating a dataframe with np.emptyvalues of the appropriate dtype helps a great deal, but the appendmethod is still the fastest.

这是一个带有计时的示例。Timer类的定义如下。正如您一样,我发现预分配比使用append!慢 10 倍左右。使用np.empty适当的 dtype 值预分配数据帧有很大帮助,但该append方法仍然是最快的。

import numpy as np
from numpy.random import rand
import pandas as pd

from timer import Timer

# Some constants
num_dfs = 10  # Number of random dataframes to generate
n_rows = 2500
n_cols = 40
n_reps = 100  # Number of repetitions for timing

# Generate a list of num_dfs dataframes of random values
df_list = [pd.DataFrame(rand(n_rows*n_cols).reshape((n_rows, n_cols)), columns=np.arange(n_cols)) for i in np.arange(num_dfs)]

##
# Define two methods of growing a large dataframe
##

# Method 1 - append dataframes
def method1():
    out_df1 = pd.DataFrame(columns=np.arange(4))
    for df in df_list:
        out_df1 = out_df1.append(df, ignore_index=True)
    return out_df1

def method2():
# # Create an empty dataframe that is big enough to hold all the dataframes in df_list
out_df2 = pd.DataFrame(columns=np.arange(n_cols), index=np.arange(num_dfs*n_rows))
#EDIT_1: Set the dtypes of each column
for ix, col in enumerate(out_df2.columns):
    out_df2[col] = out_df2[col].astype(df_list[0].dtypes[ix])
# Fill in the values
for ix, df in enumerate(df_list):
    out_df2.iloc[ix*n_rows:(ix+1)*n_rows, :] = df.values
return out_df2

# EDIT_2: 
# Method 3 - preallocate dataframe with np.empty data of appropriate type
def method3():
    # Create fake data array
    data = np.transpose(np.array([np.empty(n_rows*num_dfs, dtype=dt) for dt in df_list[0].dtypes]))
    # Create placeholder dataframe
    out_df3 = pd.DataFrame(data)
    # Fill in the real values
    for ix, df in enumerate(df_list):
        out_df3.iloc[ix*n_rows:(ix+1)*n_rows, :] = df.values
    return out_df3

##
# Time both methods
##

# Time Method 1
times_1 = np.empty(n_reps)
for i in np.arange(n_reps):
    with Timer() as t:
       df1 = method1()
    times_1[i] = t.secs
print 'Total time for %d repetitions of Method 1: %f [sec]' % (n_reps, np.sum(times_1))
print 'Best time: %f' % (np.min(times_1))
print 'Mean time: %f' % (np.mean(times_1))

#>>  Total time for 100 repetitions of Method 1: 2.928296 [sec]
#>>  Best time: 0.028532
#>>  Mean time: 0.029283

# Time Method 2
times_2 = np.empty(n_reps)
for i in np.arange(n_reps):
    with Timer() as t:
        df2 = method2()
    times_2[i] = t.secs
print 'Total time for %d repetitions of Method 2: %f [sec]' % (n_reps, np.sum(times_2))
print 'Best time: %f' % (np.min(times_2))
print 'Mean time: %f' % (np.mean(times_2))

#>>  Total time for 100 repetitions of Method 2: 32.143247 [sec]
#>>  Best time: 0.315075
#>>  Mean time: 0.321432

# Time Method 3
times_3 = np.empty(n_reps)
for i in np.arange(n_reps):
    with Timer() as t:
        df3 = method3()
    times_3[i] = t.secs
print 'Total time for %d repetitions of Method 3: %f [sec]' % (n_reps, np.sum(times_3))
print 'Best time: %f' % (np.min(times_3))
print 'Mean time: %f' % (np.mean(times_3))

#>>  Total time for 100 repetitions of Method 3: 6.577038 [sec]
#>>  Best time: 0.063437
#>>  Mean time: 0.065770

I use a nice Timercourtesy of Huy Nguyen:

我使用了TimerHuy Nguyen 的礼貌:

# credit: http://www.huyng.com/posts/python-performance-analysis/

import time

class Timer(object):
    def __init__(self, verbose=False):
        self.verbose = verbose

    def __enter__(self):
        self.start = time.clock()
        return self

    def __exit__(self, *args):
        self.end = time.clock()
        self.secs = self.end - self.start
        self.msecs = self.secs * 1000  # millisecs
        if self.verbose:
            print 'elapsed time: %f ms' % self.msecs

If you are still following, I have two questions:

如果你还在关注,我有两个问题:

1) Why is the appendmethod faster? (NOTE: for very small dataframes, i.e. n_rows = 40, it is actually slower).

1)为什么该append方法更快?(注意:对于非常小的数据帧,即n_rows = 40,它实际上更慢)。

2) What is the most efficient way to build a large dataframe out of chunks? (In my case, the chunks are all large csv files).

2)从块中构建大型数据帧的最有效方法是什么?(就我而言,块都是大型 csv 文件)。

Thanks for your help!

谢谢你的帮助!

EDIT_1: In my real world project, the columns have different dtypes. So I cannot use the pd.DataFrame(.... dtype=some_type)trick to improve the performance of preallocation, per BrenBarn's recommendation. The dtype parameter forces all the columns to be the same dtype [Ref. issue 4464]

EDIT_1:在我的真实项目中,列有不同的数据类型。因此pd.DataFrame(.... dtype=some_type),根据 BrenBarn 的建议,我无法使用该技巧来提高预分配的性能。dtype 参数强制所有列都为相同的 dtype [Ref. 问题4464]

I added some lines to method2()in my code to change the dtypes column-by-column to match in the input dataframes. This operation is expensive and negates the benefits of having the appropriate dtypes when writing blocks of rows.

method2()在代码中添加了一些行以逐列更改 dtypes 以匹配输入数据帧。此操作很昂贵,并且在写入行块时会抵消使用适当的 dtype 的好处。

EDIT_2: Try preallocating a dataframe using placeholder array np.empty(... dtyp=some_type). Per @Joris's suggestion.

EDIT_2:尝试使用占位符数组预分配数据帧np.empty(... dtyp=some_type)。根据@Joris 的建议。

采纳答案by Jeff

Your benchmark is actually too small to show the real difference. Appending, copies EACH time, so you are actually doing copying a size N memory space N*(N-1) times. This is horribly inefficient as the size of your dataframe grows. This certainly might not matter in a very small frame. But if you have any real size this matters a lot. This is specifically noted in the docs here, though kind of a small warning.

您的基准实际上太小而无法显示真正的差异。附加,复制每个时间,因此您实际上是在复制大小为 N 的内存空间 N*(N-1) 次。随着数据帧大小的增长,这是非常低效的。在非常小的框架中,这当然可能无关紧要。但如果你有任何实际尺寸,这很重要。此处的文档中特别指出了这一点,尽管这是一个小警告。

In [97]: df = DataFrame(np.random.randn(100000,20))

In [98]: df['B'] = 'foo'

In [99]: df['C'] = pd.Timestamp('20130101')

In [103]: df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 100000 entries, 0 to 99999
Data columns (total 22 columns):
0     100000 non-null float64
1     100000 non-null float64
2     100000 non-null float64
3     100000 non-null float64
4     100000 non-null float64
5     100000 non-null float64
6     100000 non-null float64
7     100000 non-null float64
8     100000 non-null float64
9     100000 non-null float64
10    100000 non-null float64
11    100000 non-null float64
12    100000 non-null float64
13    100000 non-null float64
14    100000 non-null float64
15    100000 non-null float64
16    100000 non-null float64
17    100000 non-null float64
18    100000 non-null float64
19    100000 non-null float64
B     100000 non-null object
C     100000 non-null datetime64[ns]
dtypes: datetime64[ns](1), float64(20), object(1)
memory usage: 17.5+ MB

Appending

追加

In [85]: def f1():
   ....:     result = df
   ....:     for i in range(9):
   ....:         result = result.append(df)
   ....:     return result
   ....: 

Concat

康卡特

In [86]: def f2():
   ....:     result = []
   ....:     for i in range(10):
   ....:         result.append(df)
   ....:     return pd.concat(result)
   ....: 

In [100]: f1().equals(f2())
Out[100]: True

In [101]: %timeit f1()
1 loops, best of 3: 1.66 s per loop

In [102]: %timeit f2()
1 loops, best of 3: 220 ms per loop

Note that I wouldn't even bother trying to pre-allocate. Its somewhat complicated, especially since you are dealing with multiple dtypes (e.g. you couldmake a giant frame and simply .locand it would work). But pd.concatis just dead simple, works reliably, and fast.

请注意,我什至不会费心尝试预分配。它有点复杂,特别是因为您正在处理多个 dtypes(例如,您可以制作一个巨大的框架,.loc并且很简单,它会起作用)。但pd.concat它非常简单,工作可靠且快速。

And timing of your sizes from above

以及从上面调整尺寸的时间

In [104]: df = DataFrame(np.random.randn(2500,40))

In [105]: %timeit f1()
10 loops, best of 3: 33.1 ms per loop

In [106]: %timeit f2()
100 loops, best of 3: 4.23 ms per loop

回答by BrenBarn

You didn't specify any data or type for out_df2, so it has the "object" dtype. This makes assigning values to it very slow. Specify float64 dtype:

您没有为 指定任何数据或类型out_df2,因此它具有“对象”dtype。这使得为​​其分配值非常缓慢。指定 float64 数据类型:

out_df2 = pd.DataFrame(columns=np.arange(n_cols), index=np.arange(num_dfs*n_rows), dtype=np.float64)

You will see a dramatic speedup. When I tried it, method2with this change is about twice as fast as method1.

您将看到显着的加速。当我尝试时,method2此更改的速度大约是method1.

回答by andrew

@Jeff, pd.concatwins by a mile! I benchmarked a fourth method using pd.concatwith num_dfs = 500. The results are unequivocal:

@Jeff,pd.concat赢了一英里!我使用pd.concatwith对第四种方法进行了基准测试num_dfs = 500。结果是明确的:

The method4()definition:

method4()定义:

# Method 4 - us pd.concat on df_list
def method4():
return pd.concat(df_list, ignore_index=True)

Profiling results, using the same Timerin my original question:

分析结果,Timer在我原来的问题中使用相同的结果:

Total time for 100 repetitions of Method 1: 3679.334655 [sec]
Best time: 35.570036
Mean time: 36.793347
Total time for 100 repetitions of Method 2: 1569.917425 [sec]
Best time: 15.457102
Mean time: 15.699174
Total time for 100 repetitions of Method 3: 325.730455 [sec]
Best time: 3.192702
Mean time: 3.257305
Total time for 100 repetitions of Method 4: 25.448473 [sec]
Best time: 0.244309
Mean time: 0.254485

The pd.concatmethod is 13x faster than preallocating with a np.empty(... dtype)palceholder.

pd.concat方法比使用np.empty(... dtype)占位符进行预分配快 13 倍。

回答by Tom

Jeff's answer is correct, but I found for my data type another solution worked better.

杰夫的回答是正确的,但我发现对于我的数据类型,另一种解决方案效果更好。

def df_(): 
    return pd.DataFrame(['foo']*np.random.randint(100)).transpose()

k = 100
frames = [df_() for x in range(0, k)]

def f1():
    result = frames[0]
    for i in range(k-1):
        result = result.append(frames[i+1])
    return result

def f2():  
    result = []
    for i in range(k):
        result.append(frames[i])
    return pd.concat(result)

def f3():
    result = []
    for i in range(k):
       result.append(frames[i])

    n = 2
    while len(result) > 1:
        _result = []
        for i in range(0, len(result), n):
            _result.append(pd.concat(result[i:i+n]))
        result = _result
    return result[0]

My dataframes are a single row and of varying length - the null entries must have something to do with why f3() succeeds.

我的数据帧是单行且长度不同 - 空条目必须与 f3() 成功的原因有关。

In [33]: f1().equals(f2())
Out[33]: True

In [34]: f1().equals(f3())
Out[34]: True

In [35]: %timeit f1()
357 ms ± 192 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [36]: %timeit f2()
562 ms ± 68.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [37]: %timeit f3()
215 ms ± 58.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

The above results is still for k=100, but for larger k it is even more significant.

上面的结果仍然是对于k=100,但是对于更大的k就更加显着了。