Python Pandas DataFrame concat vs append

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/15819050/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-18 21:05:04  来源:igfitidea点击:

Pandas DataFrame concat vs append

pythonpandas

提问by JPBelanger

I have a list of 4 pandas dataframes containing a day of tick data that I want to merge into a single data frame. I cannot understand the behavior of concat on my timestamps. See details below:

我有一个包含一天刻度数据的 4 个熊猫数据框的列表,我想将这些数据合并到一个数据框中。我无法理解 concat 在我的时间戳上的行为。请参阅下面的详细信息:

data

[<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 35228 entries, 2013-03-28 00:00:07.089000+02:00 to 2013-03-28 18:59:20.357000+02:00
Data columns:
Price       4040  non-null values
Volume      4040  non-null values
BidQty      35228  non-null values
BidPrice    35228  non-null values
AskPrice    35228  non-null values
AskQty      35228  non-null values
dtypes: float64(6),
<class 'pandas.core.frame.DataFrame'>

DatetimeIndex: 33088 entries, 2013-04-01 00:03:17.047000+02:00 to 2013-04-01 18:59:58.175000+02:00
Data columns:
Price       3969  non-null values
Volume      3969  non-null values
BidQty      33088  non-null values
BidPrice    33088  non-null values
AskPrice    33088  non-null values
AskQty      33088  non-null values
dtypes: float64(6),
<class 'pandas.core.frame.DataFrame'>

DatetimeIndex: 50740 entries, 2013-04-02 00:03:27.470000+02:00 to 2013-04-02 18:59:58.172000+02:00
Data columns:
Price       7326  non-null values
Volume      7326  non-null values
BidQty      50740  non-null values
BidPrice    50740  non-null values
AskPrice    50740  non-null values
AskQty      50740  non-null values
dtypes: float64(6),
<class 'pandas.core.frame.DataFrame'>

DatetimeIndex: 60799 entries, 2013-04-03 00:03:06.994000+02:00 to 2013-04-03 18:59:58.180000+02:00
Data columns:
Price       8258  non-null values
Volume      8258  non-null values
BidQty      60799  non-null values
BidPrice    60799  non-null values
AskPrice    60799  non-null values
AskQty      60799  non-null values
dtypes: float64(6)]

Using appendI get:

使用append我得到:

pd.DataFrame().append(data)

<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 179855 entries, 2013-03-28 00:00:07.089000+02:00 to 2013-04-03 18:59:58.180000+02:00
Data columns:
AskPrice    179855  non-null values
AskQty      179855  non-null values
BidPrice    179855  non-null values
BidQty      179855  non-null values
Price       23593  non-null values
Volume      23593  non-null values
dtypes: float64(6)

Using concatI get:

使用concat我得到:

pd.concat(data)

<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 179855 entries, 2013-03-27 22:00:07.089000+02:00 to 2013-04-03 16:59:58.180000+02:00
Data columns:
Price       23593  non-null values
Volume      23593  non-null values
BidQty      179855  non-null values
BidPrice    179855  non-null values
AskPrice    179855  non-null values
AskQty      179855  non-null values
dtypes: float64(6)

Notice how the index changes when using concat. Why is that happening and how would I go about using concatto reproduce the results obtained using append? (Since concatseems so much faster; 24.6 ms per loop vs 3.02 s per loop)

请注意使用concat. 为什么会发生这种情况,我将如何使用concat来重现使用 获得的结果append?(因为concat看起来快得多;每个循环 24.6 毫秒 vs 每个循环 3.02 秒)

采纳答案by Jeff

So what are you doing is with append and concat is almostequivalent. The difference is the empty DataFrame. For some reason this causes a big slowdown, not sure exactly why, will have to look at some point. Below is a recreation of basically what you did.

所以你在做什么是 append 和 concat几乎是等效的。不同之处在于空的 DataFrame。出于某种原因,这会导致大幅放缓,不确定究竟是什么原因,将不得不考虑某个时间点。下面是对您所做的基本操作的重新创建。

I almost always use concat (though in this case they are equivalent, except for the empty frame); if you don't use the empty frame they will be the same speed.

我几乎总是使用 concat (尽管在这种情况下它们是等效的,除了空框架);如果您不使用空帧,它们的速度将相同。

In [17]: df1 = pd.DataFrame(dict(A = range(10000)),index=pd.date_range('20130101',periods=10000,freq='s'))

In [18]: df1
Out[18]: 
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 10000 entries, 2013-01-01 00:00:00 to 2013-01-01 02:46:39
Freq: S
Data columns (total 1 columns):
A    10000  non-null values
dtypes: int64(1)

In [19]: df4 = pd.DataFrame()

The concat

In [20]: %timeit pd.concat([df1,df2,df3])
1000 loops, best of 3: 270 us per loop

This is equavalent of your append

In [21]: %timeit pd.concat([df4,df1,df2,df3])
10 loops, best of 

 3: 56.8 ms per loop

回答by Michael Dorner

I have implemented a tiny benchmark (please find the code on Gist) to evaluate the pandas' concatand append. I updated the code snippet and the results after the comment by ssk08- thanks alot!

我已经实现了一个很小的基准测试(请在 Gist 上找到代码)来评估 pandasconcatappend. 我在评论后更新了代码片段和结果ssk08- 非常感谢!

The benchmark ran on a Mac OS X 10.13 system with Python 3.6.2 and pandas 0.20.3.

基准测试在 Mac OS X 10.13 系统上运行,使用 Python 3.6.2 和 pandas 0.20.3。

+--------+---------------------------------+---------------------------------+
|        | ignore_index=False              | ignore_index=True               |
+--------+---------------------------------+---------------------------------+
| size   | append | concat | append/concat | append | concat | append/concat |
+--------+--------+--------+---------------+--------+--------+---------------+
| small  | 0.4635 | 0.4891 | 94.77 %       | 0.4056 | 0.3314 | 122.39 %      |
+--------+--------+--------+---------------+--------+--------+---------------+
| medium | 0.5532 | 0.6617 | 83.60 %       | 0.3605 | 0.3521 | 102.37 %      |
+--------+--------+--------+---------------+--------+--------+---------------+
| large  | 0.9558 | 0.9442 | 101.22 %      | 0.6670 | 0.6749 | 98.84 %       |
+--------+--------+--------+---------------+--------+--------+---------------+

Using ignore_index=Falseappendis slightly faster, with ignore_index=Trueconcatis slightly faster.

usingignore_index=Falseappend稍快,withignore_index=Trueconcat稍快。

tl;drNo significant difference between concatand append.

tl;drconcat和 之间没有显着差异append

回答by Mohsin Mahmood

Pandas concat vs append vs join vs merge

Pandas concat vs append vs join vs merge

  • Concatgives the flexibility to join based on the axis( all rows or all columns)

  • Appendis the specific case(axis=0, join='outer') of concat

  • Joinis based on the indexes (set by set_index) on how variable =['left','right','inner','couter']

  • Mergeis based on any particular column each of the two dataframes, this columns are variables on like 'left_on', 'right_on', 'on'

  • Concat提供了基于轴(所有行或所有列)连接的灵活性

  • Append是concat的具体case(axis=0, join='outer')

  • 加入基于索引(由 set_index 设置)关于变量 =['left','right','inner','couter']

  • 合并基于两个数据帧中的每一个的任何特定列,该列是变量,如“left_on”、“right_on”、“on”

回答by nhanhoangle

One more thing you have to keep in mind that the APPEND() method in Pandas doesn't modify the original object. Instead it creates a new one with combined data. Because of involving creation and data buffer, its performance is not well. You'd better use CONCAT() function when doing multi-APPEND operations.

您必须记住的另一件事是 Pandas 中的 APPEND() 方法不会修改原始对象。相反,它使用组合数据创建一个新的。由于涉及创建和数据缓冲区,其性能不佳。进行多APPEND操作时最好使用CONCAT()函数。