Python Pandas DataFrame concat vs append
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/15819050/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Pandas DataFrame concat vs append
提问by JPBelanger
I have a list of 4 pandas dataframes containing a day of tick data that I want to merge into a single data frame. I cannot understand the behavior of concat on my timestamps. See details below:
我有一个包含一天刻度数据的 4 个熊猫数据框的列表,我想将这些数据合并到一个数据框中。我无法理解 concat 在我的时间戳上的行为。请参阅下面的详细信息:
data
[<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 35228 entries, 2013-03-28 00:00:07.089000+02:00 to 2013-03-28 18:59:20.357000+02:00
Data columns:
Price 4040 non-null values
Volume 4040 non-null values
BidQty 35228 non-null values
BidPrice 35228 non-null values
AskPrice 35228 non-null values
AskQty 35228 non-null values
dtypes: float64(6),
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 33088 entries, 2013-04-01 00:03:17.047000+02:00 to 2013-04-01 18:59:58.175000+02:00
Data columns:
Price 3969 non-null values
Volume 3969 non-null values
BidQty 33088 non-null values
BidPrice 33088 non-null values
AskPrice 33088 non-null values
AskQty 33088 non-null values
dtypes: float64(6),
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 50740 entries, 2013-04-02 00:03:27.470000+02:00 to 2013-04-02 18:59:58.172000+02:00
Data columns:
Price 7326 non-null values
Volume 7326 non-null values
BidQty 50740 non-null values
BidPrice 50740 non-null values
AskPrice 50740 non-null values
AskQty 50740 non-null values
dtypes: float64(6),
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 60799 entries, 2013-04-03 00:03:06.994000+02:00 to 2013-04-03 18:59:58.180000+02:00
Data columns:
Price 8258 non-null values
Volume 8258 non-null values
BidQty 60799 non-null values
BidPrice 60799 non-null values
AskPrice 60799 non-null values
AskQty 60799 non-null values
dtypes: float64(6)]
Using appendI get:
使用append我得到:
pd.DataFrame().append(data)
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 179855 entries, 2013-03-28 00:00:07.089000+02:00 to 2013-04-03 18:59:58.180000+02:00
Data columns:
AskPrice 179855 non-null values
AskQty 179855 non-null values
BidPrice 179855 non-null values
BidQty 179855 non-null values
Price 23593 non-null values
Volume 23593 non-null values
dtypes: float64(6)
Using concatI get:
使用concat我得到:
pd.concat(data)
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 179855 entries, 2013-03-27 22:00:07.089000+02:00 to 2013-04-03 16:59:58.180000+02:00
Data columns:
Price 23593 non-null values
Volume 23593 non-null values
BidQty 179855 non-null values
BidPrice 179855 non-null values
AskPrice 179855 non-null values
AskQty 179855 non-null values
dtypes: float64(6)
Notice how the index changes when using concat. Why is that happening and how would I go about using concatto reproduce the results obtained using append? (Since concatseems so much faster; 24.6 ms per loop vs 3.02 s per loop)
请注意使用concat. 为什么会发生这种情况,我将如何使用concat来重现使用 获得的结果append?(因为concat看起来快得多;每个循环 24.6 毫秒 vs 每个循环 3.02 秒)
采纳答案by Jeff
So what are you doing is with append and concat is almostequivalent. The difference is the empty DataFrame. For some reason this causes a big slowdown, not sure exactly why, will have to look at some point. Below is a recreation of basically what you did.
所以你在做什么是 append 和 concat几乎是等效的。不同之处在于空的 DataFrame。出于某种原因,这会导致大幅放缓,不确定究竟是什么原因,将不得不考虑某个时间点。下面是对您所做的基本操作的重新创建。
I almost always use concat (though in this case they are equivalent, except for the empty frame); if you don't use the empty frame they will be the same speed.
我几乎总是使用 concat (尽管在这种情况下它们是等效的,除了空框架);如果您不使用空帧,它们的速度将相同。
In [17]: df1 = pd.DataFrame(dict(A = range(10000)),index=pd.date_range('20130101',periods=10000,freq='s'))
In [18]: df1
Out[18]:
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 10000 entries, 2013-01-01 00:00:00 to 2013-01-01 02:46:39
Freq: S
Data columns (total 1 columns):
A 10000 non-null values
dtypes: int64(1)
In [19]: df4 = pd.DataFrame()
The concat
In [20]: %timeit pd.concat([df1,df2,df3])
1000 loops, best of 3: 270 us per loop
This is equavalent of your append
In [21]: %timeit pd.concat([df4,df1,df2,df3])
10 loops, best of
3: 56.8 ms per loop
回答by Michael Dorner
I have implemented a tiny benchmark (please find the code on Gist) to evaluate the pandas' concatand append. I updated the code snippet and the results after the comment by ssk08- thanks alot!
我已经实现了一个很小的基准测试(请在 Gist 上找到代码)来评估 pandasconcat和append. 我在评论后更新了代码片段和结果ssk08- 非常感谢!
The benchmark ran on a Mac OS X 10.13 system with Python 3.6.2 and pandas 0.20.3.
基准测试在 Mac OS X 10.13 系统上运行,使用 Python 3.6.2 和 pandas 0.20.3。
+--------+---------------------------------+---------------------------------+ | | ignore_index=False | ignore_index=True | +--------+---------------------------------+---------------------------------+ | size | append | concat | append/concat | append | concat | append/concat | +--------+--------+--------+---------------+--------+--------+---------------+ | small | 0.4635 | 0.4891 | 94.77 % | 0.4056 | 0.3314 | 122.39 % | +--------+--------+--------+---------------+--------+--------+---------------+ | medium | 0.5532 | 0.6617 | 83.60 % | 0.3605 | 0.3521 | 102.37 % | +--------+--------+--------+---------------+--------+--------+---------------+ | large | 0.9558 | 0.9442 | 101.22 % | 0.6670 | 0.6749 | 98.84 % | +--------+--------+--------+---------------+--------+--------+---------------+
Using ignore_index=Falseappendis slightly faster, with ignore_index=Trueconcatis slightly faster.
usingignore_index=Falseappend稍快,withignore_index=Trueconcat稍快。
tl;drNo significant difference between concatand append.
tl;drconcat和
之间没有显着差异append。
回答by Mohsin Mahmood
Pandas concat vs append vs join vs merge
Pandas concat vs append vs join vs merge
Concatgives the flexibility to join based on the axis( all rows or all columns)
Appendis the specific case(axis=0, join='outer') of concat
Joinis based on the indexes (set by set_index) on how variable =['left','right','inner','couter']
Mergeis based on any particular column each of the two dataframes, this columns are variables on like 'left_on', 'right_on', 'on'
Concat提供了基于轴(所有行或所有列)连接的灵活性
Append是concat的具体case(axis=0, join='outer')
加入基于索引(由 set_index 设置)关于变量 =['left','right','inner','couter']
合并基于两个数据帧中的每一个的任何特定列,该列是变量,如“left_on”、“right_on”、“on”
回答by nhanhoangle
One more thing you have to keep in mind that the APPEND() method in Pandas doesn't modify the original object. Instead it creates a new one with combined data. Because of involving creation and data buffer, its performance is not well. You'd better use CONCAT() function when doing multi-APPEND operations.
您必须记住的另一件事是 Pandas 中的 APPEND() 方法不会修改原始对象。相反,它使用组合数据创建一个新的。由于涉及创建和数据缓冲区,其性能不佳。进行多APPEND操作时最好使用CONCAT()函数。

