Python 使用 Pandas 将列从一个 DataFrame 复制到另一个 DataFrame 的最快方法?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/21295329/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Fastest way to copy columns from one DataFrame to another using pandas?
提问by user1204369
I have a large DataFrame (million +) records I'm using to store core of my data (like a database) and I then have a smaller DataFrame (1 to 2000) records that I'm combining a few of the columns for each time step in my program which can be several thousand time steps . Both DataFrames are indexed the same way by a id column.
我有一个大的 DataFrame(百万+)记录,我用它来存储我的数据的核心(如数据库),然后我有一个较小的 DataFrame(1 到 2000)记录,我为每个记录组合了一些列我的程序中的时间步长可能是几千个时间步长。两个 DataFrame 都通过 id 列以相同的方式索引。
the code I'm using is:
我使用的代码是:
df_large.loc[new_ids, core_cols] = df_small.loc[new_ids, core_cols]
Where core_cols is a list of about 10 fields that I'm coping over and new_ids are the ids from the small DataFrame. This code works fine but it is the slowest part of my code my a magnitude of three. I just wanted to know if they was a faster way to merge the data of the two DataFrame together.
其中 core_cols 是我正在处理的大约 10 个字段的列表,而 new_ids 是来自小型 DataFrame 的 ID。这段代码工作正常,但它是我代码中最慢的部分,幅度为 3。我只是想知道它们是否是将两个 DataFrame 的数据合并在一起的更快方法。
I tried merging the data each time with the merge function but process took way to long that is way I have gone to creating a larger DataFrame that I update to improve the speed.
我每次都尝试使用合并功能合并数据,但过程花费了很长时间,这就是我创建一个更大的 DataFrame 的方式,我更新它以提高速度。
采纳答案by Jeff
There is nothing inherently slow about using .locto set with an alignable frame, though it does go through a bit of code to cover lot of cases, so probably it's not ideal to have in a tight loop. FYI, this example is slightly different that the 2nd example.
使用.loc可对齐框架进行设置并没有什么本质上的缓慢,尽管它确实通过一些代码来涵盖很多情况,所以在紧密循环中可能并不理想。仅供参考,此示例与第二个示例略有不同。
In [1]: import numpy as np
In [2]: import pandas as pd
In [3]: from pandas import DataFrame
In [4]: df = DataFrame(1.,index=list('abcdefghij'),columns=[0,1,2])
In [5]: df
Out[5]:
0 1 2
a 1 1 1
b 1 1 1
c 1 1 1
d 1 1 1
e 1 1 1
f 1 1 1
g 1 1 1
h 1 1 1
i 1 1 1
j 1 1 1
[10 rows x 3 columns]
In [6]: df2 = DataFrame(0,index=list('afg'),columns=[1,2])
In [7]: df2
Out[7]:
1 2
a 0 0
f 0 0
g 0 0
[3 rows x 2 columns]
In [8]: df.loc[df2.index,df2.columns] = df2
In [9]: df
Out[9]:
0 1 2
a 1 0 0
b 1 1 1
c 1 1 1
d 1 1 1
e 1 1 1
f 1 0 0
g 1 0 0
h 1 1 1
i 1 1 1
j 1 1 1
[10 rows x 3 columns]
Here's an alternative. It may or may not fit your data pattern. If the updates (your small frame) are pretty much independent this would work (IOW you are not updating the big frame, then picking out a new sub-frame, then updating, etc. - if this is your pattern, then using .locis about right).
这是一个替代方案。它可能适合也可能不适合您的数据模式。如果更新(你的小框架)几乎是独立的,这会起作用(IOW 你没有更新大框架,然后选择一个新的子框架,然后更新等等 - 如果这是你的模式,那么使用.loc是关于对)。
Instead of updating the big frame, update the small frame with the columns from the big frame, e.g.:
不是更新大框架,而是用大框架中的列更新小框架,例如:
In [10]: df = DataFrame(1.,index=list('abcdefghij'),columns=[0,1,2])
In [11]: df2 = DataFrame(0,index=list('afg'),columns=[1,2])
In [12]: needed_columns = df.columns-df2.columns
In [13]: df2[needed_columns] = df.reindex(index=df2.index,columns=needed_columns)
In [14]: df2
Out[14]:
1 2 0
a 0 0 1
f 0 0 1
g 0 0 1
[3 rows x 3 columns]
In [15]: df3 = DataFrame(0,index=list('cji'),columns=[1,2])
In [16]: needed_columns = df.columns-df3.columns
In [17]: df3[needed_columns] = df.reindex(index=df3.index,columns=needed_columns)
In [18]: df3
Out[18]:
1 2 0
c 0 0 1
j 0 0 1
i 0 0 1
[3 rows x 3 columns]
And concat everything together when you want (they are kept in a list in the mean time, or see my comments below, these sub-frames could be moved to external storage when created, then read back before this concatenating step).
并在需要时将所有内容连接在一起(同时将它们保存在列表中,或者请参阅下面的评论,这些子帧在创建时可以移动到外部存储,然后在此连接步骤之前回读)。
In [19]: pd.concat([ df.reindex(index=df.index-df2.index-df3.index), df2, df3]).reindex_like(df)
Out[19]:
0 1 2
a 1 0 0
b 1 1 1
c 1 0 0
d 1 1 1
e 1 1 1
f 1 0 0
g 1 0 0
h 1 1 1
i 1 0 0
j 1 0 0
[10 rows x 3 columns]
The beauty of this pattern is that it is easily extended to using an actual db (or much better an HDFStore), to actually store the 'database', then creating/updating sub-frames as needed, then writing out to a new store when finished.
这种模式的美妙之处在于它很容易扩展到使用实际的 db(或更好的HDFStore),实际存储“数据库”,然后根据需要创建/更新子帧,然后在完成后写出到新存储.
I use this pattern all of the time, though with Panels actually.
我一直使用这种模式,尽管实际上是使用面板。
- perform a computation on a sub-set of the data and write each to a separate file
- then at the end read them all in and concat (in memory), and write out a gigantic new file. The concat step could be done all at once in memory, or if truly a large task, then can be done iteratively.
- 对数据的子集执行计算并将每个子集写入单独的文件
- 然后最后将它们全部读入并连接(在内存中),并写出一个巨大的新文件。concat 步骤可以在内存中一次性完成,或者如果真的是一个大任务,那么可以迭代完成。
I am able to use multi-processes to perform my computations AND write each individual Panel to a file separate as they are all completely independent. The only dependent part is the concat.
我能够使用多进程来执行我的计算并将每个单独的面板写入一个单独的文件,因为它们都是完全独立的。唯一的依赖部分是 concat。
This is essentially a map-reduce pattern.
这本质上是一种 map-reduce 模式。
回答by fantabolous
I've had to copy between large dataframes a fair bit. I'm using dataframes with realtime market data, which may not be what pandas is designed for, but this is my experience..
我不得不在大型数据帧之间进行相当多的复制。我正在使用带有实时市场数据的数据框,这可能不是 Pandas 的设计目的,但这是我的经验。
On my pc, copying a single datapoint with .attakes 15μs with the df size making negligible difference. .loctakes a minimum of 550μs and increases as the df gets larger: 3100μs to copy a single point from one 100000x2 df to another. .ixseems to be just barely faster than .loc.
在我的电脑上,复制一个数据点.at需要 15μs,df 大小的差异可以忽略不计。.loc最少需要 550μs,并且随着 df 变大而增加:3100μs 将一个点从一个 100000x2 df 复制到另一个。.ix似乎比.loc.
For a single datapoint .atis very fast and is not impacted by the size of the dataframe, but it cannot handle ranges so loops are required, and as such the time scaling is linear. .locand .ixon the other hand are (relatively) very slow for single datapoints, but they can handle ranges and scale up better than linearly. However, unlike .atthey slow down significantly wrt dataframe size.
对于单个数据点.at,速度非常快,不受数据帧大小的影响,但它无法处理范围,因此需要循环,因此时间缩放是线性的。.loc而.ix在另一方面是(相对)很单一数据点慢,但他们可以处理的范围比线性更好的扩展。但是,与.at它们不同的是,它们会显着降低数据帧大小。
Therefore when I'm frequently copying small ranges between large dataframes, I tend to use .at with a for loop, and otherwise I use .ixwith a range.
因此,当我经常在大型数据帧之间复制小范围时,我倾向于使用 .at 和 for 循环,否则我使用.ix范围。
for new_id in new_ids:
for core_col in core_cols:
df_large.at[new_id, core_col] = df_small.at[new_id, core_col]
Of course, to do it properly I'd go with Jeff's solution above, but it's nice to have options.
当然,要正确地做到这一点,我会采用上面杰夫的解决方案,但有选择是很好的。
Caveats of .at: it doesn't work with ranges, and it doesn't work if the dtype is datetime (and maybe others).
注意事项.at:它不适用于范围,如果 dtype 是日期时间(可能还有其他类型),它也不起作用。

