如何在不复制数据的情况下连接 Pandas DataFrames?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 
原文地址: http://stackoverflow.com/questions/18295630/
Warning: these are provided under cc-by-sa 4.0 license.  You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
How do I concatenate pandas DataFrames without copying the data?
提问by Soldalma
I want to concatenate two pandas DataFrames without copying the data. That is, I want the concatenated DataFrame to be a view on the data in the two original DataFrames. I tried using concat() and that did not work. This block of code shows that changing the underlying data affects the two DataFrames that are concatenated but not the concatenated DataFrame:
我想在不复制数据的情况下连接两个 Pandas DataFrame。也就是说,我希望串联的 DataFrame 成为两个原始 DataFrame 中数据的视图。我尝试使用 concat() 但没有奏效。此代码块显示更改底层数据会影响连接的两个 DataFrame,但不会影响连接的 DataFrame:
arr = np.random.randn(12).reshape(6, 2)
df = pd.DataFrame(arr, columns = ('VALE5', 'PETR4'), index = dates)
arr2 = np.random.randn(12).reshape(6, 2)
df2 = pd.DataFrame(arr, columns = ('AMBV3', 'BBDC4'), index = dates)
df_concat = pd.concat(dict(A = df, B = df2),axis=1)
pp(df)
pp(df_concat)
arr[0, 0] = 9999999.99
pp(df)
pp(df_concat)
This is the output of the last five lines. df changed after a new value was assigned to arr[0, 0]; df_concat was not affected.
这是最后五行的输出。将新值分配给 arr[0, 0] 后,df 发生了变化;df_concat 不受影响。
In [56]: pp(df)
           VALE5     PETR4
2013-01-01 -0.557180  0.170073
2013-01-02 -0.975797  0.763136
2013-01-03 -0.913254  1.042521
2013-01-04 -1.973013 -2.069460
2013-01-05 -1.259005  1.448442
2013-01-06 -0.323640  0.024857
In [57]: pp(df_concat)
               A                   B          
           VALE5     PETR4     AMBV3     BBDC4
2013-01-01 -0.557180  0.170073 -0.557180  0.170073
2013-01-02 -0.975797  0.763136 -0.975797  0.763136
2013-01-03 -0.913254  1.042521 -0.913254  1.042521
2013-01-04 -1.973013 -2.069460 -1.973013 -2.069460
2013-01-05 -1.259005  1.448442 -1.259005  1.448442
2013-01-06 -0.323640  0.024857 -0.323640  0.024857
In [58]: arr[0, 0] = 9999999.99
In [59]: pp(df)
                 VALE5     PETR4
2013-01-01  9999999.990000  0.170073
2013-01-02       -0.975797  0.763136
2013-01-03       -0.913254  1.042521
2013-01-04       -1.973013 -2.069460
2013-01-05       -1.259005  1.448442
2013-01-06       -0.323640  0.024857
In [60]: pp(df_concat)
               A                   B          
           VALE5     PETR4     AMBV3     BBDC4
2013-01-01 -0.557180  0.170073 -0.557180  0.170073
2013-01-02 -0.975797  0.763136 -0.975797  0.763136
2013-01-03 -0.913254  1.042521 -0.913254  1.042521
2013-01-04 -1.973013 -2.069460 -1.973013 -2.069460
2013-01-05 -1.259005  1.448442 -1.259005  1.448442
2013-01-06 -0.323640  0.024857 -0.323640  0.024857
I guess this means concat() created a copy of the data. Is there a way to avoid a copy being made? (I want to minimize memory usage).
我猜这意味着 concat() 创建了数据的副本。有没有办法避免复制?(我想最小化内存使用量)。
Also, is there a fast way to check if two DataFrames are linked to the same underlying data? (short of going through the trouble of changing the data and checking if each DataFrame has changed)
另外,有没有一种快速的方法来检查两个 DataFrame 是否链接到相同的底层数据?(没有经历更改数据和检查每个 DataFrame 是否已更改的麻烦)
Thanks for the help.
谢谢您的帮助。
FS
FS
回答by Phillip Cloud
You can't (at least easily). When you call concat, ultimately np.concatenategets called. 
你不能(至少很容易)。当您调用 时concat,最终np.concatenate会被调用。
See this answer explaining why you can't concatenate arrays without copying. The short of it is that the arrays are not guaranteed to be contiguous in memory.
请参阅此答案,解释为什么您不能在不复制的情况下连接数组。缺点是不能保证数组在内存中是连续的。
Here's a simple example
这是一个简单的例子
a = rand(2, 10)
x, y = a
z = vstack((x, y))
print 'x.base is a and y.base is a ==', x.base is a and y.base is a
print 'x.base is z or y.base is z ==', x.base is z or y.base is z
Output:
输出:
x.base is a and y.base is a == True
x.base is z or y.base is z == False
Even though xand yshare the same base, namely a, concatenate(and thus vstack) cannot assume that they do since one often wants to concatenate arbitrarily strided arrays.
即使x和y共享相同的base,即a, concatenate(因此vstack)也不能假设它们会这样做,因为人们经常想要连接任意跨距的数组。
You easily generate two arrays with different strides sharing the same memory like so:
您可以轻松地生成两个共享相同内存的具有不同步幅的数组,如下所示:
a = arange(10)
b = a[::2]
print a.strides
print b.strides
Output:
输出:
(8,)
(16,)
This is why the following happens:
这就是为什么会发生以下情况:
In [214]: a = arange(10)
In [215]: a[::2].view(int16)
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-215-0366fadb1128> in <module>()
----> 1 a[::2].view(int16)
ValueError: new type not compatible with array.
In [216]: a[::2].copy().view(int16)
Out[216]: array([0, 0, 0, 0, 2, 0, 0, 0, 4, 0, 0, 0, 6, 0, 0, 0, 8, 0, 0, 0], dtype=int16)
EDIT:Using pd.merge(df1,  df2, copy=False)(or df1.merge(df2, copy=False)) when df1.dtype != df2.dtypewill not make a copy. Otherwise, a copy is made.
编辑:使用pd.merge(df1,  df2, copy=False)(或df1.merge(df2, copy=False)) whendf1.dtype != df2.dtype不会制作副本。否则,将进行复制。

