如何在不复制数据的情况下连接 Pandas DataFrames？

Question

提问by Soldalma

I want to concatenate two pandas DataFrames without copying the data. That is, I want the concatenated DataFrame to be a view on the data in the two original DataFrames. I tried using concat() and that did not work. This block of code shows that changing the underlying data affects the two DataFrames that are concatenated but not the concatenated DataFrame:

我想在不复制数据的情况下连接两个 Pandas DataFrame。也就是说，我希望串联的 DataFrame 成为两个原始 DataFrame 中数据的视图。我尝试使用 concat() 但没有奏效。此代码块显示更改底层数据会影响连接的两个 DataFrame，但不会影响连接的 DataFrame：

arr = np.random.randn(12).reshape(6, 2)
df = pd.DataFrame(arr, columns = ('VALE5', 'PETR4'), index = dates)
arr2 = np.random.randn(12).reshape(6, 2)
df2 = pd.DataFrame(arr, columns = ('AMBV3', 'BBDC4'), index = dates)
df_concat = pd.concat(dict(A = df, B = df2),axis=1)
pp(df)
pp(df_concat)
arr[0, 0] = 9999999.99
pp(df)
pp(df_concat)

This is the output of the last five lines. df changed after a new value was assigned to arr[0, 0]; df_concat was not affected.

这是最后五行的输出。将新值分配给 arr[0, 0] 后，df 发生了变化；df_concat 不受影响。

In [56]: pp(df)
           VALE5     PETR4
2013-01-01 -0.557180  0.170073
2013-01-02 -0.975797  0.763136
2013-01-03 -0.913254  1.042521
2013-01-04 -1.973013 -2.069460
2013-01-05 -1.259005  1.448442
2013-01-06 -0.323640  0.024857

In [57]: pp(df_concat)
               A                   B          
           VALE5     PETR4     AMBV3     BBDC4
2013-01-01 -0.557180  0.170073 -0.557180  0.170073
2013-01-02 -0.975797  0.763136 -0.975797  0.763136
2013-01-03 -0.913254  1.042521 -0.913254  1.042521
2013-01-04 -1.973013 -2.069460 -1.973013 -2.069460
2013-01-05 -1.259005  1.448442 -1.259005  1.448442
2013-01-06 -0.323640  0.024857 -0.323640  0.024857

In [58]: arr[0, 0] = 9999999.99

In [59]: pp(df)
                 VALE5     PETR4
2013-01-01  9999999.990000  0.170073
2013-01-02       -0.975797  0.763136
2013-01-03       -0.913254  1.042521
2013-01-04       -1.973013 -2.069460
2013-01-05       -1.259005  1.448442
2013-01-06       -0.323640  0.024857

In [60]: pp(df_concat)
               A                   B          
           VALE5     PETR4     AMBV3     BBDC4
2013-01-01 -0.557180  0.170073 -0.557180  0.170073
2013-01-02 -0.975797  0.763136 -0.975797  0.763136
2013-01-03 -0.913254  1.042521 -0.913254  1.042521
2013-01-04 -1.973013 -2.069460 -1.973013 -2.069460
2013-01-05 -1.259005  1.448442 -1.259005  1.448442
2013-01-06 -0.323640  0.024857 -0.323640  0.024857

I guess this means concat() created a copy of the data. Is there a way to avoid a copy being made? (I want to minimize memory usage).

我猜这意味着 concat() 创建了数据的副本。有没有办法避免复制？（我想最小化内存使用量）。

Also, is there a fast way to check if two DataFrames are linked to the same underlying data? (short of going through the trouble of changing the data and checking if each DataFrame has changed)

另外，有没有一种快速的方法来检查两个 DataFrame 是否链接到相同的底层数据？（没有经历更改数据和检查每个 DataFrame 是否已更改的麻烦）

Thanks for the help.

谢谢您的帮助。

FS

Answer 1

回答by Phillip Cloud

You can't (at least easily). When you call concat, ultimately np.concatenategets called.

你不能（至少很容易）。当您调用时concat，最终np.concatenate会被调用。

See this answer explaining why you can't concatenate arrays without copying. The short of it is that the arrays are not guaranteed to be contiguous in memory.

请参阅此答案，解释为什么您不能在不复制的情况下连接数组。缺点是不能保证数组在内存中是连续的。

Here's a simple example

这是一个简单的例子

a = rand(2, 10)
x, y = a
z = vstack((x, y))
print 'x.base is a and y.base is a ==', x.base is a and y.base is a
print 'x.base is z or y.base is z ==', x.base is z or y.base is z

Output:

输出：

x.base is a and y.base is a == True
x.base is z or y.base is z == False

Even though xand yshare the same base, namely a, concatenate(and thus vstack) cannot assume that they do since one often wants to concatenate arbitrarily strided arrays.

即使x和y共享相同的base，即a, concatenate（因此vstack）也不能假设它们会这样做，因为人们经常想要连接任意跨距的数组。

You easily generate two arrays with different strides sharing the same memory like so:

您可以轻松地生成两个共享相同内存的具有不同步幅的数组，如下所示：

a = arange(10)
b = a[::2]
print a.strides
print b.strides

Output:

输出：

(8,)
(16,)

This is why the following happens:

这就是为什么会发生以下情况：

In [214]: a = arange(10)

In [215]: a[::2].view(int16)
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-215-0366fadb1128> in <module>()
----> 1 a[::2].view(int16)

ValueError: new type not compatible with array.

In [216]: a[::2].copy().view(int16)
Out[216]: array([0, 0, 0, 0, 2, 0, 0, 0, 4, 0, 0, 0, 6, 0, 0, 0, 8, 0, 0, 0], dtype=int16)

EDIT:Using pd.merge(df1, df2, copy=False)(or df1.merge(df2, copy=False)) when df1.dtype != df2.dtypewill not make a copy. Otherwise, a copy is made.

编辑：使用pd.merge(df1, df2, copy=False)(或df1.merge(df2, copy=False)) whendf1.dtype != df2.dtype不会制作副本。否则，将进行复制。

如何在不复制数据的情况下连接 Pandas DataFrames？

提问by Soldalma

回答by Phillip Cloud

相关推荐

最近更新

标签

如何在不复制数据的情况下连接 Pandas DataFrames？

提问by Soldalma

回答by Phillip Cloud

相关推荐

Pandas：DataFrame 中的 DataFrame

Pandas：子索引数据帧：副本与视图

Pandas：一列基于另一列的箱线图

pandas 熊猫合并并加入不起作用

相关推荐

最近更新

标签