在 Pandas 中同步两个大数据帧的最有效方法是什么？

Question

提问by Mannaggia

I would like to synchronize two very long data frames, performance is key in this use case. The two data frames are indexed in chronological order (this should be exploited to be as fast as possible) using datetimes or Timestamps.

我想同步两个很长的数据帧，在这个用例中性能是关键。这两个数据帧使用日期时间或时间戳按时间顺序编制索引（应该尽可能快地利用它）。

One way to synch is provided in this example:

本示例提供了一种同步方式：

import pandas as pd
df1=pd.DataFrame({'A':[1,2,3,4,5,6], 'B':[1,5,3,4,5,7]}, index=pd.date_range('20140101 101501', freq='u', periods=6))
df2=pd.DataFrame({'D':[10,2,30,4,5,10], 'F':[1,5,3,4,5,70]}, index=pd.date_range('20140101 101501.000003', freq='u', periods=6))

# synch data frames
df3=df1.merge(df2, how='outer', right_index=True, left_index=True).fillna(method='ffill')

My question is if this is the most efficient way to do it? I am ready to explore other solutions (e.g. using numpy or cython) if there are faster ways to solve this task.

我的问题是这是否是最有效的方法？如果有更快的方法来解决这个任务，我准备探索其他解决方案（例如使用 numpy 或 cython）。

Thanks

谢谢

Note: time-stamps are not in general equally spaced (as in the example above), the method should also work in this case

注意：时间戳通常不是等距的（如上例所示），该方法也适用于这种情况

Comment after reading the answers

阅读答案后发表评论

I think there are many use cases in which neither align nor merge or join help. The point is to not use DB related semantics for aligning (which for timeseries are not so relevant in my opinion). For me aligning means map series A into B and have a way to deal with missing values (typically sample and hold method), align and join cause a not wanted effects like several timestamps repeated as a result of joining. I still do not have a perfect solution, but it seems np.searchsorted can help (it is much faster than using several calls to join / align to do what I need). I could not find a pandas way to do this up to now.

我认为有很多用例既不对齐也不合并或加入帮助。关键是不要使用与 DB 相关的语义进行对齐（在我看来，这对于时间序列并不是那么相关）。对我来说，对齐意味着将系列 A 映射到 B 并有一种方法来处理缺失值（通常是采样和保持方法），对齐和连接会导致不想要的效果，例如由于连接而重复的几个时间戳。我仍然没有完美的解决方案，但似乎 np.searchsorted 可以提供帮助（它比使用多次调用 join / align 来完成我需要的要快得多）。到目前为止，我找不到Pandas的方法来做到这一点。

How can I map A into B so that B so that the result has all timestamps of A and B but no repetitions (except those which are already in A and B)?

如何将 A 映射到 B 以便 B 使结果具有 A 和 B 的所有时间戳但没有重复（除了那些已经在 A 和 B 中的时间戳）？

Another typical use case is sample and hold synch, which can be solved in an efficient way as follows (synch A with B, i.e. take for every timestamp in A the corresponding values in B:

另一个典型的用例是采样和保持同步，它可以通过如下有效的方式解决（同步 A 与 B，即为 A 中的每个时间戳取 B 中的相应值：

idx=np.searchsorted(B.index.values, A.index.values, side='right')-1
df=A.copy()
for i in B:
    df[i]=B[i].ix[idx].values

the result df contains the same index of A and the synchronized values in B.

结果 df 包含 A 的相同索引和 B 中的同步值。

Is there an effective way to do such things directly in pandas?

有没有一种有效的方法可以直接在Pandas中做这样的事情？

Answer 1

回答by Jeff

If you need to synchronize then, use align, docs are here. Otherwise merge is a good option.

如果您需要同步，请使用align，文档在此处。否则合并是一个不错的选择。

In [18]: N=100000

In [19]: df1=pd.DataFrame({'A':[1,2,3,4,5,6]*N, 'B':[1,5,3,4,5,7]*N}, index=pd.date_range('20140101 101501', freq='u', periods=6*N))

In [20]: df2=pd.DataFrame({'D':[10,2,30,4,5,10]*N, 'F':[1,5,3,4,5,70]*N}, index=pd.date_range('20140101 101501.000003', freq='u', periods=6*N))

In [21]: %timeit df1.merge(df2, how='outer', right_index=True, left_index=True).fillna(method='ffill')
10 loops, best of 3: 69.3 ms per loop

In [22]: %timeit df1.align(df2)
10 loops, best of 3: 36.5 ms per loop

In [24]: pd.set_option('max_rows',10)

In [25]: x, y = df1.align(df2)

In [26]: x
Out[26]: 
                             A   B   D   F
2014-01-01 10:15:01          1   1 NaN NaN
2014-01-01 10:15:01.000001   2   5 NaN NaN
2014-01-01 10:15:01.000002   3   3 NaN NaN
2014-01-01 10:15:01.000003   4   4 NaN NaN
2014-01-01 10:15:01.000004   5   5 NaN NaN
...                         ..  ..  ..  ..
2014-01-01 10:15:01.599998   5   5 NaN NaN
2014-01-01 10:15:01.599999   6   7 NaN NaN
2014-01-01 10:15:01.600000 NaN NaN NaN NaN
2014-01-01 10:15:01.600001 NaN NaN NaN NaN
2014-01-01 10:15:01.600002 NaN NaN NaN NaN

[600003 rows x 4 columns]

In [27]: y
Out[27]: 
                             A   B   D   F
2014-01-01 10:15:01        NaN NaN NaN NaN
2014-01-01 10:15:01.000001 NaN NaN NaN NaN
2014-01-01 10:15:01.000002 NaN NaN NaN NaN
2014-01-01 10:15:01.000003 NaN NaN  10   1
2014-01-01 10:15:01.000004 NaN NaN   2   5
...                         ..  ..  ..  ..
2014-01-01 10:15:01.599998 NaN NaN   2   5
2014-01-01 10:15:01.599999 NaN NaN  30   3
2014-01-01 10:15:01.600000 NaN NaN   4   4
2014-01-01 10:15:01.600001 NaN NaN   5   5
2014-01-01 10:15:01.600002 NaN NaN  10  70

[600003 rows x 4 columns]

Answer 2

回答by Xavi

If you wish to use the index of one of your DataFrames as pattern for synchronizing, maybe useful:

如果您希望使用您的 DataFrame 之一的索引作为同步模式，可能有用：

df3 = df1.iloc[df1.index.isin(df2.index),]

Note: I guess shape of df1 > shape of df2

注意：我猜 df1 的形状 > df2 的形状

In the previous code snippet, you get the elements in df1 and df2 but if you want to add new indexes maybe you prefer doing:

在前面的代码片段中，您获得了 df1 和 df2 中的元素，但如果您想添加新索引，您可能更喜欢这样做：

new_indexes = df1.index.diff(df2.index) # indexes of df1 and not in df2
default_values = np.zeros((new_indexes.shape[0],df2.shape[1])) 
df2 = df2.append(pd.DataFrame(default_values , index=new_indexes)).sort(axis=0)

You can see another way to synchronize in this post

你可以在这篇文章中看到另一种同步方式

Answer 3

回答by Nikolai Zaitsev

To my view sync of time series is a very simple procedure. Assume ts# (#=0,1,2)to be filled with

在我看来，时间序列的同步是一个非常简单的过程。假设ts# (#=0,1,2)充满

ts#[0,:]- time

ts#[0,:]- 时间

ts#[1,:]- ask

ts#[1,:]- 问

ts#[2,:]- bid

ts#[2,:]- 出价

ts#[3,:]- asksz

ts#[3,:]- 问

ts#[4,:]- bidsz

ts#[4,:]- 投标

output is

输出是

totts[0,:]- sync time

totts[0,:]- 同步时间

totts[1-4,:]- ask/bid/asksz/bidsz of ts0

totts[1-4,:]- 询问/出价/询问/出价 ts0

totts[5-8,:]- ask/bid/asksz/bidsz of ts1

totts[5-8,:]- 询问/出价/询问/出价 ts1

totts[9-12,:]- ask/bid/asksz/bidsz of ts2

totts[9-12,:]- 询问/出价/询问/出价 ts2

function:

功能：

def syncTS(ts0,ts1,ts2):

    ti0 = ts0[0,:]
    ti1 = ts1[0,:]
    ti2 = ts2[0,:]

    totti = np.union1d(ti0, ti1)
    totti = np.union1d(totti,ti2)

    totts = np.ndarray((13,len(totti)))

    it0=it1=it2=0
    nT0=len(ti0)-1
    nT1=len(ti1)-1
    nT2=len(ti2)-1

    for it,tim in enumerate(totti):
        if tim >= ti0[it0] and it0 < nT0:
            it0+=1

        if tim >= ti1[it1] and it1 < nT1:
            it1 += 1

        if tim >= ti2[it2] and it2 < nT2:
            it2 += 1

        totts[0, it] = tim
        for k in range(1,5):
            totts[k, it] = ts0[k, it0]
            totts[k + 4, it] = ts1[k, it1]
            totts[k + 8, it] = ts2[k, it2]

    return totts

在 Pandas 中同步两个大数据帧的最有效方法是什么？

提问by Mannaggia

回答by Jeff

回答by Xavi

回答by Nikolai Zaitsev

相关推荐

最近更新

标签

在 Pandas 中同步两个大数据帧的最有效方法是什么？

提问by Mannaggia

回答by Jeff

回答by Xavi

回答by Nikolai Zaitsev

相关推荐

Pandas groupby 如何计算范围内的计数

pandas 绘制数据框熊猫不工作

检查 Pandas 数据框列列表中的值

在 Pandas DataFrame 中有效地找到匹配的行（基于内容）

相关推荐

最近更新

标签