使用 numpy/pandas 按时间戳合并时间序列数据

Question

提问by vind

I have time series data from three completely different sensor sources as CSV files and want to combine them into one big CSV file. I've managed to read them into numpy using numpy's genfromtxt, but I'm not sure what to do from here.

我有来自三个完全不同的传感器源的时间序列数据作为 CSV 文件，并希望将它们组合成一个大的 CSV 文件。我已经设法使用 numpy 的 genfromtxt 将它们读入 numpy，但我不确定从这里开始做什么。

Basically, what I have is something like this:

基本上，我所拥有的是这样的：

Table 1:

表格1：

timestamp    val_a   val_b   val_c

Table 2:

表 2：

timestamp    val_d   val_e   val_f   val_g

Table 3:

表3：

timestamp    val_h   val_i

All timestamps are UNIX millisecond timestamps as numpy.uint64.

所有时间戳都是 UNIX 毫秒时间戳，如 numpy.uint64。

And what I want is:

而我想要的是：

timestamp    val_a   val_b   val_c   val_d   val_e   val_f   val_g   val_h   val_i

...where all data is combined and ordered by timestamps. Each of the three tables is already ordered by timestamp. Since the data comes from different sources, there is no guarantee that a timestamp from table 1 will also be in table 2 or 3 and vice versa. In that case, the empty values should be marked as N/A.

...所有数据按时间戳组合和排序。三个表中的每一个都已按时间戳排序。由于数据来自不同的来源，因此无法保证表 1 中的时间戳也会出现在表 2 或表 3 中，反之亦然。在这种情况下，空值应标记为 N/A。

So far, I have tried using pandas to convert the data like so:

到目前为止，我已经尝试使用 Pandas 来转换数据，如下所示：

df_sensor1 = pd.DataFrame(numpy_arr_sens1)
df_sensor2 = pd.DataFrame(numpy_arr_sens2)
df_sensor3 = pd.DataFrame(numpy_arr_sens3)

and then tried using pandas.DataFrame.merge, but I'm pretty sure that won't work for what I'm trying to do now. Can anyone point me in the right direction?

然后尝试使用pandas.DataFrame.merge，但我很确定这不适用于我现在要做的事情。任何人都可以指出我正确的方向吗？

Answer 1

回答by Romain

I think that you can simply

我认为你可以简单地

Define the timestampas the indexof each DataFrame(use of set_index)
Use a jointo merge them with the 'outer'method
Optionnaly convert timestampto datetime

将定义timestamp为index每个DataFrame（使用set_index）
使用 ajoin将它们与'outer'方法合并
可选转换timestamp为datetime

Here is what it looks like.

这是它的样子。

# generating some test data
timestamp = [1440540000, 1450540000]
df1 = pd.DataFrame(
    {'timestamp': timestamp, 'a': ['val_a', 'val2_a'], 'b': ['val_b', 'val2_b'], 'c': ['val_c', 'val2_c']})
# building a different index
timestamp = timestamp * np.random.randn(abs(1))
df2 = pd.DataFrame(
    {'timestamp': timestamp, 'd': ['val_d', 'val2_d'], 'e': ['val_e', 'val2_e'], 'f': ['val_f', 'val2_f'],
     'g': ['val_g', 'val2_g']}, index=index)
# keeping a value in common with the first index
timestamp = [1440540000, 1450560000]
df3 = pd.DataFrame({'timestamp': timestamp, 'h': ['val_h', 'val2_h'], 'i': ['val_i', 'val2_i']}, index=index)

# Setting the timestamp as the index
df1.set_index('timestamp', inplace=True)
df2.set_index('timestamp', inplace=True)
df3.set_index('timestamp', inplace=True)

# You can convert timestamps to dates but it's not mandatory I think
df1.index = pd.to_datetime(df1.index, unit='s')
df2.index = pd.to_datetime(df2.index, unit='s')
df3.index = pd.to_datetime(df3.index, unit='s')

# Just perform a join and that's it
result = df1.join(df2, how='outer').join(df3, how='outer')
result

使用 numpy/pandas 按时间戳合并时间序列数据

提问by vind

回答by Romain

相关推荐

最近更新

标签

使用 numpy/pandas 按时间戳合并时间序列数据

提问by vind

回答by Romain

相关推荐

pandas 如何通过广播将pandas数据帧与numpy数组相乘

将 Pandas groupby 数据行值重塑为列标题

pandas 熊猫 - 绘制排序列以增加整数索引

pandas.concat：无法处理非唯一的多索引！熊猫蟒

相关推荐

最近更新

标签