使用 numpy/pandas 按时间戳合并时间序列数据

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/32215024/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-13 23:48:46  来源:igfitidea点击:

Merging time series data by timestamp using numpy/pandas

pythonnumpypandas

提问by vind

I have time series data from three completely different sensor sources as CSV files and want to combine them into one big CSV file. I've managed to read them into numpy using numpy's genfromtxt, but I'm not sure what to do from here.

我有来自三个完全不同的传感器源的时间序列数据作为 CSV 文件,并希望将它们组合成一个大的 CSV 文件。我已经设法使用 numpy 的 genfromtxt 将它们读入 numpy,但我不确定从这里开始做什么。

Basically, what I have is something like this:

基本上,我所拥有的是这样的:

Table 1:

表格1:

timestamp    val_a   val_b   val_c

Table 2:

表 2:

timestamp    val_d   val_e   val_f   val_g

Table 3:

表3:

timestamp    val_h   val_i

All timestamps are UNIX millisecond timestamps as numpy.uint64.

所有时间戳都是 UNIX 毫秒时间戳,如 numpy.uint64。

And what I want is:

而我想要的是:

timestamp    val_a   val_b   val_c   val_d   val_e   val_f   val_g   val_h   val_i

...where all data is combined and ordered by timestamps. Each of the three tables is already ordered by timestamp. Since the data comes from different sources, there is no guarantee that a timestamp from table 1 will also be in table 2 or 3 and vice versa. In that case, the empty values should be marked as N/A.

...所有数据按时间戳组合和排序。三个表中的每一个都已按时间戳排序。由于数据来自不同的来源,因此无法保证表 1 中的时间戳也会出现在表 2 或表 3 中,反之亦然。在这种情况下,空值应标记为 N/A。

So far, I have tried using pandas to convert the data like so:

到目前为止,我已经尝试使用 Pandas 来转换数据,如下所示:

df_sensor1 = pd.DataFrame(numpy_arr_sens1)
df_sensor2 = pd.DataFrame(numpy_arr_sens2)
df_sensor3 = pd.DataFrame(numpy_arr_sens3)

and then tried using pandas.DataFrame.merge, but I'm pretty sure that won't work for what I'm trying to do now. Can anyone point me in the right direction?

然后尝试使用pandas.DataFrame.merge,但我很确定这不适用于我现在要做的事情。任何人都可以指出我正确的方向吗?

回答by Romain

I think that you can simply

我认为你可以简单地

  • Define the timestampas the indexof each DataFrame(use of set_index)
  • Use a jointo merge them with the 'outer'method
  • Optionnaly convert timestampto datetime
  • 将 定义timestampindex每个DataFrame(使用set_index
  • 使用 ajoin将它们与'outer'方法合并
  • 可选转换timestampdatetime

Here is what it looks like.

这是它的样子。

# generating some test data
timestamp = [1440540000, 1450540000]
df1 = pd.DataFrame(
    {'timestamp': timestamp, 'a': ['val_a', 'val2_a'], 'b': ['val_b', 'val2_b'], 'c': ['val_c', 'val2_c']})
# building a different index
timestamp = timestamp * np.random.randn(abs(1))
df2 = pd.DataFrame(
    {'timestamp': timestamp, 'd': ['val_d', 'val2_d'], 'e': ['val_e', 'val2_e'], 'f': ['val_f', 'val2_f'],
     'g': ['val_g', 'val2_g']}, index=index)
# keeping a value in common with the first index
timestamp = [1440540000, 1450560000]
df3 = pd.DataFrame({'timestamp': timestamp, 'h': ['val_h', 'val2_h'], 'i': ['val_i', 'val2_i']}, index=index)

# Setting the timestamp as the index
df1.set_index('timestamp', inplace=True)
df2.set_index('timestamp', inplace=True)
df3.set_index('timestamp', inplace=True)

# You can convert timestamps to dates but it's not mandatory I think
df1.index = pd.to_datetime(df1.index, unit='s')
df2.index = pd.to_datetime(df2.index, unit='s')
df3.index = pd.to_datetime(df3.index, unit='s')

# Just perform a join and that's it
result = df1.join(df2, how='outer').join(df3, how='outer')
result

result

结果