使用 numpy/pandas 按时间戳合并时间序列数据
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/32215024/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Merging time series data by timestamp using numpy/pandas
提问by vind
I have time series data from three completely different sensor sources as CSV files and want to combine them into one big CSV file. I've managed to read them into numpy using numpy's genfromtxt, but I'm not sure what to do from here.
我有来自三个完全不同的传感器源的时间序列数据作为 CSV 文件,并希望将它们组合成一个大的 CSV 文件。我已经设法使用 numpy 的 genfromtxt 将它们读入 numpy,但我不确定从这里开始做什么。
Basically, what I have is something like this:
基本上,我所拥有的是这样的:
Table 1:
表格1:
timestamp val_a val_b val_c
Table 2:
表 2:
timestamp val_d val_e val_f val_g
Table 3:
表3:
timestamp val_h val_i
All timestamps are UNIX millisecond timestamps as numpy.uint64.
所有时间戳都是 UNIX 毫秒时间戳,如 numpy.uint64。
And what I want is:
而我想要的是:
timestamp val_a val_b val_c val_d val_e val_f val_g val_h val_i
...where all data is combined and ordered by timestamps. Each of the three tables is already ordered by timestamp. Since the data comes from different sources, there is no guarantee that a timestamp from table 1 will also be in table 2 or 3 and vice versa. In that case, the empty values should be marked as N/A.
...所有数据按时间戳组合和排序。三个表中的每一个都已按时间戳排序。由于数据来自不同的来源,因此无法保证表 1 中的时间戳也会出现在表 2 或表 3 中,反之亦然。在这种情况下,空值应标记为 N/A。
So far, I have tried using pandas to convert the data like so:
到目前为止,我已经尝试使用 Pandas 来转换数据,如下所示:
df_sensor1 = pd.DataFrame(numpy_arr_sens1)
df_sensor2 = pd.DataFrame(numpy_arr_sens2)
df_sensor3 = pd.DataFrame(numpy_arr_sens3)
and then tried using pandas.DataFrame.merge, but I'm pretty sure that won't work for what I'm trying to do now. Can anyone point me in the right direction?
然后尝试使用pandas.DataFrame.merge,但我很确定这不适用于我现在要做的事情。任何人都可以指出我正确的方向吗?
回答by Romain
I think that you can simply
我认为你可以简单地
- Define the
timestampas theindexof eachDataFrame(use ofset_index) - Use a
jointo merge them with the'outer'method - Optionnaly convert
timestamptodatetime
- 将 定义
timestamp为index每个DataFrame(使用set_index) - 使用 a
join将它们与'outer'方法合并 - 可选转换
timestamp为datetime
Here is what it looks like.
这是它的样子。
# generating some test data
timestamp = [1440540000, 1450540000]
df1 = pd.DataFrame(
{'timestamp': timestamp, 'a': ['val_a', 'val2_a'], 'b': ['val_b', 'val2_b'], 'c': ['val_c', 'val2_c']})
# building a different index
timestamp = timestamp * np.random.randn(abs(1))
df2 = pd.DataFrame(
{'timestamp': timestamp, 'd': ['val_d', 'val2_d'], 'e': ['val_e', 'val2_e'], 'f': ['val_f', 'val2_f'],
'g': ['val_g', 'val2_g']}, index=index)
# keeping a value in common with the first index
timestamp = [1440540000, 1450560000]
df3 = pd.DataFrame({'timestamp': timestamp, 'h': ['val_h', 'val2_h'], 'i': ['val_i', 'val2_i']}, index=index)
# Setting the timestamp as the index
df1.set_index('timestamp', inplace=True)
df2.set_index('timestamp', inplace=True)
df3.set_index('timestamp', inplace=True)
# You can convert timestamps to dates but it's not mandatory I think
df1.index = pd.to_datetime(df1.index, unit='s')
df2.index = pd.to_datetime(df2.index, unit='s')
df3.index = pd.to_datetime(df3.index, unit='s')
# Just perform a join and that's it
result = df1.join(df2, how='outer').join(df3, how='outer')
result


