高效地将单行添加到 Pandas Series 或 DataFrame

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/13751926/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-13 20:31:29  来源:igfitidea点击:

Efficiently add single row to Pandas Series or DataFrame

pythonperformancepandastime-series

提问by user1883571

I want to use Pandas to work with series in real-time. Every second, I need to add the latest observation to an existing series. My series are grouped into a DataFrame and stored in an HDF5 file.

我想使用 Pandas 实时处理系列。每一秒,我都需要将最新的观察结果添加到现有系列中。我的系列被分组到一个 DataFrame 中并存储在一个 HDF5 文件中。

Here's how I do it at the moment:

这是我目前的做法:

>> existing_series = Series([7,13,97], [0,1,2]) 
>> updated_series = existing_series.append( Series([111], [3]) )

Is this the most efficient way? I've read countless posts but cannot find any that focuses on efficiency with high-frequency data.

这是最有效的方法吗?我读过无数帖子,但找不到任何专注于高频数据效率的帖子。

Edit: I just read about modules shelve and pickle. It seems like they would achieve what I'm trying to do, basically save lists on disks. Because my lists are large, is there any way not to load the full list into memory but, rather, efficiently append values one at a time?

编辑:我刚刚阅读了有关模块搁置和泡菜的信息。看起来他们会实现我想要做的事情,基本上是将列表保存在磁盘上。因为我的列表很大,有什么方法可以不将完整列表加载到内存中,而是一次有效地附加一个值?

回答by Jeff

Take a look at the new PyTables docs in 0.10 (coming soon) or you can get from master. http://pandas.pydata.org/pandas-docs/dev/whatsnew.html

查看 0.10 中的新 PyTables 文档(即将推出),或者您可以从 master 获取。http://pandas.pydata.org/pandas-docs/dev/whatsnew.html

PyTables is actually pretty good at appending, and writing to a HDFStore every second will work. You want to store a DataFrame table. You can then select data in a query like fashion, e.g.

PyTables 实际上非常擅长追加,并且每秒写入 HDFStore 都可以。您想存储一个 DataFrame 表。然后,您可以像时尚一样以查询方式选择数据,例如

store.append('df', the_latest_df)
store.append('df', the_latest_df)
....
store.select('df', [ 'index>12:00:01' ])

If this is all from the same process, then this will work great. If you have a writer process and then another process is reading, this is a little tricky (but will work correctly depending on what you are doing).

如果这一切都来自同一个过程,那么这将非常有效。如果您有一个写入进程,然后另一个进程正在读取,这有点棘手(但会根据您的操作正常工作)。

Another option is to use messaging to transmit from one process to another (and then append in memory), this avoids the serialization issue.

另一种选择是使用消息从一个进程传输到另一个进程(然后附加到内存中),这避免了序列化问题。