在 Pandas 中查询 HDF5
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/23863553/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Query HDF5 in Pandas
提问by user3576212
I have following data (18,619,211 rows) stored as a pandas dataframe object in hdf5 file:
我将以下数据(18,619,211 行)作为Pandas数据框对象存储在 hdf5 文件中:
date id2 w
id
100010 1980-03-31 10401 0.000839
100010 1980-03-31 10604 0.020140
100010 1980-03-31 12490 0.026149
100010 1980-03-31 13047 0.033560
100010 1980-03-31 13303 0.001657
where idis index and others are columns. dateis np.datetime64. I need to perform query like this (the code doesn't work of course):
哪里id是索引,其他是列。date是np.datetime64。我需要执行这样的查询(代码当然不起作用):
db=pd.HDFStore('database.h5')
data=db.select('df', where='id==id_i & date>bgdt & date<endt')
Note id_i, bgdt, endtare all variables, not actual values and need to be passed within a loop. for example:
注意id_i, bgdt, endt是所有变量,而不是实际值,需要在循环中传递。例如:
datesis a Pandas Period index or timestamps index, either way, I can convert to each other.
dates是 Pandas Period 索引或时间戳索引,无论哪种方式,我都可以相互转换。
dates=['1990-01', 1990-04','1990-09',......]
id_listis a list of IDs
id_list是一个 ID 列表
id_list=[100010, 100011,1000012,.......]
The loop looks like this (the reason I am doing loop is because the data is huge, there are other datasets I have query in the same time and then perform some operations)
循环看起来像这样(我做循环的原因是因为数据很大,我同时查询了其他数据集然后执行一些操作)
db=pd.HDFStore('database.h5')
for id_i in id_list:
for date in dates:
bgdt=date-1 (move to previous month)
endt=date-60 (previous 60 month)
data=db.select('df', where='index==id_i & date>bgdt & date<endt')
......
This problem have 2 parts:
这个问题有两部分:
- I don't know how to query index and columns in the same time. The doc in pandas showed how to query based on index conditions OR columns conditions, but no examples on how to query based on them in the SAME TIME.
- (BTW, This is very common in Pandas Documentation. The doc usually shows very simple thing like how to do 'A', OR how to do 'B', but not how to do BOTH 'A' and 'B'. A good example is use
queryon a MultiIndex pandas dataframe. The doc shows based on eitherlevel=0ORlevel=1, but no example on how to do BOTH in the SAME TIME.)
- (BTW, This is very common in Pandas Documentation. The doc usually shows very simple thing like how to do 'A', OR how to do 'B', but not how to do BOTH 'A' and 'B'. A good example is use
- I don't know how to pass three variables
id_i, bgdt, endtto the query. I know how to pass only on by using%s, but not all of them.- I am also a little confused with the datetime datatype. There seems to be quite a few of datetimes:
datetime.datetime,numpy.datetime64,pandas.Period. I am mostly working on monthly data, sopandas.Periodis the most useful one. But I can't easily convert a column (not index) of timestamps (the default date type of Pandas when parsed from raw data). Is there any datatype that is simply a 'date', not timestamps, not period, but just a simple DATE with only year,month and day?
- I am also a little confused with the datetime datatype. There seems to be quite a few of datetimes:
- 我不知道如何同时查询索引和列。pandas 中的文档展示了如何根据索引条件或列条件进行查询,但没有关于如何在同一时间基于它们进行查询的示例。
- (顺便说一句,这在 Pandas 文档中很常见。该文档通常显示非常简单的事情,例如如何执行“A”或如何执行“B”,但不显示如何同时执行“A”和“B”。很好示例用于
queryMultiIndex pandas 数据框。文档显示基于level=0ORlevel=1,但没有关于如何在同一时间执行 BOTH 的示例。)
- (顺便说一句,这在 Pandas 文档中很常见。该文档通常显示非常简单的事情,例如如何执行“A”或如何执行“B”,但不显示如何同时执行“A”和“B”。很好示例用于
- 我不知道如何将三个变量传递
id_i, bgdt, endt给查询。我只知道如何通过使用来传递%s,但不是全部。- 我也对 datetime 数据类型有点困惑。似乎有很多日期时间:
datetime.datetime,numpy.datetime64,pandas.Period。我主要研究月度数据,所以pandas.Period是最有用的。但是我无法轻松转换时间戳列(不是索引)(从原始数据解析时 Pandas 的默认日期类型)。是否有任何数据类型只是一个“日期”,而不是时间戳,而不是期间,而只是一个只有年、月和日的简单 DATE?
- 我也对 datetime 数据类型有点困惑。似乎有很多日期时间:
A lot troubles, but I really LOVE python and pandas (I am trying to move my workflow from SAS to Python). Any help will be appreciated!
很多麻烦,但我真的很喜欢 python 和 Pandas(我正在尝试将我的工作流程从 SAS 转移到 Python)。任何帮助将不胜感激!
采纳答案by Jeff
hereare the docs for querying on non-index columns.
这是用于查询非索引列的文档。
Create the test data. It is not clear how the original frame is constructed, e.g. whether its unique data and the ranges, so I have created a sample, with 10M rows, and a multi-level date range with the id column.
创建测试数据。不清楚原始框架是如何构建的,例如它的数据和范围是否唯一,所以我创建了一个样本,包含 10M 行,以及一个带有 id 列的多级日期范围。
In [60]: np.random.seed(1234)
In [62]: pd.set_option('display.max_rows',20)
In [63]: index = pd.MultiIndex.from_product([np.arange(10000,11000),pd.date_range('19800101',periods=10000)],names=['id','date'])
In [67]: df = DataFrame(dict(id2=np.random.randint(0,1000,size=len(index)),w=np.random.randn(len(index))),index=index).reset_index().set_index(['id','date'])
In [68]: df
Out[68]:
id2 w
id date
10000 1980-01-01 712 0.371372
1980-01-02 718 -1.255708
1980-01-03 581 -1.182727
1980-01-04 202 -0.947432
1980-01-05 493 -0.125346
1980-01-06 752 0.380210
1980-01-07 435 -0.444139
1980-01-08 128 -1.885230
1980-01-09 425 1.603619
1980-01-10 449 0.103737
... ... ...
10999 2007-05-09 8 0.624532
2007-05-10 669 0.268340
2007-05-11 918 0.134816
2007-05-12 979 -0.769406
2007-05-13 969 -0.242123
2007-05-14 950 -0.347884
2007-05-15 49 -1.284825
2007-05-16 922 -1.313928
2007-05-17 347 -0.521352
2007-05-18 353 0.189717
[10000000 rows x 2 columns]
Write the data to disk, showing how to create a data column (note that the indexes are automatically queryable, this allows id2 to be queryable as well). This is de-facto equivalent to doing. This takes care of opening and closing the store (you can accomplish the same thing by opening a store, appending, and closing).
将数据写入磁盘,展示如何创建数据列(注意索引是自动可查询的,这也允许 id2 可查询)。这实际上等同于做。这负责打开和关闭存储(您可以通过打开存储、附加和关闭来完成相同的操作)。
In order to query a column, it MUST BE A DATA COLUMN or an index of the frame.
为了查询一列,它必须是一个数据列或框架的索引。
In [70]: df.to_hdf('test.h5','df',mode='w',data_columns=['id2'],format='table')
In [71]: !ls -ltr test.h5
-rw-rw-r-- 1 jreback users 430540284 May 26 17:16 test.h5
Queries
查询
In [80]: ids=[10101,10898]
In [81]: start_date='20010101'
In [82]: end_date='20010301'
You can specify dates as string (either in-line or as variables; you can also specify Timestamp like objects)
您可以将日期指定为字符串(行内或作为变量;您还可以指定时间戳之类的对象)
In [83]: pd.read_hdf('test.h5','df',where='date>start_date & date<end_date')
Out[83]:
id2 w
id date
10000 2001-01-02 972 -0.146107
2001-01-03 954 1.420412
2001-01-04 567 1.077633
2001-01-05 87 -0.042838
2001-01-06 79 -1.791228
2001-01-07 744 1.110478
2001-01-08 237 -0.846086
2001-01-09 998 -0.696369
2001-01-10 266 -0.595555
2001-01-11 206 -0.294633
... ... ...
10999 2001-02-19 616 -0.745068
2001-02-20 577 -1.474748
2001-02-21 990 -1.276891
2001-02-22 939 -1.369558
2001-02-23 621 -0.214365
2001-02-24 396 -0.142100
2001-02-25 492 -0.204930
2001-02-26 478 1.839291
2001-02-27 688 0.291504
2001-02-28 356 -1.987554
[58000 rows x 2 columns]
You can use in-line lists
您可以使用内联列表
In [84]: pd.read_hdf('test.h5','df',where='date>start_date & date<end_date & id=ids')
Out[84]:
id2 w
id date
10101 2001-01-02 722 1.620553
2001-01-03 849 -0.603468
2001-01-04 635 -1.419072
2001-01-05 331 0.521634
2001-01-06 730 0.008830
2001-01-07 706 -1.006412
2001-01-08 59 1.380005
2001-01-09 689 0.017830
2001-01-10 788 -3.090800
2001-01-11 704 -1.491824
... ... ...
10898 2001-02-19 530 -1.031167
2001-02-20 652 -0.019266
2001-02-21 472 0.638266
2001-02-22 540 -1.827251
2001-02-23 654 -1.020140
2001-02-24 328 -0.477425
2001-02-25 871 -0.892684
2001-02-26 166 0.894118
2001-02-27 806 0.648240
2001-02-28 824 -1.051539
[116 rows x 2 columns]
You can also specify boolean expressions
您还可以指定布尔表达式
In [85]: pd.read_hdf('test.h5','df',where='date>start_date & date<end_date & id=ids & id2>500 & id2<600')
Out[85]:
id2 w
id date
10101 2001-01-12 534 -0.220692
2001-01-14 596 -2.225393
2001-01-16 596 0.956239
2001-01-30 513 -2.528996
2001-02-01 572 -1.877398
2001-02-13 569 -0.940748
2001-02-14 541 1.035619
2001-02-21 571 -0.116547
10898 2001-01-16 591 0.082564
2001-02-06 586 0.470872
2001-02-10 531 -0.536194
2001-02-16 586 0.949947
2001-02-19 530 -1.031167
2001-02-22 540 -1.827251
To answer your actual question I would do this (their is really not enough information, but I'll put some reasonable expectations):
要回答您的实际问题,我会这样做(他们的信息确实不够,但我会提出一些合理的期望):
- Do't loop over queries, unless you have a very small number of absolute queries
- Read the biggest chunk into memory that you can. Usually this is accomplished by selecting out the biggest ranges of data that you need, even if you select MORE data than you actually need.
- Then subselect using in-memory expressions, which will generally be orders of magnitude faster.
- List elements are limited to about 30 elements in total (this is current an implementation limit on the PyTables side). It will work if you specify more, but what will happen is that you will read in a lot of data, then it will be reindexed down (in-memory). So user needs to be aware of this.
- 不要循环查询,除非你有非常少量的绝对查询
- 将最大的块读入内存中。通常这是通过选择您需要的最大范围的数据来实现的,即使您选择的数据多于您实际需要的数据。
- 然后使用内存中的表达式进行子选择,这通常会快几个数量级。
- 列表元素总共限制在大约 30 个元素(这是 PyTables 方面的当前实现限制)。如果您指定更多,它将起作用,但会发生的是您将读取大量数据,然后将其重新索引(在内存中)。所以用户需要注意这一点。
So for example say that you have 1000 unique ids with 10000 dates per as my example demonstrates. You want to select say 200 of these, with a date range of 1000.
例如,如我的示例所示,假设您有 1000 个唯一 ID,每个 ID 有 10000 个日期。您想选择其中的 200 个,日期范围为 1000。
So in this case I would simply select on the dates then do the in-memory comparison, something like this:
所以在这种情况下,我会简单地选择日期然后进行内存比较,如下所示:
df = pd.read_hdf('test.h5','df',where='date=>global_start_date & date<=global_end_date')
df[df.isin(list_of_ids)]
You also might have dates that change per ids. So chunk them, this time using a list of ids.
您还可能有每个 id 更改的日期。所以将它们分块,这次使用 id 列表。
Something like this:
像这样的东西:
output = []
for i in len(list_of_ids) % 30:
ids = list_of_ids[i:(i+30)]
start_date = get_start_date_for_these_ids (global)
end_date = get_end_date_for_these_ids (global)
where = 'id=ids & start_date>=start_date & end_date<=end_date'
df = pd.read_hdf('test.h5','df',where=where)
output.append(df)
final_result = concat(output)
The basic idea then is to select a superset of the data using the criteria that you want, sub-selecting so it fits in memory, but you limit the number of queries you do (e.g. imagine that you end up selecting a single row with your query, if you have to query this 18M times that is bad).
基本思想是使用您想要的标准选择数据的超集,进行子选择以使其适合内存,但您限制了您执行的查询数量(例如,想象您最终选择了一行查询,如果您必须查询这 18M 次,那是不好的)。

