在 Pandas 中查询 HDF5

Question

提问by user3576212

I have following data (18,619,211 rows) stored as a pandas dataframe object in hdf5 file:

我将以下数据（18,619,211 行）作为Pandas数据框对象存储在 hdf5 文件中：

             date    id2         w
id                              
100010 1980-03-31   10401  0.000839
100010 1980-03-31   10604  0.020140
100010 1980-03-31   12490  0.026149
100010 1980-03-31   13047  0.033560
100010 1980-03-31   13303  0.001657

where idis index and others are columns. dateis np.datetime64. I need to perform query like this (the code doesn't work of course):

哪里id是索引，其他是列。date是np.datetime64。我需要执行这样的查询（代码当然不起作用）：

db=pd.HDFStore('database.h5')
data=db.select('df', where='id==id_i & date>bgdt & date<endt')

Note id_i, bgdt, endtare all variables, not actual values and need to be passed within a loop. for example:

注意id_i, bgdt, endt是所有变量，而不是实际值，需要在循环中传递。例如：

datesis a Pandas Period index or timestamps index, either way, I can convert to each other.

dates是 Pandas Period 索引或时间戳索引，无论哪种方式，我都可以相互转换。

dates=['1990-01', 1990-04','1990-09',......]

id_listis a list of IDs

id_list是一个 ID 列表

id_list=[100010, 100011,1000012,.......]

The loop looks like this (the reason I am doing loop is because the data is huge, there are other datasets I have query in the same time and then perform some operations)

循环看起来像这样（我做循环的原因是因为数据很大，我同时查询了其他数据集然后执行一些操作）

db=pd.HDFStore('database.h5')
for id_i in id_list:
    for date in dates:
        bgdt=date-1 (move to previous month)
        endt=date-60 (previous 60 month)
        data=db.select('df', where='index==id_i & date>bgdt & date<endt')
        ......

This problem have 2 parts:

这个问题有两部分：

I don't know how to query index and columns in the same time. The doc in pandas showed how to query based on index conditions OR columns conditions, but no examples on how to query based on them in the SAME TIME.
- (BTW, This is very common in Pandas Documentation. The doc usually shows very simple thing like how to do 'A', OR how to do 'B', but not how to do BOTH 'A' and 'B'. A good example is use queryon a MultiIndex pandas dataframe. The doc shows based on either level=0OR level=1, but no example on how to do BOTH in the SAME TIME.)
I don't know how to pass three variables id_i, bgdt, endtto the query. I know how to pass only on by using %s, but not all of them.
- I am also a little confused with the datetime datatype. There seems to be quite a few of datetimes:datetime.datetime, numpy.datetime64, pandas.Period. I am mostly working on monthly data, so pandas.Periodis the most useful one. But I can't easily convert a column (not index) of timestamps (the default date type of Pandas when parsed from raw data). Is there any datatype that is simply a 'date', not timestamps, not period, but just a simple DATE with only year,month and day?

我不知道如何同时查询索引和列。pandas 中的文档展示了如何根据索引条件或列条件进行查询，但没有关于如何在同一时间基于它们进行查询的示例。
- （顺便说一句，这在 Pandas 文档中很常见。该文档通常显示非常简单的事情，例如如何执行“A”或如何执行“B”，但不显示如何同时执行“A”和“B”。很好示例用于queryMultiIndex pandas 数据框。文档显示基于level=0OR level=1，但没有关于如何在同一时间执行 BOTH 的示例。）
我不知道如何将三个变量传递id_i, bgdt, endt给查询。我只知道如何通过使用来传递%s，但不是全部。
- 我也对 datetime 数据类型有点困惑。似乎有很多日期时间：datetime.datetime, numpy.datetime64, pandas.Period。我主要研究月度数据，所以pandas.Period是最有用的。但是我无法轻松转换时间戳列（不是索引）（从原始数据解析时 Pandas 的默认日期类型）。是否有任何数据类型只是一个“日期”，而不是时间戳，而不是期间，而只是一个只有年、月和日的简单 DATE？

A lot troubles, but I really LOVE python and pandas (I am trying to move my workflow from SAS to Python). Any help will be appreciated!

很多麻烦，但我真的很喜欢 python 和 Pandas（我正在尝试将我的工作流程从 SAS 转移到 Python）。任何帮助将不胜感激！

Answer 1

采纳答案by Jeff

hereare the docs for querying on non-index columns.

这是用于查询非索引列的文档。

Create the test data. It is not clear how the original frame is constructed, e.g. whether its unique data and the ranges, so I have created a sample, with 10M rows, and a multi-level date range with the id column.

创建测试数据。不清楚原始框架是如何构建的，例如它的数据和范围是否唯一，所以我创建了一个样本，包含 10M 行，以及一个带有 id 列的多级日期范围。

In [60]: np.random.seed(1234)

In [62]: pd.set_option('display.max_rows',20)

In [63]: index = pd.MultiIndex.from_product([np.arange(10000,11000),pd.date_range('19800101',periods=10000)],names=['id','date'])

In [67]: df = DataFrame(dict(id2=np.random.randint(0,1000,size=len(index)),w=np.random.randn(len(index))),index=index).reset_index().set_index(['id','date'])

In [68]: df
Out[68]: 
                  id2         w
id    date                     
10000 1980-01-01  712  0.371372
      1980-01-02  718 -1.255708
      1980-01-03  581 -1.182727
      1980-01-04  202 -0.947432
      1980-01-05  493 -0.125346
      1980-01-06  752  0.380210
      1980-01-07  435 -0.444139
      1980-01-08  128 -1.885230
      1980-01-09  425  1.603619
      1980-01-10  449  0.103737
...               ...       ...
10999 2007-05-09    8  0.624532
      2007-05-10  669  0.268340
      2007-05-11  918  0.134816
      2007-05-12  979 -0.769406
      2007-05-13  969 -0.242123
      2007-05-14  950 -0.347884
      2007-05-15   49 -1.284825
      2007-05-16  922 -1.313928
      2007-05-17  347 -0.521352
      2007-05-18  353  0.189717

[10000000 rows x 2 columns]

Write the data to disk, showing how to create a data column (note that the indexes are automatically queryable, this allows id2 to be queryable as well). This is de-facto equivalent to doing. This takes care of opening and closing the store (you can accomplish the same thing by opening a store, appending, and closing).

将数据写入磁盘，展示如何创建数据列（注意索引是自动可查询的，这也允许 id2 可查询）。这实际上等同于做。这负责打开和关闭存储（您可以通过打开存储、附加和关闭来完成相同的操作）。

In order to query a column, it MUST BE A DATA COLUMN or an index of the frame.

为了查询一列，它必须是一个数据列或框架的索引。

In [70]: df.to_hdf('test.h5','df',mode='w',data_columns=['id2'],format='table')

In [71]: !ls -ltr test.h5
-rw-rw-r-- 1 jreback users 430540284 May 26 17:16 test.h5

Queries

查询

In [80]: ids=[10101,10898]

In [81]: start_date='20010101'

In [82]: end_date='20010301'

You can specify dates as string (either in-line or as variables; you can also specify Timestamp like objects)

您可以将日期指定为字符串（行内或作为变量；您还可以指定时间戳之类的对象）

In [83]: pd.read_hdf('test.h5','df',where='date>start_date & date<end_date')
Out[83]: 
                  id2         w
id    date                     
10000 2001-01-02  972 -0.146107
      2001-01-03  954  1.420412
      2001-01-04  567  1.077633
      2001-01-05   87 -0.042838
      2001-01-06   79 -1.791228
      2001-01-07  744  1.110478
      2001-01-08  237 -0.846086
      2001-01-09  998 -0.696369
      2001-01-10  266 -0.595555
      2001-01-11  206 -0.294633
...               ...       ...
10999 2001-02-19  616 -0.745068
      2001-02-20  577 -1.474748
      2001-02-21  990 -1.276891
      2001-02-22  939 -1.369558
      2001-02-23  621 -0.214365
      2001-02-24  396 -0.142100
      2001-02-25  492 -0.204930
      2001-02-26  478  1.839291
      2001-02-27  688  0.291504
      2001-02-28  356 -1.987554

[58000 rows x 2 columns]

You can use in-line lists

您可以使用内联列表

In [84]: pd.read_hdf('test.h5','df',where='date>start_date & date<end_date & id=ids')
Out[84]: 
                  id2         w
id    date                     
10101 2001-01-02  722  1.620553
      2001-01-03  849 -0.603468
      2001-01-04  635 -1.419072
      2001-01-05  331  0.521634
      2001-01-06  730  0.008830
      2001-01-07  706 -1.006412
      2001-01-08   59  1.380005
      2001-01-09  689  0.017830
      2001-01-10  788 -3.090800
      2001-01-11  704 -1.491824
...               ...       ...
10898 2001-02-19  530 -1.031167
      2001-02-20  652 -0.019266
      2001-02-21  472  0.638266
      2001-02-22  540 -1.827251
      2001-02-23  654 -1.020140
      2001-02-24  328 -0.477425
      2001-02-25  871 -0.892684
      2001-02-26  166  0.894118
      2001-02-27  806  0.648240
      2001-02-28  824 -1.051539

[116 rows x 2 columns]

You can also specify boolean expressions

您还可以指定布尔表达式

In [85]: pd.read_hdf('test.h5','df',where='date>start_date & date<end_date & id=ids & id2>500 & id2<600')
Out[85]: 
                  id2         w
id    date                     
10101 2001-01-12  534 -0.220692
      2001-01-14  596 -2.225393
      2001-01-16  596  0.956239
      2001-01-30  513 -2.528996
      2001-02-01  572 -1.877398
      2001-02-13  569 -0.940748
      2001-02-14  541  1.035619
      2001-02-21  571 -0.116547
10898 2001-01-16  591  0.082564
      2001-02-06  586  0.470872
      2001-02-10  531 -0.536194
      2001-02-16  586  0.949947
      2001-02-19  530 -1.031167
      2001-02-22  540 -1.827251

To answer your actual question I would do this (their is really not enough information, but I'll put some reasonable expectations):

要回答您的实际问题，我会这样做（他们的信息确实不够，但我会提出一些合理的期望）：

Do't loop over queries, unless you have a very small number of absolute queries
Read the biggest chunk into memory that you can. Usually this is accomplished by selecting out the biggest ranges of data that you need, even if you select MORE data than you actually need.
Then subselect using in-memory expressions, which will generally be orders of magnitude faster.
List elements are limited to about 30 elements in total (this is current an implementation limit on the PyTables side). It will work if you specify more, but what will happen is that you will read in a lot of data, then it will be reindexed down (in-memory). So user needs to be aware of this.

不要循环查询，除非你有非常少量的绝对查询
将最大的块读入内存中。通常这是通过选择您需要的最大范围的数据来实现的，即使您选择的数据多于您实际需要的数据。
然后使用内存中的表达式进行子选择，这通常会快几个数量级。
列表元素总共限制在大约 30 个元素（这是 PyTables 方面的当前实现限制）。如果您指定更多，它将起作用，但会发生的是您将读取大量数据，然后将其重新索引（在内存中）。所以用户需要注意这一点。

So for example say that you have 1000 unique ids with 10000 dates per as my example demonstrates. You want to select say 200 of these, with a date range of 1000.

例如，如我的示例所示，假设您有 1000 个唯一 ID，每个 ID 有 10000 个日期。您想选择其中的 200 个，日期范围为 1000。

So in this case I would simply select on the dates then do the in-memory comparison, something like this:

所以在这种情况下，我会简单地选择日期然后进行内存比较，如下所示：

df = pd.read_hdf('test.h5','df',where='date=>global_start_date & date<=global_end_date')
df[df.isin(list_of_ids)]

You also might have dates that change per ids. So chunk them, this time using a list of ids.

您还可能有每个 id 更改的日期。所以将它们分块，这次使用 id 列表。

Something like this:

像这样的东西：

output = []
for i in len(list_of_ids) % 30:
    ids = list_of_ids[i:(i+30)]
    start_date = get_start_date_for_these_ids (global)
    end_date = get_end_date_for_these_ids (global)
    where = 'id=ids & start_date>=start_date & end_date<=end_date'
    df = pd.read_hdf('test.h5','df',where=where)
    output.append(df)

 final_result = concat(output)

The basic idea then is to select a superset of the data using the criteria that you want, sub-selecting so it fits in memory, but you limit the number of queries you do (e.g. imagine that you end up selecting a single row with your query, if you have to query this 18M times that is bad).

基本思想是使用您想要的标准选择数据的超集，进行子选择以使其适合内存，但您限制了您执行的查询数量（例如，想象您最终选择了一行查询，如果您必须查询这 18M 次，那是不好的）。

在 Pandas 中查询 HDF5

提问by user3576212

采纳答案by Jeff

相关推荐

最近更新

标签

在 Pandas 中查询 HDF5

提问by user3576212

采纳答案by Jeff

相关推荐

通过主键将 Pandas 数据框附加到 sqlite 表

Python pandas / matplotlib 在条形图列上方注释标签

更改 Pandas 数据帧类型时的异常处理

在 Pandas 中以字符串形式获取索引标签

相关推荐

最近更新

标签