从 pandas.HDFStore 表中选择列

Question

提问by Zelazny7

How can I retrieve specific columns from a pandas HDFStore? I regularly work with very large data sets that are too big to manipulate in memory. I would like to read in a csv file iteratively, append each chunk into HDFStore object, and then work with subsets of the data. I have read in a simple csv file and loaded it into an HDFStore with the following code:

如何从Pandas HDFStore 中检索特定列？我经常使用非常大的数据集，这些数据集太大而无法在内存中操作。我想迭代地读入一个 csv 文件，将每个块附加到 HDFStore 对象中，然后处理数据的子集。我读入了一个简单的 csv 文件，并使用以下代码将其加载到 HDFStore 中：

tmp = pd.HDFStore('test.h5')
chunker = pd.read_csv('cars.csv', iterator=True, chunksize=10, names=['make','model','drop'])
tmp.append('df', pd.concat([chunk for chunk in chunker], ignore_index=True))

And the output:

和输出：

In [97]: tmp
Out[97]:
<class 'pandas.io.pytables.HDFStore'>
File path: test.h5
/df     frame_table (typ->appendable,nrows->1930,indexers->[index])

My Question is how do I access specific columns from tmp['df']? The documenation makes mention of a select()method and some Termobjects. The examples provided are applied to Panel data; however, and I'm too much of a novice to extend it to the simpler data frame case. My guess is that I have to create an index of the columns somehow. Thanks!

我的问题是如何访问特定列tmp['df']？该文档提到了一个select()方法和一些Term对象。提供的示例适用于面板数据；但是，我太新手了，无法将其扩展到更简单的数据框案例。我的猜测是我必须以某种方式创建列的索引。谢谢！

Answer 1

采纳答案by Jeff

The way HDFStore records tables, the columns are stored by type as single numpy arrays. You always get back all of the columns, you can filter on them, so you will be returned for what you ask. In 0.10.0 you can pass a Term that involves columns.

HDFStore 记录表的方式，列按类型存储为单个 numpy 数组。您总是会返回所有列，您可以对它们进行过滤，因此您将根据您的要求返回。在 0.10.0 中，您可以传递一个涉及列的 Term。

store.select('df', [ Term('index', '>', Timestamp('20010105')), 
                     Term('columns', '=', ['A','B']) ])

or you can reindex afterwards

或者你可以在之后重新索引

df = store.select('df', [ Term('index', '>', Timestamp('20010105') ])
df.reindex(columns = ['A','B'])

The axesis not really the solution here (what you actually created was in effect storing a transposed frame). This parameter allows you to re-order the storage of axes to enable data alignment in different ways. For a dataframe it really doesn't mean much; for 3d or 4d structures, on-disk data alignment is crucial for really fast queries.

这axes并不是真正的解决方案（您实际创建的实际上是存储转置帧）。此参数允许您重新排序轴的存储以不同方式启用数据对齐。对于数据框，它真的没有多大意义；对于 3d 或 4d 结构，磁盘数据对齐对于真正快速的查询至关重要。

0.10.1 will allow a more elegant solution, namely data columns, that is, you can elect certain columns to be represented as there own columns in the table store, so you really can select just them. Here is a taste what is coming.

0.10.1 将允许一个更优雅的解决方案，即数据列，也就是说，您可以选择某些列表示为表存储中自己的列，因此您真的可以只选择它们。这是即将到来的味道。

 store.append('df', columns = ['A','B','C'])
 store.select('df', [ 'A > 0', Term('index', '>', Timestamp(2000105)) ])

Another way to do go about this is to store separate tables in different nodes of the file, then you can select only what you need.

另一种方法是将单独的表存储在文件的不同节点中，然后您可以只选择您需要的内容。

In general, I recommend again really wide tables. hayden offers up the Panel solution, which might be a benefit for you now, as the actual data arangement should reflect how you want to query the data.

一般来说，我再次推荐非常宽的桌子。hayden 提供了 Panel 解决方案，这可能对您现在有好处，因为实际的数据排列应该反映您想要查询数据的方式。

Answer 2

回答by Andy Hayden

You can store the dataframe with an index of the columns as follows:

您可以使用列索引存储数据框，如下所示：

import pandas as pd
import numpy as np
from pandas.io.pytables import Term

index = pd.date_range('1/1/2000', periods=8)
df = pd.DataFrame( np.random.randn(8,3), index=index, columns=list('ABC'))  

store = pd.HDFStore('mydata.h5')
store.append('df_cols', df, axes='columns')

and then select as you might hope:

然后按照您的希望进行选择：

In [8]: store.select('df_cols', [Term('columns', '=', 'A')])
Out[8]: 
2000-01-01    0.347644
2000-01-02    0.477167
2000-01-03    1.419741
2000-01-04    0.641400
2000-01-05   -1.313405
2000-01-06   -0.137357
2000-01-07   -1.208429
2000-01-08   -0.539854

Where:

在哪里：

In [9]: df
Out[9]: 
                   A         B         C
2000-01-01  0.347644  0.895084 -1.457772
2000-01-02  0.477167  0.464013 -1.974695
2000-01-03  1.419741  0.470735 -0.309796
2000-01-04  0.641400  0.838864 -0.112582
2000-01-05 -1.313405 -0.678250 -0.306318
2000-01-06 -0.137357 -0.723145  0.982987
2000-01-07 -1.208429 -0.672240  1.331291
2000-01-08 -0.539854 -0.184864 -1.056217

.

To me this isn't an ideal solution, as we can only indexing the DataFrame by one thing! Worryingly the docsseem to suggest you canonly index a DataFrame by one thing, at least using axes:

对我来说，这不是一个理想的解决方案，因为我们只能通过一件事来索引 DataFrame！令人担忧的是，文档似乎建议您只能通过一件事来索引 DataFrame，至少使用axes：

Pass the axes keyword with a list of dimension (currently must by exactly 1 less than the total dimensions of the object).

传递带有维度列表的轴关键字（当前必须正好比对象的总维度小 1）。

I may be reading this incorrectly, in which case hopefully someone can prove me wrong!

我可能读错了，在这种情况下，希望有人能证明我是错的！

.

Note: One way I have found to index a DataFrame by two things (index and columns), is to convert it to a Panel, which can then retrieve using two indices. However then we have to convert to the selected subpanel to a DataFrame each time items are retrieved... again, not ideal.

注意：我发现通过两件事（索引和列）索引 DataFrame 的一种方法是将其转换为面板，然后可以使用两个索引进行检索。但是，每次检索项目时，我们都必须将选定的子面板转换为 DataFrame ......同样，这并不理想。

Answer 3

回答by Eagle Liang

From now on，U can use query expression instead Termconstruction. e.g: store.select('df', "index > Timestamp('20000105')")

从现在开始，你可以使用查询表达式代替Term构造。例如：store.select('df', "index > Timestamp('20000105')")

从 pandas.HDFStore 表中选择列

提问by Zelazny7

采纳答案by Jeff

回答by Andy Hayden

回答by Eagle Liang

相关推荐

最近更新

标签

从 pandas.HDFStore 表中选择列

提问by Zelazny7

采纳答案by Jeff

回答by Andy Hayden

回答by Eagle Liang

相关推荐

在大型 DataFrame 上对 Pandas 进行排列的有效方法

pandas 熊猫滚动应用缺失数据

pandas 熊猫数据框，按值复制

Python Pandas：根据出现次数删除条目

相关推荐

最近更新

标签