pandas 熊猫对 HDFStore 中大数据的“分组依据”查询？

Question

提问by technomalogical

I have about 7 million rows in an HDFStorewith more than 60 columns. The data is more than I can fit into memory. I'm looking to aggregate the data into groups based on the value of a column "A". The documentation for pandas splitting/aggregating/combiningassumes that I have all my data in a DataFramealready, however I can't read the entire store into an in-memory DataFrame. What is the correct approach for grouping data in an HDFStore?

我有大约 700 万行，HDFStore其中有 60 多列。数据超出了我的内存容量。我希望根据列“A”的值将数据聚合到组中。pandas splitting/aggregating/combining的文档假设我的所有数据都DataFrame已经存在，但是我无法将整个存储读入 in-memory DataFrame。将数据分组的正确方法是HDFStore什么？

Answer 1

采纳答案by Jeff

Heres a complete example.

这是一个完整的例子。

import numpy as np
import pandas as pd
import os

fname = 'groupby.h5'

# create a frame
df = pd.DataFrame({'A': ['foo', 'foo', 'foo', 'foo',
                         'bar', 'bar', 'bar', 'bar',
                         'foo', 'foo', 'foo'],
                   'B': ['one', 'one', 'one', 'two',
                         'one', 'one', 'one', 'two',
                         'two', 'two', 'one'],
                   'C': ['dull', 'dull', 'shiny', 'dull',
                         'dull', 'shiny', 'shiny', 'dull',
                         'shiny', 'shiny', 'shiny'],
                   'D': np.random.randn(11),
                   'E': np.random.randn(11),
                   'F': np.random.randn(11)})


# create the store and append, using data_columns where I possibily
# could aggregate
with pd.get_store(fname) as store:
    store.append('df',df,data_columns=['A','B','C'])
    print "store:\n%s" % store

    print "\ndf:\n%s" % store['df']

    # get the groups
    groups = store.select_column('df','A').unique()
    print "\ngroups:%s" % groups

    # iterate over the groups and apply my operations
    l = []
    for g in groups:

        grp = store.select('df',where = [ 'A=%s' % g ])

        # this is a regular frame, aggregate however you would like
        l.append(grp[['D','E','F']].sum())


    print "\nresult:\n%s" % pd.concat(l, keys = groups)

os.remove(fname)

Output

输出

store:
<class 'pandas.io.pytables.HDFStore'>
File path: groupby.h5
/df            frame_table  (typ->appendable,nrows->11,ncols->6,indexers->[index],dc->[A,B,C])

df:
      A    B      C         D         E         F
0   foo  one   dull -0.815212 -1.195488 -1.346980
1   foo  one   dull -1.111686 -1.814385 -0.974327
2   foo  one  shiny -1.069152 -1.926265  0.360318
3   foo  two   dull -0.472180  0.698369 -1.007010
4   bar  one   dull  1.329867  0.709621  1.877898
5   bar  one  shiny -0.962906  0.489594 -0.663068
6   bar  one  shiny -0.657922 -0.377705  0.065790
7   bar  two   dull -0.172245  1.694245  1.374189
8   foo  two  shiny -0.780877 -2.334895 -2.747404
9   foo  two  shiny -0.257413  0.577804 -0.159316
10  foo  one  shiny  0.737597  1.979373 -0.236070

groups:Index([bar, foo], dtype=object)

result:
bar  D   -0.463206
     E    2.515754
     F    2.654810
foo  D   -3.768923
     E   -4.015488
     F   -6.110789
dtype: float64

Some caveats:

一些注意事项：

1) This methodology makes sense if your group density is relatively low. On the order of hundreds or thousands of groups. If you get more than that there are more efficient (but more complicated methods), and your function which you are applying (in this case sum) become more restrictive.

1) 如果您的团队密度相对较低，则此方法是有意义的。成百上千个组。如果你得到的不止这些，就会有更有效的（但更复杂的方法），并且你正在应用的函数（在这种情况下sum）变得更加严格。

Essentially you would iterator over the entire store by chunks, grouping as you go, but keeping the groups only semi-collapsed (imagine doing a mean, so you would need to keep a running total plus a running count, then divide at the end). So some operations would be a bit trickier, but could potentially handle MANY groups (and is really fast).

本质上，您将按块迭代整个商店，随您进行分组，但保持组仅半折叠（想象一下做一个平均值，所以您需要保持运行总数加上运行计数，然后在最后进行划分） . 所以有些操作会有点棘手，但可能会处理许多组（而且速度非常快）。

2) the efficiency of this could be improved by saving the coordinates (e.g. the group locations, but this is a bit more complicated)

2）可以通过保存坐标来提高效率（例如组位置，但这有点复杂）

3) multi-grouping is not possible with this scheme (it IS possible, but requires an approach more like 2) above

3）这种方案不可能进行多分组（这是可能的，但需要一种更像上面2）的方法

4) the columns that you want to group, MUST be a data_column!

4) 要分组的列，必须是 data_column！

5) you can combine any other filter you wish in the select btw (which is a sneeky way of doing multi-grouping btw, you just form 2 unique lists of group and iterator over the product of them, not extremely efficient if you have lots of groups, but can work)

5）您可以在 select btw 中组合您想要的任何其他过滤器（这是一种进行多分组 btw 的狡猾方式，您只需在它们的乘积上形成 2 个唯一的组列表和迭代器，如果您有很多，则效率不高组，但可以工作）

HTH

let me know if this works for you

让我知道这是否适合你

pandas 熊猫对 HDFStore 中大数据的“分组依据”查询？

提问by technomalogical

采纳答案by Jeff

相关推荐

最近更新

标签

pandas 熊猫对 HDFStore 中大数据的“分组依据”查询？

提问by technomalogical

采纳答案by Jeff

相关推荐

pandas join/merge '重新索引仅对唯一值索引有效'

pandas 在 Python 中的两个列表/数组中查找最近的项目

pandas 使用自定义周期重新采样

pandas 沿短轴扩展熊猫面板框架

相关推荐

最近更新

标签