如何将大量数据附加到 Pandas HDFStore 并获得自然唯一索引?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/16997048/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-13 20:53:44  来源:igfitidea点击:

How does one append large amounts of data to a Pandas HDFStore and get a natural unique index?

pythonindexingpandasdataframehdfstore

提问by Ben Scherrey

I'm importing large amounts of http logs (80GB+) into a Pandas HDFStore for statistical processing. Even within a single import file I need to batch the content as I load it. My tactic thus far has been to read the parsed lines into a DataFrame then store the DataFrame into the HDFStore. My goal is to have the index key unique for a single key in the DataStore but each DataFrame restarts it's own index value again. I was anticipating HDFStore.append() would have some mechanism to tell it to ignore the DataFrame index values and just keep adding to my HDFStore key's existing index values but cannot seem to find it. How do I import DataFrames and ignore the index values contained therein while having the HDFStore increment its existing index values? Sample code below batches every 10 lines. Naturally the real thing would be larger.

我正在将大量 http 日志 (80GB+) 导入 Pandas HDFStore 以进行统计处理。即使在单个导入文件中,我也需要在加载内容时对其进行批处理。到目前为止,我的策略是将解析的行读取到 DataFrame 中,然后将 DataFrame 存储到 HDFStore 中。我的目标是让 DataStore 中的单个键的索引键唯一,但每个 DataFrame 再次重新启动它自己的索引值。我期待 HDFStore.append() 会有一些机制告诉它忽略 DataFrame 索引值并继续添加到我的 HDFStore 键的现有索引值,但似乎找不到它。如何导入 DataFrame 并忽略其中包含的索引值,同时让 HDFStore 增加其现有索引值?下面的示例代码每 10 行批处理一次。自然真实的东西会更大。

if hd_file_name:
        """
        HDF5 output file specified.
        """

        hdf_output = pd.HDFStore(hd_file_name, complib='blosc')
        print hdf_output

        columns = ['source', 'ip', 'unknown', 'user', 'timestamp', 'http_verb', 'path', 'protocol', 'http_result', 
                   'response_size', 'referrer', 'user_agent', 'response_time']

        source_name = str(log_file.name.rsplit('/')[-1])   # HDF5 Tables don't play nice with unicode so explicit str(). :(

        batch = []

        for count, line in enumerate(log_file,1):
            data = parse_line(line, rejected_output = reject_output)

            # Add our source file name to the beginning.
            data.insert(0, source_name )    
            batch.append(data)

            if not (count % 10):
                df = pd.DataFrame( batch, columns = columns )
                hdf_output.append(KEY_NAME, df)
                batch = []

        if (count % 10):
            df = pd.DataFrame( batch, columns = columns )
            hdf_output.append(KEY_NAME, df)

回答by Jeff

You can do it like this. Only trick is that the first time the store table doesn't exist, so get_storerwill raise.

你可以这样做。唯一的技巧是第一次存储表不存在,所以get_storer会加注。

import pandas as pd
import numpy as np
import os

files = ['test1.csv','test2.csv']
for f in files:
    pd.DataFrame(np.random.randn(10,2),columns=list('AB')).to_csv(f)

path = 'test.h5'
if os.path.exists(path):
    os.remove(path)

with pd.get_store(path) as store:
    for f in files:
        df = pd.read_csv(f,index_col=0)
        try:
            nrows = store.get_storer('foo').nrows
        except:
            nrows = 0

        df.index = pd.Series(df.index) + nrows
        store.append('foo',df)


In [10]: pd.read_hdf('test.h5','foo')
Out[10]: 
           A         B
0   0.772017  0.153381
1   0.304131  0.368573
2   0.995465  0.799655
3  -0.326959  0.923280
4  -0.808376  0.449645
5  -1.336166  0.236968
6  -0.593523 -0.359080
7  -0.098482  0.037183
8   0.315627 -1.027162
9  -1.084545 -1.922288
10  0.412407 -0.270916
11  1.835381 -0.737411
12 -0.607571  0.507790
13  0.043509 -0.294086
14 -0.465210  0.880798
15  1.181344  0.354411
16  0.501892 -0.358361
17  0.633256  0.419397
18  0.932354 -0.603932
19 -0.341135  2.453220

You actually don't necessarily need a global unique index, (unless you want one) as HDFStore(through PyTables) provides one by uniquely numbering rows. You can always add these selection parameters.

您实际上不一定需要全局唯一索引(除非您想要一个),因为HDFStore(through PyTables) 通过对行进行唯一编号来提供一个。您始终可以添加这些选择参数。

In [11]: pd.read_hdf('test.h5','foo',start=12,stop=15)
Out[11]: 
           A         B
12 -0.607571  0.507790
13  0.043509 -0.294086
14 -0.465210  0.880798