pandas HDFStore.append(string, DataFrame) 当字符串列的内容比已有的内容长时失败

Question

提问by ultra909

I have a Pandas DataFrame stored via an HDFStore that essentially stores summary rows about test runs I am doing.

我有一个通过 HDFStore 存储的 Pandas DataFrame，它本质上存储了关于我正在做的测试运行的摘要行。

Several of the fields in each row contain descriptive strings of variable length.

每行中的几个字段包含可变长度的描述性字符串。

When I do a test run, I create a new DataFrame with a single row in it:

当我进行测试运行时，我创建了一个新的 DataFrame，其中只有一行：

def export_as_df(self):
    return pd.DataFrame(data=[self._to_dict()], index=[datetime.datetime.now()])

And then call HDFStore.append(string, DataFrame)to add the new row to the existing DataFrame.

然后调用HDFStore.append(string, DataFrame)将新行添加到现有 DataFrame。

This works fine, apart from where one of the string columns contents is larger than the longest instance already existing, whereupon I get the following error:

这工作正常，除了其中一个字符串列内容大于已经存在的最长实例，因此我收到以下错误：

File "<ipython-input-302-a33c7955df4a>", line 516, in save_pytables
store.append('tests', test.export_as_df())
File "/Library/Frameworks/EPD64.framework/Versions/7.3/lib/python2.7/site-packages/pandas/io/pytables.py", line 532, in append
self._write_to_group(key, value, table=True, append=True, **kwargs)
File "/Library/Frameworks/EPD64.framework/Versions/7.3/lib/python2.7/site-packages/pandas/io/pytables.py", line 788, in _write_to_group
s.write(obj = value, append=append, complib=complib, **kwargs)
File "/Library/Frameworks/EPD64.framework/Versions/7.3/lib/python2.7/site-packages/pandas/io/pytables.py", line 2491, in write
min_itemsize=min_itemsize, **kwargs)
File "/Library/Frameworks/EPD64.framework/Versions/7.3/lib/python2.7/site-packages/pandas/io/pytables.py", line 2254, in create_axes
raise Exception("cannot find the correct atom type -> [dtype->%s,items->%s] %s" % (b.dtype.name, b.items, str(detail)))
Exception: cannot find the correct atom type -> [dtype->object,items->Index([bp, id, inst, per, sp, st, title], dtype=object)] [values_block_3] column has a min_itemsize of [51] but itemsize [46] is required!

I can't find any documentation about how to specify string length when creating a DataFrame. What is the solution here?

我找不到关于如何在创建 DataFrame 时指定字符串长度的任何文档。这里的解决方案是什么？

Update:

更新：

Code that is failing:

失败的代码：

        store = pd.HDFStore(pytables_store)            
        for test in self.backtests:
            try:
                min_itemsizes = { 'buy_pattern' : 60, 'sell_pattern': 60, 'strategy': 60, 'title': 60 }
                store.append('tests', test.export_as_df(), min_itemsize = min_itemsizes)

Here's the error under 0.11rc1:

这是0.11rc1下的错误：

File "<ipython-input-110-492b7b6603d7>", line 522, in save_pytables
  store.append('tests', test.export_as_df(), min_itemsize = min_itemsizes)
File "/Users/admin/dev/pandas/pandas-0.11.0rc1/pandas/io/pytables.py", line 610, in append
  self._write_to_group(key, value, table=True, append=True, **kwargs)
File "/Users/admin/dev/pandas/pandas-0.11.0rc1/pandas/io/pytables.py", line 871, in _write_to_group
  s.write(obj = value, append=append, complib=complib, **kwargs)
File "/Users/admin/dev/pandas/pandas-0.11.0rc1/pandas/io/pytables.py", line 2707, in write
  min_itemsize=min_itemsize, **kwargs)
File "/Users/admin/dev/pandas/pandas-0.11.0rc1/pandas/io/pytables.py", line 2447, in create_axes
  self.validate_min_itemsize(min_itemsize)
File "/Users/admin/dev/pandas/pandas-0.11.0rc1/pandas/io/pytables.py", line 2184, in validate_min_itemsize
  raise ValueError("min_itemsize has [%s] which is not an axis or data_column" % k)
ValueError: min_itemsize has [buy_pattern] which is not an axis or data_column

Data sample:

数据样本：

                           all_day              buy_pattern  \
2013-04-14 12:11:44.377695   False  Hammer() and LowerLow()   

                                                           id instrument  \
2013-04-14 12:11:44.377695  tafdcc96ba4eb11e2a86d14109fcecd49     EURUSD   

                            open_margin periodicity sell_pattern strategy  \
2013-04-14 12:11:44.377695       0.0001     1:00:00                 Tsl()   

                           title  top_bottom  wick_body  
2013-04-14 12:11:44.377695   tsl         0.5          2

dtypes:

数据类型：

print prob_test.export_as_df().get_dtype_counts() 

    bool       1
    float64    2
    int64      1
    object     7
    dtype: int64

I am deleting the h5 file each time as I want clean results. Wondering if there is something as silly as it is failing because the df does not exist in the h5 (and hence neither do any columns) at the time of the first append?

我每次都删除 h5 文件，因为我想要干净的结果。想知道是否有一些像失败一样愚蠢的事情，因为 df 在第一次追加时不存在于 h5 中（因此也不存在任何列）？

Answer 1

采纳答案by Jeff

Here is the link to the new docs section about this: http://pandas.pydata.org/pandas-docs/stable/io.html#string-columns

这是有关此新文档部分的链接：http: //pandas.pydata.org/pandas-docs/stable/io.html#string-columns

This issue is that you are specifiying a column in min_itemsize that is not a data_column. Simple workaround is to add data_columns=Trueto your append statement, but I have also updated the code to automatically create the data_columns if you pass a valid column name. I think this makes sense, you want to have a minimum column size, so let it happen.

这个问题是您在 min_itemsize 中指定了一个不是 data_column 的列。简单的解决方法是添加data_columns=True到您的 append 语句中，但我还更新了代码以在您传递有效的列名时自动创建 data_columns。我认为这是有道理的，你想要一个最小的列大小，所以让它发生。

Also created a new doc section String Columns to show a more complete example with explanation (docs will be updated soon).

还创建了一个新的文档部分 String Columns 以显示更完整的示例和说明（文档将很快更新）。

# this is the new behavior (after code updates)
n [340]: dfs = DataFrame(dict(A = 'foo', B = 'bar'),index=range(5))

In [341]: dfs
Out[341]: 
     A    B
0  foo  bar
1  foo  bar
2  foo  bar
3  foo  bar
4  foo  bar

# A and B have a size of 30
In [342]: store.append('dfs', dfs, min_itemsize = 30)

In [343]: store.get_storer('dfs').table
Out[343]: 
/dfs/table (Table(5,)) ''
  description := {
  "index": Int64Col(shape=(), dflt=0, pos=0),
  "values_block_0": StringCol(itemsize=30, shape=(2,), dflt='', pos=1)}
  byteorder := 'little'
  chunkshape := (963,)
  autoIndex := True
  colindexes := {
    "index": Index(6, medium, shuffle, zlib(1)).is_CSI=False}

# A is created as a data_column with a size of 30
# B is size is calculated
In [344]: store.append('dfs2', dfs, min_itemsize = { 'A' : 30 })

In [345]: store.get_storer('dfs2').table
Out[345]: 
/dfs2/table (Table(5,)) ''
  description := {
  "index": Int64Col(shape=(), dflt=0, pos=0),
  "values_block_0": StringCol(itemsize=3, shape=(1,), dflt='', pos=1),
  "A": StringCol(itemsize=30, shape=(), dflt='', pos=2)}
  byteorder := 'little'
  chunkshape := (1598,)
  autoIndex := True
  colindexes := {
    "A": Index(6, medium, shuffle, zlib(1)).is_CSI=False,
    "index": Index(6, medium, shuffle, zlib(1)).is_CSI=False}

pandas HDFStore.append(string, DataFrame) 当字符串列的内容比已有的内容长时失败

提问by ultra909

采纳答案by Jeff

相关推荐

最近更新

标签

pandas HDFStore.append(string, DataFrame) 当字符串列的内容比已有的内容长时失败

提问by ultra909

采纳答案by Jeff

相关推荐

pandas dict of dicts to DataFrame

如何将 python 字典放入一个键是日期对象的 Pandas 时间序列数据帧

pandas 从 DataFrame 中选择多键横截面

pandas 在 iPython 中使用 HDF5 文件时出现异常“HDFStore 需要 PyTables”

相关推荐

最近更新

标签