pandas HDFStore.append(string, DataFrame) 当字符串列的内容比已有的内容长时失败
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/15988871/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
HDFStore.append(string, DataFrame) fails when string column contents are longer than those already there
提问by ultra909
I have a Pandas DataFrame stored via an HDFStore that essentially stores summary rows about test runs I am doing.
我有一个通过 HDFStore 存储的 Pandas DataFrame,它本质上存储了关于我正在做的测试运行的摘要行。
Several of the fields in each row contain descriptive strings of variable length.
每行中的几个字段包含可变长度的描述性字符串。
When I do a test run, I create a new DataFrame with a single row in it:
当我进行测试运行时,我创建了一个新的 DataFrame,其中只有一行:
def export_as_df(self):
return pd.DataFrame(data=[self._to_dict()], index=[datetime.datetime.now()])
And then call HDFStore.append(string, DataFrame)to add the new row to the existing DataFrame.
然后调用HDFStore.append(string, DataFrame)将新行添加到现有 DataFrame。
This works fine, apart from where one of the string columns contents is larger than the longest instance already existing, whereupon I get the following error:
这工作正常,除了其中一个字符串列内容大于已经存在的最长实例,因此我收到以下错误:
File "<ipython-input-302-a33c7955df4a>", line 516, in save_pytables
store.append('tests', test.export_as_df())
File "/Library/Frameworks/EPD64.framework/Versions/7.3/lib/python2.7/site-packages/pandas/io/pytables.py", line 532, in append
self._write_to_group(key, value, table=True, append=True, **kwargs)
File "/Library/Frameworks/EPD64.framework/Versions/7.3/lib/python2.7/site-packages/pandas/io/pytables.py", line 788, in _write_to_group
s.write(obj = value, append=append, complib=complib, **kwargs)
File "/Library/Frameworks/EPD64.framework/Versions/7.3/lib/python2.7/site-packages/pandas/io/pytables.py", line 2491, in write
min_itemsize=min_itemsize, **kwargs)
File "/Library/Frameworks/EPD64.framework/Versions/7.3/lib/python2.7/site-packages/pandas/io/pytables.py", line 2254, in create_axes
raise Exception("cannot find the correct atom type -> [dtype->%s,items->%s] %s" % (b.dtype.name, b.items, str(detail)))
Exception: cannot find the correct atom type -> [dtype->object,items->Index([bp, id, inst, per, sp, st, title], dtype=object)] [values_block_3] column has a min_itemsize of [51] but itemsize [46] is required!
I can't find any documentation about how to specify string length when creating a DataFrame. What is the solution here?
我找不到关于如何在创建 DataFrame 时指定字符串长度的任何文档。这里的解决方案是什么?
Update:
更新:
Code that is failing:
失败的代码:
store = pd.HDFStore(pytables_store)
for test in self.backtests:
try:
min_itemsizes = { 'buy_pattern' : 60, 'sell_pattern': 60, 'strategy': 60, 'title': 60 }
store.append('tests', test.export_as_df(), min_itemsize = min_itemsizes)
Here's the error under 0.11rc1:
这是0.11rc1下的错误:
File "<ipython-input-110-492b7b6603d7>", line 522, in save_pytables
store.append('tests', test.export_as_df(), min_itemsize = min_itemsizes)
File "/Users/admin/dev/pandas/pandas-0.11.0rc1/pandas/io/pytables.py", line 610, in append
self._write_to_group(key, value, table=True, append=True, **kwargs)
File "/Users/admin/dev/pandas/pandas-0.11.0rc1/pandas/io/pytables.py", line 871, in _write_to_group
s.write(obj = value, append=append, complib=complib, **kwargs)
File "/Users/admin/dev/pandas/pandas-0.11.0rc1/pandas/io/pytables.py", line 2707, in write
min_itemsize=min_itemsize, **kwargs)
File "/Users/admin/dev/pandas/pandas-0.11.0rc1/pandas/io/pytables.py", line 2447, in create_axes
self.validate_min_itemsize(min_itemsize)
File "/Users/admin/dev/pandas/pandas-0.11.0rc1/pandas/io/pytables.py", line 2184, in validate_min_itemsize
raise ValueError("min_itemsize has [%s] which is not an axis or data_column" % k)
ValueError: min_itemsize has [buy_pattern] which is not an axis or data_column
Data sample:
数据样本:
all_day buy_pattern \
2013-04-14 12:11:44.377695 False Hammer() and LowerLow()
id instrument \
2013-04-14 12:11:44.377695 tafdcc96ba4eb11e2a86d14109fcecd49 EURUSD
open_margin periodicity sell_pattern strategy \
2013-04-14 12:11:44.377695 0.0001 1:00:00 Tsl()
title top_bottom wick_body
2013-04-14 12:11:44.377695 tsl 0.5 2
dtypes:
数据类型:
print prob_test.export_as_df().get_dtype_counts()
bool 1
float64 2
int64 1
object 7
dtype: int64
I am deleting the h5 file each time as I want clean results. Wondering if there is something as silly as it is failing because the df does not exist in the h5 (and hence neither do any columns) at the time of the first append?
我每次都删除 h5 文件,因为我想要干净的结果。想知道是否有一些像失败一样愚蠢的事情,因为 df 在第一次追加时不存在于 h5 中(因此也不存在任何列)?
采纳答案by Jeff
Here is the link to the new docs section about this: http://pandas.pydata.org/pandas-docs/stable/io.html#string-columns
这是有关此新文档部分的链接:http: //pandas.pydata.org/pandas-docs/stable/io.html#string-columns
This issue is that you are specifiying a column in min_itemsize that is not a data_column. Simple workaround is to add data_columns=Trueto your append statement, but I have also updated the code to automatically create the data_columns if you pass a valid column name. I think this makes sense, you want to have a minimum column size, so let it happen.
这个问题是您在 min_itemsize 中指定了一个不是 data_column 的列。简单的解决方法是添加data_columns=True到您的 append 语句中,但我还更新了代码以在您传递有效的列名时自动创建 data_columns。我认为这是有道理的,你想要一个最小的列大小,所以让它发生。
Also created a new doc section String Columns to show a more complete example with explanation (docs will be updated soon).
还创建了一个新的文档部分 String Columns 以显示更完整的示例和说明(文档将很快更新)。
# this is the new behavior (after code updates)
n [340]: dfs = DataFrame(dict(A = 'foo', B = 'bar'),index=range(5))
In [341]: dfs
Out[341]:
A B
0 foo bar
1 foo bar
2 foo bar
3 foo bar
4 foo bar
# A and B have a size of 30
In [342]: store.append('dfs', dfs, min_itemsize = 30)
In [343]: store.get_storer('dfs').table
Out[343]:
/dfs/table (Table(5,)) ''
description := {
"index": Int64Col(shape=(), dflt=0, pos=0),
"values_block_0": StringCol(itemsize=30, shape=(2,), dflt='', pos=1)}
byteorder := 'little'
chunkshape := (963,)
autoIndex := True
colindexes := {
"index": Index(6, medium, shuffle, zlib(1)).is_CSI=False}
# A is created as a data_column with a size of 30
# B is size is calculated
In [344]: store.append('dfs2', dfs, min_itemsize = { 'A' : 30 })
In [345]: store.get_storer('dfs2').table
Out[345]:
/dfs2/table (Table(5,)) ''
description := {
"index": Int64Col(shape=(), dflt=0, pos=0),
"values_block_0": StringCol(itemsize=3, shape=(1,), dflt='', pos=1),
"A": StringCol(itemsize=30, shape=(), dflt='', pos=2)}
byteorder := 'little'
chunkshape := (1598,)
autoIndex := True
colindexes := {
"A": Index(6, medium, shuffle, zlib(1)).is_CSI=False,
"index": Index(6, medium, shuffle, zlib(1)).is_CSI=False}

