pandas 带有字符串列的 HDFStore 出现问题

Question

提问by uday

I have a pandas DataFrame myDFwith a few string columns (whose dtypeis object) and many numeric columns. I tried the following:

我有一个myDF带有几个字符串列（其dtype是object）和许多数字列的Pandas DataFrame 。我尝试了以下方法：

d=pandas.HDFStore("C:\PF\Temp.h5")
d['test']=myDF

I got this result:

我得到了这个结果：

C:\PF\WinPython-64bit-3.3.3.3\python-3.3.3.amd64\lib\site-packages\pandas\io\pytables.py:2446: PerformanceWarning: 

your performance may suffer as PyTables will pickle object types that it cannot
map directly to c-types [inferred_type->mixed,key->block2_values] 
[items->[0, 1, 3, 4, 5, 6, 9, 10, 292, 411, 412, 477, 478, 479, 495, 572, 581, 590, 599, 608, 617, 626, 635]]

  warnings.warn(ws, PerformanceWarning)

It seems like the issue is occurring for every column that is a string. For example if I try

对于作为字符串的每一列，问题似乎都发生了。例如，如果我尝试

myDF[0].dtype

I get

我得到

Out[38]: dtype('O')

How can I fix the issue, i.e. change the dtypefor string columns so that HDFStore can treat it like a string column?

我该如何解决这个问题，即更改dtypefor 字符串列，以便 HDFStore 可以将其视为字符串列？

* EDIT *

* 编辑 *

More info as requested

根据要求提供更多信息

>>> pandas.__version__
Out[49]: '0.13.1'

>>> tables.__version__
Out[53]: '3.1.0'

Constructing the pandas data frame as follows:

构建pandas数据框如下：

pandas.read_csv(fName,sep="|",header=None,low_memory=False)

When I try

当我尝试

myDF.info()

I get

我得到

Int64Index: 153895 entries, 0 to 153894
Data columns (total 644 columns):
0      object
1      object
2      int64
3      object
4      object
5      object
6      object
7      int64
8      float64
9      object
10     object
11     float64
12     float64
13     float64
14     float64
...
...
642    float64
643    float64
dtypes: float64(619), int64(2), object(23)

All string columns have been read as object

所有字符串列都被读取为 object

Answer 1

回答by Jeff

This warning ONLY happens if you have mixed-types IN a column. Not just strings, but string AND numbers.

只有在列中有混合类型时才会发生此警告。不仅仅是字符串，还有字符串 AND 数字。

In [2]: DataFrame({ 'A' : [1.0,'foo'] }).to_hdf('test.h5','df',mode='w')
pandas/io/pytables.py:2439: PerformanceWarning: 
your performance may suffer as PyTables will pickle object types that it cannot
map directly to c-types [inferred_type->mixed,key->block0_values] [items->['A']]

  warnings.warn(ws, PerformanceWarning)

In [3]: df = DataFrame({ 'A' : [1.0,'foo'] })

In [4]: df
Out[4]: 
     A
0    1
1  foo

[2 rows x 1 columns]

In [5]: df.dtypes
Out[5]: 
A    object
dtype: object

In [6]: df['A']
Out[6]: 
0      1
1    foo
Name: A, dtype: object

In [7]: df['A'].values
Out[7]: array([1.0, 'foo'], dtype=object)

So, you need to ensure that you don't mix WITHIN a column

因此，您需要确保不要在列内混合

If you have columns that need conversion you can do this:

如果您有需要转换的列，您可以这样做：

In [9]: columns = ['A']

In [10]: df.loc[:,columns] = df[columns].applymap(str)

In [11]: df
Out[11]: 
     A
0  1.0
1  foo

[2 rows x 1 columns]

In [12]: df['A'].values
Out[12]: array(['1.0', 'foo'], dtype=object)

pandas 带有字符串列的 HDFStore 出现问题

提问by uday

回答by Jeff

相关推荐

最近更新

标签

pandas 带有字符串列的 HDFStore 出现问题

提问by uday

回答by Jeff

相关推荐

绘制 Pandas OLS 线性回归结果

Pytables/Pandas：组合（阅读？）按行拆分的多个 HDF5 存储

Pandas Groupby 应用函数计算大于零的值

Pandas 按时间窗口分组

相关推荐

最近更新

标签