Pandas 在读取制表符分隔的数据时似乎忽略了第一列名称，给出了 KeyError

Question

提问by RobTeszka

I am using pandas 0.12.0 in ipython3 on Ubuntu 13.10, in order to wrangle large tab-delimited datasets in txt files. Using read_table to create a DataFrame from the txt appears to work, and the first row is read as a header, but attempting to access the first column using its name as an index throws a KeyError. I don't understand why this happens, given that the column names all appear to have been read correctly, and every other column can be indexed in this way.

我在 Ubuntu 13.10 上的 ipython3 中使用 pandas 0.12.0，以便在 txt 文件中处理大型制表符分隔的数据集。使用 read_table 从 txt 创建 DataFrame 似乎有效，并且第一行被读取为标题，但尝试使用其名称作为索引访问第一列会引发 KeyError。我不明白为什么会发生这种情况，因为列名似乎都已被正确读取，并且其他每一列都可以通过这种方式进行索引。

The data looks like this:

数据如下所示：

RECORDING_SESSION_LABEL LEFT_GAZE_X LEFT_GAZE_Y RIGHT_GAZE_X    RIGHT_GAZE_Y    VIDEO_FRAME_INDEX   VIDEO_NAME
73_1    .   .   395.1   302 .   .
73_1    .   .   395 301.9   .   .
73_1    .   .   394.9   301.7   .   .
73_1    .   .   394.8   301.5   .   .
73_1    .   .   394.6   301.3   .   .
73_1    .   .   394.7   300.9   .   .
73_1    .   .   394.9   301.3   .   .
73_1    .   .   395.2   302 1   1_1_just_act.avi
73_1    .   .   395.3   302.3   1   1_1_just_act.avi
73_1    .   .   395.4   301.9   1   1_1_just_act.avi
73_1    .   .   395.7   301.5   1   1_1_just_act.avi
73_1    .   .   395.9   301.5   1   1_1_just_act.avi
73_1    .   .   396 301.5   1   1_1_just_act.avi
73_1    .   .   395.9   301.5   1   1_1_just_act.avi
15_1    395.4   301.7   .   .   .   .

The delimiter is definitely tabs, and there is no trailing or leading whitespace.

分隔符绝对是制表符，并且没有尾随或前导空格。

The error occurs with this minimal program:

这个最小的程序会发生错误：

import pandas as pd

samples = pd.read_table('~/datafile.txt')

print(samples['RECORDING_SESSION_LABEL'])

which gives the error:

这给出了错误：

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
<ipython-input-65-137d3c16b931> in <module>()
----> 1 print(samples['RECORDING_SESSION_LABEL'])

/usr/lib/python3/dist-packages/pandas/core/frame.py in __getitem__(self, key)
   2001             # get column
   2002             if self.columns.is_unique:
-> 2003                 return self._get_item_cache(key)
   2004 
   2005             # duplicate columns

/usr/lib/python3/dist-packages/pandas/core/generic.py in _get_item_cache(self, item)
    665             return cache[item]
    666         except Exception:
--> 667             values = self._data.get(item)
    668             res = self._box_item_values(item, values)
    669             cache[item] = res

/usr/lib/python3/dist-packages/pandas/core/internals.py in get(self, item)
   1654     def get(self, item):
   1655         if self.items.is_unique:
-> 1656             _, block = self._find_block(item)
   1657             return block.get(item)
   1658         else:

/usr/lib/python3/dist-packages/pandas/core/internals.py in _find_block(self, item)
   1934 
   1935     def _find_block(self, item):
-> 1936         self._check_have(item)
   1937         for i, block in enumerate(self.blocks):
   1938             if item in block:

/usr/lib/python3/dist-packages/pandas/core/internals.py in _check_have(self, item)
   1941     def _check_have(self, item):
   1942         if item not in self.items:
-> 1943             raise KeyError('no item named %s' % com.pprint_thing(item))
   1944 
   1945     def reindex_axis(self, new_axis, method=None, axis=0, copy=True):

KeyError: 'no item named RECORDING_SESSION_LABEL'

Simply doing print(samples)gives the expected output of printing the whole table, complete with the first column and its header. Trying to print any other column (ie; the exact same code, but with 'RECORDING_SESSION_LABEL' replaced with 'LEFT_GAZE_X') works as it should. Furthermore, the header seems to have been read in correctly, and pandas recognizes 'RECORDING_SESSION_LABEL' as a column name. This is evidenced by using the .info() method and viewing the .columns attribute of samples, after it's been read in:

简单地做就print(samples)给出了打印整个表格的预期输出，包括第一列及其标题。尝试打印任何其他列（即，完全相同的代码，但将“RECORDING_SESSION_LABEL”替换为“LEFT_GAZE_X”）可以正常工作。此外，标题似乎已被正确读取，并且 Pandas 将“RECORDING_SESSION_LABEL”识别为列名。这可以通过使用 .info() 方法并查看样本的 .columns 属性来证明，在它被读入之后：

>samples.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 28 entries, 0 to 27
Data columns (total 7 columns):
RECORDING_SESSION_LABEL    28  non-null values
LEFT_GAZE_X                 28  non-null values
LEFT_GAZE_Y                 28  non-null values
RIGHT_GAZE_X                28  non-null values
RIGHT_GAZE_Y                28  non-null values
VIDEO_FRAME_INDEX           28  non-null values
VIDEO_NAME                  28  non-null values
dtypes: object(7)

>print(samples.columns)

Index(['?RECORDING_SESSION_LABEL', 'LEFT_GAZE_X', 'LEFT_GAZE_Y', 'RIGHT_GAZE_X', 'RIGHT_GAZE_Y', 'VIDEO_FRAME_INDEX', 'VIDEO_NAME'], dtype=object)

Another error behaviour that I feel is related occurs when using ipython's tab completion, which allows me to access the columns of samples as if they were attributes. It works for every column except the first. ie; hitting the tab key with >samples.Ronly suggests samples.RIGHT_GAZE_X samples.RIGHT_GAZE_Y.

另一个我认为相关的错误行为发生在使用 ipython 的选项卡完成时，它允许我访问样本列，就好像它们是属性一样。它适用于除第一列之外的每一列。IE; 按 tab 键>samples.R只提示samples.RIGHT_GAZE_X samples.RIGHT_GAZE_Y。

So why is it behaving normally when looking at the whole dataframe, but failing when trying to access the first column by its name, even though it appears to have correctly read in that name?

那么为什么它在查看整个数据帧时表现正常，但在尝试按名称访问第一列时失败，即使它似乎正确读取了该名称？

Answer 1

采纳答案by robbles

Sounds like you just need to conditionally remove the BOM from the start of your files. You can do this with a wrapper around the file like so:

听起来您只需要从文件的开头有条件地删除 BOM。您可以使用围绕文件的包装器来执行此操作，如下所示：

def remove_bom(filename):
    fp = open(filename, 'rbU')
    if fp.read(2) != b'\xfe\xff':
        fp.seek(0, 0)
    return fp

# read_table also accepts a file pointer, so we can remove the bom first
samples = pd.read_table(remove_bom('~/datafile.txt'))

print(samples['RECORDING_SESSION_LABEL'])

Answer 2

回答by DSM

This seems to be (related to) a known issue, see GH #4793. Using 'utf-8-sig'as the encoding seems to work. Without it, we have:

这似乎是（相关）一个已知问题，请参阅GH #4793。使用'utf-8-sig'as 编码似乎有效。没有它，我们有：

>>> df = pd.read_table("datafile.txt")
>>> df.columns
Index([u'RECORDING_SESSION_LABEL', u'LEFT_GAZE_X', u'LEFT_GAZE_Y', u'RIGHT_GAZE_X', u'RIGHT_GAZE_Y', u'VIDEO_FRAME_INDEX', u'VIDEO_NAME'], dtype='object')
>>> df.columns[0]
'\xef\xbb\xbfRECORDING_SESSION_LABEL'

but with it, we have

但有了它，我们有

>>> df = pd.read_table("datafile.txt", encoding="utf-8-sig")
>>> df.columns
Index([u'RECORDING_SESSION_LABEL', u'LEFT_GAZE_X', u'LEFT_GAZE_Y', u'RIGHT_GAZE_X', u'RIGHT_GAZE_Y', u'VIDEO_FRAME_INDEX', u'VIDEO_NAME'], dtype='object')
>>> df.columns[0]
u'RECORDING_SESSION_LABEL'
>>> df["RECORDING_SESSION_LABEL"].max()
u'73_1'

(Used Python 2 for the above, but the same happens with Python 3.)

（以上使用 Python 2，但 Python 3 也是如此。）

Answer 3

回答by StefanK

I also stumbled upon similar problem. When I was reading as df = pandas.read_csv(csvfile, sep), the first column had this strange format in name:

我也偶然发现了类似的问题。当我阅读 df = pandas.read_csv(csvfile, sep) 时，第一列的名称有这种奇怪的格式：

df.columns[0]

returned this result:

返回这个结果：

'\xef\xbb\xbfColName'

When I tried selecting this column, I got an error:

当我尝试选择此列时，出现错误：

df.ColName

returned

回来

AttributeError: 'DataFrame' object has no attribute 'ColName'

After reading this I just used my external program Sublime to change the encoding and save the file as a new file (save with encoding UTF-8, but without BOM).

阅读本文后，我只是使用我的外部程序 Sublime 更改编码并将文件另存为新文件（使用 UTF-8 编码保存，但没有 BOM）。

Afterwards pandas reads the first column name correctly and I am able to select it withdf.ColNameand it returns correct value. Such a small thing that took 45 minutes to solve.

之后Pandas正确读取第一列名称，我可以选择它df.ColName并返回正确的值。这么小的事情花了45分钟才解决。

TLDR: Save file with encoding without BOM.

TLDR：使用没有 BOM 的编码保存文件。

Answer 4

回答by tinybike

I think the issue you're having is just that the "tabs" in datafile.txt aren't actually tabs. (When I read it in using your code, the dataframe has 1 column and 15 rows.) You could do a regex search-and-replace, or, alternately, just parse it as-is:

我认为您遇到的问题只是 datafile.txt 中的“标签”实际上并不是标签。（当我使用您的代码阅读它时，数据框有 1 列和 15 行。）您可以执行正则表达式搜索和替换，或者，也可以按原样解析它：

import pandas as pd
from numpy import transpose

with open('~/datafile.txt', 'r') as datafile:
    data = datafile.read()
while '  ' in data:
    data = data.replace('  ', ' ')
data = transpose([row.split(' ') for row in data.strip().split('\n')])
datadict = {}
for col in data:
    datadict[col[0]] = col[1:]
samples = pd.DataFrame(datadict)
print(samples['RECORDING_SESSION_LABEL'])

This works ok for me on your datafile.txt: the resulting dataframe has 15 rows x 7 columns.

这对我来说适用于您datafile.txt：生成的数据框有 15 行 x 7 列。

Pandas 在读取制表符分隔的数据时似乎忽略了第一列名称，给出了 KeyError

提问by RobTeszka

采纳答案by robbles

回答by DSM

回答by StefanK

回答by tinybike

相关推荐

最近更新

标签

Pandas 在读取制表符分隔的数据时似乎忽略了第一列名称，给出了 KeyError

提问by RobTeszka

采纳答案by robbles

回答by DSM

回答by StefanK

回答by tinybike

相关推荐

Pandas：链式赋值

IPython Notebook 和 Pandas 自动完成

pandas 如何使用熊猫在 x 轴上绘制列并使用索引作为 y 轴？

Pandas - 重采样和标准差

相关推荐

最近更新

标签