在 Pandas 数据框中设置索引时出现 KeyError
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/45860035/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
KeyError when setting index in a Pandas dataframe
提问by Iwan
I'm getting a keyerror when trying to set the index of my dataframe. I've not encountered this before when setting the index in the same way, and am wondering what's going wrong? The data has no column headers, therefore the DataFrame headers are 0,1,2,4,5 etc. The error occurs on any column header.
尝试设置数据帧的索引时出现关键错误。我以前在以相同的方式设置索引时没有遇到过这种情况,我想知道出了什么问题?数据没有列标题,因此 DataFrame 标题是 0、1、2、4、5 等。错误发生在任何列标题上。
I receive KeyError: '0' when trying to use the first column (which I want to use as the only index).
我在尝试使用第一列(我想将其用作唯一索引)时收到 KeyError: '0' 。
For context:In the sample below, I'm selecting macro enabled excel spreadsheets, squeezing the data, reading and converting them into DataFrames.
对于上下文:在下面的示例中,我选择启用宏的 excel 电子表格,压缩数据,读取并将它们转换为 DataFrames。
I then want to include the filename in a column, set the index and strip whitespace so that I can use index labels to extract the data I need. Not every worksheet will have the index labels so I have the try and except to skip the worksheets which don't contain those labels in the index. I then want to concatenate each result into one DataFrame and squeeze unused columns.
然后我想将文件名包含在列中,设置索引并去除空格,以便我可以使用索引标签来提取我需要的数据。并非每个工作表都会有索引标签,所以我尝试跳过索引中不包含这些标签的工作表。然后我想将每个结果连接到一个 DataFrame 并压缩未使用的列。
import itertools
import glob
from openpyxl import load_workbook
from pandas import DataFrame
import pandas as pd
import os
def get_data(ws):
for row in ws.values:
row_it = iter(row)
for cell in row_it:
if cell is not None:
yield itertools.chain((cell,), row_it)
break
def read_workbook(file_):
wb = load_workbook(file_, data_only=True)
for sheet in wb.worksheets:
ws = sheet
return DataFrame(get_data(ws))
path =r'dir'
allFiles = glob.glob(path + "/*.xlsm")
frame = pd.DataFrame()
list_ = []
for file_ in allFiles:
parsed_file = read_workbook(file_)
parsed_file['filename'] = os.path.basename(file_)
parsed_file.set_index(['0'], inplace = True)
parsed_file.index.str.strip()
try:
parsed_file.loc["Staff" : "Total"].copy()
list_.append(parsed_file)
except KeyError:
pass
frame = pd.concat(list_)
print(frame.dropna(axis='columns', thresh=2, inplace = True))
example dataframe, index position needed and labels to be extracted.
示例数据框、需要的索引位置和要提取的标签。
index
0 1 2
0 5 2 4
1 RTJHD 5 9
2 ABCD 4 6
3 Staff 9 3 --- extract from here
4 FHDHSK 3 2
5 IRRJWK 7 1
6 FJDDCN 1 8
7 67 4 7
8 Total 5 3 --- to here
Error
错误
Traceback (most recent call last):
File "<ipython-input-29-d8fd24ca84ec>", line 1, in <module>
runfile('dir.py', wdir='C:/dir/Documents')
File "C:\ProgramData\Anaconda2\lib\site-packages\spyder\utils\site\sitecustomize.py", line 880, in runfile
execfile(filename, namespace)
File "C:\ProgramData\Anaconda2\lib\site-packages\spyder\utils\site\sitecustomize.py", line 87, in execfile
exec(compile(scripttext, filename, 'exec'), glob, loc)
File "dir.py", line 36, in <module>
parsed_file.set_index(['0'], inplace = True)
File "C:\ProgramData\Anaconda2\lib\site-packages\pandas\core\frame.py", line 2830, in set_index
level = frame[col]._values
File "C:\ProgramData\Anaconda2\lib\site-packages\pandas\core\frame.py", line 1964, in __getitem__
return self._getitem_column(key)
File "C:\ProgramData\Anaconda2\lib\site-packages\pandas\core\frame.py", line 1971, in _getitem_column
return self._get_item_cache(key)
File "C:\ProgramData\Anaconda2\lib\site-packages\pandas\core\generic.py", line 1645, in _get_item_cache
values = self._data.get(item)
File "C:\ProgramData\Anaconda2\lib\site-packages\pandas\core\internals.py", line 3590, in get
loc = self.items.get_loc(item)
File "C:\ProgramData\Anaconda2\lib\site-packages\pandas\core\indexes\base.py", line 2444, in get_loc
return self._engine.get_loc(self._maybe_cast_indexer(key))
File "pandas\_libs\index.pyx", line 132, in pandas._libs.index.IndexEngine.get_loc (pandas\_libs\index.c:5280)
File "pandas\_libs\index.pyx", line 154, in pandas._libs.index.IndexEngine.get_loc (pandas\_libs\index.c:5126)
File "pandas\_libs\hashtable_class_helper.pxi", line 1210, in pandas._libs.hashtable.PyObjectHashTable.get_item (pandas\_libs\hashtable.c:20523)
File "pandas\_libs\hashtable_class_helper.pxi", line 1218, in pandas._libs.hashtable.PyObjectHashTable.get_item (pandas\_libs\hashtable.c:20477)
KeyError: '0'
回答by cs95
You're receiving this error because your dataframe is read in without any headers. This implies your headers are of type Int64Index
:
您收到此错误是因为您的数据帧是在没有任何标头的情况下读取的。这意味着您的标题类型为Int64Index
:
Int64Index([0, 1, 2, 3, ...], dtype='int64')
At this point, I would recommend just accessing df.columns
by index, wherever you're forced to deal with them:
在这一点上,我建议只df.columns
通过索引访问,无论你在哪里被迫处理它们:
parsed_file.set_index(parsed_file.columns[0], inplace = True)
Don't hardcode your column names, if you're accessing by index. The alternative to this would be to assign some of your very own column names, and thus reference those.
如果您通过索引访问,请不要对列名进行硬编码。对此的替代方法是分配一些您自己的列名称,从而引用这些名称。