Python 无法设置 Pandas 数据框的索引 - 获取“KeyError”

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/38421170/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-19 20:47:02  来源:igfitidea点击:

Can't set index of a pandas data frame - getting "KeyError"

pythonpandasdataframesetrow

提问by Dhruv Ghulati

I generate a data frame that looks like this (summaryDF):

我生成了一个如下所示的数据框 ( summaryDF):

   accuracy        f1  precision    recall
0     0.494  0.722433   0.722433  0.722433
0     0.290  0.826087   0.826087  0.826087
0     0.274  0.629630   0.629630  0.629630
0     0.278  0.628571   0.628571  0.628571
0     0.288  0.718750   0.718750  0.718750
0     0.740  0.740000   0.740000  0.740000
0     0.698  0.765133   0.765133  0.765133
0     0.582  0.778547   0.778547  0.778547
0     0.682  0.748235   0.748235  0.748235
0     0.574  0.767918   0.767918  0.767918
0     0.398  0.711656   0.711656  0.711656
0     0.530  0.780083   0.780083  0.780083

Because I know what each row in this should be, I then am using this code to set the names of each row (these aren't the actual row names but just for argument's sake).

因为我知道这里面的每一行应该是什么,然后我使用这个代码来设置每一行的名称(这些不是实际的行名称,只是为了参数)。

summaryDF = summaryDF.set_index(['A','B','C', 'D','E','F','G','H','I','J','K','L'])

However, I am getting:

但是,我得到:

level = frame[col].values
  File "/Users/me/anaconda/lib/python2.7/site-packages/pandas/core/frame.py", line 1797, in __getitem__
    return self._getitem_column(key)
  File "/Users/me/anaconda/lib/python2.7/site-packages/pandas/core/frame.py", line 1804, in _getitem_column
    return self._get_item_cache(key)
  File "/Users/me/anaconda/lib/python2.7/site-packages/pandas/core/generic.py", line 1084, in _get_item_cache
    values = self._data.get(item)
  File "/Users/me/anaconda/lib/python2.7/site-packages/pandas/core/internals.py", line 2851, in get
    loc = self.items.get_loc(item)
  File "/Users/me/anaconda/lib/python2.7/site-packages/pandas/core/index.py", line 1572, in get_loc
    return self._engine.get_loc(_values_from_object(key))
  File "pandas/index.pyx", line 134, in pandas.index.IndexEngine.get_loc (pandas/index.c:3824)
  File "pandas/index.pyx", line 154, in pandas.index.IndexEngine.get_loc (pandas/index.c:3704)
  File "pandas/hashtable.pyx", line 686, in pandas.hashtable.PyObjectHashTable.get_item (pandas/hashtable.c:12280)
  File "pandas/hashtable.pyx", line 694, in pandas.hashtable.PyObjectHashTable.get_item (pandas/hashtable.c:12231)
KeyError: 'A'

I have no idea what I am doing wrong and have researched far and wide. Any ideas?

我不知道我做错了什么,并且进行了广泛的研究。有任何想法吗?

采纳答案by MaxU

I guess you and @jezrael misunderstood an example from the pandas docs:

我猜你和@jezrael 误解了熊猫文档中的一个例子:

df.set_index(['A', 'B'])

Aand Bare column names / labels in this example:

AB是本例中的列名/标签:

In [55]: df = pd.DataFrame(np.random.randint(0, 10, (5,4)), columns=list('ABCD'))

In [56]: df
Out[56]:
   A  B  C  D
0  6  9  7  4
1  5  1  3  4
2  4  4  0  5
3  9  0  9  8
4  6  4  5  7

In [57]: df.set_index(['A','B'])
Out[57]:
     C  D
A B
6 9  7  4
5 1  3  4
4 4  0  5
9 0  9  8
6 4  5  7

The documentationsays it should be listof column labels / arrays.

文件说,它应该是列表的列标签/的阵列

so you were looking for:

所以你正在寻找:

In [58]: df.set_index([['A','B','C','D','E']])
Out[58]:
   A  B  C  D
A  6  9  7  4
B  5  1  3  4
C  4  4  0  5
D  9  0  9  8
E  6  4  5  7

but as @jezrael has suggested df.index = ['A','B',...]is faster and more idiomatic method...

但正如@jezrael 所建议的那样df.index = ['A','B',...]是更快、更惯用的方法......

回答by jezrael

You need assign listto summaryDF.index, if lengthof listis same as lengthof DataFrame:

您需要分配listsummaryDF.index,如果lengthoflistlengthof相同DataFrame

summaryDF.index = ['A','B','C', 'D','E','F','G','H','I','J','K','L']
print (summaryDF)
   accuracy        f1  precision    recall
A     0.494  0.722433   0.722433  0.722433
B     0.290  0.826087   0.826087  0.826087
C     0.274  0.629630   0.629630  0.629630
D     0.278  0.628571   0.628571  0.628571
E     0.288  0.718750   0.718750  0.718750
F     0.740  0.740000   0.740000  0.740000
G     0.698  0.765133   0.765133  0.765133
H     0.582  0.778547   0.778547  0.778547
I     0.682  0.748235   0.748235  0.748235
J     0.574  0.767918   0.767918  0.767918
K     0.398  0.711656   0.711656  0.711656
L     0.530  0.780083   0.780083  0.780083

print (summaryDF.index)
Index(['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L'], dtype='object')

Timings:

时间

In [117]: %timeit summaryDF.index = ['A','B','C', 'D','E','F','G','H','I','J','K','L']
The slowest run took 6.86 times longer than the fastest. This could mean that an intermediate result is being cached.
10000 loops, best of 3: 76.2 μs per loop

In [118]: %timeit summaryDF.set_index(pd.Index(['A','B','C', 'D','E','F','G','H','I','J','K','L']))
The slowest run took 6.77 times longer than the fastest. This could mean that an intermediate result is being cached.
1000 loops, best of 3: 227 μs per loop

Another solution is convert listto numpy array:

另一种解决方案是转换listnumpy array

summaryDF.set_index(np.array(['A','B','C', 'D','E','F','G','H','I','J','K','L']), inplace=True)
print (summaryDF)
   accuracy        f1  precision    recall
A     0.494  0.722433   0.722433  0.722433
B     0.290  0.826087   0.826087  0.826087
C     0.274  0.629630   0.629630  0.629630
D     0.278  0.628571   0.628571  0.628571
E     0.288  0.718750   0.718750  0.718750
F     0.740  0.740000   0.740000  0.740000
G     0.698  0.765133   0.765133  0.765133
H     0.582  0.778547   0.778547  0.778547
I     0.682  0.748235   0.748235  0.748235
J     0.574  0.767918   0.767918  0.767918
K     0.398  0.711656   0.711656  0.711656
L     0.530  0.780083   0.780083  0.780083