Pandas:使用包含在索引中的列名时出现 KeyError
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/43128024/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Pandas: KeyError when using column names which are included in an index
提问by James Adams
I have text files that I'm parsing which contain fixed width fields with lines that look like this:
我有我正在解析的文本文件,其中包含固定宽度的字段,行看起来像这样:
USC00142401201703TMAX 211 H 133 H 161 H 194 H 206 H 161 H 244 H 178 H-9999 250 H 78 H 44 H 67 H 50 H 39 H 106 H 239 H 239 H 217 H 317 H 311 H 178 H 139 H-9999 228 H-9999 -9999 -9999 -9999 -9999 -9999
I'm parsing these into a pandas DataFrame like so:
我将这些解析成一个 Pandas DataFrame,如下所示:
from collections import OrderedDict
from pandas import DataFrame
import pandas as pd
import numpy as np
def read_into_dataframe(station_filepath):
# specify the fixed-width fields
column_specs = [(0, 11), # ID
(11, 15), # year
(15, 17), # month
(17, 21), # variable (referred to as element in the GHCND readme.txt)
(21, 26), # day 1
(29, 34), # day 2
(37, 42), # day 3
(45, 50), # day 4
(53, 58), # day 5
(61, 66), # day 6
(69, 74), # day 7
(77, 82), # day 8
(85, 90), # day 9
(93, 98), # day 10
(101, 106), # day 11
(109, 114), # day 12
(117, 122), # day 13
(125, 130), # day 14
(133, 138), # day 15
(141, 146), # day 16
(149, 154), # day 17
(157, 162), # day 18
(165, 170), # day 19
(173, 178), # day 20
(181, 186), # day 21
(189, 194), # day 22
(197, 202), # day 23
(205, 210), # day 24
(213, 218), # day 25
(221, 226), # day 26
(229, 234), # day 27
(237, 242), # day 28
(245, 250), # day 29
(253, 258), # day 30
(261, 266)] # day 31
# create column names to correspond with the fields specified above
column_names = ['station_id', 'year', 'month', 'variable',
'01', '02', '03', '04', '05', '06', '07', '08', '09', '10',
'11', '12', '13', '14', '15', '16', '17', '18', '19', '20',
'21', '22', '23', '24', '25', '26', '27', '28', '29', '30', '31']
# read the fixed width file into a DataFrame columns with the widths and names specified above
df = pd.read_fwf(station_filepath,
header=None,
colspecs=column_specs,
names=column_names,
na_values=-9999)
# convert the variable column to string data type, all others as integer data type
df.dropna() #REVISIT do we really want to do this?
df['variable'] = df['variable'].astype(str)
# keep only the rows where the variable value is 'PRCP', 'TMIN', or 'TMAX'
df = df[df['variable'].isin(['PRCP', 'TMAX', 'TMIN'])]
# melt the individual day columns into a single day column
df = pd.melt(df,
id_vars=['station_id', 'year', 'month', 'variable'],
value_vars=['01', '02', '03', '04', '05', '06', '07', '08', '09', '10',
'11', '12', '13', '14', '15', '16', '17', '18', '19', '20',
'21', '22', '23', '24', '25', '26', '27', '28', '29', '30', '31'],
var_name='day',
value_name='value')
# pivot the DataFrame on the variable type (PRCP, TMIN, TMAX), so each
# type has a separate column with the day's value for the type
df = df.pivot_table(index=['year',
'month',
'day'],
columns='variable',
values='value')
return df
I now get the DataFrame in the shape I want it, except that there are rows for days that don't exist (i.e. February 31st, etc.), and which I'd like to remove.
我现在得到了我想要的形状的 DataFrame,除了有不存在的日子(即 2 月 31 日等)的行,我想删除这些行。
I've tried to do this using masks, but when I've done so I get a KeyError when I try to use what I think are valid column names. For example if I include the following code in the above function before returning the DataFrame I will get a KeyError:
我已经尝试使用掩码来做到这一点,但是当我这样做时,当我尝试使用我认为有效的列名时,我会得到一个 KeyError。例如,如果我在返回 DataFrame 之前在上述函数中包含以下代码,我将收到 KeyError:
months_with_31days = [1, 3, 7, 8, 10, 12]
df = df[((df['day'] == 31) & (df['month'] in months_with_31days))
|
((df['day'] == 30) & (df['month'] != 2))
|
((df['day'] == 29) & (df['month'] != 2))
|
((df['day'] == 29) & (df['month'] == 2) & calendar.isleap(df['year']))
|
df['day'] < 29]
The above will result in a KeyError:
以上将导致 KeyError:
KeyError: 'day'
The day variable was created by the melt() call, then used within the index in the call to pivot_table(). How this affects the indexing of the DataFrame and why it goobers up the ability to use the previous column names is not clear to me. [EDIT]I assume that I now have a MultiIndex on the DatFrame, created as a result of the call to pivot_table() via using an index argument.
day 变量由melt() 调用创建,然后在调用pivot_table() 的索引中使用。我不清楚这如何影响 DataFrame 的索引以及为什么它会提高使用先前列名的能力。[编辑]我假设我现在在 DatFrame 上有一个 MultiIndex,它是通过使用索引参数调用 pivot_table() 而创建的。
Initial lines displayed when printing the DataFrame:
打印 DataFrame 时显示的初始行:
variable PRCP TMAX TMIN
year month day
1893 1 01 NaN 61.0 33.0
02 NaN 33.0 6.0
03 NaN 44.0 17.0
04 NaN 78.0 22.0
05 NaN 17.0 -94.0
06 NaN 33.0 0.0
07 NaN 0.0 -67.0
I've tried referencing the DataFrame's columns using dot notation instead of brackets with quoted column names, but I get similar errors. It seems like the year, month, and day columns have been merged into a single index column and can no longer be referenced individually. Or not, maybe something else is going on here? I'm stumped, maybe not even approaching this in the best way, any help or suggestions will be very appreciated. Thanks.
我尝试使用点表示法而不是带引号的列名的括号来引用 DataFrame 的列,但我遇到了类似的错误。似乎年、月和日列已合并为单个索引列,无法再单独引用。或者不,也许这里发生了其他事情?我很难过,也许甚至没有以最好的方式解决这个问题,任何帮助或建议将不胜感激。谢谢。
回答by gaw89
Yes, you've created a multi-index DataFrame. From looking at your output (without having access to your data), you should be able to access the days by typing:
是的,您已经创建了一个多索引 DataFrame。通过查看您的输出(无需访问您的数据),您应该能够通过键入以下内容来访问这些天:
df['variable']['day']