Pandas：使用包含在索引中的列名时出现 KeyError

Question

提问by James Adams

I have text files that I'm parsing which contain fixed width fields with lines that look like this:

我有我正在解析的文本文件，其中包含固定宽度的字段，行看起来像这样：

USC00142401201703TMAX  211  H  133  H  161  H  194  H  206  H  161  H  244  H  178  H-9999     250  H   78  H   44  H   67  H   50  H   39  H  106  H  239  H  239  H  217  H  317  H  311  H  178  H  139  H-9999     228  H-9999   -9999   -9999   -9999   -9999   -9999

I'm parsing these into a pandas DataFrame like so:

我将这些解析成一个 Pandas DataFrame，如下所示：

from collections import OrderedDict
from pandas import DataFrame
import pandas as pd
import numpy as np

def read_into_dataframe(station_filepath):

    # specify the fixed-width fields
    column_specs = [(0, 11),   # ID
                    (11, 15),  # year
                    (15, 17),  # month
                    (17, 21),  # variable (referred to as element in the GHCND readme.txt)
                    (21, 26),  # day 1
                    (29, 34),  # day 2
                    (37, 42),  # day 3
                    (45, 50),  # day 4
                    (53, 58),  # day 5
                    (61, 66),  # day 6
                    (69, 74),  # day 7
                    (77, 82),  # day 8
                    (85, 90),  # day 9
                    (93, 98),  # day 10
                    (101, 106),  # day 11
                    (109, 114),  # day 12
                    (117, 122),  # day 13
                    (125, 130),  # day 14
                    (133, 138),  # day 15
                    (141, 146),  # day 16
                    (149, 154),  # day 17
                    (157, 162),  # day 18
                    (165, 170),  # day 19
                    (173, 178),  # day 20
                    (181, 186),  # day 21
                    (189, 194),  # day 22
                    (197, 202),  # day 23
                    (205, 210),  # day 24
                    (213, 218),  # day 25
                    (221, 226),  # day 26
                    (229, 234),  # day 27
                    (237, 242),  # day 28
                    (245, 250),  # day 29
                    (253, 258),  # day 30
                    (261, 266)]  # day 31

    # create column names to correspond with the fields specified above
    column_names = ['station_id', 'year', 'month', 'variable',
                    '01', '02', '03', '04', '05', '06', '07', '08', '09', '10',  
                    '11', '12', '13', '14', '15', '16', '17', '18', '19', '20',  
                    '21', '22', '23', '24', '25', '26', '27', '28', '29', '30',  '31']

    # read the fixed width file into a DataFrame columns with the widths and names specified above
    df = pd.read_fwf(station_filepath, 
                     header=None,
                     colspecs=column_specs,
                     names=column_names,
                     na_values=-9999)

    # convert the variable column to string data type, all others as integer data type
    df.dropna()  #REVISIT do we really want to do this?
    df['variable'] = df['variable'].astype(str)

    # keep only the rows where the variable value is 'PRCP', 'TMIN', or 'TMAX'
    df = df[df['variable'].isin(['PRCP', 'TMAX', 'TMIN'])]

    # melt the individual day columns into a single day column
    df = pd.melt(df,
                 id_vars=['station_id', 'year', 'month', 'variable'],
                 value_vars=['01', '02', '03', '04', '05', '06', '07', '08', '09', '10',
                             '11', '12', '13', '14', '15', '16', '17', '18', '19', '20',
                             '21', '22', '23', '24', '25', '26', '27', '28', '29', '30', '31'],
                 var_name='day', 
                 value_name='value')

    # pivot the DataFrame on the variable type (PRCP, TMIN, TMAX), so each
    # type has a separate column with the day's value for the type
    df = df.pivot_table(index=['year',
                               'month',
                               'day'],
                        columns='variable',
                        values='value')

    return df

I now get the DataFrame in the shape I want it, except that there are rows for days that don't exist (i.e. February 31st, etc.), and which I'd like to remove.

我现在得到了我想要的形状的 DataFrame，除了有不存在的日子（即 2 月 31 日等）的行，我想删除这些行。

I've tried to do this using masks, but when I've done so I get a KeyError when I try to use what I think are valid column names. For example if I include the following code in the above function before returning the DataFrame I will get a KeyError:

我已经尝试使用掩码来做到这一点，但是当我这样做时，当我尝试使用我认为有效的列名时，我会得到一个 KeyError。例如，如果我在返回 DataFrame 之前在上述函数中包含以下代码，我将收到 KeyError：

months_with_31days = [1, 3, 7, 8, 10, 12]
df = df[((df['day'] == 31) & (df['month'] in months_with_31days))
        |
       ((df['day'] == 30) & (df['month'] != 2))
        |
       ((df['day'] == 29) & (df['month'] != 2))
        |
       ((df['day'] == 29) & (df['month'] == 2) & calendar.isleap(df['year']))
        | 
        df['day'] < 29]

The above will result in a KeyError:

以上将导致 KeyError：

KeyError: 'day'

The day variable was created by the melt() call, then used within the index in the call to pivot_table(). How this affects the indexing of the DataFrame and why it goobers up the ability to use the previous column names is not clear to me. [EDIT]I assume that I now have a MultiIndex on the DatFrame, created as a result of the call to pivot_table() via using an index argument.

day 变量由melt() 调用创建，然后在调用pivot_table() 的索引中使用。我不清楚这如何影响 DataFrame 的索引以及为什么它会提高使用先前列名的能力。[编辑]我假设我现在在 DatFrame 上有一个 MultiIndex，它是通过使用索引参数调用 pivot_table() 而创建的。

Initial lines displayed when printing the DataFrame:

打印 DataFrame 时显示的初始行：

variable         PRCP   TMAX   TMIN
year month day                     
1893 1     01     NaN   61.0   33.0
           02     NaN   33.0    6.0
           03     NaN   44.0   17.0
           04     NaN   78.0   22.0
           05     NaN   17.0  -94.0
           06     NaN   33.0    0.0
           07     NaN    0.0  -67.0

I've tried referencing the DataFrame's columns using dot notation instead of brackets with quoted column names, but I get similar errors. It seems like the year, month, and day columns have been merged into a single index column and can no longer be referenced individually. Or not, maybe something else is going on here? I'm stumped, maybe not even approaching this in the best way, any help or suggestions will be very appreciated. Thanks.

我尝试使用点表示法而不是带引号的列名的括号来引用 DataFrame 的列，但我遇到了类似的错误。似乎年、月和日列已合并为单个索引列，无法再单独引用。或者不，也许这里发生了其他事情？我很难过，也许甚至没有以最好的方式解决这个问题，任何帮助或建议将不胜感激。谢谢。

Answer 1

回答by gaw89

Yes, you've created a multi-index DataFrame. From looking at your output (without having access to your data), you should be able to access the days by typing:

是的，您已经创建了一个多索引 DataFrame。通过查看您的输出（无需访问您的数据），您应该能够通过键入以下内容来访问这些天：

df['variable']['day']

Pandas：使用包含在索引中的列名时出现 KeyError

提问by James Adams

回答by gaw89

相关推荐

最近更新

标签

Pandas：使用包含在索引中的列名时出现 KeyError

提问by James Adams

回答by gaw89

相关推荐

pandas 合并数据框保留所有项目熊猫

pandas “如果不为 io 传递缓冲区或路径，则必须明确设置引擎”在 Panda 中

pandas 从numpy数组列表构建pandas数据框

Pandas DataFrame 删除 groupby 中的行

相关推荐

最近更新

标签