pandas 如何在熊猫多索引数据框中仅选择索引列?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/47807405/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-14 04:54:39  来源:igfitidea点击:

How to select ONLY THE INDEX COLUMNS in a pandas multi-index Dataframe?

python-3.xpandasdataframemulti-index

提问by Archan Joshi

Okay, so I have a DataFrame with a 2 column index, and I am trying to filter the rows from that DataFrame and keep ONLY THE INDEX COLUMNS of the original dataframe into the new filtered DataFrame.

好的,所以我有一个带有 2 列索引的 DataFrame,我正在尝试过滤该 DataFrame 中的行,并仅将原始数据帧的 INDEX COLUMNS 保留到新过滤的 DataFrame 中。

I created the dataframe from a CSV file by: Find the CSV file here

我通过以下方式从 CSV 文件创建了数据框:在此处查找 CSV 文件

census_df = pd.read_csv("census.csv", index_col = ["STNAME", "CTYNAME"])
census_df.sort_index(ascending = True)

Then, I applied some filtering to the DataFrame, which works perfectly fine, and I get the desired rows. The code I used is shown below:

然后,我对 DataFrame 应用了一些过滤,效果很好,我得到了所需的行。我使用的代码如下所示:

def my_answer():

    mask1 = census_df["REGION"].between(1, 2)
    mask2 = census_df.index.get_level_values("CTYNAME").str.startswith("Washington")
    mask3 = (census_df["POPESTIMATE2015"] > census_df["POPESTIMATE2014"])
    new_df = census_df[mask1 & mask2 & mask3]
    return pd.DataFrame(new_df.iloc[:, -1])

my_answer()

Here is the problem:

这是问题所在:

The above code returns a dataframe with the index AND the first column IN ADDITION to the 2 index columns. What I want is JUST THE TWO INDEX COLUMNS. So, the final answer should return a DATAFRAME, with "STNAME" and "CTYNAME", with 5 rows in it.

上面的代码返回一个带有索引和第一列 IN ADDITION 到 2 个索引列的数据框。我想要的只是两个索引列。因此,最终答案应该返回一个 DATAFRAME,带有“STNAME”和“CTYNAME”,其中有 5 行。

采纳答案by jezrael

You can convert indexto DataFrame:

您可以转换indexDataFrame

def my_answer():

    mask1 = census_df["REGION"].between(1, 2)
    mask2 = census_df.index.get_level_values("CTYNAME").str.startswith("Washington")
    mask3 = (census_df["POPESTIMATE2015"] > census_df["POPESTIMATE2014"])
    new_df = census_df[mask1 & mask2 & mask3]
    return pd.DataFrame(new_df.index.tolist(), columns=['STNAME','CTYNAME'])

print (my_answer())

         STNAME            CTYNAME
0          Iowa  Washington County
1     Minnesota  Washington County
2  Pennsylvania  Washington County
3  Rhode Island  Washington County
4     Wisconsin  Washington County

If want output as MultiIndexneed MultiIndex.remove_unused_levels, but it working in pandas 0.20.0+:

如果需要根据MultiIndex需要输出MultiIndex.remove_unused_levels,但它可以在pandas 0.20.0+

def my_answer():

    mask1 = census_df["REGION"].between(1, 2)
    mask2 = census_df.index.get_level_values("CTYNAME").str.startswith("Washington")
    mask3 = (census_df["POPESTIMATE2015"] > census_df["POPESTIMATE2014"])
    new_df = census_df[mask1 & mask2 & mask3]
    return new_df.index.remove_unused_levels()

print (my_answer())

MultiIndex(levels=[['Iowa', 'Minnesota', 'Pennsylvania', 'Rhode Island', 'Wisconsin'], 
                   ['Washington County']],
           labels=[[0, 1, 2, 3, 4], [0, 0, 0, 0, 0]],
           names=['STNAME', 'CTYNAME'])

回答by Hermes Morales

Using list comprehension:

使用列表理解:

def my_answer():
     mask1 = census_df["REGION"].between(1, 2)
     mask2 = census_df.index.get_level_values("CTYNAME").str.startswith("Washington")
     mask3 = (census_df["POPESTIMATE2015"] > census_df["POPESTIMATE2014"])
     new_df = census_df[mask1 & mask2 & mask3]

     return pd.DataFrame([new_df.index[x] for x in range(len(new_df))])    

my_answer()

Output:

输出:

    0              1
 0  Iowa         Washington County
 1  Minnesota    Washington County
 2  Pennsylvania Washington County
 3  Rhode Island Washington County
 4  Wisconsin    Washington County``