在python中使用特定列名过滤pandas数据框

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/48198021/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-19 18:35:39  来源:igfitidea点击:

Filter pandas dataframe with specific column names in python

pythonpandasdataframe

提问by J Cena

I have a pandas dataframe and a list as follows

我有一个熊猫数据框和一个列表如下

mylist = ['nnn', 'mmm', 'yyy']
mydata =
   xxx   yyy zzz nnn ddd mmm
0  0  10      5    5   5  5
1  1   9      2    3   4  4
2  2   8      8    7   9  0

Now, I want to get only the columns mentioned in mylistand save it as a csv file.

现在,我只想获取中提到的列mylist并将其另存为 csv 文件。

i.e.

IE

     yyy  nnn   mmm
0    10     5     5
1    9      3     4
2    8      7     0

My current code is as follows.

我目前的代码如下。

mydata = pd.read_csv( input_file, header=0)

for item in mylist:
    mydata_new = mydata[item]

print(mydata_new)
mydata_new.to_csv(file_name)

It seems to me that my new dataframe produces wrong results.Where I am making it wrong? Please help me!

在我看来,我的新数据框产生了错误的结果。我哪里出错了?请帮我!

回答by cs95

Just pass a list of column names to index df:

只需将列名列表传递给 index df

df[['nnn', 'mmm', 'yyy']]

   nnn  mmm  yyy
0    5    5   10
1    3    4    9
2    7    0    8


If you need to handle non-existent column names in your list, try filtering with df.columns.isin-

如果您需要处理列表中不存在的列名称,请尝试使用df.columns.isin-

df.loc[:, df.columns.isin(['nnn', 'mmm', 'yyy', 'zzzzzz'])]

   yyy  nnn  mmm
0   10    5    5
1    9    3    4
2    8    7    0

回答by Tai

You can just put mylistinside []and pandas will select it for you.

你可以把它mylist放进去[],pandas 会为你选择它。

mydata_new = mydata[mylist]

Not sure whether your yyyis a typo.

不确定你是否yyy是一个错字。

The reason that you are wrong is that you are assigning mydata_newto a new series every time in the loop.

你错的原因是你mydata_new每次在循环中都分配给一个新系列。

for item in mylist:
    mydata_new = mydata[item]  # <-  

Thus, it will create a series rather than the whole df you want.

因此,它将创建一个系列而不是您想要的整个 df。



If some names in the list is not in your data frame, you can always check it with,

如果列表中的某些名称不在您的数据框中,您可以随时检查,

len(set(mylist) - set(mydata.columns)) > 0

and print it out

并打印出来

print(set(mylist) - set(mydata.columns))

Then see if there are typos or other unintended behaviors.

然后查看是否有错别字或其他意外行为。