pandas 如何根据部分匹配选择DataFrame列?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/31551412/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-13 23:39:37  来源:igfitidea点击:

How to select DataFrame columns based on partial matching?

pythonpandas

提问by Michele Ancis

I was struggling this afternoon to find a way of selecting few columns of my Pandas DataFrame, by checking the occurrence of a certain pattern in their name (label?).

今天下午我正在努力寻找一种方法来选择我的 Pandas DataFrame 的几列,方法是检查它们名称(标签?)中某个模式的出现。

I had been looking for something like containsor isinfor nd.arrays/ pd.series, but got no luck.

我一直在寻找类似containsisinnd.arrays/ 的东西pd.series,但没有运气。

This frustrated me quite a bit, as I was already checking the columns of my DataFramefor occurrences of specific string patterns, as in:

这让我很沮丧,因为我已经在检查我的列中DataFrame是否出现了特定的字符串模式,如下所示:

hp = ~(df.target_column.str.contains('some_text') | df.target_column.str.contains('other_text'))
df_cln= df[hp]

However, no matter how I banged my head, I could not apply .str.contains()to the object returned bydf.columns- which is an Index- nor the one returned by df.columns.values- which is an ndarray. This works fine for what is returned by the "slicing" operation df[column_name], i.e. a Series, though.

但是,无论我怎么敲我的头,我都无法应用.str.contains()到由df.columns-返回的对象- 这是一个Index- 也不能应用到-返回的对象df.columns.values- 是一个ndarray. 不过,这对于“切片”操作返回的内容(df[column_name]即 a Series)很有效。

My first solution involved a forloop and the creation of a help list:

我的第一个解决方案涉及一个for循环和帮助列表的创建:

ll = []
for a in df.columns:
    if a.startswith('start_exp1') | a.startswith('start_exp2'):
    ll.append(a)
df[ll]

(one could apply any of the strfunctions, of course)

str当然,可以应用任何功能)

Then, I found the mapfunction and got it to work with the following code:

然后,我找到了该map函数并使其与以下代码一起工作:

import re
sel = df.columns.map(lambda x: bool(re.search('your_regex',x))
df[df.columns[sel]]

Of course in the first solution I could have performed the same kind of regex checking, because I can apply it to the strdata type returned by the iteration.

当然,在第一个解决方案中,我可以执行相同类型的正则表达式检查,因为我可以将它应用于str迭代返回的数据类型。

I am very new to Python and never really programmed anything so I am not too familiar with speed/timing/efficiency, but I tend to think that the second method - using a map - could potentially be faster, besides looking more elegant to my untrained eye.

我对 Python 很陌生,从来没有真正编程过任何东西,所以我对速度/时间/效率不太熟悉,但我倾向于认为第二种方法 - 使用地图 - 可能会更快,除了对我未经训练的人来说看起来更优雅眼睛。

I am curious to know what you think of it, and what possible alternatives would be. Given my level of noobness, I would really appreciate if you could correct any mistakes I could have made in the code and point me in the right direction.

我很想知道您对此有何看法,以及可能的替代方案是什么。鉴于我的菜鸟水平,如果您能纠正我在代码中可能犯的任何错误并为我指明正确的方向,我将不胜感激。

Thanks, Michele

谢谢,米歇尔

EDIT: I just found the Indexmethod Index.to_series(), which returns - ehm - a Seriesto which I could apply .str.contains('whatever'). However, this is not quite as powerful as a true regex, and I could not find a way of passing the result of Index.to_series().strto the re.search()function..

编辑:我刚刚发现的Index方法Index.to_series(),它的回报- EHM -一Series到,我可以申请.str.contains('whatever')。但是,这并不像真正的正则表达式那么强大,而且我找不到将 的结果传递Index.to_series().strre.search()函数的方法。

回答by Robert Smith

Your solution using mapis very good. If you really want to use str.contains, it is possible to convert Index objects to Series (which have the str.containsmethod):

您使用的解决方案map非常好。如果您真的想使用 str.contains,则可以将 Index 对象转换为 Series(具有该str.contains方法):

In [1]: df
Out[1]: 
   x  y  z
0  0  0  0
1  1  1  1
2  2  2  2
3  3  3  3
4  4  4  4
5  5  5  5
6  6  6  6
7  7  7  7
8  8  8  8
9  9  9  9

In [2]: df.columns.to_series().str.contains('x')
Out[2]: 
x     True
y    False
z    False
dtype: bool

In [3]: df[df.columns[df.columns.to_series().str.contains('x')]]
Out[3]: 
   x
0  0
1  1
2  2
3  3
4  4
5  5
6  6
7  7
8  8
9  9

UPDATEI just read your last paragraph. From the documentation, str.containsallows you to pass a regex by default (str.contains('^myregex'))

更新我刚读了你的最后一段。从文档中str.contains允许您默认传递正则表达式 ( str.contains('^myregex'))

回答by Philipp Schwarz

Select column by partial string, can simply be done, via:

按部分字符串选择列,可以简单地完成,通过:

df.filter(like='hello')  # select columns which contain the word hello

And to select rows by partial string match, you can pass axis=0 to filter:

并且要通过部分字符串匹配来选择行,您可以通过 axis=0 来过滤:

df.filter(like='hello', axis=0) 

回答by Geeocode

I think df.keys().tolist()is the thing you're searching for.

我认为df.keys().tolist()是你正在寻找的东西。

A tiny example:

from pandas import DataFrame as df

d = df({'somename': [1,2,3], 'othername': [4,5,6]})

names = d.keys().tolist()

for n in names:
    print n
    print type(n)

Output:

输出:

othername
type 'str'

somename
type 'str'

Then with the strings you got, you can do any string operation you want.

然后用你得到的字符串,你可以做任何你想做的字符串操作。