pandas 如何根据部分匹配选择DataFrame列？

Question

提问by Michele Ancis

I was struggling this afternoon to find a way of selecting few columns of my Pandas DataFrame, by checking the occurrence of a certain pattern in their name (label?).

今天下午我正在努力寻找一种方法来选择我的 Pandas DataFrame 的几列，方法是检查它们名称（标签？）中某个模式的出现。

I had been looking for something like containsor isinfor nd.arrays/ pd.series, but got no luck.

我一直在寻找类似contains或isin为nd.arrays/ 的东西pd.series，但没有运气。

This frustrated me quite a bit, as I was already checking the columns of my DataFramefor occurrences of specific string patterns, as in:

这让我很沮丧，因为我已经在检查我的列中DataFrame是否出现了特定的字符串模式，如下所示：

hp = ~(df.target_column.str.contains('some_text') | df.target_column.str.contains('other_text'))
df_cln= df[hp]

However, no matter how I banged my head, I could not apply .str.contains()to the object returned bydf.columns- which is an Index- nor the one returned by df.columns.values- which is an ndarray. This works fine for what is returned by the "slicing" operation df[column_name], i.e. a Series, though.

但是，无论我怎么敲我的头，我都无法应用.str.contains()到由df.columns-返回的对象- 这是一个Index- 也不能应用到-返回的对象df.columns.values- 是一个ndarray. 不过，这对于“切片”操作返回的内容（df[column_name]即 a Series）很有效。

My first solution involved a forloop and the creation of a help list:

我的第一个解决方案涉及一个for循环和帮助列表的创建：

ll = []
for a in df.columns:
    if a.startswith('start_exp1') | a.startswith('start_exp2'):
    ll.append(a)
df[ll]

(one could apply any of the strfunctions, of course)

（str当然，可以应用任何功能）

Then, I found the mapfunction and got it to work with the following code:

然后，我找到了该map函数并使其与以下代码一起工作：

import re
sel = df.columns.map(lambda x: bool(re.search('your_regex',x))
df[df.columns[sel]]

Of course in the first solution I could have performed the same kind of regex checking, because I can apply it to the strdata type returned by the iteration.

当然，在第一个解决方案中，我可以执行相同类型的正则表达式检查，因为我可以将它应用于str迭代返回的数据类型。

I am very new to Python and never really programmed anything so I am not too familiar with speed/timing/efficiency, but I tend to think that the second method - using a map - could potentially be faster, besides looking more elegant to my untrained eye.

我对 Python 很陌生，从来没有真正编程过任何东西，所以我对速度/时间/效率不太熟悉，但我倾向于认为第二种方法 - 使用地图 - 可能会更快，除了对我未经训练的人来说看起来更优雅眼睛。

I am curious to know what you think of it, and what possible alternatives would be. Given my level of noobness, I would really appreciate if you could correct any mistakes I could have made in the code and point me in the right direction.

我很想知道您对此有何看法，以及可能的替代方案是什么。鉴于我的菜鸟水平，如果您能纠正我在代码中可能犯的任何错误并为我指明正确的方向，我将不胜感激。

Thanks, Michele

谢谢，米歇尔

EDIT: I just found the Indexmethod Index.to_series(), which returns - ehm - a Seriesto which I could apply .str.contains('whatever'). However, this is not quite as powerful as a true regex, and I could not find a way of passing the result of Index.to_series().strto the re.search()function..

编辑：我刚刚发现的Index方法Index.to_series()，它的回报- EHM -一Series到，我可以申请.str.contains('whatever')。但是，这并不像真正的正则表达式那么强大，而且我找不到将的结果传递Index.to_series().str给re.search()函数的方法。

Answer 1

回答by Robert Smith

Your solution using mapis very good. If you really want to use str.contains, it is possible to convert Index objects to Series (which have the str.containsmethod):

您使用的解决方案map非常好。如果您真的想使用 str.contains，则可以将 Index 对象转换为 Series（具有该str.contains方法）：

In [1]: df
Out[1]: 
   x  y  z
0  0  0  0
1  1  1  1
2  2  2  2
3  3  3  3
4  4  4  4
5  5  5  5
6  6  6  6
7  7  7  7
8  8  8  8
9  9  9  9

In [2]: df.columns.to_series().str.contains('x')
Out[2]: 
x     True
y    False
z    False
dtype: bool

In [3]: df[df.columns[df.columns.to_series().str.contains('x')]]
Out[3]: 
   x
0  0
1  1
2  2
3  3
4  4
5  5
6  6
7  7
8  8
9  9

UPDATEI just read your last paragraph. From the documentation, str.containsallows you to pass a regex by default (str.contains('^myregex'))

更新我刚读了你的最后一段。从文档中，str.contains允许您默认传递正则表达式 ( str.contains('^myregex'))

Answer 2

回答by Philipp Schwarz

Select column by partial string, can simply be done, via:

按部分字符串选择列，可以简单地完成，通过：

df.filter(like='hello')  # select columns which contain the word hello

And to select rows by partial string match, you can pass axis=0 to filter:

并且要通过部分字符串匹配来选择行，您可以通过 axis=0 来过滤：

df.filter(like='hello', axis=0)

Answer 3

回答by Geeocode

I think df.keys().tolist()is the thing you're searching for.

我认为df.keys().tolist()是你正在寻找的东西。

A tiny example:

from pandas import DataFrame as df

d = df({'somename': [1,2,3], 'othername': [4,5,6]})

names = d.keys().tolist()

for n in names:
    print n
    print type(n)

Output:

输出：

othername
type 'str'

somename
type 'str'

Then with the strings you got, you can do any string operation you want.

然后用你得到的字符串，你可以做任何你想做的字符串操作。

pandas 如何根据部分匹配选择DataFrame列？

提问by Michele Ancis

回答by Robert Smith

回答by Philipp Schwarz

回答by Geeocode

相关推荐

最近更新

标签

pandas 如何根据部分匹配选择DataFrame列？

提问by Michele Ancis

回答by Robert Smith

回答by Philipp Schwarz

回答by Geeocode

相关推荐

pandas 使用k-means，我得到了一个错误；具有 0 个特征的数组

pandas 熊猫在 x 轴上绘制 xticks

如何在 Pandas 中绘制日期的核密度图？

pandas python dask DataFrame，是否支持（平凡可并行化）行应用？

相关推荐

最近更新

标签