Python 如何通过正则表达式从数据框中选择列

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/30808430/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-19 09:00:40  来源:igfitidea点击:

How to select columns from dataframe by regex

pythonpython-2.7pandas

提问by Yan Song

I have a dataframe in python pandas. The structure of the dataframe is as the following:

我在 python pandas 中有一个数据框。数据帧的结构如下:

   a    b    c    d1   d2   d3 
   10   14   12   44  45    78

I would like to select the columns which begin with d. Is there a simple way to achieve this in python .

我想选择以 d 开头的列。有没有一种简单的方法可以在 python 中实现这一点。

采纳答案by farhawa

You can use DataFrame.filterthis way:

你可以这样使用DataFrame.filter

import pandas as pd

df = pd.DataFrame(np.array([[2,4,4],[4,3,3],[5,9,1]]),columns=['d','t','didi'])
>>
   d  t  didi
0  2  4     4
1  4  3     3
2  5  9     1

df.filter(regex=("d.*"))

>>
   d  didi
0  2     4
1  4     3
2  5     1

The idea is to select columns by regex

这个想法是通过选择列 regex

回答by Alexander

You can use a list comprehension to iterate over all of the column names in your DataFrame dfand then only select those that begin with 'd'.

您可以使用列表理解来遍历 DataFrame 中的所有列名称df,然后仅选择以 'd' 开头的列名称。

df = pd.DataFrame({'a': {0: 10}, 'b': {0: 14}, 'c': {0: 12},
                   'd1': {0: 44}, 'd2': {0: 45}, 'd3': {0: 78}})

Use list comprehension to iterate over the columns in the dataframe and return their names (cbelow is a local variable representing the column name).

使用列表理解来迭代数据框中的列并返回它们的名称(c下面是表示列名称的局部变量)。

>>> [c for c in df]
['a', 'b', 'c', 'd1', 'd2', 'd3']

Then only select those beginning with 'd'.

然后只选择那些以“d”开头的。

>>> [c for c in df if c[0] == 'd']  # As an alternative to c[0], use c.startswith(...)
['d1', 'd2', 'd3']

Finally, pass this list of columns to the DataFrame.

最后,将此列列表传递给 DataFrame。

df[[c for c in df if c.startswith('d')]]
>>> df
   d1  d2  d3
0  44  45  78

===========================================================================

================================================== ==========================

TIMINGS(added Feb 2018 per comments from devinbost claiming that this method is slow...)

时间(根据 devinbost 的评论于 2018 年 2 月添加,声称此方法很慢......)

First, lets create a dataframe with 30k columns:

首先,让我们创建一个包含 30k 列的数据框:

n = 10000
cols = ['{0}_{1}'.format(letters, number) 
        for number in range(n) for letters in ('d', 't', 'didi')]
df = pd.DataFrame(np.random.randn(3, n * 3), columns=cols)
>>> df.shape
(3, 30000)

>>> %timeit df[[c for c in df if c[0] == 'd']]  # Simple list comprehension.
# 10 loops, best of 3: 16.4 ms per loop

>>> %timeit df[[c for c in df if c.startswith('d')]]  # More 'pythonic'?
# 10 loops, best of 3: 29.2 ms per loop

>>> %timeit df.select(lambda col: col.startswith('d'), axis=1)  # Solution of gbrener.
# 10 loops, best of 3: 21.4 ms per loop

>>> %timeit df.filter(regex=("d.*"))  # Accepted solution.
# 10 loops, best of 3: 40 ms per loop

回答by gbrener

Use select:

使用select

import pandas as pd

df = pd.DataFrame([[10, 14, 12, 44, 45, 78]], columns=['a', 'b', 'c', 'd1', 'd2', 'd3'])

df.select(lambda col: col.startswith('d'), axis=1)

Result:

结果:

   d1  d2  d3
0  44  45  78

This is a nice solution if you're not comfortable with regular expressions.

如果您对正则表达式不满意,这是一个不错的解决方案。

回答by prafi

You can also use

你也可以使用

df.filter(regex='^d')

回答by devinbost

On a larger dataset especially, a vectorized approach is actually MUCH FASTER (by more than two orders of magnitude) and is MUCH more readable. I'm providing a screenshot as proof. (Note: Except for the last few lines I wrote at the bottom to make my point clear with a vectorized approach, the other code was derived from the answer by @Alexander.)

特别是在更大的数据集上,矢量化方法实际上更快(超过两个数量级)并且更具可读性。我提供截图作为证据。(注意:除了我在底部写的最后几行用矢量化方法明确我的观点外,其他代码来自@Alexander 的答案。)

enter image description here

在此处输入图片说明

Here's that code for reference:

这是供参考的代码:

import pandas as pd
import numpy as np
n = 10000
cols = ['{0}_{1}'.format(letters, number) 
        for number in range(n) for letters in ('d', 't', 'didi')]
df = pd.DataFrame(np.random.randn(30000, n * 3), columns=cols)

%timeit df[[c for c in df if c[0] == 'd']]

%timeit df[[c for c in df if c.startswith('d')]]

%timeit df.select(lambda col: col.startswith('d'), axis=1)

%timeit df.filter(regex=("d.*"))

%timeit df.filter(like='d')

%timeit df.filter(like='d', axis=1)

%timeit df.filter(regex=("d.*"), axis=1)

%timeit df.columns.map(lambda x: x.startswith("d"))

columnVals = df.columns.map(lambda x: x.startswith("d"))

%timeit df.filter(columnVals, axis=1)