Python 如何通过正则表达式从数据框中选择列

Question

提问by Yan Song

I have a dataframe in python pandas. The structure of the dataframe is as the following:

我在 python pandas 中有一个数据框。数据帧的结构如下：

   a    b    c    d1   d2   d3 
   10   14   12   44  45    78

I would like to select the columns which begin with d. Is there a simple way to achieve this in python .

我想选择以 d 开头的列。有没有一种简单的方法可以在 python 中实现这一点。

Answer 1

采纳答案by farhawa

You can use DataFrame.filterthis way:

你可以这样使用DataFrame.filter：

import pandas as pd

df = pd.DataFrame(np.array([[2,4,4],[4,3,3],[5,9,1]]),columns=['d','t','didi'])
>>
   d  t  didi
0  2  4     4
1  4  3     3
2  5  9     1

df.filter(regex=("d.*"))

>>
   d  didi
0  2     4
1  4     3
2  5     1

The idea is to select columns by regex

这个想法是通过选择列 regex

Answer 2

回答by Alexander

You can use a list comprehension to iterate over all of the column names in your DataFrame dfand then only select those that begin with 'd'.

您可以使用列表理解来遍历 DataFrame 中的所有列名称df，然后仅选择以 'd' 开头的列名称。

df = pd.DataFrame({'a': {0: 10}, 'b': {0: 14}, 'c': {0: 12},
                   'd1': {0: 44}, 'd2': {0: 45}, 'd3': {0: 78}})

Use list comprehension to iterate over the columns in the dataframe and return their names (cbelow is a local variable representing the column name).

使用列表理解来迭代数据框中的列并返回它们的名称（c下面是表示列名称的局部变量）。

>>> [c for c in df]
['a', 'b', 'c', 'd1', 'd2', 'd3']

Then only select those beginning with 'd'.

然后只选择那些以“d”开头的。

>>> [c for c in df if c[0] == 'd']  # As an alternative to c[0], use c.startswith(...)
['d1', 'd2', 'd3']

Finally, pass this list of columns to the DataFrame.

最后，将此列列表传递给 DataFrame。

df[[c for c in df if c.startswith('d')]]
>>> df
   d1  d2  d3
0  44  45  78

===========================================================================

================================================== ==========================

TIMINGS(added Feb 2018 per comments from devinbost claiming that this method is slow...)

时间（根据 devinbost 的评论于 2018 年 2 月添加，声称此方法很慢......）

First, lets create a dataframe with 30k columns:

首先，让我们创建一个包含 30k 列的数据框：

n = 10000
cols = ['{0}_{1}'.format(letters, number) 
        for number in range(n) for letters in ('d', 't', 'didi')]
df = pd.DataFrame(np.random.randn(3, n * 3), columns=cols)
>>> df.shape
(3, 30000)

>>> %timeit df[[c for c in df if c[0] == 'd']]  # Simple list comprehension.
# 10 loops, best of 3: 16.4 ms per loop

>>> %timeit df[[c for c in df if c.startswith('d')]]  # More 'pythonic'?
# 10 loops, best of 3: 29.2 ms per loop

>>> %timeit df.select(lambda col: col.startswith('d'), axis=1)  # Solution of gbrener.
# 10 loops, best of 3: 21.4 ms per loop

>>> %timeit df.filter(regex=("d.*"))  # Accepted solution.
# 10 loops, best of 3: 40 ms per loop

Answer 3

回答by gbrener

Use select:

使用select：

import pandas as pd

df = pd.DataFrame([[10, 14, 12, 44, 45, 78]], columns=['a', 'b', 'c', 'd1', 'd2', 'd3'])

df.select(lambda col: col.startswith('d'), axis=1)

Result:

结果：

   d1  d2  d3
0  44  45  78

This is a nice solution if you're not comfortable with regular expressions.

如果您对正则表达式不满意，这是一个不错的解决方案。

Answer 4

回答by prafi

You can also use

你也可以使用

df.filter(regex='^d')

Answer 5

回答by devinbost

On a larger dataset especially, a vectorized approach is actually MUCH FASTER (by more than two orders of magnitude) and is MUCH more readable. I'm providing a screenshot as proof. (Note: Except for the last few lines I wrote at the bottom to make my point clear with a vectorized approach, the other code was derived from the answer by @Alexander.)

特别是在更大的数据集上，矢量化方法实际上更快（超过两个数量级）并且更具可读性。我提供截图作为证据。（注意：除了我在底部写的最后几行用矢量化方法明确我的观点外，其他代码来自@Alexander 的答案。）

Here's that code for reference:

这是供参考的代码：

import pandas as pd
import numpy as np
n = 10000
cols = ['{0}_{1}'.format(letters, number) 
        for number in range(n) for letters in ('d', 't', 'didi')]
df = pd.DataFrame(np.random.randn(30000, n * 3), columns=cols)

%timeit df[[c for c in df if c[0] == 'd']]

%timeit df[[c for c in df if c.startswith('d')]]

%timeit df.select(lambda col: col.startswith('d'), axis=1)

%timeit df.filter(regex=("d.*"))

%timeit df.filter(like='d')

%timeit df.filter(like='d', axis=1)

%timeit df.filter(regex=("d.*"), axis=1)

%timeit df.columns.map(lambda x: x.startswith("d"))

columnVals = df.columns.map(lambda x: x.startswith("d"))

%timeit df.filter(columnVals, axis=1)

Python 如何通过正则表达式从数据框中选择列

提问by Yan Song

采纳答案by farhawa

回答by Alexander

回答by gbrener

回答by prafi

回答by devinbost

相关推荐

最近更新

标签

Python 如何通过正则表达式从数据框中选择列

提问by Yan Song

采纳答案by farhawa

回答by Alexander

回答by gbrener

回答by prafi

回答by devinbost

相关推荐

如何在不写/读的情况下在 Python 中执行 JPEG 压缩

python os.environ, os.putenv, /usr/bin/env

python中多个集合的并集

Python 总结每天熊猫的出现次数

相关推荐

最近更新

标签