Python 熊猫:选择名称以 X 开头的所有列的最佳方法

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/27275236/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-19 01:36:15  来源:igfitidea点击:

pandas: best way to select all columns whose names start with X

pythonpandasdataframeselection

提问by ccsv

I have a DataFrame:

我有一个数据框:

import pandas as pd
import numpy as np

df = pd.DataFrame({'foo.aa': [1, 2.1, np.nan, 4.7, 5.6, 6.8],
                   'foo.fighters': [0, 1, np.nan, 0, 0, 0],
                   'foo.bars': [0, 0, 0, 0, 0, 1],
                   'bar.baz': [5, 5, 6, 5, 5.6, 6.8],
                   'foo.fox': [2, 4, 1, 0, 0, 5],
                   'nas.foo': ['NA', 0, 1, 0, 0, 0],
                   'foo.manchu': ['NA', 0, 0, 0, 0, 0],})

I want to select values of 1 in columns starting with foo.. Is there a better way to do it other than:

我想在以foo.. 除了:

df2 = df[(df['foo.aa'] == 1)|
(df['foo.fighters'] == 1)|
(df['foo.bars'] == 1)|
(df['foo.fox'] == 1)|
(df['foo.manchu'] == 1)
]

Something similar to writing something like:

类似于写这样的东西:

df2= df[df.STARTS_WITH_FOO == 1]

The answer should print out a DataFrame like this:

答案应该打印出这样的 DataFrame:

   bar.baz  foo.aa  foo.bars  foo.fighters  foo.fox foo.manchu nas.foo
0      5.0     1.0         0             0        2         NA      NA
1      5.0     2.1         0             1        4          0       0
2      6.0     NaN         0           NaN        1          0       1
5      6.8     6.8         1             0        5          0       0

[4 rows x 7 columns]

采纳答案by EdChum

Just perform a list comprehension to create your columns:

只需执行列表理解来创建您的列:

In [28]:

filter_col = [col for col in df if col.startswith('foo')]
filter_col
Out[28]:
['foo.aa', 'foo.bars', 'foo.fighters', 'foo.fox', 'foo.manchu']
In [29]:

df[filter_col]
Out[29]:
   foo.aa  foo.bars  foo.fighters  foo.fox foo.manchu
0     1.0         0             0        2         NA
1     2.1         0             1        4          0
2     NaN         0           NaN        1          0
3     4.7         0             0        0          0
4     5.6         0             0        0          0
5     6.8         1             0        5          0

Another method is to create a series from the columns and use the vectorised str method startswith:

另一种方法是从列创建一个系列并使用矢量化 str 方法startswith

In [33]:

df[df.columns[pd.Series(df.columns).str.startswith('foo')]]
Out[33]:
   foo.aa  foo.bars  foo.fighters  foo.fox foo.manchu
0     1.0         0             0        2         NA
1     2.1         0             1        4          0
2     NaN         0           NaN        1          0
3     4.7         0             0        0          0
4     5.6         0             0        0          0
5     6.8         1             0        5          0

In order to achieve what you want you need to add the following to filter the values that don't meet your ==1criteria:

为了实现您想要的,您需要添加以下内容来过滤不符合您的==1条件的值:

In [36]:

df[df[df.columns[pd.Series(df.columns).str.startswith('foo')]]==1]
Out[36]:
   bar.baz  foo.aa  foo.bars  foo.fighters  foo.fox foo.manchu nas.foo
0      NaN       1       NaN           NaN      NaN        NaN     NaN
1      NaN     NaN       NaN             1      NaN        NaN     NaN
2      NaN     NaN       NaN           NaN        1        NaN     NaN
3      NaN     NaN       NaN           NaN      NaN        NaN     NaN
4      NaN     NaN       NaN           NaN      NaN        NaN     NaN
5      NaN     NaN         1           NaN      NaN        NaN     NaN

EDIT

编辑

OK after seeing what you want the convoluted answer is this:

确定后看到你想要的复杂答案是这样的:

In [72]:

df.loc[df[df[df.columns[pd.Series(df.columns).str.startswith('foo')]] == 1].dropna(how='all', axis=0).index]
Out[72]:
   bar.baz  foo.aa  foo.bars  foo.fighters  foo.fox foo.manchu nas.foo
0      5.0     1.0         0             0        2         NA      NA
1      5.0     2.1         0             1        4          0       0
2      6.0     NaN         0           NaN        1          0       1
5      6.8     6.8         1             0        5          0       0

回答by Alex Riley

Now that pandas' indexes support string operations, arguably the simplest and best way to select columns beginning with 'foo' is just:

既然 pandas 的索引支持字符串操作,可以说选择以 'foo' 开头的列的最简单和最好的方法就是:

df.loc[:, df.columns.str.startswith('foo')]


Alternatively, you can filter column (or row) labels with df.filter(). To specify a regular expression to match the names beginning with foo.:

或者,您可以使用 过滤列(或行)标签df.filter()。要指定正则表达式以匹配以 开头的名称foo.

>>> df.filter(regex=r'^foo\.', axis=1)
   foo.aa  foo.bars  foo.fighters  foo.fox foo.manchu
0     1.0         0             0        2         NA
1     2.1         0             1        4          0
2     NaN         0           NaN        1          0
3     4.7         0             0        0          0
4     5.6         0             0        0          0
5     6.8         1             0        5          0

To select only the required rows (containing a 1) and the columns, you can use loc, selecting the columns using filter(or any other method) and the rows using any:

要仅选择所需的行(包含 a 1)和列,您可以使用loc,使用filter(或任何其他方法)选择列并使用 选择行any

>>> df.loc[(df == 1).any(axis=1), df.filter(regex=r'^foo\.', axis=1).columns]
   foo.aa  foo.bars  foo.fighters  foo.fox foo.manchu
0     1.0         0             0        2         NA
1     2.1         0             1        4          0
2     NaN         0           NaN        1          0
5     6.8         1             0        5          0

回答by Robbie Liu

My solution. It may be slower on performance:

我的解决方案。性能可能较慢:

a = pd.concat(df[df[c] == 1] for c in df.columns if c.startswith('foo'))
a.sort_index()


   bar.baz  foo.aa  foo.bars  foo.fighters  foo.fox foo.manchu nas.foo
0      5.0     1.0         0             0        2         NA      NA
1      5.0     2.1         0             1        4          0       0
2      6.0     NaN         0           NaN        1          0       1
5      6.8     6.8         1             0        5          0       0

回答by Cleb

Another option for the selection of the desired entries is to use map:

选择所需条目的另一个选项是使用map

df.loc[(df == 1).any(axis=1), df.columns.map(lambda x: x.startswith('foo'))]

which gives you all the columns for rows that contain a 1:

它为您提供包含 a 的行的所有列1

   foo.aa  foo.bars  foo.fighters  foo.fox foo.manchu
0     1.0         0             0        2         NA
1     2.1         0             1        4          0
2     NaN         0           NaN        1          0
5     6.8         1             0        5          0

The row selectionis done by

行选择是通过做

(df == 1).any(axis=1)

as in @ajcr's answer which gives you:

就像@ajcr 的答案一样,它为您提供了:

0     True
1     True
2     True
3    False
4    False
5     True
dtype: bool

meaning that row 3and 4do not contain a 1and won't be selected.

意味着该行3and4不包含 a1并且不会被选中。

The selection of the columnsis done using Boolean indexing like this:

选择是使用布尔索引完成的,如下所示:

df.columns.map(lambda x: x.startswith('foo'))

In the example above this returns

在上面的例子中,这返回

array([False,  True,  True,  True,  True,  True, False], dtype=bool)

So, if a column does not start with foo, Falseis returned and the column is therefore not selected.

因此,如果列不以foo,开头,False则返回该列,因此不会选择该列。

If you just want to return all rows that contain a 1- as your desired output suggests - you can simply do

如果您只想返回包含 a 的所有行1- 正如您想要的输出所建议的那样 - 您可以简单地做

df.loc[(df == 1).any(axis=1)]

which returns

返回

   bar.baz  foo.aa  foo.bars  foo.fighters  foo.fox foo.manchu nas.foo
0      5.0     1.0         0             0        2         NA      NA
1      5.0     2.1         0             1        4          0       0
2      6.0     NaN         0           NaN        1          0       1
5      6.8     6.8         1             0        5          0       0

回答by Arturo Sbr

Based on @EdChum's answer, you can try the following solution:

根据@EdChum 的回答,您可以尝试以下解决方案:

df[df.columns[pd.Series(df.columns).str.contains("foo")]]

This will be really helpful in case not all the columns you want to select start with foo. This method selects all the columns that contain the substring fooand it could be placed in at any point of a column's name.

如果不是您要选择的所有列都以foo. 此方法选择包含子字符串的所有列,foo并且可以将其放置在列名称的任何位置。

In essence, I replaced .startswith()with .contains().

本质上,我.startswith().contains().

回答by mohammed Elsiddieg

The simplest way is to use str directly on column names, there is no need for pd.Series

最简单的方法是直接在列名上使用str,不需要 pd.Series

df.loc[:,df.columns.str.startswith("foo")]