Python 如何从 Pandas 数据框中过滤包含字符串模式的行

Question

提问by John Knight

Assume we have a data frame in Python Pandas that looks like this:

假设我们在 Python Pandas 中有一个如下所示的数据框：

df = pd.DataFrame({'vals': [1, 2, 3, 4], 'ids': [u'aball', u'bball', u'cnut', u'fball']})

Or, in table form:

或者，以表格形式：

ids    vals
aball   1
bball   2
cnut    3
fball   4

How do I filter rows which contain the key word "ball?" For example, the output should be:

如何过滤包含关键字“ball”的行？例如，输出应该是：

ids    vals
aball   1
bball   2
fball   4

Answer 1

采纳答案by Amit Verma

In [3]: df[df['ids'].str.contains("ball")]
Out[3]:
     ids  vals
0  aball     1
1  bball     2
3  fball     4

Answer 2

回答by user3820991

>>> mask = df['ids'].str.contains('ball')    
>>> mask
0     True
1     True
2    False
3     True
Name: ids, dtype: bool

>>> df[mask]
     ids  vals
0  aball     1
1  bball     2
3  fball     4

Answer 3

回答by Jubbles

df[df['ids'].str.contains('ball', na = False)] # valid for (at least) pandas version 0.17.1

Step-by-step explanation (from inner to outer):

分步说明（从内到外）：

df['ids']selects the idscolumn of the data frame (technically, the object df['ids']is of type pandas.Series)
df['ids'].strallows us to apply vectorized string methods (e.g., lower, contains) to the Series
df['ids'].str.contains('ball')checks eachelement of the Series as to whether the element value has the string 'ball' as a substring. The result is a Series of Booleans indicating Trueor Falseabout the existence of a 'ball' substring.
df[df['ids'].str.contains('ball')]applies the Boolean 'mask' to the dataframe and returns a view containing appropriate records.
na = Falseremoves NA / NaN values from consideration; otherwise a ValueError may be returned.

df['ids']选择ids数据框的列（从技术上讲，对象df['ids']的类型为pandas.Series）
df['ids'].str允许我们将矢量化字符串方法（例如lower，，contains）应用于系列
df['ids'].str.contains('ball')检查Series 的每个元素，以确定元素值是否将字符串 'ball' 作为子字符串。结果是一系列布尔值，指示True或False关于“球”子字符串的存在。
df[df['ids'].str.contains('ball')]将布尔“掩码”应用于数据框并返回包含适当记录的视图。
na = False从考虑中删除 NA / NaN 值；否则可能会返回 ValueError。

Answer 4

回答by Cleb

If you want to set the column you filter on as a new index, you could also consider to use .filter; if you want to keep it as a separate column then str.containsis the way to go.

如果要将筛选的列设置为新索引，也可以考虑使用.filter; 如果你想把它作为一个单独的列，那么str.contains就是要走的路。

Let's say you have

假设你有

df = pd.DataFrame({'vals': [1, 2, 3, 4, 5], 'ids': [u'aball', u'bball', u'cnut', u'fball', 'ballxyz']})

       ids  vals
0    aball     1
1    bball     2
2     cnut     3
3    fball     4
4  ballxyz     5

and your plan is to filter all rows in which idscontains ballAND set idsas new index, you can do

并且您的计划是过滤ids包含ballAND 设置ids为新索引的所有行，您可以这样做

df.set_index('ids').filter(like='ball', axis=0)

which gives

这使

         vals
ids          
aball       1
bball       2
fball       4
ballxyz     5

But filteralso allows you to pass a regex, so you could also filter only those rows where the column entry ends with ball. In this case you use

但filter也允许您传递正则表达式，因此您还可以仅过滤列条目以ball. 在这种情况下，您使用

df.set_index('ids').filter(regex='ball$', axis=0)

       vals
ids        
aball     1
bball     2
fball     4

Note that now the entry with ballxyzis not included as it starts with balland does not end with it.

请注意，现在条目 withballxyz不包括在内，因为它以它开头ball并且不以它结尾。

If you want to get all entries that start with ballyou can simple use

如果您想获取所有以ball您开头的条目，可以简单使用

df.set_index('ids').filter(regex='^ball', axis=0)

yielding

屈服

         vals
ids          
ballxyz     5

The same works with columns; all you then need to change is the axis=0part. If you filter based on columns, it would be axis=1.

同样适用于列；然后你需要改变的就是axis=0零件。如果您根据列进行过滤，它将是axis=1.

Python 如何从 Pandas 数据框中过滤包含字符串模式的行

提问by John Knight

采纳答案by Amit Verma

回答by user3820991

回答by Jubbles

回答by Cleb

相关推荐

最近更新

标签

Python 如何从 Pandas 数据框中过滤包含字符串模式的行

提问by John Knight

采纳答案by Amit Verma

回答by user3820991

回答by Jubbles

回答by Cleb

相关推荐

在 Python 中检查当前线程是否为主线程

如何在 Python 中管理大数的除法？

Python 将虚拟列添加到原始数据帧

Python 找不到模块 NLTK

相关推荐

最近更新

标签