Python 如何从 Pandas 数据框中过滤包含字符串模式的行
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/27975069/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
How to filter rows containing a string pattern from a Pandas dataframe
提问by John Knight
Assume we have a data frame in Python Pandas that looks like this:
假设我们在 Python Pandas 中有一个如下所示的数据框:
df = pd.DataFrame({'vals': [1, 2, 3, 4], 'ids': [u'aball', u'bball', u'cnut', u'fball']})
Or, in table form:
或者,以表格形式:
ids vals
aball 1
bball 2
cnut 3
fball 4
How do I filter rows which contain the key word "ball?" For example, the output should be:
如何过滤包含关键字“ball”的行?例如,输出应该是:
ids vals
aball 1
bball 2
fball 4
采纳答案by Amit Verma
In [3]: df[df['ids'].str.contains("ball")]
Out[3]:
ids vals
0 aball 1
1 bball 2
3 fball 4
回答by user3820991
>>> mask = df['ids'].str.contains('ball')
>>> mask
0 True
1 True
2 False
3 True
Name: ids, dtype: bool
>>> df[mask]
ids vals
0 aball 1
1 bball 2
3 fball 4
回答by Jubbles
df[df['ids'].str.contains('ball', na = False)] # valid for (at least) pandas version 0.17.1
Step-by-step explanation (from inner to outer):
分步说明(从内到外):
df['ids']selects theidscolumn of the data frame (technically, the objectdf['ids']is of typepandas.Series)df['ids'].strallows us to apply vectorized string methods (e.g.,lower,contains) to the Seriesdf['ids'].str.contains('ball')checks eachelement of the Series as to whether the element value has the string 'ball' as a substring. The result is a Series of Booleans indicatingTrueorFalseabout the existence of a 'ball' substring.df[df['ids'].str.contains('ball')]applies the Boolean 'mask' to the dataframe and returns a view containing appropriate records.na = Falseremoves NA / NaN values from consideration; otherwise a ValueError may be returned.
df['ids']选择ids数据框的列(从技术上讲,对象df['ids']的类型为pandas.Series)df['ids'].str允许我们将矢量化字符串方法(例如lower,,contains)应用于系列df['ids'].str.contains('ball')检查Series 的每个元素,以确定元素值是否将字符串 'ball' 作为子字符串。结果是一系列布尔值,指示True或False关于“球”子字符串的存在。df[df['ids'].str.contains('ball')]将布尔“掩码”应用于数据框并返回包含适当记录的视图。na = False从考虑中删除 NA / NaN 值;否则可能会返回 ValueError。
回答by Cleb
If you want to set the column you filter on as a new index, you could also consider to use .filter; if you want to keep it as a separate column then str.containsis the way to go.
如果要将筛选的列设置为新索引,也可以考虑使用.filter; 如果你想把它作为一个单独的列,那么str.contains就是要走的路。
Let's say you have
假设你有
df = pd.DataFrame({'vals': [1, 2, 3, 4, 5], 'ids': [u'aball', u'bball', u'cnut', u'fball', 'ballxyz']})
ids vals
0 aball 1
1 bball 2
2 cnut 3
3 fball 4
4 ballxyz 5
and your plan is to filter all rows in which idscontains ballAND set idsas new index, you can do
并且您的计划是过滤ids包含ballAND 设置ids为新索引的所有行,您可以这样做
df.set_index('ids').filter(like='ball', axis=0)
which gives
这使
vals
ids
aball 1
bball 2
fball 4
ballxyz 5
But filteralso allows you to pass a regex, so you could also filter only those rows where the column entry ends with ball. In this case you use
但filter也允许您传递正则表达式,因此您还可以仅过滤列条目以ball. 在这种情况下,您使用
df.set_index('ids').filter(regex='ball$', axis=0)
vals
ids
aball 1
bball 2
fball 4
Note that now the entry with ballxyzis not included as it starts with balland does not end with it.
请注意,现在条目 withballxyz不包括在内,因为它以它开头ball并且不以它结尾。
If you want to get all entries that start with ballyou can simple use
如果您想获取所有以ball您开头的条目,可以简单使用
df.set_index('ids').filter(regex='^ball', axis=0)
yielding
屈服
vals
ids
ballxyz 5
The same works with columns; all you then need to change is the axis=0part. If you filter based on columns, it would be axis=1.
同样适用于列;然后你需要改变的就是axis=0零件。如果您根据列进行过滤,它将是axis=1.

