pandas 如何按字符串过滤熊猫数据框?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/48020296/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
how to filter pandas dataframe by string?
提问by eh2699
I have a pandas dataframe that I'd like to filter by a specific word (test) in a column. I tried:
我有一个 Pandas 数据框,我想按列中的特定单词(测试)进行过滤。我试过:
df[df[col].str.contains('test')]
df[df[col].str.contains('test')]
But it returns an empty dataframe with just the column names. For the output, I'm looking for a dataframe that'd contain all rows that contain the word 'test'. What can I do?
但它返回一个只有列名的空数据框。对于输出,我正在寻找一个包含所有包含单词“test”的行的数据框。我能做什么?
EDIT (to add samples):
编辑(添加样本):
data = pd.read_csv(/...csv)
data = pd.read_csv(/...csv)
data has 5 cols, including 'BusinessDescription'
, and I want to extract all rows that have the word 'dental' (case insensitive) in the Business Description
col, so I used:
数据有 5 个列,包括'BusinessDescription'
,我想提取列中包含“dental”(不区分大小写)一词的所有行Business Description
,所以我使用了:
filtered = data[data['BusinessDescription'].str.contains('dental')==True]
filtered = data[data['BusinessDescription'].str.contains('dental')==True]
and I get an empty dataframe, with just the header names of the 5 cols.
我得到一个空的数据框,只有 5 个列的标题名称。
回答by jezrael
It seems you need parameter flags
in contains
:
看来你需要参数flags
在contains
:
import re
filtered = data[data['BusinessDescription'].str.contains('dental', flags = re.IGNORECASE)]
Another solution, thanks Anton vBRis convert to lowercase first:
另一个解决方案,感谢Anton vBR首先转换为小写:
filtered = data[data['BusinessDescription'].str.lower().str.contains('dental')]
Example:
For future programming I'd recommend using the keyword df instead of data when refering to dataframes. It is the common way around SO to use that notation.
示例:
对于未来的编程,我建议在引用数据帧时使用关键字 df 而不是 data。使用该符号是围绕 SO 的常用方法。
import pandas as pd
data = dict(BusinessDescription=['dental fluss','DENTAL','Dentist'])
df = pd.DataFrame(data)
df[df['BusinessDescription'].str.lower().str.contains('dental')]
BusinessDescription
0 dental fluss
1 DENTAL
Timings:
时间:
d = dict(BusinessDescription=['dental fluss','DENTAL','Dentist'])
data = pd.DataFrame(d)
data = pd.concat([data]*10000).reset_index(drop=True)
#print (data)
In [122]: %timeit data[data['BusinessDescription'].str.contains('dental', flags = re.IGNORECASE)]
10 loops, best of 3: 28.9 ms per loop
In [123]: %timeit data[data['BusinessDescription'].str.lower().str.contains('dental')]
10 loops, best of 3: 32.6 ms per loop
Caveat:
警告:
Performance really depend on the data - size of DataFrame
and number of values matching condition.
性能实际上取决于数据 -DataFrame
匹配条件的值的大小和数量。
回答by Nephilim
Keep the string enclosed in quotes.
将字符串括在引号中。
df[df['col'].str.contains('test')]
Thanks
谢谢
回答by Jimmys
It works also OK if you add a condition
如果您添加条件,它也可以正常工作
df[df['col'].str.contains('test') == True]