pandas 如何按字符串过滤熊猫数据框?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/48020296/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-14 04:58:59  来源:igfitidea点击:

how to filter pandas dataframe by string?

pythonregexpandasfilter

提问by eh2699

I have a pandas dataframe that I'd like to filter by a specific word (test) in a column. I tried:

我有一个 Pandas 数据框,我想按列中的特定单词(测试)进行过滤。我试过:

df[df[col].str.contains('test')]

df[df[col].str.contains('test')]

But it returns an empty dataframe with just the column names. For the output, I'm looking for a dataframe that'd contain all rows that contain the word 'test'. What can I do?

但它返回一个只有列名的空数据框。对于输出,我正在寻找一个包含所有包含单词“test”的行的数据框。我能做什么?

EDIT (to add samples):

编辑(添加样本):

data = pd.read_csv(/...csv)

data = pd.read_csv(/...csv)

data has 5 cols, including 'BusinessDescription', and I want to extract all rows that have the word 'dental' (case insensitive) in the Business Descriptioncol, so I used:

数据有 5 个列,包括'BusinessDescription',我想提取列中包含“dental”(不区分大小写)一词的所有行Business Description,所以我使用了:

filtered = data[data['BusinessDescription'].str.contains('dental')==True]

filtered = data[data['BusinessDescription'].str.contains('dental')==True]

and I get an empty dataframe, with just the header names of the 5 cols.

我得到一个空的数据框,只有 5 个列的标题名称。

回答by jezrael

It seems you need parameter flagsin contains:

看来你需要参数flagscontains

import re

filtered = data[data['BusinessDescription'].str.contains('dental', flags = re.IGNORECASE)]

Another solution, thanks Anton vBRis convert to lowercase first:

另一个解决方案,感谢Anton vBR首先转换为小写:

filtered = data[data['BusinessDescription'].str.lower().str.contains('dental')]


Example:
For future programming I'd recommend using the keyword df instead of data when refering to dataframes. It is the common way around SO to use that notation.

示例:
对于未来的编程,我建议在引用数据帧时使用关键字 df 而不是 data。使用该符号是围绕 SO 的常用方法。

import pandas as pd

data = dict(BusinessDescription=['dental fluss','DENTAL','Dentist'])
df = pd.DataFrame(data)
df[df['BusinessDescription'].str.lower().str.contains('dental')]

  BusinessDescription
0        dental fluss
1              DENTAL

Timings:

时间

d = dict(BusinessDescription=['dental fluss','DENTAL','Dentist'])
data = pd.DataFrame(d)
data = pd.concat([data]*10000).reset_index(drop=True)

#print (data)

In [122]: %timeit data[data['BusinessDescription'].str.contains('dental', flags = re.IGNORECASE)]
10 loops, best of 3: 28.9 ms per loop

In [123]: %timeit data[data['BusinessDescription'].str.lower().str.contains('dental')]
10 loops, best of 3: 32.6 ms per loop

Caveat:

警告

Performance really depend on the data - size of DataFrameand number of values matching condition.

性能实际上取决于数据 -DataFrame匹配条件的值的大小和数量。

回答by Nephilim

Keep the string enclosed in quotes.

将字符串括在引号中。

df[df['col'].str.contains('test')]

Thanks

谢谢

回答by Jimmys

It works also OK if you add a condition

如果您添加条件,它也可以正常工作

df[df['col'].str.contains('test') == True]