pandas 如何按字符串过滤熊猫数据框？

Question

提问by eh2699

I have a pandas dataframe that I'd like to filter by a specific word (test) in a column. I tried:

我有一个 Pandas 数据框，我想按列中的特定单词（测试）进行过滤。我试过：

df[df[col].str.contains('test')]

But it returns an empty dataframe with just the column names. For the output, I'm looking for a dataframe that'd contain all rows that contain the word 'test'. What can I do?

但它返回一个只有列名的空数据框。对于输出，我正在寻找一个包含所有包含单词“test”的行的数据框。我能做什么？

EDIT (to add samples):

编辑（添加样本）：

data = pd.read_csv(/...csv)

data has 5 cols, including 'BusinessDescription', and I want to extract all rows that have the word 'dental' (case insensitive) in the Business Descriptioncol, so I used:

数据有 5 个列，包括'BusinessDescription'，我想提取列中包含“dental”（不区分大小写）一词的所有行Business Description，所以我使用了：

filtered = data[data['BusinessDescription'].str.contains('dental')==True]

and I get an empty dataframe, with just the header names of the 5 cols.

我得到一个空的数据框，只有 5 个列的标题名称。

Answer 1

回答by jezrael

It seems you need parameter flagsin contains:

看来你需要参数flags在contains：

import re

filtered = data[data['BusinessDescription'].str.contains('dental', flags = re.IGNORECASE)]

Another solution, thanks Anton vBRis convert to lowercase first:

另一个解决方案，感谢Anton vBR首先转换为小写：

filtered = data[data['BusinessDescription'].str.lower().str.contains('dental')]

Example:
For future programming I'd recommend using the keyword df instead of data when refering to dataframes. It is the common way around SO to use that notation.

示例：
对于未来的编程，我建议在引用数据帧时使用关键字 df 而不是 data。使用该符号是围绕 SO 的常用方法。

import pandas as pd

data = dict(BusinessDescription=['dental fluss','DENTAL','Dentist'])
df = pd.DataFrame(data)
df[df['BusinessDescription'].str.lower().str.contains('dental')]

  BusinessDescription
0        dental fluss
1              DENTAL

Timings:

时间：

d = dict(BusinessDescription=['dental fluss','DENTAL','Dentist'])
data = pd.DataFrame(d)
data = pd.concat([data]*10000).reset_index(drop=True)

#print (data)

In [122]: %timeit data[data['BusinessDescription'].str.contains('dental', flags = re.IGNORECASE)]
10 loops, best of 3: 28.9 ms per loop

In [123]: %timeit data[data['BusinessDescription'].str.lower().str.contains('dental')]
10 loops, best of 3: 32.6 ms per loop

Caveat:

警告：

Performance really depend on the data - size of DataFrameand number of values matching condition.

性能实际上取决于数据 -DataFrame匹配条件的值的大小和数量。

Answer 2

回答by Nephilim

Keep the string enclosed in quotes.

将字符串括在引号中。

df[df['col'].str.contains('test')]

Thanks

谢谢

Answer 3

回答by Jimmys

It works also OK if you add a condition

如果您添加条件，它也可以正常工作

df[df['col'].str.contains('test') == True]

pandas 如何按字符串过滤熊猫数据框？

提问by eh2699

回答by jezrael

回答by Nephilim

回答by Jimmys

相关推荐

最近更新

标签

pandas 如何按字符串过滤熊猫数据框？

提问by eh2699

回答by jezrael

回答by Nephilim

回答by Jimmys

相关推荐

pandas 熊猫 value_counts() 不是降序排列

Pandas sort_values 不能正确排序数字

pandas Seaborn 热图：将颜色条移动到图的顶部

从 Pandas DataFrame 行获取单元格值

相关推荐

最近更新

标签