在列表中的数据框列中搜索部分字符串匹配 - Pandas - Python

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/38333582/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-14 01:33:52  来源:igfitidea点击:

Search for a partial string match in a data frame column from a list - Pandas - Python

pythonpandas

提问by Eric Coy

I have a list:

我有一个清单:

things = ['A1','B2','C3']

I have a pandas data frame with a column containing values separated by a semicolon - some of the rows will contain matches with one of the items in the list above (it won't be a perfect match since it has other parts of a string in the column.. for example, a row in that column may have 'Wow;Here;This=A1;10001;0')

我有一个 Pandas 数据框,其中一列包含用分号分隔的值 - 一些行将包含与上面列表中的一个项目的匹配项(它不会是完美匹配,因为它包含字符串的其他部分)列.. 例如,该列中的一行可能有 'Wow;Here;This= A1;10001;0')

I want to save the rows that contain a match with items from the list, and then create a new data frame with those selected rows (should have the same headers). This is what I tried:

我想保存包含与列表中项目匹配的行,然后使用这些选定的行(应该具有相同的标题)创建一个新的数据框。这是我尝试过的:

import re

for_new_df =[]

for x in df['COLUMN']:
    for mp in things:
        if df[df['COLUMN'].str.contains(mp)]:
            for_new_df.append(mp)  #This won't save the whole row - help here too, please.

This code gave me an error:

这段代码给了我一个错误:

ValueError: The truth value of a DataFrame is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().

ValueError:DataFrame 的真值不明确。使用 a.empty、a.bool()、a.item()、a.any() 或 a.all()。

I'm very new to coding, so the more explanation and detail in your answer, the better! Thanks in advance.

我对编码很陌生,所以你的答案中的解释和细节越多越好!提前致谢。

回答by EdChum

You can avoid the loop by joining your list of words to create a regex and use str.contains:

您可以通过加入单词列表来创建正则表达式并使用str.contains

pat = '|'.join(thing)
for_new_df = df[df['COLUMN'].str.contains(pat)]

should just work

应该工作

So the regex pattern becomes: 'A1|B2|C3'and this will match anywhere in your strings that contain any of these strings

所以正则表达式模式变为:'A1|B2|C3'这将匹配包含任何这些字符串的字符串中的任何地方

Example:

例子:

In [65]:
things = ['A1','B2','C3']
pat = '|'.join(things)
df = pd.DataFrame({'a':['Wow;Here;This=A1;10001;0', 'B2', 'asdasda', 'asdas']})
df[df['a'].str.contains(pat)]

Out[65]:
                          a
0  Wow;Here;This=A1;10001;0
1                        B2

As to why it failed:

至于失败的原因:

if df[df['COLUMN'].str.contains(mp)]

this line:

这一行:

df[df['COLUMN'].str.contains(mp)]

returns a df masked by the boolean array of your inner str.contains, ifdoesn't understand how to evaluate an array of booleans hence the error. If you think about it what should it do if you 1 True or all but one True? it expects a scalar and not an array like value.

返回由您的内部布尔数组屏蔽的 df str.containsif不了解如何评估布尔数组,因此出现错误。如果您考虑一下,如果您为 1 True 或除 1 外全部为 True,它应该怎么办?它需要一个标量而不是一个类似值的数组。

回答by emmalg

Pandas is actually amazing but I don't find it very easy to use. However it does have many functions designed to make life easy, including tools for searching through huge data frames.

Pandas 实际上很棒,但我觉得它不是很容易使用。然而,它确实有许多旨在让生活变得轻松的功能,包括用于搜索巨大数据框的工具。

Though it may not be a full solution to your problem, this may help set you off on the right foot. I have assumed that you know which column you are searching in, column A in my example.

虽然它可能不是您问题的完整解决方案,但这可能有助于您正确地迈出第一步。我假设您知道要搜索的列,即我示例中的 A 列。

import pandas as pd

df = pd.DataFrame({'A' : pd.Categorical(['Wow;Here;This=A1;10001;0', 'Another;C3;Row=Great;100', 'This;D6;Row=bad100']),
                   'B' : 'foo'})
print df #Original data frame
print
print df['A'].str.contains('A1|B2|C3')  # Boolean array showing matches for col A
print
print df[df['A'].str.contains('A1|B2|C3')]   # Matching rows

The output:

输出:

                          A    B
0  Wow;Here;This=A1;10001;0  foo
1  Another;C3;Row=Great;100  foo
2        This;D6;Row=bad100  foo

0     True
1     True
2    False
Name: A, dtype: bool

                          A    B
0  Wow;Here;This=A1;10001;0  foo
1  Another;C3;Row=Great;100  foo