Python 在包含字符串列表的系列上使用 Pandas 字符串方法“包含”

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/27300070/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-19 01:37:39  来源:igfitidea点击:

Use Pandas string method 'contains' on a Series containing lists of strings

pythonregexstringpandas

提问by Dirk

Given a simple Pandas Series that contains some strings which can consist of more than one sentence:

给定一个简单的 Pandas 系列,其中包含一些可以由多个句子组成的字符串:

In:
import pandas as pd
s = pd.Series(['This is a long text. It has multiple sentences.','Do you see? More than one sentence!','This one has only one sentence though.'])

Out:
0    This is a long text. It has multiple sentences.
1                Do you see? More than one sentence!
2             This one has only one sentence though.
dtype: object

I use pandas string method splitand a regex-pattern to split each row into its single sentences (which produces unnecessary empty list elements - any suggestions on how to improve the regex?).

我使用熊猫字符串方法split和正则表达式模式将每一行拆分为单个句子(这会产生不必要的空列表元素 - 有关如何改进正则表达式的任何建议?)。

In:
s = s.str.split(r'([A-Z][^\.!?]*[\.!?])')

Out:
0    [, This is a long text.,  , It has multiple se...
1        [, Do you see?,  , More than one sentence!, ]
2         [, This one has only one sentence though., ]
dtype: object

This converts each row into lists of strings, each element holding one sentence.

这会将每一行转换为字符串列表,每个元素包含一个句子。

Now, my goal is to use the string method containsto check each element in each row seperately to match a specific regex pattern and create a new Series accordingly which stores the returned boolean values, each signalizing if the regex matched on at least one of the list elements.

现在,我的目标是使用 string 方法contains分别检查每一行中的每个元素以匹配特定的正则表达式模式并相应地创建一个新系列,该系列存储返回的布尔值,如果正则表达式匹配列表中的至少一个,则每个信号都发出信号元素。

I would expect something like:

我希望是这样的:

In:
s.str.contains('you')

Out:
0   False
1   True
2   False

<-- Row 0 does not contain 'you'in any of its elements, but row 1 does, while row 2 does not.

<-- 第 0 行不包含'you'任何元素,但第 1 行包含,而第 2 行不包含。

However, when doing the above, the return is

但是,在执行上述操作时,返回的是

0   NaN
1   NaN
2   NaN
dtype: float64

I also tried a list comprehension which does not work:

我还尝试了一个不起作用的列表理解:

result = [[x.str.contains('you') for x in y] for y in s]
AttributeError: 'str' object has no attribute 'str'

Any suggestions on how this can be achieved?

关于如何实现这一目标的任何建议?

采纳答案by Roman Pekar

you can use python find()method

你可以使用pythonfind()方法

>>> s.apply(lambda x : any((i for i in x if i.find('you') >= 0)))
0    False
1     True
2    False
dtype: bool

I guess s.str.contains('you')is not working because elements of your series is not strings, but lists. But you can also do something like this:

我猜s.str.contains('you')是行不通的,因为您系列的元素不是字符串,而是列表。但是你也可以做这样的事情:

>>> s.apply(lambda x: any(pd.Series(x).str.contains('you')))
0    False
1     True
2    False