Python 在包含字符串列表的系列上使用 Pandas 字符串方法“包含”
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/27300070/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Use Pandas string method 'contains' on a Series containing lists of strings
提问by Dirk
Given a simple Pandas Series that contains some strings which can consist of more than one sentence:
给定一个简单的 Pandas 系列,其中包含一些可以由多个句子组成的字符串:
In:
import pandas as pd
s = pd.Series(['This is a long text. It has multiple sentences.','Do you see? More than one sentence!','This one has only one sentence though.'])
Out:
0 This is a long text. It has multiple sentences.
1 Do you see? More than one sentence!
2 This one has only one sentence though.
dtype: object
I use pandas string method split
and a regex-pattern to split each row into its single sentences (which produces unnecessary empty list elements - any suggestions on how to improve the regex?).
我使用熊猫字符串方法split
和正则表达式模式将每一行拆分为单个句子(这会产生不必要的空列表元素 - 有关如何改进正则表达式的任何建议?)。
In:
s = s.str.split(r'([A-Z][^\.!?]*[\.!?])')
Out:
0 [, This is a long text., , It has multiple se...
1 [, Do you see?, , More than one sentence!, ]
2 [, This one has only one sentence though., ]
dtype: object
This converts each row into lists of strings, each element holding one sentence.
这会将每一行转换为字符串列表,每个元素包含一个句子。
Now, my goal is to use the string method contains
to check each element in each row seperately to match a specific regex pattern and create a new Series accordingly which stores the returned boolean values, each signalizing if the regex matched on at least one of the list elements.
现在,我的目标是使用 string 方法contains
分别检查每一行中的每个元素以匹配特定的正则表达式模式并相应地创建一个新系列,该系列存储返回的布尔值,如果正则表达式匹配列表中的至少一个,则每个信号都发出信号元素。
I would expect something like:
我希望是这样的:
In:
s.str.contains('you')
Out:
0 False
1 True
2 False
<-- Row 0 does not contain 'you'
in any of its elements, but row 1 does, while row 2 does not.
<-- 第 0 行不包含'you'
任何元素,但第 1 行包含,而第 2 行不包含。
However, when doing the above, the return is
但是,在执行上述操作时,返回的是
0 NaN
1 NaN
2 NaN
dtype: float64
I also tried a list comprehension which does not work:
我还尝试了一个不起作用的列表理解:
result = [[x.str.contains('you') for x in y] for y in s]
AttributeError: 'str' object has no attribute 'str'
Any suggestions on how this can be achieved?
关于如何实现这一目标的任何建议?
采纳答案by Roman Pekar
you can use python find()
method
你可以使用pythonfind()
方法
>>> s.apply(lambda x : any((i for i in x if i.find('you') >= 0)))
0 False
1 True
2 False
dtype: bool
I guess s.str.contains('you')
is not working because elements of your series is not strings, but lists. But you can also do something like this:
我猜s.str.contains('you')
是行不通的,因为您系列的元素不是字符串,而是列表。但是你也可以做这样的事情:
>>> s.apply(lambda x: any(pd.Series(x).str.contains('you')))
0 False
1 True
2 False