从 python pandas 的数据框列中搜索匹配的字符串模式
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/36740680/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
searching matching string pattern from dataframe column in python pandas
提问by Satya
i have a data-frame like below
我有一个如下所示的数据框
name genre
satya |ACTION|DRAMA|IC|
satya |COMEDY|BIOPIC|SOCIAL|
abc |CLASSICAL|
xyz |ROMANCE|ACTION|DARMA|
def |DISCOVERY|SPORT|COMEDY|IC|
ghj |IC|
Now I want to query the dataframe so that i can get row 1,5 and 6.i:e i want to find |IC| with alone or with any combination of other genres.
现在我想查询数据框,以便我可以获得第 1,5 行和 6.i:ei 想要找到 |IC| 单独或与其他类型的任何组合。
Upto now i am able to do either a exact search using
到目前为止,我可以使用
df[df['genre'] == '|ACTION|DRAMA|IC|'] ######exact value yields row 1
or a string contains search by
或字符串包含搜索
df[df['genre'].str.contains('IC')] ####yields row 1,2,3,5,6
# as BIOPIC has IC in that same for CLASSICAL also
But i don't want these two.
但我不要这两个。
#df[df['genre'].str.contains('|IC|')] #### row 6
# This also not satisfying my need as i am missing rows 1 and 5
So my requirement is to find genres having |IC| in them.(My string search fails because python treats '|' as or operator)
所以我的要求是找到具有 |IC| 的流派 在它们中。(我的字符串搜索失败,因为 python 将 '|' 视为 or 运算符)
Somebody suggest some reg or any method to do that.Thanks in ADv.
有人建议一些 reg 或任何方法来做到这一点。感谢 ADv。
回答by jezrael
I think you can add \
to regex for escaping , because |
without \
is interpreted as OR
:
我认为您可以添加\
到正则表达式进行转义,因为|
without \
被解释为OR
:
'|'
A|B, where A and B can be arbitrary REs, creates a regular expression that will match either A or B. An arbitrary number of REs can be separated by the '|' in this way. This can be used inside groups (see below) as well. As the target string is scanned, REs separated by '|' are tried from left to right. When one pattern completely matches, that branch is accepted. This means that once A matches, B will not be tested further, even if it would produce a longer overall match. In other words, the '|' operator is never greedy. To match a literal '|', use \|, or enclose it inside a character class, as in [|].
'|'
A|B,其中 A 和 B 可以是任意 RE,创建一个匹配 A 或 B 的正则表达式。任意数量的 RE 可以用“|”分隔 通过这种方式。这也可以在组内使用(见下文)。扫描目标字符串时,RE 以“|”分隔 从左到右尝试。当一个模式完全匹配时,该分支被接受。这意味着一旦 A 匹配,B 将不会被进一步测试,即使它会产生更长的整体匹配。换句话说,“|” 运营商从不贪婪。要匹配文字“|”,请使用 \|,或将其括在字符类中,如 [|]。
print df['genre'].str.contains(u'\|IC\|')
0 True
1 False
2 False
3 False
4 True
5 True
Name: genre, dtype: bool
print df[df['genre'].str.contains(u'\|IC\|')]
name genre
0 satya |ACTION|DRAMA|IC|
4 def |DISCOVERY|SPORT|COMEDY|IC|
5 ghj |IC|
回答by apet
may be this construction:
可能是这样的结构:
pd.DataFrame[DataFrame['columnName'].str.contains(re.compile('regex_pattern'))]