带有 Pandas 的 REGEX 过滤器(任何数字组合后跟“加号”)
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/48236846/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
REGEX filter with Pandas (any numeric combination followed by 'plus' sign)
提问by Stanleyrr
I have a Pandas dataframe called df
with the following 3 columns: id
, creation_date
and email
.
我有一个 Pandas 数据框,df
其中包含以下 3 列:id
,creation_date
和email
.
I want to return all rows where the email
column contains any strictly numeric combination (must be strictly numbers) followed by a 'plus' sign and then followed by anything.
我想返回email
列包含任何严格数字组合(必须是严格数字)的所有行,后跟“加”号,然后是任何内容。
For example:
- [email protected]
, [email protected]
will meet my criteria.
- [email protected]
and [email protected]
will not, because they contain non-numeric characters before the 'plus' sign.
例如:
- [email protected]
,[email protected]
将符合我的标准。
-[email protected]
并且[email protected]
不会,因为它们在“加号”之前包含非数字字符。
I know df.email.str.contains('\+')
won't work because it will return everything that contains a 'plus' sign. I had tried df.filter(['email'], regex=r'([^0-9])' % '\+', axis=0)
but it threw an error message that read TypeError: not all arguments converted during string formatting
.
我知道df.email.str.contains('\+')
这行不通,因为它会返回包含“加”号的所有内容。我试过了,df.filter(['email'], regex=r'([^0-9])' % '\+', axis=0)
但它抛出了一条错误消息,内容为TypeError: not all arguments converted during string formatting
.
Can anyone advise?
任何人都可以建议吗?
Thanks very much!
非常感谢!
回答by andrew_reece
You can use contains
, but match
should be sufficient:
您可以使用contains
,但match
应该足够了:
# example data
data = ["[email protected]", "[email protected]",
"[email protected]", "[email protected]"]
df = pd.DataFrame(data, columns=["email"])
df
email
0 [email protected]
1 [email protected]
2 [email protected]
3 [email protected]
Now use match
:
现在使用match
:
df.email.str.match("\d+\+.*")
0 True
1 True
2 False
3 False
Name: email, dtype: bool
Note the difference between contains
and match
, from the docs:
请注意contains
和match
, 从文档中的区别:
contains
analogous, but less strict, relying on re.search instead of re.match
包含
类似的,但不那么严格,依赖于 re.search 而不是 re.match
回答by McClAnalytics
Try this:
尝试这个:
df.email.str.contains('^\d+\+\@')
In breaking down the regular expression:
在分解正则表达式时:
^
ensures that we are starting at the beginning of the email string
^
确保我们从电子邮件字符串的开头开始
\d+
captures only digit (numeric) character 1 to many times
\d+
仅捕获数字(数字)字符 1 到多次
\+
escapes the plus sign to ensure a match
\+
转义加号以确保匹配
\@
escapes the @ and ensures that the plus sign previously matched occurs at the end of the email just prior to the @
\@
转义 @ 并确保先前匹配的加号出现在电子邮件末尾@之前
回答by Rahul
Since your combination is followed by +
which might be followed by digits you can try with following regex.
由于您的组合后跟+
which 可能后跟数字,您可以尝试使用以下正则表达式。
Regex:(?:\d+\+?)+@[a-z]+\.[a-z]+
正则表达式:(?:\d+\+?)+@[a-z]+\.[a-z]+
Explanation:
解释:
(?:\d+\+?)+
will match your pattern ofdigit``+
.[a-z]+\.[a-z]+
will match remaining part.
(?:\d+\+?)+
将匹配您的模式digit``+
。[a-z]+\.[a-z]+
将匹配剩余部分。
回答by Srdjan M.
Regex: ^\d+\+\d*@\S+
正则表达式:^\d+\+\d*@\S+
Details:
详情:
^
asserts position at start of a line
^
在行首断言位置
\d+
matches a digit (equal to [0-9])
\d+
匹配一个数字(等于 [0-9])
\+
matches the character + literally
\+
匹配字符 + 字面意思
\d*
matches a digit (equal to [0-9]), *
Quantifier — Matches between zero and unlimited times
\d*
匹配一个数字(等于 [0-9]),*
量词 - 在零次和无限次之间匹配
@
matches the character @
@
匹配字符@
\S+
matches any non-whitespace character
\S+
匹配任何非空白字符