带有 Pandas 的 REGEX 过滤器(任何数字组合后跟“加号”)
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/48236846/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
REGEX filter with Pandas (any numeric combination followed by 'plus' sign)
提问by Stanleyrr
I have a Pandas dataframe called dfwith the following 3 columns: id, creation_dateand email.
我有一个 Pandas 数据框,df其中包含以下 3 列:id,creation_date和email.
I want to return all rows where the emailcolumn contains any strictly numeric combination (must be strictly numbers) followed by a 'plus' sign and then followed by anything.
我想返回email列包含任何严格数字组合(必须是严格数字)的所有行,后跟“加”号,然后是任何内容。
For example:
- [email protected], [email protected]will meet my criteria.
- [email protected]and [email protected]will not, because they contain non-numeric characters before the 'plus' sign.
例如:
- [email protected],[email protected]将符合我的标准。
-[email protected]并且[email protected]不会,因为它们在“加号”之前包含非数字字符。
I know df.email.str.contains('\+')won't work because it will return everything that contains a 'plus' sign. I had tried df.filter(['email'], regex=r'([^0-9])' % '\+', axis=0)but it threw an error message that read TypeError: not all arguments converted during string formatting.
我知道df.email.str.contains('\+')这行不通,因为它会返回包含“加”号的所有内容。我试过了,df.filter(['email'], regex=r'([^0-9])' % '\+', axis=0)但它抛出了一条错误消息,内容为TypeError: not all arguments converted during string formatting.
Can anyone advise?
任何人都可以建议吗?
Thanks very much!
非常感谢!
回答by andrew_reece
You can use contains, but matchshould be sufficient:
您可以使用contains,但match应该足够了:
# example data
data = ["[email protected]", "[email protected]",
"[email protected]", "[email protected]"]
df = pd.DataFrame(data, columns=["email"])
df
email
0 [email protected]
1 [email protected]
2 [email protected]
3 [email protected]
Now use match:
现在使用match:
df.email.str.match("\d+\+.*")
0 True
1 True
2 False
3 False
Name: email, dtype: bool
Note the difference between containsand match, from the docs:
请注意contains和match, 从文档中的区别:
contains
analogous, but less strict, relying on re.search instead of re.match
包含
类似的,但不那么严格,依赖于 re.search 而不是 re.match
回答by McClAnalytics
Try this:
尝试这个:
df.email.str.contains('^\d+\+\@')
In breaking down the regular expression:
在分解正则表达式时:
^ensures that we are starting at the beginning of the email string
^确保我们从电子邮件字符串的开头开始
\d+captures only digit (numeric) character 1 to many times
\d+仅捕获数字(数字)字符 1 到多次
\+escapes the plus sign to ensure a match
\+转义加号以确保匹配
\@escapes the @ and ensures that the plus sign previously matched occurs at the end of the email just prior to the @
\@转义 @ 并确保先前匹配的加号出现在电子邮件末尾@之前
回答by Rahul
Since your combination is followed by +which might be followed by digits you can try with following regex.
由于您的组合后跟+which 可能后跟数字,您可以尝试使用以下正则表达式。
Regex:(?:\d+\+?)+@[a-z]+\.[a-z]+
正则表达式:(?:\d+\+?)+@[a-z]+\.[a-z]+
Explanation:
解释:
(?:\d+\+?)+will match your pattern ofdigit``+.[a-z]+\.[a-z]+will match remaining part.
(?:\d+\+?)+将匹配您的模式digit``+。[a-z]+\.[a-z]+将匹配剩余部分。
回答by Srdjan M.
Regex: ^\d+\+\d*@\S+
正则表达式:^\d+\+\d*@\S+
Details:
详情:
^asserts position at start of a line
^在行首断言位置
\d+matches a digit (equal to [0-9])
\d+匹配一个数字(等于 [0-9])
\+matches the character + literally
\+匹配字符 + 字面意思
\d*matches a digit (equal to [0-9]), *Quantifier — Matches between zero and unlimited times
\d*匹配一个数字(等于 [0-9]),*量词 - 在零次和无限次之间匹配
@matches the character @
@匹配字符@
\S+matches any non-whitespace character
\S+匹配任何非空白字符


