带有 Pandas 的 REGEX 过滤器(任何数字组合后跟“加号”)

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/48236846/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-14 05:03:01  来源:igfitidea点击:

REGEX filter with Pandas (any numeric combination followed by 'plus' sign)

pythonregexpandas

提问by Stanleyrr

I have a Pandas dataframe called dfwith the following 3 columns: id, creation_dateand email.

我有一个 Pandas 数据框,df其中包含以下 3 列:id,creation_dateemail.

I want to return all rows where the emailcolumn contains any strictly numeric combination (must be strictly numbers) followed by a 'plus' sign and then followed by anything.

我想返回email列包含任何严格数字组合(必须是严格数字)的所有行,后跟“加”号,然后是任何内容。

For example:
- [email protected], [email protected]will meet my criteria.
- [email protected]and [email protected]will not, because they contain non-numeric characters before the 'plus' sign.

例如:
- [email protected][email protected]将符合我的标准。
-[email protected]并且[email protected]不会,因为它们在“加号”之前包含非数字字符。

I know df.email.str.contains('\+')won't work because it will return everything that contains a 'plus' sign. I had tried df.filter(['email'], regex=r'([^0-9])' % '\+', axis=0)but it threw an error message that read TypeError: not all arguments converted during string formatting.

我知道df.email.str.contains('\+')这行不通,因为它会返回包含“加”号的所有内容。我试过了,df.filter(['email'], regex=r'([^0-9])' % '\+', axis=0)但它抛出了一条错误消息,内容为TypeError: not all arguments converted during string formatting.

Can anyone advise?

任何人都可以建议吗?

Thanks very much!

非常感谢!

回答by andrew_reece

You can use contains, but matchshould be sufficient:

您可以使用contains,但match应该足够了:

# example data
data = ["[email protected]", "[email protected]", 
        "[email protected]", "[email protected]"]
df = pd.DataFrame(data, columns=["email"])

df
                   email
0     [email protected]
1  [email protected]
2   [email protected]
3   [email protected]

Now use match:

现在使用match

df.email.str.match("\d+\+.*")

0     True
1     True
2    False
3    False
Name: email, dtype: bool

Note the difference between containsand match, from the docs:

请注意containsmatch, 从文档中的区别:

contains
analogous, but less strict, relying on re.search instead of re.match

包含
类似的,但不那么严格,依赖于 re.search 而不是 re.match

回答by McClAnalytics

Try this:

尝试这个:

df.email.str.contains('^\d+\+\@')

In breaking down the regular expression:

在分解正则表达式时:

^ensures that we are starting at the beginning of the email string

^确保我们从电子邮件字符串的开头开始

\d+captures only digit (numeric) character 1 to many times

\d+仅捕获数字(数字)字符 1 到多次

\+escapes the plus sign to ensure a match

\+转义加号以确保匹配

\@escapes the @ and ensures that the plus sign previously matched occurs at the end of the email just prior to the @

\@转义 @ 并确保先前匹配的加号出现在电子邮件末尾@之前

回答by Rahul

Since your combination is followed by +which might be followed by digits you can try with following regex.

由于您的组合后跟+which 可能后跟数字,您可以尝试使用以下正则表达式。

Regex:(?:\d+\+?)+@[a-z]+\.[a-z]+

正则表达式:(?:\d+\+?)+@[a-z]+\.[a-z]+

Explanation:

解释:

  • (?:\d+\+?)+will match your pattern of digit``+.

  • [a-z]+\.[a-z]+will match remaining part.

  • (?:\d+\+?)+将匹配您的模式digit``+

  • [a-z]+\.[a-z]+将匹配剩余部分。

Regex101 Demo

Regex101 演示

回答by Srdjan M.

Regex: ^\d+\+\d*@\S+

正则表达式^\d+\+\d*@\S+

enter image description here

在此处输入图片说明

Details:

详情

^asserts position at start of a line

^在行首断言位置

\d+matches a digit (equal to [0-9])

\d+匹配一个数字(等于 [0-9])

\+matches the character + literally

\+匹配字符 + 字面意思

\d*matches a digit (equal to [0-9]), *Quantifier — Matches between zero and unlimited times

\d*匹配一个数字(等于 [0-9]),*量词 - 在零次和无限次之间匹配

@matches the character @

@匹配字符@

\S+matches any non-whitespace character

\S+匹配任何非空白字符

Regex demo

正则表达式演示