带有 Pandas 的 REGEX 过滤器（任何数字组合后跟“加号”）

Question

提问by Stanleyrr

I have a Pandas dataframe called dfwith the following 3 columns: id, creation_dateand email.

我有一个 Pandas 数据框，df其中包含以下 3 列：id,creation_date和email.

I want to return all rows where the emailcolumn contains any strictly numeric combination (must be strictly numbers) followed by a 'plus' sign and then followed by anything.

我想返回email列包含任何严格数字组合（必须是严格数字）的所有行，后跟“加”号，然后是任何内容。

For example:
- [email protected], [email protected]will meet my criteria.
- [email protected]and [email protected]will not, because they contain non-numeric characters before the 'plus' sign.

例如：
- [email protected]，[email protected]将符合我的标准。
-[email protected]并且[email protected]不会，因为它们在“加号”之前包含非数字字符。

I know df.email.str.contains('\+')won't work because it will return everything that contains a 'plus' sign. I had tried df.filter(['email'], regex=r'([^0-9])' % '\+', axis=0)but it threw an error message that read TypeError: not all arguments converted during string formatting.

我知道df.email.str.contains('\+')这行不通，因为它会返回包含“加”号的所有内容。我试过了，df.filter(['email'], regex=r'([^0-9])' % '\+', axis=0)但它抛出了一条错误消息，内容为TypeError: not all arguments converted during string formatting.

Can anyone advise?

任何人都可以建议吗？

Thanks very much!

非常感谢！

Answer 1

回答by andrew_reece

You can use contains, but matchshould be sufficient:

您可以使用contains，但match应该足够了：

# example data
data = ["[email protected]", "[email protected]", 
        "[email protected]", "[email protected]"]
df = pd.DataFrame(data, columns=["email"])

df
                   email
0     [email protected]
1  [email protected]
2   [email protected]
3   [email protected]

Now use match:

现在使用match：

df.email.str.match("\d+\+.*")

0     True
1     True
2    False
3    False
Name: email, dtype: bool

Note the difference between containsand match, from the docs:

请注意contains和match, 从文档中的区别：

contains
analogous, but less strict, relying on re.search instead of re.match

包含
类似的，但不那么严格，依赖于 re.search 而不是 re.match

Answer 2

回答by McClAnalytics

Try this:

尝试这个：

df.email.str.contains('^\d+\+\@')

In breaking down the regular expression:

在分解正则表达式时：

^ensures that we are starting at the beginning of the email string

^确保我们从电子邮件字符串的开头开始

\d+captures only digit (numeric) character 1 to many times

\d+仅捕获数字（数字）字符 1 到多次

\+escapes the plus sign to ensure a match

\+转义加号以确保匹配

\@escapes the @ and ensures that the plus sign previously matched occurs at the end of the email just prior to the @

\@转义 @ 并确保先前匹配的加号出现在电子邮件末尾@之前

Answer 3

回答by Rahul

Since your combination is followed by +which might be followed by digits you can try with following regex.

由于您的组合后跟+which 可能后跟数字，您可以尝试使用以下正则表达式。

Regex:(?:\d+\+?)+@[a-z]+\.[a-z]+

正则表达式：(?:\d+\+?)+@[a-z]+\.[a-z]+

Explanation:

解释：

(?:\d+\+?)+will match your pattern of digit``+.
[a-z]+\.[a-z]+will match remaining part.

(?:\d+\+?)+将匹配您的模式digit``+。
[a-z]+\.[a-z]+将匹配剩余部分。

Regex101 Demo

Regex101 演示

Answer 4

回答by Srdjan M.

Regex: ^\d+\+\d*@\S+

正则表达式：^\d+\+\d*@\S+

Details:

详情：

^asserts position at start of a line

^在行首断言位置

\d+matches a digit (equal to [0-9])

\d+匹配一个数字（等于 [0-9]）

\+matches the character + literally

\+匹配字符 + 字面意思

\d*matches a digit (equal to [0-9]), *Quantifier — Matches between zero and unlimited times

\d*匹配一个数字（等于 [0-9]），*量词 - 在零次和无限次之间匹配

@matches the character @

@匹配字符@

\S+matches any non-whitespace character

\S+匹配任何非空白字符

Regex demo

正则表达式演示

带有 Pandas 的 REGEX 过滤器（任何数字组合后跟“加号”）

提问by Stanleyrr

回答by andrew_reece

回答by McClAnalytics

回答by Rahul

回答by Srdjan M.

相关推荐

最近更新

标签

带有 Pandas 的 REGEX 过滤器（任何数字组合后跟“加号”）

提问by Stanleyrr

回答by andrew_reece

回答by McClAnalytics

回答by Rahul

回答by Srdjan M.

相关推荐

pandas 为什么 DBSCAN 聚类在电影镜头数据集上返回单个聚类？

为什么在使用 pandas apply 时会出现 AttributeError？

Pandas - 将混合正/负数列变为正数

用 Pandas 编写单个 CSV 标头

相关推荐

最近更新

标签