Python 从大文档中提取电子邮件子字符串
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/17681670/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Extract email sub-strings from large document
提问by user1893148
I have a very large .txt file with hundreds of thousands of email addresses scattered throughout. They all take the format:
我有一个非常大的 .txt 文件,其中散布着数十万个电子邮件地址。它们都采用以下格式:
...<[email protected]>...
What is the best way to have Python to cycle through the entire .txt file looking for a all instances of a certain @domain string, and then grab the entirety of the address within the <...>'s, and add it to a list? The trouble I have is with the variable length of different addresses.
让 Python 循环遍历整个 .txt 文件以查找某个 @domain 字符串的所有实例,然后获取 <...> 中的整个地址,并将其添加到的最佳方法是什么?一个列表?我遇到的问题是不同地址的可变长度。
采纳答案by 0x90
This codeextracts the email addresses in a string. Use it while reading line by line
此代码提取字符串中的电子邮件地址。在逐行阅读时使用它
>>> import re
>>> line = "should we use regex more often? let me know at [email protected]"
>>> match = re.search(r'[\w\.-]+@[\w\.-]+', line)
>>> match.group(0)
'[email protected]'
If you have several email addresses use findall
:
如果您有多个电子邮件地址,请使用findall
:
>>> line = "should we use regex more often? let me know at [email protected]"
>>> match = re.findall(r'[\w\.-]+@[\w\.-]+', line)
>>> match
['[email protected]', '[email protected]']
The regex above probably finds the most common non-fake email address. If you want to be completely aligned with the RFC 5322you should check which email addresses follow the specification. Check thisout to avoid any bugs in finding email addresses correctly.
上面的正则表达式可能会找到最常见的非假电子邮件地址。如果您想完全符合RFC 5322,您应该检查哪些电子邮件地址符合规范。检查这一点以避免在正确查找电子邮件地址时出现任何错误。
Edit:as suggested in a comment by @kostek:
In the string Contact us at [email protected].
my regex returns [email protected]. (with dot at the end). To avoid this, use [\w\.,]+@[\w\.,]+\.\w+)
编辑:正如@kostek的评论中所建议的:在字符串中,Contact us at [email protected].
我的正则表达式返回 [email protected]。(末尾有点)。为避免这种情况,请使用[\w\.,]+@[\w\.,]+\.\w+)
Edit II:another wonderful improvement was mentioned in the comments: [\w\.-]+@[\w\.-]+\.\w+
which will capture [email protected] as well.
编辑二:评论中提到了另一个很棒的改进:[\w\.-]+@[\w\.-]+\.\w+
它也将捕获 [email protected]。
回答by tehsockz
If you're looking for a specific domain:
如果您正在寻找特定域:
>>> import re
>>> text = "this is an email [email protected], it will be matched, [email protected] will not, and [email protected] will"
>>> match = re.findall(r'[\w-\._\+%]+@test\.com',text) # replace test\.com with the domain you're looking for, adding a backslash before periods
>>> match
['[email protected]', '[email protected]']
回答by Stryker
You can also use the following to find all the email addresses in a text and print them in an array or each email on a separate line.
您还可以使用以下内容查找文本中的所有电子邮件地址,并将它们打印在一个数组中或将每个电子邮件打印在单独的行上。
import re
line = "why people don't know what regex are? let me know [email protected], [email protected] " \
"[email protected],[email protected]"
match = re.findall(r'[\w\.-]+@[\w\.-]+', line)
for i in match:
print(i)
If you want to add it to a list just print the "match"
如果要将其添加到列表中,只需打印“匹配”
this will print the list
这将打印列表
print(match)
Hope this helps.
希望这可以帮助。
回答by nischi
Here's another approach for this specific problem, with a regex from emailregex.com:
这是针对此特定问题的另一种方法,使用来自emailregex.com的正则表达式:
text = "blabla <[email protected]>><[email protected]> <huhu@fake> bla bla <[email protected]>"
# 1. find all potential email addresses (note: < inside <> is a problem)
matches = re.findall('<\S+?>', text) # ['<[email protected]>', '<[email protected]>', '<huhu@fake>', '<[email protected]>']
# 2. apply email regex pattern to string inside <>
emails = [ x[1:-1] for x in matches if re.match(r"(^[a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+$)", x[1:-1]) ]
print emails # ['[email protected]', '[email protected]', '[email protected]']
回答by david_adler
import re
rgx = r'(?:\.?)([\w\-_+#~!$&\'\.]+(?<!\.)(@|[ ]?\(?[ ]?(at|AT)[ ]?\)?[ ]?)(?<!\.)[\w]+[\w\-\.]*\.[a-zA-Z-]{2,3})(?:[^\w])'
matches = re.findall(rgx, text)
get_first_group = lambda y: list(map(lambda x: x[0], y))
emails = get_first_group(matches)
Please don't hate me for having a go at this infamous regex. The regex works for a decent portion of email addresses shown below. I mostly used this as my basisfor the valid chars in an email address.
请不要因为我尝试了这个臭名昭著的正则表达式而恨我。正则表达式适用于如下所示的相当一部分电子邮件地址。我主要使用它作为电子邮件地址中有效字符的基础。
Feel free to play around with it here
I also made a variationwhere the regex captures emails like name at example.com
我还做了一个变体,其中正则表达式捕获电子邮件,如name at example.com
(?:\.?)([\w\-_+#~!$&\'\.]+(?<!\.)(@|[ ]\(?[ ]?(at|AT)[ ]?\)?[ ])(?<!\.)[\w]+[\w\-\.]*\.[a-zA-Z-]{2,3})(?:[^\w])
回答by ayoub ELMAJJODI
import re
txt = 'hello from [email protected] to [email protected] about the meeting @2PM'
email =re.findall('\S+@\S+',s)
print(email)
Printed output:
打印输出:
['[email protected]', '[email protected]']
回答by Laksh Jadhwani
import re
with open("file_name",'r') as f:
s = f.read()
result = re.findall(r'\S+@\S+',s)
for r in result:
print(r)
回答by Muneer Ahmad
import re
mess = '''[email protected] [email protected]
abc@gmail'''
email = re.compile(r'([\w\.-][email protected])')
result= email.findall(mess)
if(result != None):
print(result)
The above code will help to you and bring the Gmail, email only after calling it.
上面的代码将帮助您并带Gmail,只有在调用后才能发送电子邮件。
回答by Palash Jhamb
import re
reg_pat = r'\S+@\S+\.\S+'
test_text = '[email protected] [email protected] uiufubvcbuw bvkw ko@com m@urice'
emails = re.findall(reg_pat ,test_text,re.IGNORECASE)
print(emails)
Output:
输出:
['[email protected]', '[email protected]']
回答by Rishang
You can use \b at the end to get the correct email to define ending of the email.
您可以在末尾使用 \b 来获取正确的电子邮件来定义电子邮件的结尾。
The regex
正则表达式
[\w\.\-]+@[\w\-\.]+\b