在python脚本中查找电话号码
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/3868753/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Find phone numbers in python script
提问by Aaron
the following python script allows me to scrape email addresses from a given file using regular expressions.
以下 python 脚本允许我使用正则表达式从给定文件中抓取电子邮件地址。
How could I add to this so that I can also get phone numbers? Say, if it was either the 7 digit or 10 digit (with area code), and also account for parenthesis?
我怎样才能添加到这个以便我也可以获得电话号码?比如说,如果是 7 位或 10 位(带区号),还要加上括号?
My current script can be found below:
我当前的脚本可以在下面找到:
# filename variables
filename = 'file.txt'
newfilename = 'result.txt'
# read the file
if os.path.exists(filename):
data = open(filename,'r')
bulkemails = data.read()
else:
print "File not found."
raise SystemExit
# regex = [email protected]
r = re.compile(r'(\b[\w.]+@+[\w.]+.+[\w.]\b)')
results = r.findall(bulkemails)
emails = ""
for x in results:
emails += str(x)+"\n"
# function to write file
def writefile():
f = open(newfilename, 'w')
f.write(emails)
f.close()
print "File written."
Regex for phone numbers:
电话号码的正则表达式:
(\d{3}[-\.\s]\d{3}[-\.\s]\d{4}|\(\d{3}\)\s*\d{3}[-\.\s]\d{4}|\d{3}[-\.\s]\d{4})
Another regex for phone numbers:
电话号码的另一个正则表达式:
(?:(?:\+?1\s*(?:[.-]\s*)?)?(?:\(\s*([2-9]1[02-9]|[2-9][02-8]1|[2-9][02-8][02-9])\s*\)|([2-9]1[02-9]|[2-9][02-8]1|[2-9][02-8][02-9]))\s*(?:[.-]\s*)?)?([2-9]1[02-9]|[2-9][02-9]1|[2-9][02-9]{2})\s*(?:[.-]\s*)?([0-9]{4})(?:\s*(?:#|x\.?|ext\.?|extension)\s*(\d+))?
采纳答案by Auguste
If you are interested in learning Regex, you could take a stab at writing it yourself. It's not quite as hard as it's made out to be. Sites like RegexPalallow you to enter some test data, then write and test a Regular Expression against that data. Using RegexPal, try adding some phone numbers in the various formats you expect to find them (with brackets, area codes, etc), grab a Regex cheatsheetand see how far you can get. If nothing else, it will help in reading other peoples Expressions.
如果您对学习 Regex 感兴趣,可以尝试自己编写。这并不像人们想象的那么难。RegexPal等站点允许您输入一些测试数据,然后针对该数据编写和测试正则表达式。使用 RegexPal,尝试添加一些您希望找到的各种格式的电话号码(带括号、区号等),获取Regex 备忘单,看看您能得到多远。如果不出意外,它将有助于阅读其他人的表情。
Edit: Here is a modified version of your Regex, which should also match 7 and 10-digit phone numbers that lack any hyphens, spaces or dots. I added question marks after the character classes (the []s), which makes anything within them optional. I tested it in RegexPal, but as I'm still learning Regex, I'm not sure that it's perfect. Give it a try.
编辑:这是您的 Regex 的修改版本,它也应该匹配缺少任何连字符、空格或点的 7 位和 10 位电话号码。我在字符类([]s)之后添加了问号,这使得其中的任何内容都是可选的。我在 RegexPal 中对其进行了测试,但由于我仍在学习 Regex,我不确定它是否完美。试一试。
(\d{3}[-\.\s]??\d{3}[-\.\s]??\d{4}|\(\d{3}\)\s*\d{3}[-\.\s]??\d{4}|\d{3}[-\.\s]??\d{4})
It matched the following values in RegexPal:
它匹配 RegexPal 中的以下值:
000-000-0000
000 000 0000
000.000.0000
(000)000-0000
(000)000 0000
(000)000.0000
(000) 000-0000
(000) 000 0000
(000) 000.0000
000-0000
000 0000
000.0000
0000000
0000000000
(000)0000000
回答by dotancohen
This is the process of building a phone number scraping regex.
这是构建电话号码抓取正则表达式的过程。
First, we need to match an area code (3 digits), a trunk (3 digits), and an extension (4 digits):
首先,我们需要匹配一个区号(3 位数字)、一个中继线(3 位数字)和一个分机号(4 位数字):
reg = re.compile("\d{3}\d{3}\d{4}")
Now, we want to capture the matched phone number, so we add parenthesis around the parts that we're interested in capturing (all of it):
现在,我们想要捕获匹配的电话号码,因此我们在我们感兴趣的部分(全部)周围添加括号:
reg = re.compile("(\d{3}\d{3}\d{4})")
The area code, trunk, and extension might be separated by up to 3 characters that are not digits (such as the case when spaces are used along with the hyphen/dot delimiter):
区号、中继线和分机号最多可以由 3 个非数字字符分隔(例如空格与连字符/点分隔符一起使用的情况):
reg = re.compile("(\d{3}\D{0,3}\d{3}\D{0,3}\d{4})")
Now, the phone number might actually start with a (character (if the area code is enclosed in parentheses):
现在,电话号码实际上可能以(字符开头(如果区号括在括号中):
reg = re.compile("(\(?\d{3}\D{0,3}\d{3}\D{0,3}\d{4}).*?")
Now that whole phone number is likely embedded in a bunch of other text:
现在整个电话号码很可能嵌入在一堆其他文本中:
reg = re.compile(".*?(\(?\d{3}\D{0,3}\d{3}\D{0,3}\d{4}).*?")
Now, that other text might include newlines:
现在,其他文本可能包含换行符:
reg = re.compile(".*?(\(?\d{3}\D{0,3}\d{3}\D{0,3}\d{4}).*?", re.S)
Enjoy!
享受!
I personally stop here, but if you really want to be sure that only spaces, hyphens, and dots are used as delimiters then you could try the following (untested):
我个人到此为止,但如果您真的想确保仅使用空格、连字符和点作为分隔符,那么您可以尝试以下操作(未经测试):
reg = re.compile(".*?(\(?\d{3})? ?[\.-]? ?\d{3} ?[\.-]? ?\d{4}).*?", re.S)
回答by user4959
I think this regex is very simple for parsing phone numbers
我认为这个正则表达式解析电话号码非常简单
re.findall("[(][\d]{3}[)][ ]?[\d]{3}-[\d]{4}", lines)
回答by Alex Moleiro
For spanish phone numbers I use this with quite success:
对于西班牙电话号码,我非常成功地使用它:
re.findall( r'[697]\d{1,2}.\d{2,3}.\d{2,3}.\d{0,2}',str)
回答by J. Doe
You can check : http://regex.inginf.units.it/. With some training data and target, it constructs you an appropriate regex. It is not always perfect (check F-score). Let's try it with 15 examples :
您可以查看:http: //regex.inginf.units.it/。使用一些训练数据和目标,它会为您构建一个合适的正则表达式。它并不总是完美的(检查 F 分数)。让我们用 15 个例子来试试:
re.findall("\w\d \w\w \w\w \w\w \w\d|(?<=[^\d][^_][^_] )[^_]\d[^ ]\d[^ ][^ ]+|(?<= [^<]\w\w \w\w[^:]\w[^_][^ ][^,][^_] )(?: *[^<]\d+)+",
"""Lorem ipsum ? 04-42-00-00-00 dolor 1901 sit amet, consectetur +33 (0)4 42 00 00 00 adipisicing elit. 2016 Sapiente dicta fugit fugiat hic 04 42 00 00 00 aliquam itaque 04.42.00.00.00 facere, 13205 number: 100 000 000 00013 soluta. 4 Totam id dolores!""")
returns ['04 42 00 00 00', '04.42.00.00.00', '04-42-00-00-00', '50498,']add more examples to gain precision
返回['04 42 00 00 00', '04.42.00.00.00', '04-42-00-00-00', '50498,']添加更多示例以提高精度
回答by Th3Tr1ckst3r
Since nobody has posted this regex yet, I will. This is what I use to find phone numbers. It matches all regular phone number formats you see in the United States. I did not need this regex to match international numbers so I didn't make adjustments to regex for that purpose.
由于还没有人发布过这个正则表达式,我会的。这是我用来查找电话号码的方法。它匹配您在美国看到的所有常规电话号码格式。我不需要这个正则表达式来匹配国际号码,所以我没有为此目的对正则表达式进行调整。
phone_number_regex_pattern = r"\(?\d{3}\)?[-.\s]\d{3}[-.\s]\d{4}"
Use this pattern if you want simple phone numbers with no characters in between to match. An example of this would be: "4441234567".
如果您想匹配中间没有字符的简单电话号码,请使用此模式。一个例子是:“4441234567”。
phone_number_regex_pattern = r"\(?\d{3}\)?[-.\s]?\d{3}[-.\s]?\d{4}"

