如何从 Python 中的字符串中提取数字?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/4289331/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
How to extract numbers from a string in Python?
提问by pablouche
I would extract all the numbers contained in a string. Which is the better suited for the purpose, regular expressions or the isdigit()method?
我会提取字符串中包含的所有数字。哪个更适合目的,正则表达式或isdigit()方法?
Example:
例子:
line = "hello 12 hi 89"
Result:
结果:
[12, 89]
采纳答案by fmark
If you only want to extract only positive integers, try the following:
如果只想提取正整数,请尝试以下操作:
>>> str = "h3110 23 cat 444.4 rabbit 11 2 dog"
>>> [int(s) for s in str.split() if s.isdigit()]
[23, 11, 2]
I would argue that this is better than the regex example for three reasons. First, you don't need another module; secondly, it's more readable because you don't need to parse the regex mini-language; and third, it is faster (and thus likely more pythonic):
我认为这比 regex 示例更好,原因有三个。首先,您不需要另一个模块;其次,它更具可读性,因为您不需要解析正则表达式迷你语言;第三,它更快(因此可能更像pythonic):
python -m timeit -s "str = 'h3110 23 cat 444.4 rabbit 11 2 dog' * 1000" "[s for s in str.split() if s.isdigit()]"
100 loops, best of 3: 2.84 msec per loop
python -m timeit -s "import re" "str = 'h3110 23 cat 444.4 rabbit 11 2 dog' * 1000" "re.findall('\b\d+\b', str)"
100 loops, best of 3: 5.66 msec per loop
This will not recognize floats, negative integers, or integers in hexadecimal format. If you can't accept these limitations, slim's answer belowwill do the trick.
这将无法识别浮点数、负整数或十六进制格式的整数。如果您不能接受这些限制,下面的 slim 答案就可以解决问题。
回答by Vincent Savard
I'd use a regexp :
我会使用正则表达式:
>>> import re
>>> re.findall(r'\d+', 'hello 42 I\'m a 32 string 30')
['42', '32', '30']
This would also match 42 from bla42bla. If you only want numbers delimited by word boundaries (space, period, comma), you can use \b :
这也将匹配 42 来自bla42bla. 如果您只想要由单词边界(空格、句点、逗号)分隔的数字,则可以使用 \b :
>>> re.findall(r'\b\d+\b', 'he33llo 42 I\'m a 32 string 30')
['42', '32', '30']
To end up with a list of numbers instead of a list of strings:
以数字列表而不是字符串列表结束:
>>> [int(s) for s in re.findall(r'\b\d+\b', 'he33llo 42 I\'m a 32 string 30')]
[42, 32, 30]
回答by jmnas
I'm assuming you want floats not just integers so I'd do something like this:
我假设你想要浮点数而不仅仅是整数,所以我会做这样的事情:
l = []
for t in s.split():
try:
l.append(float(t))
except ValueError:
pass
Note that some of the other solutions posted here don't work with negative numbers:
请注意,此处发布的其他一些解决方案不适用于负数:
>>> re.findall(r'\b\d+\b', 'he33llo 42 I\'m a 32 string -30')
['42', '32', '30']
>>> '-3'.isdigit()
False
回答by ZacSketches
@jmnas, I liked your answer, but it didn't find floats. I'm working on a script to parse code going to a CNC mill and needed to find both X and Y dimensions that can be integers or floats, so I adapted your code to the following. This finds int, float with positive and negative vals. Still doesn't find hex formatted values but you could add "x" and "A" through "F" to the num_chartuple and I think it would parse things like '0x23AC'.
@jmnas,我喜欢你的回答,但没有找到浮动。我正在编写一个脚本来解析进入 CNC 铣床的代码,并且需要找到可以是整数或浮点数的 X 和 Y 维度,因此我将您的代码调整为以下内容。这会找到 int, float 的正负值。仍然找不到十六进制格式的值,但您可以将“x”和“A”到“F”添加到num_char元组中,我认为它会解析诸如“0x23AC”之类的东西。
s = 'hello X42 I\'m a Y-32.35 string Z30'
xy = ("X", "Y")
num_char = (".", "+", "-")
l = []
tokens = s.split()
for token in tokens:
if token.startswith(xy):
num = ""
for char in token:
# print(char)
if char.isdigit() or (char in num_char):
num = num + char
try:
l.append(float(num))
except ValueError:
pass
print(l)
回答by aidan.plenert.macdonald
This is more than a bit late, but you can extend the regex expression to account for scientific notation too.
这有点晚了,但您也可以扩展正则表达式以考虑科学记数法。
import re
# Format is [(<string>, <expected output>), ...]
ss = [("apple-12.34 ba33na fanc-14.23e-2yapple+45e5+67.56E+3",
['-12.34', '33', '-14.23e-2', '+45e5', '+67.56E+3']),
('hello X42 I\'m a Y-32.35 string Z30',
['42', '-32.35', '30']),
('he33llo 42 I\'m a 32 string -30',
['33', '42', '32', '-30']),
('h3110 23 cat 444.4 rabbit 11 2 dog',
['3110', '23', '444.4', '11', '2']),
('hello 12 hi 89',
['12', '89']),
('4',
['4']),
('I like 74,600 commas not,500',
['74,600', '500']),
('I like bad math 1+2=.001',
['1', '+2', '.001'])]
for s, r in ss:
rr = re.findall("[-+]?[.]?[\d]+(?:,\d\d\d)*[\.]?\d*(?:[eE][-+]?\d+)?", s)
if rr == r:
print('GOOD')
else:
print('WRONG', rr, 'should be', r)
Gives all good!
给一切都好!
Additionally, you can look at the AWS Glue built-in regex
此外,您可以查看AWS Glue 内置正则表达式
回答by Ajay Kumar
The best option I found is below. It will extract a number and can eliminate any type of char.
我发现的最佳选择如下。它将提取一个数字并可以消除任何类型的字符。
def extract_nbr(input_str):
if input_str is None or input_str == '':
return 0
out_number = ''
for ele in input_str:
if ele.isdigit():
out_number += ele
return float(out_number)
回答by Menglong Li
This answer also contains the case when the number is float in the string
这个答案还包含数字在字符串中浮动的情况
def get_first_nbr_from_str(input_str):
'''
:param input_str: strings that contains digit and words
:return: the number extracted from the input_str
demo:
'ab324.23.123xyz': 324.23
'.5abc44': 0.5
'''
if not input_str and not isinstance(input_str, str):
return 0
out_number = ''
for ele in input_str:
if (ele == '.' and '.' not in out_number) or ele.isdigit():
out_number += ele
elif out_number:
break
return float(out_number)
回答by dfostic
If you know it will be only one number in the string, i.e 'hello 12 hi', you can try filter.
如果您知道字符串中只有一个数字,即“hello 12 hi”,您可以尝试过滤。
For example:
例如:
In [1]: int(''.join(filter(str.isdigit, '200 grams')))
Out[1]: 200
In [2]: int(''.join(filter(str.isdigit, 'Counters: 55')))
Out[2]: 55
In [3]: int(''.join(filter(str.isdigit, 'more than 23 times')))
Out[3]: 23
But be carefull !!! :
但是要小心!!!:
In [4]: int(''.join(filter(str.isdigit, '200 grams 5')))
Out[4]: 2005
回答by Moinuddin Quadri
I am amazed to see that no one has yet mentioned the usage of itertools.groupbyas an alternative to achieve this.
我很惊讶地看到还没有人提到使用itertools.groupby作为实现这一目标的替代方法。
You may use itertools.groupby()along with str.isdigit()in order to extract numbers from string as:
您可以使用itertools.groupby()withstr.isdigit()来从字符串中提取数字:
from itertools import groupby
my_str = "hello 12 hi 89"
l = [int(''.join(i)) for is_digit, i in groupby(my_str, str.isdigit) if is_digit]
The value hold by lwill be:
持有的价值l将是:
[12, 89]
PS:This is just for illustration purpose to show that as an alternative we could also use groupbyto achieve this. But this is not a recommended solution. If you want to achieve this, you should be using accepted answer of fmarkbased on using list comprehension with str.isdigitas filter.
PS:这只是为了说明目的,作为替代方案,我们也可以groupby用来实现这一目标。但这不是推荐的解决方案。如果你想实现这一点,你应该使用基于使用列表理解和作为过滤器的 fmark的公认答案str.isdigit。
回答by Marc Maxmeister
Since none of these dealt with real world financial numbers in excel and word docs that I needed to find, here is my variation. It handles ints, floats, negative numbers, currency numbers (because it doesn't reply on split), and has the option to drop the decimal part and just return ints, or return everything.
由于这些都没有涉及我需要找到的 excel 和 word 文档中的真实世界财务数字,这是我的变体。它处理整数、浮点数、负数、货币数(因为它不响应拆分),并且可以选择删除小数部分并只返回整数,或返回所有内容。
It also handles Indian Laks number system where commas appear irregularly, not every 3 numbers apart.
它还处理印度湖人数字系统,其中逗号不规则地出现,而不是每隔 3 个数字。
It does not handle scientific notation or negative numbers put inside parentheses in budgets -- will appear positive.
它不处理科学记数法或预算中括号内的负数 - 将显示为正数。
It also does not extract dates. There are better ways for finding dates in strings.
它也不提取日期。有更好的方法可以在字符串中查找日期。
import re
def find_numbers(string, ints=True):
numexp = re.compile(r'[-]?\d[\d,]*[\.]?[\d{2}]*') #optional - in front
numbers = numexp.findall(string)
numbers = [x.replace(',','') for x in numbers]
if ints is True:
return [int(x.replace(',','').split('.')[0]) for x in numbers]
else:
return numbers

