使用 Python 正则表达式提取数据

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/15958394/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-18 21:27:57  来源:igfitidea点击:

Extracting Data with Python Regular Expressions

pythonregexparsing

提问by greyfox

I am having some trouble wrapping my head around Python regular expressions to come up with a regular expression to extract specific values.

我在围绕 Python 正则表达式想出一个正则表达式来提取特定值时遇到了一些麻烦。

The page I am trying to parse has a number of productIds which appear in the following format

我试图解析的页面有许多 productIds,它们以下列格式出现

\"productId\":\"111111\"

I need to extract all the values, 111111in this case.

111111在这种情况下,我需要提取所有值。

采纳答案by perreal

t = "\"productId\":\"111111\""
m = re.match("\W*productId[^:]*:\D*(\d+)", t)
if m:
    print m.group(1)

meaning match non-word characters (\W*), then productIdfollowed by non-column characters ([^:]*) and a :. Then match non-digits (\D*) and match and capture following digits ((\d+)).

意思是匹配非单词字符 ( \W*),然后productId是非列字符 ( [^:]*) 和 a :。然后匹配非数字 ( \D*) 并匹配并捕获以下数字 ( (\d+))。

Output

输出

111111

回答by frickskit

Try this,

尝试这个,

 :\"(\d*)\"

Give more examples of your data if this doesn't do what you want.

如果这不能满足您的要求,请提供更多数据示例。

回答by Fredrik Pihl

something like this:

像这样:

In [13]: s=r'\"productId\":\"111111\"'

In [14]: print s
\"productId\":\"111111\"

In [15]: import re

In [16]: re.findall(r'\d+', s)
Out[16]: ['111111']

回答by Tobia

The backslashes here might add to the confusion, because they are used as an escape character both by (non-raw) Python strings and by the regexp syntax.

这里的反斜杠可能会增加混乱,因为它们被(非原始)Python 字符串和正则表达式语法用作转义字符。

This extracts the product ids from the format you posted:

这将从您发布的格式中提取产品 ID:

re_prodId = re.compile(r'\"productId\":\"([^"]+)\"')

The raw string r'...'does away with one level of backslash escaping; the use of a single quote as the string delimiter does away with the need to escape double quotes; and finally the backslashe are doubled (only once) because of their special meaning in the regexp language.

原始字符串r'...'取消了一级反斜杠转义;使用单引号作为字符串分隔符不需要转义双引号;最后,反斜杠加倍(仅一次),因为它们在正则表达式语言中的特殊含义。

You can use the regexp object's findall()method to find all matches in some text:

您可以使用 regexp 对象的findall()方法来查找某些文本中的所有匹配项:

re_prodId.findall(text_to_search)

This will return a list of all product ids.

这将返回所有产品 ID 的列表。