使用正则表达式在 Python 中解析 XML
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/18168684/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Parsing XML in Python with regex
提问by user2671656
I'm trying to use regex to parse an XML
file (in my case this seems the simplest way).
我正在尝试使用正则表达式来解析XML
文件(在我的情况下,这似乎是最简单的方法)。
For example a line might be:
例如,一行可能是:
line='<City_State>PLAINSBORO, NJ 08536-1906</City_State>'
To access the text for the tag City_State, I'm using:
要访问标记 City_State 的文本,我正在使用:
attr = re.match('>.*<', line)
but nothing is being returned.
但没有任何东西被退回。
Can someone point out what I'm doing wrong?
有人可以指出我做错了什么吗?
采纳答案by TerryA
You normally don't want to use re.match
. Quoting from the docs:
您通常不想使用re.match
. 引用文档:
If you want to locate a match anywhere in string, use search()instead (see also search() vs. match()).
如果您想在字符串中的任何位置找到匹配项,请改用search()(另请参阅search() 与 match())。
Note:
笔记:
>>> print re.match('>.*<', line)
None
>>> print re.search('>.*<', line)
<_sre.SRE_Match object at 0x10f666238>
>>> print re.search('>.*<', line).group(0)
>PLAINSBORO, NJ 08536-1906<
Also, why parse XML with regex when you can use something like BeautifulSoup
:).
另外,当您可以使用诸如BeautifulSoup
:) 之类的东西时,为什么还要使用正则表达式解析 XML 。
>>> from bs4 import BeautifulSoup as BS
>>> line='<City_State>PLAINSBORO, NJ 08536-1906</City_State>'
>>> soup = BS(line)
>>> print soup.find('city_state').text
PLAINSBORO, NJ 08536-1906
回答by Kyle
re.match returns a match only if the pattern matches the entire string. To find substrings matching the pattern, use re.search.
re.match 仅当模式匹配整个字符串时才返回匹配项。要查找与模式匹配的子字符串,请使用 re.search。
And yes, this is a simple way to parse XML, but I would highly encourage you to use a library specifically designed for the task.
是的,这是解析 XML 的一种简单方法,但我强烈建议您使用专门为该任务设计的库。
回答by Viktor Kerkez
Please, just use an XML parser like ElementTree
请使用像 ElementTree 这样的 XML 解析器
>>> from xml.etree import ElementTree as ET
>>> line='<City_State>PLAINSBORO, NJ 08536-1906</City_State>'
>>> ET.fromstring(line).text
'PLAINSBORO, NJ 08536-1906'