用于HTML解析的Python正则表达式(BeautifulSoup)
时间:2020-03-05 18:51:06 来源:igfitidea点击:
我想获取HTML中隐藏的输入字段的值。
<input type="hidden" name="fooId" value="12-3456789-1111111111" />
我想用Python编写一个正则表达式,该表达式将返回fooId
的值,因为我知道HTML中的行遵循以下格式:
<input type="hidden" name="fooId" value="**[id is here]**" />
有人可以提供Python范例来解析HTML值吗?
解决方案
回答
/<input type="hidden" name="fooId" value="([\d-]+)" \/>/
回答
解析是我们真正想要避免的那些领域之一,因为我们将追逐边缘情况,并且会出现很多年的错误
我建议使用BeautifulSoup。它具有很高的声誉,并且从文档中看起来很容易使用。
回答
import re reg = re.compile('<input type="hidden" name="([^"]*)" value="<id>" />') value = reg.search(inputHTML).group(1) print 'Value is', value
回答
对于这种特殊情况,BeautifulSoup比正则表达式更难编写,但是它更健壮...我只是为BeautifulSoup示例提供帮助,因为我们已经知道要使用哪个正则表达式:-)
from BeautifulSoup import BeautifulSoup #Or retrieve it from the web, etc. html_data = open('/yourwebsite/page.html','r').read() #Create the soup object from the HTML data soup = BeautifulSoup(html_data) fooId = soup.find('input',name='fooId',type='hidden') #Find the proper tag value = fooId.attrs[2][1] #The value of the third attribute of the desired tag #or index it directly via fooId['value']
回答
/<input\s+type="hidden"\s+name="([A-Za-z0-9_]+)"\s+value="([A-Za-z0-9_\-]*)"\s*/>/ >>> import re >>> s = '<input type="hidden" name="fooId" value="12-3456789-1111111111" />' >>> re.match('<input\s+type="hidden"\s+name="([A-Za-z0-9_]+)"\s+value="([A-Za-z0-9_\-]*)"\s*/>', s).groups() ('fooId', '12-3456789-1111111111')
回答
我同意Vinko BeautifulSoup是必经之路。但是我建议使用fooId ['value']`来获取属性,而不是依赖于value作为第三个属性。
from BeautifulSoup import BeautifulSoup #Or retrieve it from the web, etc. html_data = open('/yourwebsite/page.html','r').read() #Create the soup object from the HTML data soup = BeautifulSoup(html_data) fooId = soup.find('input',name='fooId',type='hidden') #Find the proper tag value = fooId['value'] #The value attribute
回答
在BeautifulSoup和regex之间,Pyparsing是一个很好的过渡步骤。它比正则表达式更强大,因为它的HTML标记解析可以理解大小写,空格,属性存在/不存在/顺序的变化,但是比起使用BS,这种基本的标记提取更容易。
示例特别简单,因为我们要查找的所有内容都位于打开的" input"标记的属性中。这是一个pyparsing示例,该示例显示了输入标签上的多种变体,这些变体将使正则表达式适合,并且还显示了如何在标记中添加不匹配的标记:
html = """<html><body> <input type="hidden" name="fooId" value="**[id is here]**" /> <blah> <input name="fooId" type="hidden" value="**[id is here too]**" /> <input NAME="fooId" type="hidden" value="**[id is HERE too]**" /> <INPUT NAME="fooId" type="hidden" value="**[and id is even here TOO]**" /> <!-- <input type="hidden" name="fooId" value="**[don't report this id]**" /> --> <foo> </body></html>""" from pyparsing import makeHTMLTags, withAttribute, htmlComment # use makeHTMLTags to create tag expression - makeHTMLTags returns expressions for # opening and closing tags, we're only interested in the opening tag inputTag = makeHTMLTags("input")[0] # only want input tags with special attributes inputTag.setParseAction(withAttribute(type="hidden", name="fooId")) # don't report tags that are commented out inputTag.ignore(htmlComment) # use searchString to skip through the input foundTags = inputTag.searchString(html) # dump out first result to show all returned tags and attributes print foundTags[0].dump() print # print out the value attribute for all matched tags for inpTag in foundTags: print inpTag.value
印刷:
['input', ['type', 'hidden'], ['name', 'fooId'], ['value', '**[id is here]**'], True] - empty: True - name: fooId - startInput: ['input', ['type', 'hidden'], ['name', 'fooId'], ['value', '**[id is here]**'], True] - empty: True - name: fooId - type: hidden - value: **[id is here]** - type: hidden - value: **[id is here]** **[id is here]** **[id is here too]** **[id is HERE too]** **[and id is even here TOO]**
我们可以看到pyparsing不仅匹配了这些不可预测的变化,而且还返回了对象中的数据,从而可以轻松地读取各个标签属性及其值。