在 Python 中验证 (X)HTML
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/35538/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Validate (X)HTML in Python
提问by cdleary
What's the best way to go about validating that a document follows some version of HTML (prefereably that I can specify)? I'd like to be able to know where the failures occur, as in a web-based validator, except in a native Python app.
验证文档是否遵循某个版本的 HTML(最好是我可以指定的)的最佳方法是什么?我希望能够知道故障发生的位置,就像在基于 Web 的验证器中一样,但在本机 Python 应用程序中除外。
采纳答案by John Millikin
XHTML is easy, use lxml.
XHTML 很简单,使用lxml。
from lxml import etree
from StringIO import StringIO
etree.parse(StringIO(html), etree.HTMLParser(recover=False))
HTML is harder, since there's traditionally not been as much interest in validation among the HTML crowd (run StackOverflow itself through a validator, yikes). The easiest solution would be to execute external applications such as nsgmlsor OpenJade, and then parse their output.
HTML 更难,因为传统上 HTML 人群对验证没有那么大的兴趣(通过验证器运行 StackOverflow 本身,是的)。最简单的解决方案是执行外部应用程序,例如nsgmls或OpenJade,然后解析它们的输出。
回答by Dave Brondsema
PyTidyLibis a nice python binding for HTML Tidy. Their example:
PyTidyLib是一个很好的用于 HTML Tidy 的 Python 绑定。他们的例子:
from tidylib import tidy_document
document, errors = tidy_document('''<p>fõo <img src="bar.jpg">''',
options={'numeric-entities':1})
print document
print errors
Moreover it's compatible with both legacy HTML Tidyand the new tidy-html5.
此外,它与旧的 HTML Tidy和新的 tidy-html5兼容。
回答by Martin Hepp
I think the most elegant way it to invoke the W3C Validation Service at
我认为调用 W3C 验证服务的最优雅的方式是
http://validator.w3.org/
programmatically. Few people know that you do not have to screen-scrape the results in order to get the results, because the service returns non-standard HTTP header paramaters
以编程方式。很少有人知道您不必为了获得结果而对结果进行屏幕抓取,因为该服务会返回非标准的 HTTP 标头参数
X-W3C-Validator-Recursion: 1
X-W3C-Validator-Status: Invalid (or Valid)
X-W3C-Validator-Errors: 6
X-W3C-Validator-Warnings: 0
for indicating the validity and the number of errors and warnings.
用于指示有效性以及错误和警告的数量。
For instance, the command line
例如,命令行
curl -I "http://validator.w3.org/check?uri=http%3A%2F%2Fwww.stalsoft.com"
returns
回报
HTTP/1.1 200 OK
Date: Wed, 09 May 2012 15:23:58 GMT
Server: Apache/2.2.9 (Debian) mod_python/3.3.1 Python/2.5.2
Content-Language: en
X-W3C-Validator-Recursion: 1
X-W3C-Validator-Status: Invalid
X-W3C-Validator-Errors: 6
X-W3C-Validator-Warnings: 0
Content-Type: text/html; charset=UTF-8
Vary: Accept-Encoding
Connection: close
Thus, you can elegantly invoke the W3C Validation Service and extract the results from the HTTP header:
因此,您可以优雅地调用 W3C 验证服务并从 HTTP 标头中提取结果:
# Programmatic XHTML Validations in Python
# Martin Hepp and Alex Stolz
# [email protected] / [email protected]
import urllib
import urllib2
URL = "http://validator.w3.org/check?uri=%s"
SITE_URL = "http://www.heppnetz.de"
# pattern for HEAD request taken from
# http://stackoverflow.com/questions/4421170/python-head-request-with-urllib2
request = urllib2.Request(URL % urllib.quote(SITE_URL))
request.get_method = lambda : 'HEAD'
response = urllib2.urlopen(request)
valid = response.info().getheader('X-W3C-Validator-Status')
if valid == "Valid":
valid = True
else:
valid = False
errors = int(response.info().getheader('X-W3C-Validator-Errors'))
warnings = int(response.info().getheader('X-W3C-Validator-Warnings'))
print "Valid markup: %s (Errors: %i, Warnings: %i) " % (valid, errors, warnings)
回答by karlcow
You can decide to install the HTML validator locally and create a client to request the validation.
您可以决定在本地安装 HTML 验证器并创建一个客户端来请求验证。
Here I had made a program to validate a list of urls in a txt file. I was just checking the HEAD to get the validation status, but if you do a GET you would get the full results. Look at the API of the validator, there are plenty of options for it.
在这里,我制作了一个程序来验证 txt 文件中的 url 列表。我只是检查 HEAD 以获得验证状态,但如果您执行 GET,您将获得完整的结果。看看验证器的 API,它有很多选项。
import httplib2
import time
h = httplib2.Http(".cache")
f = open("urllistfile.txt", "r")
urllist = f.readlines()
f.close()
for url in urllist:
# wait 10 seconds before the next request - be nice with the validator
time.sleep(10)
resp= {}
url = url.strip()
urlrequest = "http://qa-dev.w3.org/wmvs/HEAD/check?doctype=HTML5&uri="+url
try:
resp, content = h.request(urlrequest, "HEAD")
if resp['x-w3c-validator-status'] == "Abort":
print url, "FAIL"
else:
print url, resp['x-w3c-validator-status'], resp['x-w3c-validator-errors'], resp['x-w3c-validator-warnings']
except:
pass
回答by Aaron Maenpaa
Try tidylib. You can get some really basic bindings as part of the elementtidy module (builds elementtrees from HTML documents). http://effbot.org/downloads/#elementtidy
试试 tidylib。您可以获得一些非常基本的绑定作为 elementtidy 模块的一部分(从 HTML 文档构建元素树)。http://effbot.org/downloads/#elementtidy
>>> import _elementtidy
>>> xhtml, log = _elementtidy.fixup("<html></html>")
>>> print log
line 1 column 1 - Warning: missing <!DOCTYPE> declaration
line 1 column 7 - Warning: discarding unexpected </html>
line 1 column 14 - Warning: inserting missing 'title' element
Parsing the log should give you pretty much everything you need.
解析日志应该可以为您提供几乎所有您需要的东西。
回答by Neall
回答by speedplane
This is a very basic html validator based on lxml's HTMLParser. It is not a complete html validator, but does a few basic checks, doesn't require any internet connection, and doesn't require a large library.
这是一个基于 lxml 的 HTMLParser 的非常基本的 html 验证器。它不是一个完整的 html 验证器,但会进行一些基本检查,不需要任何互联网连接,也不需要大型库。
_html_parser = None
def validate_html(html):
global _html_parser
from lxml import etree
from StringIO import StringIO
if not _html_parser:
_html_parser = etree.HTMLParser(recover = False)
return etree.parse(StringIO(html), _html_parser)
Note that this will not check for closing tags, so for example, the following will pass:
请注意,这不会检查关闭标签,因此例如,以下内容将通过:
validate_html("<a href='example.com'>foo")
> <lxml.etree._ElementTree at 0xb2fd888>
However, the following wont:
但是,以下不会:
validate_html("<a href='example.com'>foo</a")
> XMLSyntaxError: End tag : expected '>', line 1, column 29
回答by Changaco
The html5libmodule can be used to validate an HTML5 document:
所述html5lib模块可用于验证文件HTML5:
>>> import html5lib
>>> html5parser = html5lib.HTMLParser(strict=True)
>>> html5parser.parse('<html></html>')
Traceback (most recent call last):
...
html5lib.html5parser.ParseError: Unexpected start tag (html). Expected DOCTYPE.
回答by user9869932
In my case the python W3C/HTML validation packages did not work pip search w3c
(as of sept 2016).
就我而言,python W3C/HTML 验证包不起作用pip search w3c
(截至 2016 年 9 月)。
I solved this with
我解决了这个问题
$ pip install requests
$ python
Python 2.7.12 (default, Jun 29 2016, 12:46:54)
[GCC 4.2.1 Compatible Apple LLVM 6.0 (clang-600.0.57)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> r = requests.post('https://validator.w3.org/nu/',
... data=file('index.html', 'rb').read(),
... params={'out': 'json'},
... headers={'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2272.101 Safari/537.36',
... 'Content-Type': 'text/html; charset=UTF-8'})
>>> r.text
>>> u'{"messages":[{"type":"info", ...
>>> r.json()
>>> {u'messages': [{u'lastColumn': 59, ...
More documentation here python requests, W3C Validator API
这里有更多文档python requests,W3C Validator API