在 Python 中验证 (X)HTML

Question

提问by cdleary

What's the best way to go about validating that a document follows some version of HTML (prefereably that I can specify)? I'd like to be able to know where the failures occur, as in a web-based validator, except in a native Python app.

验证文档是否遵循某个版本的 HTML（最好是我可以指定的）的最佳方法是什么？我希望能够知道故障发生的位置，就像在基于 Web 的验证器中一样，但在本机 Python 应用程序中除外。

Answer 1

采纳答案by John Millikin

XHTML is easy, use lxml.

XHTML 很简单，使用lxml。

from lxml import etree
from StringIO import StringIO
etree.parse(StringIO(html), etree.HTMLParser(recover=False))

HTML is harder, since there's traditionally not been as much interest in validation among the HTML crowd (run StackOverflow itself through a validator, yikes). The easiest solution would be to execute external applications such as nsgmlsor OpenJade, and then parse their output.

HTML 更难，因为传统上 HTML 人群对验证没有那么大的兴趣（通过验证器运行 StackOverflow 本身，是的）。最简单的解决方案是执行外部应用程序，例如nsgmls或OpenJade，然后解析它们的输出。

Answer 2

回答by Dave Brondsema

PyTidyLibis a nice python binding for HTML Tidy. Their example:

PyTidyLib是一个很好的用于 HTML Tidy 的 Python 绑定。他们的例子：

from tidylib import tidy_document
document, errors = tidy_document('''<p>f&otilde;o <img src="bar.jpg">''',
    options={'numeric-entities':1})
print document
print errors

Moreover it's compatible with both legacy HTML Tidyand the new tidy-html5.

此外，它与旧的 HTML Tidy和新的 tidy-html5兼容。

Answer 3

回答by Martin Hepp

I think the most elegant way it to invoke the W3C Validation Service at

我认为调用 W3C 验证服务的最优雅的方式是

http://validator.w3.org/

programmatically. Few people know that you do not have to screen-scrape the results in order to get the results, because the service returns non-standard HTTP header paramaters

以编程方式。很少有人知道您不必为了获得结果而对结果进行屏幕抓取，因为该服务会返回非标准的 HTTP 标头参数

X-W3C-Validator-Recursion: 1
X-W3C-Validator-Status: Invalid (or Valid)
X-W3C-Validator-Errors: 6
X-W3C-Validator-Warnings: 0

for indicating the validity and the number of errors and warnings.

用于指示有效性以及错误和警告的数量。

For instance, the command line

例如，命令行

curl -I "http://validator.w3.org/check?uri=http%3A%2F%2Fwww.stalsoft.com"

returns

回报

HTTP/1.1 200 OK
Date: Wed, 09 May 2012 15:23:58 GMT
Server: Apache/2.2.9 (Debian) mod_python/3.3.1 Python/2.5.2
Content-Language: en
X-W3C-Validator-Recursion: 1
X-W3C-Validator-Status: Invalid
X-W3C-Validator-Errors: 6
X-W3C-Validator-Warnings: 0
Content-Type: text/html; charset=UTF-8
Vary: Accept-Encoding
Connection: close

Thus, you can elegantly invoke the W3C Validation Service and extract the results from the HTTP header:

因此，您可以优雅地调用 W3C 验证服务并从 HTTP 标头中提取结果：

# Programmatic XHTML Validations in Python
# Martin Hepp and Alex Stolz
# [email protected] / [email protected]

import urllib
import urllib2

URL = "http://validator.w3.org/check?uri=%s"
SITE_URL = "http://www.heppnetz.de"

# pattern for HEAD request taken from 
# http://stackoverflow.com/questions/4421170/python-head-request-with-urllib2

request = urllib2.Request(URL % urllib.quote(SITE_URL))
request.get_method = lambda : 'HEAD'
response = urllib2.urlopen(request)

valid = response.info().getheader('X-W3C-Validator-Status')
if valid == "Valid":
    valid = True
else:
    valid = False
errors = int(response.info().getheader('X-W3C-Validator-Errors'))
warnings = int(response.info().getheader('X-W3C-Validator-Warnings'))

print "Valid markup: %s (Errors: %i, Warnings: %i) " % (valid, errors, warnings)

Answer 4

回答by karlcow

You can decide to install the HTML validator locally and create a client to request the validation.

您可以决定在本地安装 HTML 验证器并创建一个客户端来请求验证。

Here I had made a program to validate a list of urls in a txt file. I was just checking the HEAD to get the validation status, but if you do a GET you would get the full results. Look at the API of the validator, there are plenty of options for it.

在这里，我制作了一个程序来验证 txt 文件中的 url 列表。我只是检查 HEAD 以获得验证状态，但如果您执行 GET，您将获得完整的结果。看看验证器的 API，它有很多选项。

import httplib2
import time

h = httplib2.Http(".cache")

f = open("urllistfile.txt", "r")
urllist = f.readlines()
f.close()

for url in urllist:
   # wait 10 seconds before the next request - be nice with the validator
   time.sleep(10)
   resp= {}
   url = url.strip()
   urlrequest = "http://qa-dev.w3.org/wmvs/HEAD/check?doctype=HTML5&uri="+url
   try:
      resp, content = h.request(urlrequest, "HEAD")
      if resp['x-w3c-validator-status'] == "Abort":
         print url, "FAIL"
      else:
         print url, resp['x-w3c-validator-status'], resp['x-w3c-validator-errors'], resp['x-w3c-validator-warnings']
   except:
      pass

Answer 5

回答by Aaron Maenpaa

Try tidylib. You can get some really basic bindings as part of the elementtidy module (builds elementtrees from HTML documents). http://effbot.org/downloads/#elementtidy

试试 tidylib。您可以获得一些非常基本的绑定作为 elementtidy 模块的一部分（从 HTML 文档构建元素树）。http://effbot.org/downloads/#elementtidy

>>> import _elementtidy
>>> xhtml, log = _elementtidy.fixup("<html></html>")
>>> print log
line 1 column 1 - Warning: missing <!DOCTYPE> declaration
line 1 column 7 - Warning: discarding unexpected </html>
line 1 column 14 - Warning: inserting missing 'title' element

Parsing the log should give you pretty much everything you need.

解析日志应该可以为您提供几乎所有您需要的东西。

Answer 6

回答by Neall

I think that HTML tidywill do what you want. There is a Python binding for it.

我认为HTML tidy会做你想做的。它有一个 Python 绑定。

Answer 7

回答by speedplane

This is a very basic html validator based on lxml's HTMLParser. It is not a complete html validator, but does a few basic checks, doesn't require any internet connection, and doesn't require a large library.

这是一个基于 lxml 的 HTMLParser 的非常基本的 html 验证器。它不是一个完整的 html 验证器，但会进行一些基本检查，不需要任何互联网连接，也不需要大型库。

_html_parser = None
def validate_html(html):
    global _html_parser
    from lxml import etree
    from StringIO import StringIO
    if not _html_parser:
        _html_parser = etree.HTMLParser(recover = False)
    return etree.parse(StringIO(html), _html_parser)

Note that this will not check for closing tags, so for example, the following will pass:

请注意，这不会检查关闭标签，因此例如，以下内容将通过：

validate_html("<a href='example.com'>foo")
> <lxml.etree._ElementTree at 0xb2fd888>

However, the following wont:

但是，以下不会：

validate_html("<a href='example.com'>foo</a")
> XMLSyntaxError: End tag : expected '>', line 1, column 29

Answer 8

回答by Changaco

The html5libmodule can be used to validate an HTML5 document:

所述html5lib模块可用于验证文件HTML5：

>>> import html5lib
>>> html5parser = html5lib.HTMLParser(strict=True)
>>> html5parser.parse('<html></html>')
Traceback (most recent call last):
  ...
html5lib.html5parser.ParseError: Unexpected start tag (html). Expected DOCTYPE.

Answer 9

回答by user9869932

In my case the python W3C/HTML validation packages did not work pip search w3c(as of sept 2016).

就我而言，python W3C/HTML 验证包不起作用pip search w3c（截至 2016 年 9 月）。

I solved this with

我解决了这个问题

$ pip install requests

$ python
Python 2.7.12 (default, Jun 29 2016, 12:46:54)
[GCC 4.2.1 Compatible Apple LLVM 6.0 (clang-600.0.57)] on darwin
Type "help", "copyright", "credits" or "license" for more information.

>>> r = requests.post('https://validator.w3.org/nu/', 
...                    data=file('index.html', 'rb').read(), 
...                    params={'out': 'json'}, 
...                    headers={'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2272.101 Safari/537.36', 
...                    'Content-Type': 'text/html; charset=UTF-8'})

>>> r.text
>>> u'{"messages":[{"type":"info", ...

>>> r.json()
>>> {u'messages': [{u'lastColumn': 59, ...

More documentation here python requests, W3C Validator API

这里有更多文档python requests，W3C Validator API

在 Python 中验证 (X)HTML

提问by cdleary

采纳答案by John Millikin

回答by Dave Brondsema

回答by Martin Hepp

回答by karlcow

回答by Aaron Maenpaa

回答by Neall

回答by speedplane

回答by Changaco

回答by user9869932

相关推荐

最近更新

标签

在 Python 中验证 (X)HTML

提问by cdleary

采纳答案by John Millikin

回答by Dave Brondsema

回答by Martin Hepp

回答by karlcow

回答by Aaron Maenpaa

回答by Neall

回答by speedplane

回答by Changaco

回答by user9869932

相关推荐

java Angular 5：预检响应具有无效的 HTTP 状态代码 403

java 什么是带阴影的罐子？uber jar 和 shaded jar 之间有什么区别/相似之处？

在 Java 8 中以不区分大小写的方式对字符串值进行排序

java netflix.feign 和 openfeign 的区别

相关推荐

最近更新

标签