python 如何修复错误嵌套/未关闭的 HTML 标签?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/293482/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
How do I fix wrongly nested / unclosed HTML tags?
提问by Baishampayan Ghose
I need to sanitize HTML submitted by the user by closing any open tags with correct nesting order. I have been looking for an algorithm or Python code to do this but haven't found anything except some half-baked implementations in PHP, etc.
我需要通过以正确的嵌套顺序关闭任何打开的标签来清理用户提交的 HTML。我一直在寻找一种算法或 Python 代码来执行此操作,但除了 PHP 中的一些半生不熟的实现外,什么也没找到。
For example, something like
例如,像
<p>
<ul>
<li>Foo
becomes
变成
<p>
<ul>
<li>Foo</li>
</ul>
</p>
Any help would be appreciated :)
任何帮助,将不胜感激 :)
回答by pantsgolem
using BeautifulSoup:
使用 BeautifulSoup:
from BeautifulSoup import BeautifulSoup
html = "<p><ul><li>Foo"
soup = BeautifulSoup(html)
print soup.prettify()
gets you
得到你
<p>
<ul>
<li>
Foo
</li>
</ul>
</p>
As far as I know, you can't control putting the <li></li> tags on separate lines from Foo.
据我所知,您无法控制将 <li></li> 标签放在与 Foo 不同的行上。
using Tidy:
使用整洁:
import tidy
html = "<p><ul><li>Foo"
print tidy.parseString(html, show_body_only=True)
gets you
得到你
<ul>
<li>Foo</li>
</ul>
Unfortunately, I know of no way to keep the <p> tag in the example. Tidy interprets it as an empty paragraph rather than an unclosed one, so doing
不幸的是,我不知道如何在示例中保留 <p> 标签。Tidy 将它解释为一个空段落而不是一个未封闭的段落,所以这样做
print tidy.parseString(html, show_body_only=True, drop_empty_paras=False)
comes out as
出来作为
<p></p>
<ul>
<li>Foo</li>
</ul>
Ultimately, of course, the <p> tag in your example is redundant, so you might be fine with losing it.
当然,最终,您示例中的 <p> 标记是多余的,因此您可以将其丢失。
Finally, Tidy can also do indenting:
最后,Tidy 还可以进行缩进:
print tidy.parseString(html, show_body_only=True, indent=True)
becomes
变成
<ul>
<li>Foo
</li>
</ul>
All of these have their ups and downs, but hopefully one of them is close enough.
所有这些都有其起起落落,但希望其中之一足够接近。
回答by Nicholas Piasecki
回答by u7739221
use html5lib, work great! like this.
使用 html5lib,效果很好!像这样。
soup = BeautifulSoup(data, 'html5lib')
汤 = BeautifulSoup(data, 'html5lib')
回答by drt
I tried to use, below method but Failedon python 3
我尝试使用下面的方法,但在python 3 上失败
from BeautifulSoup import BeautifulSoup
soup = BeautifulSoup(page, 'html5lib')
I tried below and got Success
我在下面尝试并获得了成功
soup = bs4.BeautifulSoup(html, 'html5lib')
f_html = soup.prettify()
print(f'Formatted html::: {f_html}')
回答by Mithril
Just now, I got a html which lxml and pyquery didn't work good on , seems there are some errors in the html.
Since Tidy is not easy to install in windows, I choose BeautifulSoup
.
But I found that:
刚才,我得到了一个 html,其中 lxml 和 pyquery 无法正常工作,似乎 html 中存在一些错误。由于 Tidy 在 windows 下不容易安装,所以我选择了BeautifulSoup
. 但我发现:
from BeautifulSoup import BeautifulSoup
import lxml.html
soup = BeautifulSoup(page)
h = lxml.html(soup.prettify())
act same as h = lxml.html(page)
行为相同 h = lxml.html(page)
Which real solve my problem is soup = BeautifulSoup(page, 'html5lib')
.
You should install html5lib
first, then can use it as a parser in BeautifulSoup
.
html5lib
parser seems work much better than others.
哪个真正解决了我的问题soup = BeautifulSoup(page, 'html5lib')
。
您应该html5lib
先安装,然后才能将其用作BeautifulSoup
.
html5lib
解析器似乎比其他人好得多。
Hope this can help someone.
希望这可以帮助某人。