python 如何修复错误嵌套/未关闭的 HTML 标签？

Question

提问by Baishampayan Ghose

I need to sanitize HTML submitted by the user by closing any open tags with correct nesting order. I have been looking for an algorithm or Python code to do this but haven't found anything except some half-baked implementations in PHP, etc.

我需要通过以正确的嵌套顺序关闭任何打开的标签来清理用户提交的 HTML。我一直在寻找一种算法或 Python 代码来执行此操作，但除了 PHP 中的一些半生不熟的实现外，什么也没找到。

For example, something like

例如，像

<p>
  <ul>
    <li>Foo

becomes

变成

<p>
  <ul>
    <li>Foo</li>
  </ul>
</p>

Any help would be appreciated :)

任何帮助，将不胜感激：）

Answer 1

回答by pantsgolem

using BeautifulSoup:

使用 BeautifulSoup：

from BeautifulSoup import BeautifulSoup
html = "<p><ul><li>Foo"
soup = BeautifulSoup(html)
print soup.prettify()

gets you

得到你

<p>
 <ul>
  <li>
   Foo
  </li>
 </ul>
</p>

As far as I know, you can't control putting the <li></li> tags on separate lines from Foo.

据我所知，您无法控制将 <li></li> 标签放在与 Foo 不同的行上。

using Tidy:

使用整洁：

import tidy
html = "<p><ul><li>Foo"
print tidy.parseString(html, show_body_only=True)

gets you

得到你

<ul>
<li>Foo</li>
</ul>

Unfortunately, I know of no way to keep the <p> tag in the example. Tidy interprets it as an empty paragraph rather than an unclosed one, so doing

不幸的是，我不知道如何在示例中保留 <p> 标签。Tidy 将它解释为一个空段落而不是一个未封闭的段落，所以这样做

print tidy.parseString(html, show_body_only=True, drop_empty_paras=False)

comes out as

出来作为

<p></p>
<ul>
<li>Foo</li>
</ul>

Ultimately, of course, the <p> tag in your example is redundant, so you might be fine with losing it.

当然，最终，您示例中的 <p> 标记是多余的，因此您可以将其丢失。

Finally, Tidy can also do indenting:

最后，Tidy 还可以进行缩进：

print tidy.parseString(html, show_body_only=True, indent=True)

becomes

变成

<ul>
  <li>Foo
  </li>
</ul>

All of these have their ups and downs, but hopefully one of them is close enough.

所有这些都有其起起落落，但希望其中之一足够接近。

Answer 2

回答by Nicholas Piasecki

Run it through Tidyor one of its ported libraries.

通过Tidy或其移植库之一运行它。

Try to code it by hand and you willwant to gouge your eyes out.

尝试手动编码，你会想要挖出你的眼睛。

Answer 3

回答by u7739221

use html5lib, work great! like this.

使用 html5lib，效果很好！像这样。

soup = BeautifulSoup(data, 'html5lib')

汤 = BeautifulSoup(data, 'html5lib')

Answer 4

回答by drt

I tried to use, below method but Failedon python 3

我尝试使用下面的方法，但在python 3 上失败

from BeautifulSoup import BeautifulSoup
soup = BeautifulSoup(page, 'html5lib')

I tried below and got Success

我在下面尝试并获得了成功

soup = bs4.BeautifulSoup(html, 'html5lib')
f_html = soup.prettify()
print(f'Formatted html::: {f_html}')

Answer 5

回答by Mithril

Just now, I got a html which lxml and pyquery didn't work good on , seems there are some errors in the html. Since Tidy is not easy to install in windows, I choose BeautifulSoup. But I found that:

刚才，我得到了一个 html，其中 lxml 和 pyquery 无法正常工作，似乎 html 中存在一些错误。由于 Tidy 在 windows 下不容易安装，所以我选择了BeautifulSoup. 但我发现：

from BeautifulSoup import BeautifulSoup
import lxml.html
soup = BeautifulSoup(page)
h = lxml.html(soup.prettify())

act same as h = lxml.html(page)

行为相同 h = lxml.html(page)

Which real solve my problem is soup = BeautifulSoup(page, 'html5lib').
You should install html5libfirst, then can use it as a parser in BeautifulSoup. html5libparser seems work much better than others.

哪个真正解决了我的问题soup = BeautifulSoup(page, 'html5lib')。
您应该html5lib先安装，然后才能将其用作BeautifulSoup. html5lib解析器似乎比其他人好得多。

Hope this can help someone.

希望这可以帮助某人。

python 如何修复错误嵌套/未关闭的 HTML 标签？

提问by Baishampayan Ghose

回答by pantsgolem

回答by Nicholas Piasecki

回答by u7739221

回答by drt

回答by Mithril

相关推荐

最近更新

标签

python 如何修复错误嵌套/未关闭的 HTML 标签？

提问by Baishampayan Ghose

回答by pantsgolem

回答by Nicholas Piasecki

回答by u7739221

回答by drt

回答by Mithril

相关推荐

如何使用 Python 的 PIL 绘制贝塞尔曲线？

python 没有站点包的 Ubuntu 上的 Virtualenv

python Lua 作为通用脚本语言？

python 确定是否传递了命名参数

相关推荐

最近更新

标签