Python BeautifulSoup 和 lxml.html - 更喜欢什么?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/4967103/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-18 18:23:08  来源:igfitidea点击:

BeautifulSoup and lxml.html - what to prefer?

pythonbeautifulsouplxml

提问by user225312

I am working on a project that will involve parsing HTML.

我正在做一个涉及解析 HTML 的项目。

After searching around, I found two probable options: BeautifulSoup and lxml.html

四处搜索后,我发现了两个可能的选项:BeautifulSoup 和 lxml.html

Is there any reason to prefer one over the other? I have used lxml for XML some time back and I feel I will be more comfortable with it, however BeautifulSoup seems to be much common.

有什么理由更喜欢一个吗?一段时间前,我将 lxml 用于 XML,我觉得我会更习惯它,但是 BeautifulSoup 似乎很常见。

I know I should use the one that works for me, but I was looking for personal experiences with both.

我知道我应该使用对我有用的那个,但我一直在寻找两者的个人经验。

采纳答案by simon

The simple answer, imo, is that if you trust your source to be well-formed, go with the lxml solution. Otherwise, BeautifulSoup all the way.

imo 的简单答案是,如果您相信源代码格式正确,请使用 lxml 解决方案。否则,BeautifulSoup 一路。

Edit:

编辑:

This answer is three years old now; it's worth noting, as Jonathan Vanasco does in the comments, that BeautifulSoup4now supports using lxmlas the internal parser, so you can use the advanced features and interface of BeautifulSoup without most of the performance hit, if you wish (although I still reach straight for lxmlmyself -- perhaps it's just force of habit :)).

这个答案现在已经三年了;值得注意的是,正如 Jonathan Vanasco 在评论中所做的那样,BeautifulSoup4现在支持lxml用作内部解析器,因此如果您愿意,您可以使用 BeautifulSoup 的高级功能和界面而不会影响大部分性能(尽管我仍然直接为lxml自己- 也许这只是习惯的力量:))。

回答by ymv

Use both? lxml for DOM manipulation, BeautifulSoup for parsing:

两个都用?lxml 用于 DOM 操作,BeautifulSoup 用于解析:

http://lxml.de/elementsoup.html

http://lxml.de/elementsoup.html

回答by dfichter

lxml's great. But parsing your input as html is useful only if the dom structure actually helps you find what you're looking for.

lxml 很棒。但是,只有当 dom 结构确实可以帮助您找到要查找的内容时,将您的输入解析为 html 才有用。

Can you use ordinary string functions or regexes? For a lot of html parsing tasks, treating your input as a string rather than an html document is, counterintuitively, way easier.

你能使用普通的字符串函数或正则表达式吗?对于许多 html 解析任务,将您的输入视为字符串而不是 html 文档,这与直觉相反,要容易得多。

回答by osa

In summary, lxmlis positioned as a lightning-fast production-quality html and xml parser that, by the way, also includes a soupparsermodule to fall back on BeautifulSoup's functionality. BeautifulSoupis a one-person project, designed to save you time to quickly extract data out of poorly-formed html or xml.

总之,它lxml被定位为闪电般快速的生产质量 html 和 xml 解析器,顺便说一下,它还包含一个soupparser模块来支持 BeautifulSoup 的功能。BeautifulSoup是一个单人项目,旨在节省您从格式不佳的 html 或 xml 中快速提取数据的时间。

lxml documentationsays that both parsers have advantages and disadvantages. For this reason, lxmlprovides a soupparserso you can switch back and forth. Quoting,

lxml 文档说这两个解析器都有优点和缺点。为此,lxml提供了一个soupparser以便您可以来回切换。引用,

BeautifulSoup uses a different parsing approach. It is not a real HTML parser but uses regular expressions to dive through tag soup. It is therefore more forgiving in some cases and less good in others. It is not uncommon that lxml/libxml2 parses and fixes broken HTML better, but BeautifulSoup has superiour support for encoding detection. It very much depends on the input which parser works better.

BeautifulSoup 使用不同的解析方法。它不是真正的 HTML 解析器,而是使用正则表达式来深入了解标签汤。因此,它在某些情况下更宽容,而在其他情况下则不太好。lxml/libxml2 更好地解析和修复损坏的 HTML 的情况并不少见,但 BeautifulSoup 对编码检测具有卓越的支持。这在很大程度上取决于哪个解析器工作得更好的输入。

In the end they are saying,

最后他们说,

The downside of using this parser is that it is much slowerthan the HTML parser of lxml. So if performance matters, you might want to consider using soupparser only as a fallback for certain cases.

使用这个解析器的缺点是它比 lxml 的 HTML 解析器得多。因此,如果性能很重要,您可能需要考虑仅将 soundparser 用作某些情况下的后备。

If I understand them correctly, it means that the soup parser is more robust --- it can deal with a "soup" of malformed tags by using regular expressions --- whereas lxmlis more straightforward and just parses things and builds a tree as you would expect. I assume it also applies to BeautifulSoupitself, not just to the soupparserfor lxml.

如果我理解正确,这意味着汤解析器更健壮——它可以通过使用正则表达式来处理格式错误的标签的“汤”——而lxml更直接,只是像你一样解析事物并构建一棵树会期待。我认为它也适用于BeautifulSoup自身,而不仅仅是soupparserfor lxml

They also show how to benefit from BeautifulSoup's encoding detection, while still parsing quickly with lxml:

他们还展示了如何从BeautifulSoup的编码检测中受益,同时仍然可以快速解析lxml

>>> from BeautifulSoup import UnicodeDammit

>>> def decode_html(html_string):
...     converted = UnicodeDammit(html_string, isHTML=True)
...     if not converted.unicode:
...         raise UnicodeDecodeError(
...             "Failed to detect encoding, tried [%s]",
...             ', '.join(converted.triedEncodings))
...     # print converted.originalEncoding
...     return converted.unicode

>>> root = lxml.html.fromstring(decode_html(tag_soup))

(Same source: http://lxml.de/elementsoup.html).

(相同来源:http: //lxml.de/elementsoup.html)。

In words of BeautifulSoup's creator,

BeautifulSoup的创造者的话来说,

That's it! Have fun! I wrote Beautiful Soup to save everybody time. Once you get used to it, you should be able to wrangle data out of poorly-designed websites in just a few minutes. Send me email if you have any comments, run into problems, or want me to know about your project that uses Beautiful Soup.

 --Leonard

就是这样!玩得开心!我写了 Beautiful Soup 来节省大家的时间。一旦习惯了,您应该能够在短短几分钟内从设计不佳的网站中处理数据。如果您有任何意见、遇到问题或希望我了解您使用 Beautiful Soup 的项目,请给我发送电子邮件。

 --Leonard

Quoted from the Beautiful Soup documentation.

引自Beautiful Soup 文档

I hope this is now clear. The soup is a brilliant one-person project designed to save you time to extract data out of poorly-designed websites. The goal is to save you time right now, to get the job done, not necessarily to save you time in the long term, and definitely not to optimize the performance of your software.

我希望现在已经清楚了。该汤是一个出色的单人项目,旨在节省您从设计不佳的网站中提取数据的时间。目标是立即为您节省时间,完成工作,不一定是从长远来看节省您的时间,也绝对不是优化软件的性能。

Also, from the lxml website,

此外,从lxml 网站

lxml has been downloaded from the Python Package Index more than two million times and is also available directly in many package distributions, e.g. for Linux or MacOS-X.

lxml 已经从 Python Package Index 下载了超过 200 万次,也可以直接在许多软件包发行版中使用,例如 Linux 或 MacOS-X。

And, from Why lxml?,

而且,从为什么是 lxml?,

The C libraries libxml2 and libxslt have huge benefits:... Standards-compliant... Full-featured... fast. fast! FAST! ... lxml is a new Python binding for libxml2 and libxslt...

C 库 libxml2 和 libxslt 具有巨大的优势:...符合标准...功能齐全...速度快。快速地!快速地!... lxml 是 libxml2 和 libxslt 的新 Python 绑定...