Python 将 lxml 设置为默认的 BeautifulSoup 解析器

Question

提问by Adam Hammes

I'm working on a web scraping project and have ran into problems with speed. To try to fix it, I want to use lxml instead of html.parser as BeautifulSoup's parser. I've been able to do this:

我正在做一个网页抓取项目，但遇到了速度问题。为了尝试修复它，我想使用 lxml 而不是 html.parser 作为 BeautifulSoup 的解析器。我已经能够做到这一点：

soup = bs4.BeautifulSoup(html, 'lxml')

but I don't want to have to repeatedly type 'lxml'every time I call BeautifulSoup. Is there a way I can set which parser to use once at the beginning of my program?

但我不想'lxml'每次调用 BeautifulSoup 时都重复输入。有没有办法可以在我的程序开始时设置一次使用哪个解析器？

Answer 1

采纳答案by alecxe

According to the Specifying the parser to usedocumentation page:

根据指定解析器使用文档页面：

The first argument to the BeautifulSoup constructor is a string or an open filehandle–the markup you want parsed. The second argument is how you'd like the markup parsed.
If you don't specify anything, you'll get the best HTML parser that's installed. Beautiful Soup ranks lxml's parser as being the best, then html5lib's, then Python's built-in parser.

BeautifulSoup 构造函数的第一个参数是一个字符串或一个打开的文件句柄——你想要解析的标记。第二个参数是您希望如何解析标记。
如果您不指定任何内容，您将获得已安装的最佳 HTML 解析器。Beautiful Soup 将 lxml 的解析器列为最好的，然后是 html5lib 的，然后是 Python 的内置解析器。

In other words, just installing lxmlin the same python environment makes it a default parser.

换句话说，只要安装lxml在同一个 python 环境中，它就会成为默认的解析器。

Though note, that explicitly stating a parser is considered a best-practice approach. There are differences between parsersthat can result into subtle errors which would be difficult to debug if you are letting BeautifulSoupchoose the best parser by itself. You would also have to remember that you need to have lxmlinstalled. And, if you would not have it installed, you would not even notice it - BeautifulSoupwould just get the next available parser without throwing any errors.

但请注意，明确说明解析器被认为是最佳实践方法。解析器之间存在差异，可能会导致细微的错误，如果您让BeautifulSoup自己选择最佳解析器，这些错误将难以调试。您还必须记住，您需要lxml安装。而且，如果您不安装它，您甚至不会注意到它 -BeautifulSoup只会获得下一个可用的解析器而不会引发任何错误。

If you still don't want to specify the parser explicitly, at least make a note for future yourself or others who would use the code you've written in the project's README/documentation, and list lxmlin your project requirements alongside with beautifulsoup4.

如果您仍然不想明确指定解析器，请至少为您自己或将来使用您在项目自述文件/文档中编写的代码的其他人做笔记，并lxml在您的项目要求中与beautifulsoup4.

Besides: "Explicit is better than implicit."

此外：“显式优于隐式。”

Answer 2

回答by Leonid

Obviously take a look at the accepted answerfirst. It is pretty good, and as for this technicality:

显然先看看接受的答案。这很好，至于这个技术性：

but I don't want to have to repeatedly type 'lxml' every time I call BeautifulSoup. Is there a way I can set which parser to use once at the beginning of my program?

但我不想在每次调用 BeautifulSoup 时都重复输入“lxml”。有没有办法可以在我的程序开始时设置一次使用哪个解析器？

If I understood your question correctly, I can think of two approaches that will save you some keystrokes: - Define a wrapper function, or - Create a partial function.

如果我正确理解了您的问题，我可以想到两种方法来为您节省一些按键操作： - 定义一个包装函数，或者 - 创建一个部分函数。

# V1 - define a wrapper function - most straight-forward.
import bs4

def bs_parse(html):
    return bs4.BeautifulSoup(html, 'lxml')
# ...
html = ...
bs_parse(html)

Or if you feel like showing off ...

或者如果你想炫耀......

import bs4
from functools import partial
bs_parse = partial(bs4.BeautifulSoup, features='lxml')
# ...
html = ...
bs_parse(html)

Python 将 lxml 设置为默认的 BeautifulSoup 解析器

提问by Adam Hammes

采纳答案by alecxe

回答by Leonid

相关推荐

最近更新

标签

Python 将 lxml 设置为默认的 BeautifulSoup 解析器

提问by Adam Hammes

采纳答案by alecxe

回答by Leonid

相关推荐

Python 检查 Pandas 中的单个单元格值是否为 NaN

Python子集总和

Python 如何在引号内打印变量？

Python 如何用零替换 Pandas Data Frame 中的负数

相关推荐

最近更新

标签