Python 将 lxml 设置为默认的 BeautifulSoup 解析器
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/27790415/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Set lxml as default BeautifulSoup parser
提问by Adam Hammes
I'm working on a web scraping project and have ran into problems with speed. To try to fix it, I want to use lxml instead of html.parser as BeautifulSoup's parser. I've been able to do this:
我正在做一个网页抓取项目,但遇到了速度问题。为了尝试修复它,我想使用 lxml 而不是 html.parser 作为 BeautifulSoup 的解析器。我已经能够做到这一点:
soup = bs4.BeautifulSoup(html, 'lxml')
but I don't want to have to repeatedly type 'lxml'
every time I call BeautifulSoup. Is there a way I can set which parser to use once at the beginning of my program?
但我不想'lxml'
每次调用 BeautifulSoup 时都重复输入。有没有办法可以在我的程序开始时设置一次使用哪个解析器?
采纳答案by alecxe
According to the Specifying the parser to usedocumentation page:
根据指定解析器使用文档页面:
The first argument to the BeautifulSoup constructor is a string or an open filehandle–the markup you want parsed. The second argument is how you'd like the markup parsed.
If you don't specify anything, you'll get the best HTML parser that's installed. Beautiful Soup ranks lxml's parser as being the best, then html5lib's, then Python's built-in parser.
BeautifulSoup 构造函数的第一个参数是一个字符串或一个打开的文件句柄——你想要解析的标记。第二个参数是您希望如何解析标记。
如果您不指定任何内容,您将获得已安装的最佳 HTML 解析器。Beautiful Soup 将 lxml 的解析器列为最好的,然后是 html5lib 的,然后是 Python 的内置解析器。
In other words, just installing lxml
in the same python environment makes it a default parser.
换句话说,只要安装lxml
在同一个 python 环境中,它就会成为默认的解析器。
Though note, that explicitly stating a parser is considered a best-practice approach. There are differences between parsersthat can result into subtle errors which would be difficult to debug if you are letting BeautifulSoup
choose the best parser by itself. You would also have to remember that you need to have lxml
installed. And, if you would not have it installed, you would not even notice it - BeautifulSoup
would just get the next available parser without throwing any errors.
但请注意,明确说明解析器被认为是最佳实践方法。解析器之间存在差异,可能会导致细微的错误,如果您让BeautifulSoup
自己选择最佳解析器,这些错误将难以调试。您还必须记住,您需要lxml
安装。而且,如果您不安装它,您甚至不会注意到它 -BeautifulSoup
只会获得下一个可用的解析器而不会引发任何错误。
If you still don't want to specify the parser explicitly, at least make a note for future yourself or others who would use the code you've written in the project's README/documentation, and list lxml
in your project requirements alongside with beautifulsoup4
.
如果您仍然不想明确指定解析器,请至少为您自己或将来使用您在项目自述文件/文档中编写的代码的其他人做笔记,并lxml
在您的项目要求中与beautifulsoup4
.
Besides: "Explicit is better than implicit."
此外:“显式优于隐式。”
回答by Leonid
Obviously take a look at the accepted answerfirst. It is pretty good, and as for this technicality:
显然先看看接受的答案。这很好,至于这个技术性:
but I don't want to have to repeatedly type 'lxml' every time I call BeautifulSoup. Is there a way I can set which parser to use once at the beginning of my program?
但我不想在每次调用 BeautifulSoup 时都重复输入“lxml”。有没有办法可以在我的程序开始时设置一次使用哪个解析器?
If I understood your question correctly, I can think of two approaches that will save you some keystrokes: - Define a wrapper function, or - Create a partial function.
如果我正确理解了您的问题,我可以想到两种方法来为您节省一些按键操作: - 定义一个包装函数,或者 - 创建一个部分函数。
# V1 - define a wrapper function - most straight-forward.
import bs4
def bs_parse(html):
return bs4.BeautifulSoup(html, 'lxml')
# ...
html = ...
bs_parse(html)
Or if you feel like showing off ...
或者如果你想炫耀......
import bs4
from functools import partial
bs_parse = partial(bs4.BeautifulSoup, features='lxml')
# ...
html = ...
bs_parse(html)