Beautifulsoup:.find() 和 .select() 之间有区别吗 - python 3.xx
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/38028384/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Beautifulsoup : Is there a difference between .find() and .select() - python 3.xx
提问by Dieter
I've a simple question:
我有一个简单的问题:
when you use BeautifulSoupto scrape a certain part of a website, you can use data.find()
, data.findAll()
or data.select()
.
当您使用BeautifulSoup抓取网站的某个部分时,您可以使用data.find()
,data.findAll()
或data.select()
。
Now the question is. Is there a significant difference between the .find()
and the .select()
methods?
(e.g. in performance or flexibility, or ...)
现在的问题是。.find()
和.select()
方法之间有显着差异吗?(例如在性能或灵活性方面,或...)
or are they just the same?
或者他们只是一样?
Kind regards
亲切的问候
回答by Padraic Cunningham
To summarise the comments:
总结评论:
- selectfinds multiple instances and returns a list, findfinds the first, so they don't do the same thing. select_onewould be the equivalent to find.
- I almost always use css selectors when chaining tags or using tag.classname, if looking for a single element without a class I use find. Essentially it comes down to the use case and personal preference.
- As far as flexibility goes I think you know the answer,
soup.select("div[id=foo] > div > div > div[class=fee] > span > span > a")
would look pretty ugly using multiple chained find/find_allcalls. - The only issue with the css selectors in bs4 is the very limited support, nth-of-typeis the only pseudo class implemented and chaining attributes like a[href][src] is also not supported as are many other parts of css selectors. But things like a[href=..]* , a[href^=], a[href$=]etc.. are I think much nicer than
find("a", href=re.compile(....))
but again that is personal preference.
- select找到多个实例并返回一个列表,find找到第一个,所以它们不会做同样的事情。select_one相当于find。
- 我几乎总是在链接标签或使用tag.classname时使用 css 选择器,如果我使用find 寻找没有类的单个元素。从本质上讲,它归结为用例和个人偏好。
- 就灵活性而言,我认为您知道答案,
soup.select("div[id=foo] > div > div > div[class=fee] > span > span > a")
使用多个链接的find/find_all调用看起来非常难看。 - bs4 中 css 选择器的唯一问题是支持非常有限,nth-of-type是唯一实现的伪类,并且像 css 选择器的许多其他部分一样不支持像 a[href][src] 这样的链接属性。但是像a[href=..]* 、a[href^=]、a[href$=]等。我认为比这要好得多,
find("a", href=re.compile(....))
但这又是个人喜好。
For performance we can run some tests, I modified the code from an answer hererunning on 800+ html files taken from here, is is not exhaustive but should give a clue to the readability of some of the options and the performance:
出于性能,我们可以运行一些测试,我修改从代码这里的答案上800+的HTML文件运行取自这里,是不是全部,但应该给一个线索的一些选项和性能可读性:
The modified functions are:
修改后的函数是:
from bs4 import BeautifulSoup
from glob import iglob
def parse_find(soup):
author = soup.find("h4", class_="h12 talk-link__speaker").text
title = soup.find("h4", class_="h9 m5").text
date = soup.find("span", class_="meta__val").text.strip()
soup.find("footer",class_="footer").find_previous("data", {
"class": "talk-transcript__para__time"}).text.split(":")
soup.find_all("span",class_="talk-transcript__fragment")
def parse_select(soup):
author = soup.select_one("h4.h12.talk-link__speaker").text
title = soup.select_one("h4.h9.m5").text
date = soup.select_one("span.meta__val").text.strip()
soup.select_one("footer.footer").find_previous("data", {
"class": "talk-transcript__para__time"}).text
soup.select("span.talk-transcript__fragment")
def test(patt, func):
for html in iglob(patt):
with open(html) as f:
func(BeautifulSoup(f, "lxml")
Now for the timings:
现在是时间:
In [7]: from testing import test, parse_find, parse_select
In [8]: timeit test("./talks/*.html",parse_find)
1 loops, best of 3: 51.9 s per loop
In [9]: timeit test("./talks/*.html",parse_select)
1 loops, best of 3: 32.7 s per loop
Like I said not exhaustive but I think we can safely say the css selectors are definitely more efficient.
就像我说的并不详尽,但我认为我们可以有把握地说 css 选择器肯定更有效。