Python Beautiful Soup findAll 不能全部找到
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/16322862/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Beautiful Soup findAll doesn't find them all
提问by Clepto
I'm trying to parse a website and get some info with BeautifulSoup.findAll but it doesn't find them all.. I'm using python3
我正在尝试解析一个网站并使用 BeautifulSoup.findAll 获取一些信息,但它没有找到所有信息..我正在使用 python3
the code is this
代码是这样的
#!/usr/bin/python3
from bs4 import BeautifulSoup
from urllib.request import urlopen
page = urlopen ("http://mangafox.me/directory/")
# print (page.read ())
soup = BeautifulSoup (page.read ())
manga_img = soup.findAll ('a', {'class' : 'manga_img'}, limit=None)
for manga in manga_img:
print (manga['href'])
it only prints the half of them...
它只打印其中的一半......
采纳答案by Martijn Pieters
Different HTML parsers deal differently with broken HTML. That page serves broken HTML, and the lxmlparser is not dealing very well with it:
不同的 HTML 解析器处理损坏的 HTML 的方式不同。该页面提供损坏的 HTML,lxml解析器没有很好地处理它:
>>> import requests
>>> from bs4 import BeautifulSoup
>>> r = requests.get('http://mangafox.me/directory/')
>>> soup = BeautifulSoup(r.content, 'lxml')
>>> len(soup.find_all('a', class_='manga_img'))
18
The standard library html.parserhas less trouble with this specific page:
标准库html.parser在此特定页面上的麻烦较少:
>>> soup = BeautifulSoup(r.content, 'html.parser')
>>> len(soup.find_all('a', class_='manga_img'))
44
Translating that to your specific code sample using urllib, you would specify the parser thus:
使用 将其转换为您的特定代码示例urllib,您将指定解析器:
soup = BeautifulSoup(page, 'html.parser') # BeatifulSoup can do the reading

