Python Beautiful Soup findAll 不能全部找到

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/16322862/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-18 22:17:04  来源:igfitidea点击:

Beautiful Soup findAll doesn't find them all

pythonhtmlpython-3.xbeautifulsoup

提问by Clepto

I'm trying to parse a website and get some info with BeautifulSoup.findAll but it doesn't find them all.. I'm using python3

我正在尝试解析一个网站并使用 BeautifulSoup.findAll 获取一些信息,但它没有找到所有信息..我正在使用 python3

the code is this

代码是这样的

#!/usr/bin/python3

from bs4 import BeautifulSoup
from urllib.request import urlopen

page = urlopen ("http://mangafox.me/directory/")
# print (page.read ())
soup = BeautifulSoup (page.read ())

manga_img = soup.findAll ('a', {'class' : 'manga_img'}, limit=None)

for manga in manga_img:
    print (manga['href'])

it only prints the half of them...

它只打印其中的一半......

采纳答案by Martijn Pieters

Different HTML parsers deal differently with broken HTML. That page serves broken HTML, and the lxmlparser is not dealing very well with it:

不同的 HTML 解析器处理损坏的 HTML 的方式不同。该页面提供损坏的 HTML,lxml解析器没有很好地处理它:

>>> import requests
>>> from bs4 import BeautifulSoup
>>> r = requests.get('http://mangafox.me/directory/')
>>> soup = BeautifulSoup(r.content, 'lxml')
>>> len(soup.find_all('a', class_='manga_img'))
18

The standard library html.parserhas less trouble with this specific page:

标准库html.parser在此特定页面上的麻烦较少:

>>> soup = BeautifulSoup(r.content, 'html.parser')
>>> len(soup.find_all('a', class_='manga_img'))
44

Translating that to your specific code sample using urllib, you would specify the parser thus:

使用 将其转换为您的特定代码示例urllib,您将指定解析器:

soup = BeautifulSoup(page, 'html.parser')  # BeatifulSoup can do the reading