Python Beautiful Soup findAll 不能全部找到

Question

提问by Clepto

I'm trying to parse a website and get some info with BeautifulSoup.findAll but it doesn't find them all.. I'm using python3

我正在尝试解析一个网站并使用 BeautifulSoup.findAll 获取一些信息，但它没有找到所有信息..我正在使用 python3

the code is this

代码是这样的

#!/usr/bin/python3

from bs4 import BeautifulSoup
from urllib.request import urlopen

page = urlopen ("http://mangafox.me/directory/")
# print (page.read ())
soup = BeautifulSoup (page.read ())

manga_img = soup.findAll ('a', {'class' : 'manga_img'}, limit=None)

for manga in manga_img:
    print (manga['href'])

it only prints the half of them...

它只打印其中的一半......

Answer 1

采纳答案by Martijn Pieters

Different HTML parsers deal differently with broken HTML. That page serves broken HTML, and the lxmlparser is not dealing very well with it:

不同的 HTML 解析器处理损坏的 HTML 的方式不同。该页面提供损坏的 HTML，lxml解析器没有很好地处理它：

>>> import requests
>>> from bs4 import BeautifulSoup
>>> r = requests.get('http://mangafox.me/directory/')
>>> soup = BeautifulSoup(r.content, 'lxml')
>>> len(soup.find_all('a', class_='manga_img'))
18

The standard library html.parserhas less trouble with this specific page:

标准库html.parser在此特定页面上的麻烦较少：

>>> soup = BeautifulSoup(r.content, 'html.parser')
>>> len(soup.find_all('a', class_='manga_img'))
44

Translating that to your specific code sample using urllib, you would specify the parser thus:

使用将其转换为您的特定代码示例urllib，您将指定解析器：

soup = BeautifulSoup(page, 'html.parser')  # BeatifulSoup can do the reading

Python Beautiful Soup findAll 不能全部找到

提问by Clepto

采纳答案by Martijn Pieters

相关推荐

最近更新

标签

Python Beautiful Soup findAll 不能全部找到

提问by Clepto

采纳答案by Martijn Pieters

相关推荐

使用Python按行号将大文本文件拆分为较小的文本文件

Python 使用带有 Scrapy 的 css 选择器获取 href

检查我的 Python 是否具有所有必需的包

Python Django REST 框架 - 根据查询参数过滤

相关推荐

最近更新

标签