Python 使用请求和 BeautifulSoup 下载文件

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/19056031/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-19 12:39:36  来源:igfitidea点击:

Download files using requests and BeautifulSoup

pythondownloadbeautifulsouppython-requests

提问by Filipe Manuel

I'm trying download a bunch of pdf files from hereusing requestsand beautifulsoup4. This is my code:

我正在尝试使用requestsbeautifulsoup4这里下载一堆 pdf 文件。这是我的代码:

import requests
from bs4 import BeautifulSoup as bs

_ANO = '2013/'
_MES = '01/'
_MATERIAS = 'matematica/'
_CONTEXT = 'wp-content/uploads/' + _ANO + _MES
_URL = 'http://www.desconversa.com.br/' + _MATERIAS + _CONTEXT

r = requests.get(_URL)
soup = bs(r.text)

for i, link in enumerate(soup.findAll('a')):
    _FULLURL = _URL + link.get('href')

    for x in range(i):
        output = open('file[%d].pdf' % x, 'wb')
        output.write(_FULLURL.read())
        output.close()

I'm getting AttributeError: 'str' object has no attribute 'read'.

我得到AttributeError: 'str' object has no attribute 'read'

Ok, I know that, but... how can I download from that URL generated?

好的,我知道,但是...如何从生成的 URL 下载?

回答by samstav

This will write all the files from the page with their original filenames into a pdfs/directory.

这会将页面中的所有文件及其原始文件名写入一个pdfs/目录。

import requests
from bs4 import BeautifulSoup as bs
import urllib2


_ANO = '2013/'
_MES = '01/'
_MATERIAS = 'matematica/'
_CONTEXT = 'wp-content/uploads/' + _ANO + _MES
_URL = 'http://www.desconversa.com.br/' + _MATERIAS + _CONTEXT

# functional
r = requests.get(_URL)
soup = bs(r.text)
urls = []
names = []
for i, link in enumerate(soup.findAll('a')):
    _FULLURL = _URL + link.get('href')
    if _FULLURL.endswith('.pdf'):
        urls.append(_FULLURL)
        names.append(soup.select('a')[i].attrs['href'])

names_urls = zip(names, urls)

for name, url in names_urls:
    print url
    rq = urllib2.Request(url)
    res = urllib2.urlopen(rq)
    pdf = open("pdfs/" + name, 'wb')
    pdf.write(res.read())
    pdf.close()

回答by Balzer82

It might be easier with wget, because then you have the full power of wget(user agent, follow, ignore robots.txt ...), if necessary:

使用 可能会更容易wget,因为如果需要,您将拥有wget全部功能(用户代理、关注、忽略 robots.txt ...):

import os

names_urls = zip(names, urls)

for name, url in names_urls:
    print('Downloading %s' % url)
    os.system('wget %s' % url)