Python 使用请求和 BeautifulSoup 下载文件

Question

提问by Filipe Manuel

I'm trying download a bunch of pdf files from hereusing requestsand beautifulsoup4. This is my code:

我正在尝试使用requests和beautifulsoup4从这里下载一堆 pdf 文件。这是我的代码：

import requests
from bs4 import BeautifulSoup as bs

_ANO = '2013/'
_MES = '01/'
_MATERIAS = 'matematica/'
_CONTEXT = 'wp-content/uploads/' + _ANO + _MES
_URL = 'http://www.desconversa.com.br/' + _MATERIAS + _CONTEXT

r = requests.get(_URL)
soup = bs(r.text)

for i, link in enumerate(soup.findAll('a')):
    _FULLURL = _URL + link.get('href')

    for x in range(i):
        output = open('file[%d].pdf' % x, 'wb')
        output.write(_FULLURL.read())
        output.close()

I'm getting AttributeError: 'str' object has no attribute 'read'.

我得到AttributeError: 'str' object has no attribute 'read'。

Ok, I know that, but... how can I download from that URL generated?

好的，我知道，但是...如何从生成的 URL 下载？

Answer 1

回答by samstav

This will write all the files from the page with their original filenames into a pdfs/directory.

这会将页面中的所有文件及其原始文件名写入一个pdfs/目录。

import requests
from bs4 import BeautifulSoup as bs
import urllib2


_ANO = '2013/'
_MES = '01/'
_MATERIAS = 'matematica/'
_CONTEXT = 'wp-content/uploads/' + _ANO + _MES
_URL = 'http://www.desconversa.com.br/' + _MATERIAS + _CONTEXT

# functional
r = requests.get(_URL)
soup = bs(r.text)
urls = []
names = []
for i, link in enumerate(soup.findAll('a')):
    _FULLURL = _URL + link.get('href')
    if _FULLURL.endswith('.pdf'):
        urls.append(_FULLURL)
        names.append(soup.select('a')[i].attrs['href'])

names_urls = zip(names, urls)

for name, url in names_urls:
    print url
    rq = urllib2.Request(url)
    res = urllib2.urlopen(rq)
    pdf = open("pdfs/" + name, 'wb')
    pdf.write(res.read())
    pdf.close()

Answer 2

回答by Balzer82

It might be easier with wget, because then you have the full power of wget(user agent, follow, ignore robots.txt ...), if necessary:

使用可能会更容易wget，因为如果需要，您将拥有wget的全部功能（用户代理、关注、忽略 robots.txt ...）：

import os

names_urls = zip(names, urls)

for name, url in names_urls:
    print('Downloading %s' % url)
    os.system('wget %s' % url)

Python 使用请求和 BeautifulSoup 下载文件

提问by Filipe Manuel

回答by samstav

回答by Balzer82

相关推荐

最近更新

标签

Python 使用请求和 BeautifulSoup 下载文件

提问by Filipe Manuel

回答by samstav

回答by Balzer82

相关推荐

Python Cassandra cqlsh“无法连接到任何服务器”

Python 如何访问字典中的第一个和最后一个元素？

Python 未找到 Tkinter

Python 类型错误：需要类似字节的对象，而不是“str”

相关推荐

最近更新

标签