使用 Python 从 Blob URL 下载文件
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/39517522/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Download file from Blob URL with Python
提问by Winterflags
I wish to have my Python script download the Master data (Download, XLSX)Excel file from this Frankfurt stock exchange webpage.
我希望我的 Python 脚本从这个法兰克福证券交易所网页下载主数据(下载,XLSX)Excel 文件。
When to retrieve it with urrlib
and wget
, it turns out that the URL leads to a Bloband the file downloaded is only 289 bytes and unreadable.
何时使用urrlib
和检索它wget
,结果发现 URL 指向Blob,并且下载的文件只有 289 字节且不可读。
I'm entirely unfamiliar with Blobs and have these questions:
我完全不熟悉 Blob 并且有以下问题:
Can the file "behind the Blob" be successfully retrieved using Python?
If so, is it necessary to uncover the "true" URL behind the Blob – if there is such a thing – and how? My concern here is that the link above won't be static but actually change often.
可以使用 Python 成功检索“Blob 后面”的文件吗?
如果是这样,是否有必要发现 Blob 背后的“真实”URL——如果有这样的事情——以及如何发现?我在这里担心的是,上面的链接不会是静态的,而是经常更改。
采纳答案by Jeon
That 289 byte long thing might be a HTML code for 403 forbidden
page. This happen because the server is smart and rejects if your code does not specify a user agent.
那个 289 字节长的东西可能是403 forbidden
页面的 HTML 代码。发生这种情况是因为服务器很聪明,如果您的代码没有指定用户代理,就会拒绝。
Python 3
蟒蛇 3
# python3
import urllib.request as request
url = 'http://www.xetra.com/blob/1193366/b2f210876702b8e08e40b8ecb769a02e/data/All-tradable-ETFs-ETCs-and-ETNs.xlsx'
# fake user agent of Safari
fake_useragent = 'Mozilla/5.0 (iPad; CPU OS 6_0 like Mac OS X) AppleWebKit/536.26 (KHTML, like Gecko) Version/6.0 Mobile/10A5355d Safari/8536.25'
r = request.Request(url, headers={'User-Agent': fake_useragent})
f = request.urlopen(r)
# print or write
print(f.read())
Python 2
蟒蛇 2
# python2
import urllib2
url = 'http://www.xetra.com/blob/1193366/b2f210876702b8e08e40b8ecb769a02e/data/All-tradable-ETFs-ETCs-and-ETNs.xlsx'
# fake user agent of safari
fake_useragent = 'Mozilla/5.0 (iPad; CPU OS 6_0 like Mac OS X) AppleWebKit/536.26 (KHTML, like Gecko) Version/6.0 Mobile/10A5355d Safari/8536.25'
r = urllib2.Request(url, headers={'User-Agent': fake_useragent})
f = urllib2.urlopen(r)
print(f.read())
回答by kiviak
from bs4 import BeautifulSoup
import requests
import re
url='http://www.xetra.com/xetra-en/instruments/etf-exchange-traded-funds/list-of-tradable-etfs'
html=requests.get(url)
page=BeautifulSoup(html.content)
reg=re.compile('Master data')
find=page.find('span',text=reg) #find the file url
file_url='http://www.xetra.com'+find.parent['href']
file=requests.get(file_url)
with open(r'C:\Users\user\Downloads\file.xlsx','wb') as ff:
ff.write(file.content)
recommend requests and BeautifulSoup,both good lib
推荐 requests 和 BeautifulSoup,都是不错的 lib