使用 Python 从 Blob URL 下载文件

Question

提问by Winterflags

I wish to have my Python script download the Master data (Download, XLSX)Excel file from this Frankfurt stock exchange webpage.

我希望我的 Python 脚本从这个法兰克福证券交易所网页下载主数据（下载，XLSX）Excel 文件。

When to retrieve it with urrliband wget, it turns out that the URL leads to a Bloband the file downloaded is only 289 bytes and unreadable.

何时使用urrlib和检索它wget，结果发现 URL 指向Blob，并且下载的文件只有 289 字节且不可读。

http://www.xetra.com/blob/1193366/b2f210876702b8e08e40b8ecb769a02e/data/All-tradable-ETFs-ETCs-and-ETNs.xlsx

I'm entirely unfamiliar with Blobs and have these questions:

我完全不熟悉 Blob 并且有以下问题：

Can the file "behind the Blob" be successfully retrieved using Python?
If so, is it necessary to uncover the "true" URL behind the Blob – if there is such a thing – and how? My concern here is that the link above won't be static but actually change often.

可以使用 Python 成功检索“Blob 后面”的文件吗？
如果是这样，是否有必要发现 Blob 背后的“真实”URL——如果有这样的事情——以及如何发现？我在这里担心的是，上面的链接不会是静态的，而是经常更改。

Answer 1

采纳答案by Jeon

That 289 byte long thing might be a HTML code for 403 forbiddenpage. This happen because the server is smart and rejects if your code does not specify a user agent.

那个 289 字节长的东西可能是403 forbidden页面的 HTML 代码。发生这种情况是因为服务器很聪明，如果您的代码没有指定用户代理，就会拒绝。

Python 3

蟒蛇 3

# python3
import urllib.request as request

url = 'http://www.xetra.com/blob/1193366/b2f210876702b8e08e40b8ecb769a02e/data/All-tradable-ETFs-ETCs-and-ETNs.xlsx'
# fake user agent of Safari
fake_useragent = 'Mozilla/5.0 (iPad; CPU OS 6_0 like Mac OS X) AppleWebKit/536.26 (KHTML, like Gecko) Version/6.0 Mobile/10A5355d Safari/8536.25'
r = request.Request(url, headers={'User-Agent': fake_useragent})
f = request.urlopen(r)

# print or write
print(f.read())

Python 2

蟒蛇 2

# python2
import urllib2

url = 'http://www.xetra.com/blob/1193366/b2f210876702b8e08e40b8ecb769a02e/data/All-tradable-ETFs-ETCs-and-ETNs.xlsx'
# fake user agent of safari
fake_useragent = 'Mozilla/5.0 (iPad; CPU OS 6_0 like Mac OS X) AppleWebKit/536.26 (KHTML, like Gecko) Version/6.0 Mobile/10A5355d Safari/8536.25'

r = urllib2.Request(url, headers={'User-Agent': fake_useragent})
f = urllib2.urlopen(r)

print(f.read())

Answer 2

回答by kiviak

from bs4 import BeautifulSoup
import requests
import re

url='http://www.xetra.com/xetra-en/instruments/etf-exchange-traded-funds/list-of-tradable-etfs'
html=requests.get(url)
page=BeautifulSoup(html.content)
reg=re.compile('Master data')
find=page.find('span',text=reg)  #find the file url
file_url='http://www.xetra.com'+find.parent['href']
file=requests.get(file_url)
with open(r'C:\Users\user\Downloads\file.xlsx','wb') as ff:
    ff.write(file.content)

recommend requests and BeautifulSoup,both good lib

推荐 requests 和 BeautifulSoup，都是不错的 lib

使用 Python 从 Blob URL 下载文件

提问by Winterflags

采纳答案by Jeon

Python 3

蟒蛇 3

Python 2

蟒蛇 2

回答by kiviak

相关推荐

最近更新

标签

使用 Python 从 Blob URL 下载文件

提问by Winterflags

采纳答案by Jeon

Python 3

蟒蛇 3

Python 2

蟒蛇 2

回答by kiviak

相关推荐

Python 使用 scapy 读取 PCAP 文件

Python 类型错误：预期的字符串或类似字节的对象熊猫变量

Python 如何安装“glob”模块？

Python openCV 3中contourArea的兼容性问题

相关推荐

最近更新

标签