在python中从http服务器下载文件

Question

提问by cellular

Using urllib2, we can get the http response from a web server. If that server simply holds a list of files, we could parse through the files and download each individually. However, I'm not sure what the easiest, most pythonic way to parse through the files would be.

使用 urllib2，我们可以从 Web 服务器获取 http 响应。如果该服务器仅包含文件列表，我们可以解析文件并单独下载每个文件。但是，我不确定解析文件的最简单、最pythonic 的方法是什么。

When you get a whole http response of the generic file server list, through urllib2's urlopen() method, how can we neatly download each file?

当你得到一个完整的通用文件服务器列表的http响应时，通过urllib2的urlopen()方法，我们如何才能整齐地下载每个文件？

Answer 1

回答by Alex Vidal

Can you guarantee that the URL you're requesting is a directory listing? If so, can you guarantee the format of the directory listing?

你能保证你请求的 URL 是目录列表吗？如果是这样，你能保证目录列表的格式吗？

If so, you could use lxmlto parse the returned document and find all of the elements that hold the path to a file, then iterate over those elements and download each file.

如果是这样，您可以使用lxml解析返回的文档并找到保存文件路径的所有元素，然后遍历这些元素并下载每个文件。

Answer 2

回答by Blender

Here's an untested solution:

这是一个未经测试的解决方案：

import urllib2

response = urllib2.urlopen('http://server.com/file.txt')
urls = response.read().replace('\r', '').split('\n')

for file in urls:
  print 'Downloading ' + file

  response = urllib2.urlopen(file)

  handle = open(file, 'w')
  handle.write(response.read())
  handle.close()

It's untested, and it probably won't work. This is assuming you have an actual listof files inside of another file. Good luck!

它未经测试，可能无法正常工作。这是假设您在另一个文件中有一个实际的文件列表。祝你好运！

Answer 3

回答by cgohlke

Urllib2 might be OK to retrieve the list of files. For downloading large amounts of binary files PycURL http://pycurl.sourceforge.net/is a better choice. This works for my IIS based file server:

Urllib2 可能可以检索文件列表。对于下载大量二进制文件，PycURL http://pycurl.sourceforge.net/是更好的选择。这适用于我的基于 IIS 的文件服务器：

import re
import urllib2
import pycurl

url = "http://server.domain/"
path = "path/"
pattern = '<A HREF="/%s.*?">(.*?)</A>' % path

response = urllib2.urlopen(url+path).read()

for filename in re.findall(pattern, response):
    fp = open(filename, "wb")
    curl = pycurl.Curl()
    curl.setopt(pycurl.URL, url+path+filename)
    curl.setopt(pycurl.WRITEDATA, fp)
    curl.perform()
    curl.close()
    fp.close()

Answer 4

回答by Hugh Bothwell

Download the index file
If it's really huge, it may be worth reading a chunk at a time; otherwise it's probably easier to just grab the whole thing into memory.
Extract the list of files to get
If the list is xml or html, use a proper parser; else if there is much string processing to do, use regex; else use simple string methods.
Again, you can parse it all-at-once or incrementally. Incrementally is somewhat more efficient and elegant, but unless you are processing multiple tens of thousands of lines it's probably not critical.
For each file, download it and save it to a file.
If you want to try to speed things up, you could try running multiple download threads;
another (significantly faster) approach might be to delegate the work to a dedicated downloader program like Aria2 http://aria2.sourceforge.net/- note that Aria2 can be run as a service and controlled via XMLRPC, see http://sourceforge.net/apps/trac/aria2/wiki/XmlrpcInterface#InteractWitharia2UsingPython

下载索引文件
如果它真的很大，可能值得一次读一大块；否则可能更容易将整个事情抓取到内存中。
提取文件列表以获取
如果列表是 xml 或 html，请使用适当的解析器；否则如果有很多字符串处理要做，使用正则表达式；否则使用简单的字符串方法。
同样，您可以一次性或增量地解析它。增量更有效和更优雅，但除非您正在处理数万行，否则它可能并不重要。
对于每个文件，下载它并将其保存到一个文件中。
如果您想尝试加快速度，可以尝试运行多个下载线程；
另一种（明显更快）的方法可能是将工作委托给像 Aria2 这样的专用下载程序http://aria2.sourceforge.net/- 请注意，Aria2 可以作为服务运行并通过 XMLRPC 进行控制，请参阅http://sourceforge .net/apps/trac/aria2/wiki/XmlrpcInterface#InteractWitharia2UsingPython

Answer 5

回答by Sri Raghavan

My suggestion would be to use BeautifulSoup(which is an HTML/XML parser) to parse the page for a list of files. Then, pycURL would definitely come in handy.

我的建议是使用BeautifulSoup（它是一个 HTML/XML 解析器）来解析页面以获取文件列表。那么，pycURL 肯定会派上用场。

Another method, after you've got the list of files, is to use urllib.urlretrievein a way similar to wget in order to simply download the file to a location on your filesystem.

在获得文件列表后，另一种方法是以类似于 wget 的方式使用urllib.urlretrieve以便简单地将文件下载到文件系统上的某个位置。

Answer 6

回答by Sri Raghavan

You can use urllib.urlretrieve (in Python 3.x: urllib.request.urlretrieve):

您可以使用 urllib.urlretrieve（在 Python 3.x 中：urllib.request.urlretrieve）：

import urllib
urllib.urlretrieve('http://site.com/', filename='filez.txt')

This should be work :)

这应该是工作:)

and this is a fnction that can do the same thing (using urllib):

这是一个可以做同样事情的功能（使用 urllib）：

def download(url):
    webFile = urllib.urlopen(url)
    localFile = open(url.split('/')[-1], 'w')
    localFile.write(webFile.read())
    webFile.close()
    localFile.close()

Answer 7

回答by Mark Irkzher

This is a non-convential way, but although it works

这是一种非常规的方式，但虽然它有效

fPointer = open(picName, 'wb')
self.curl.setopt(self.curl.WRITEFUNCTION, fPointer.write) 


urllib.urlretrieve(link, picName) - correct way

在python中从http服务器下载文件

提问by cellular

回答by Alex Vidal

回答by Blender

回答by cgohlke

回答by Hugh Bothwell

回答by Sri Raghavan

回答by Sri Raghavan

回答by Mark Irkzher

相关推荐

最近更新

标签

在python中从http服务器下载文件

提问by cellular

回答by Alex Vidal

回答by Blender

回答by cgohlke

回答by Hugh Bothwell

回答by Sri Raghavan

回答by Sri Raghavan

回答by Mark Irkzher

相关推荐

为什么在python中以'w'模式打开文件时截断

仅使用python标准库将python UTC日期时间转换为本地日期时间？

在python调试中计算单词中的字母

Python 获取导致异常的异常描述和堆栈跟踪，全部为字符串

相关推荐

最近更新

标签