使用带有 zip 压缩的 Pandas read_csv
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/40744027/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Using pandas read_csv with zip compression
提问by itzy
I'm trying to use read_csv
in pandas to read a zipped file from an FTP server. The zip file contains just one file, as is required.
我正在尝试read_csv
在 Pandas 中使用从 FTP 服务器读取压缩文件。根据需要,zip 文件仅包含一个文件。
Here's my code:
这是我的代码:
pd.read_csv('ftp://ftp.fec.gov/FEC/2016/cn16.zip', compression='zip')
I get this error:
我收到此错误:
AttributeError: addinfourl instance has no attribute 'seek'
I get this error in both pandas 18.1 and 19.0. Am I missing something, or could this be a bug?
我在 Pandas 18.1 和 19.0 中都遇到了这个错误。我错过了什么,或者这可能是一个错误?
采纳答案by PyNoob
Although I'm not completely sure why you get the error, you can get around it by opening the url using urllib2
and writing the data to an in-memory binary stream, as shown here. In addition, we have to specify the correct separator, or else we would receive another error.
虽然我不是完全确定为什么你的错误,你可以通过打开URL绕过它urllib2
和数据写入到内存中的二进制流,如图所示这里。此外,我们必须指定正确的分隔符,否则我们会收到另一个错误。
import io
import urllib2 as urllib
import pandas as pd
r = urllib.urlopen('ftp://ftp.fec.gov/FEC/2016/cn16.zip')
df = pd.read_csv(io.BytesIO(r.read()), compression='zip', sep='|', header=None)
As far as the error itself, I think pandas is trying to use seek on the "zip file" prior to downloading the url contents (so it's not really a zip file), which would result in that error.
就错误本身而言,我认为大Pandas试图在下载 url 内容之前对“zip 文件”使用搜索(因此它不是真正的 zip 文件),这会导致该错误。
回答by Vlad Bezden
pandas now supports to load data straight from zip or other compressed files to DataFrame.
pandas 现在支持将数据直接从 zip 或其他压缩文件加载到 DataFrame。
compression : {‘infer', ‘gzip', ‘bz2', ‘zip', ‘xz', None}, default ‘infer'
For on-the-fly decompression of on-disk data. If ‘infer' and filepath_or_buffer is path-like, then detect compression from the following extensions: ‘.gz', ‘.bz2', ‘.zip', or ‘.xz' (otherwise no decompression). If using ‘zip', the ZIP file must contain only one data file to be read in. Set to None for no decompression.
New in version 0.18.1: support for ‘zip' and ‘xz' compression.
压缩:{'infer', 'gzip', 'bz2', 'zip', 'xz', None},默认为 'infer'
用于磁盘数据的即时解压缩。如果 'infer' 和 filepath_or_buffer 类似于路径,则检测来自以下扩展名的压缩:'.gz'、'.bz2'、'.zip' 或 '.xz'(否则不解压缩)。如果使用“zip”,则 ZIP 文件必须只包含一个要读入的数据文件。设置为 None 表示不解压。
0.18.1 新版功能:支持“zip”和“xz”压缩。
import pandas as pd
df = pd.read_csv("path_to_file.zip")
# or
df = pd.read_csv("path_to_file.zip", compression="zip")
回答by Vinod
header = {'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.9; rv:32.0) Gecko/20100101 Firefox/54.0.1',}
remotezip = requests.get(url, headers=header)
root = zipfile.ZipFile(io.BytesIO(remotezip.content))
for name in root.namelist():
df = pd.read_csv(root.open(name))
Taken from my own blog post: Read zipped csv files in python pandas without downloading zipfile
摘自我自己的博客文章: 在 python pandas 中读取压缩的 csv 文件而无需下载 zipfile