Python 将压缩文件作为 Pandas DataFrame 读取

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/18885175/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-19 12:10:29  来源:igfitidea点击:

Read a zipped file as a pandas DataFrame

pythonzippandas

提问by user2793667

I'm trying to unzip a csv file and pass it into pandas so I can work on the file.
The code I have tried so far is:

我正在尝试解压缩 csv 文件并将其传递给 Pandas,以便我可以处理该文件。
到目前为止我尝试过的代码是:

import requests, zipfile, StringIO
r = requests.get('http://data.octo.dc.gov/feeds/crime_incidents/archive/crime_incidents_2013_CSV.zip')
z = zipfile.ZipFile(StringIO.StringIO(r.content))
crime2013 = pandas.read_csv(z.read('crime_incidents_2013_CSV.csv'))

After the last line, although python is able to get the file, I get a "does not exist" at the end of the error.

在最后一行之后,虽然 python 能够获取文件,但在错误结束时我得到一个“不存在”。

Can someone tell me what I'm doing incorrectly?

有人可以告诉我我做错了什么吗?

回答by Andy Hayden

I think you want to openthe ZipFile, which returns a file-like object, rather than read:

我认为您想要openZipFile,它返回一个类似文件的对象,而不是read

In [11]: crime2013 = pd.read_csv(z.open('crime_incidents_2013_CSV.csv'))

In [12]: crime2013
Out[12]:
<class 'pandas.core.frame.DataFrame'>
Int64Index: 24567 entries, 0 to 24566
Data columns (total 15 columns):
CCN                            24567  non-null values
REPORTDATETIME                 24567  non-null values
SHIFT                          24567  non-null values
OFFENSE                        24567  non-null values
METHOD                         24567  non-null values
LASTMODIFIEDDATE               24567  non-null values
BLOCKSITEADDRESS               24567  non-null values
BLOCKXCOORD                    24567  non-null values
BLOCKYCOORD                    24567  non-null values
WARD                           24563  non-null values
ANC                            24567  non-null values
DISTRICT                       24567  non-null values
PSA                            24567  non-null values
NEIGHBORHOODCLUSTER            24263  non-null values
BUSINESSIMPROVEMENTDISTRICT    3613  non-null values
dtypes: float64(4), int64(1), object(10)

回答by Suchit

If you want to read a zipped or a tar.gz file into pandas dataframe, the read_csvmethods includes this particular implementation.

如果您想将压缩文件或 tar.gz 文件读入 Pandas 数据帧,这些read_csv方法包括此特定实现。

df = pd.read_csv('filename.zip')

Or the long form:

或长格式:

df = pd.read_csv('filename.zip', compression='zip', header=0, sep=',', quotechar='"')

Description of the compression argument from the docs:

文档中压缩参数的描述:

compression: {‘infer', ‘gzip', ‘bz2', ‘zip', ‘xz', None}, default ‘infer' For on-the-fly decompression of on-disk data. If ‘infer' and filepath_or_buffer is path-like, then detect compression from the following extensions: ‘.gz', ‘.bz2', ‘.zip', or ‘.xz' (otherwise no decompression). If using ‘zip', the ZIP file must contain only one data file to be read in. Set to None for no decompression.

New in version 0.18.1: support for ‘zip' and ‘xz' compression.

压缩: {'infer', 'gzip', 'bz2', 'zip', 'xz', None},默认为 'infer' 用于磁盘数据的即时解压缩。如果 'infer' 和 filepath_or_buffer 类似于路径,则检测来自以下扩展名的压缩:'.gz'、'.bz2'、'.zip' 或 '.xz'(否则不解压缩)。如果使用“zip”,则 ZIP 文件必须只包含一个要读入的数据文件。设置为 None 表示不解压。

0.18.1 新版功能:支持“zip”和“xz”压缩。

回答by imanzabet

For "zip" files, you can use import zipfileand your code will be working simply with these lines:

对于“ zip”文件,您可以使用import zipfile这些行,您的代码将简单地使用这些行:

import zipfile
import pandas as pd
with zipfile.ZipFile("Crime_Incidents_in_2013.zip") as z:
   with z.open("Crime_Incidents_in_2013.csv") as f:
      train = pd.read_csv(f, header=0, delimiter="\t")
      print(train.head())    # print the first 5 rows

And the result will be:

结果将是:

X,Y,CCN,REPORT_DAT,SHIFT,METHOD,OFFENSE,BLOCK,XBLOCK,YBLOCK,WARD,ANC,DISTRICT,PSA,NEIGHBORHOOD_CLUSTER,BLOCK_GROUP,CENSUS_TRACT,VOTING_PRECINCT,XCOORD,YCOORD,LATITUDE,LONGITUDE,BID,START_DATE,END_DATE,OBJECTID
0  -77.054968548763071,38.899775938598317,0925135...                                                                                                                                                               
1  -76.967309569035052,38.872119553647011,1003352...                                                                                                                                                               
2  -76.996184958456539,38.927921847721443,1101010...                                                                                                                                                               
3  -76.943077541353617,38.883686046653935,1104551...                                                                                                                                                               
4  -76.939209158039446,38.892278093281632,1125028...

回答by TDS

It seems you don't even have to specify the compression any more. The following snippet loads the data from filename.zip into df.

看来您甚至不必再指定压缩了。以下代码段将 filename.zip 中的数据加载到 df 中。

import pandas as pd
df = pd.read_csv('filename.zip')

(Of course you will need to specify separator, header, etc. if they are different from the defaults.)

(当然,如果它们与默认值不同,您将需要指定分隔符、标题等。)