Python 将压缩文件作为 Pandas DataFrame 读取
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/18885175/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Read a zipped file as a pandas DataFrame
提问by user2793667
I'm trying to unzip a csv file and pass it into pandas so I can work on the file.
The code I have tried so far is:
我正在尝试解压缩 csv 文件并将其传递给 Pandas,以便我可以处理该文件。
到目前为止我尝试过的代码是:
import requests, zipfile, StringIO
r = requests.get('http://data.octo.dc.gov/feeds/crime_incidents/archive/crime_incidents_2013_CSV.zip')
z = zipfile.ZipFile(StringIO.StringIO(r.content))
crime2013 = pandas.read_csv(z.read('crime_incidents_2013_CSV.csv'))
After the last line, although python is able to get the file, I get a "does not exist" at the end of the error.
在最后一行之后,虽然 python 能够获取文件,但在错误结束时我得到一个“不存在”。
Can someone tell me what I'm doing incorrectly?
有人可以告诉我我做错了什么吗?
回答by Andy Hayden
I think you want to open
the ZipFile, which returns a file-like object, rather than read
:
我认为您想要open
ZipFile,它返回一个类似文件的对象,而不是read
:
In [11]: crime2013 = pd.read_csv(z.open('crime_incidents_2013_CSV.csv'))
In [12]: crime2013
Out[12]:
<class 'pandas.core.frame.DataFrame'>
Int64Index: 24567 entries, 0 to 24566
Data columns (total 15 columns):
CCN 24567 non-null values
REPORTDATETIME 24567 non-null values
SHIFT 24567 non-null values
OFFENSE 24567 non-null values
METHOD 24567 non-null values
LASTMODIFIEDDATE 24567 non-null values
BLOCKSITEADDRESS 24567 non-null values
BLOCKXCOORD 24567 non-null values
BLOCKYCOORD 24567 non-null values
WARD 24563 non-null values
ANC 24567 non-null values
DISTRICT 24567 non-null values
PSA 24567 non-null values
NEIGHBORHOODCLUSTER 24263 non-null values
BUSINESSIMPROVEMENTDISTRICT 3613 non-null values
dtypes: float64(4), int64(1), object(10)
回答by Suchit
If you want to read a zipped or a tar.gz file into pandas dataframe, the read_csv
methods includes this particular implementation.
如果您想将压缩文件或 tar.gz 文件读入 Pandas 数据帧,这些read_csv
方法包括此特定实现。
df = pd.read_csv('filename.zip')
Or the long form:
或长格式:
df = pd.read_csv('filename.zip', compression='zip', header=0, sep=',', quotechar='"')
Description of the compression argument from the docs:
文档中压缩参数的描述:
compression: {‘infer', ‘gzip', ‘bz2', ‘zip', ‘xz', None}, default ‘infer' For on-the-fly decompression of on-disk data. If ‘infer' and filepath_or_buffer is path-like, then detect compression from the following extensions: ‘.gz', ‘.bz2', ‘.zip', or ‘.xz' (otherwise no decompression). If using ‘zip', the ZIP file must contain only one data file to be read in. Set to None for no decompression.
New in version 0.18.1: support for ‘zip' and ‘xz' compression.
压缩: {'infer', 'gzip', 'bz2', 'zip', 'xz', None},默认为 'infer' 用于磁盘数据的即时解压缩。如果 'infer' 和 filepath_or_buffer 类似于路径,则检测来自以下扩展名的压缩:'.gz'、'.bz2'、'.zip' 或 '.xz'(否则不解压缩)。如果使用“zip”,则 ZIP 文件必须只包含一个要读入的数据文件。设置为 None 表示不解压。
0.18.1 新版功能:支持“zip”和“xz”压缩。
回答by imanzabet
For "zip" files, you can use import zipfile
and your code will be working simply with these lines:
对于“ zip”文件,您可以使用import zipfile
这些行,您的代码将简单地使用这些行:
import zipfile
import pandas as pd
with zipfile.ZipFile("Crime_Incidents_in_2013.zip") as z:
with z.open("Crime_Incidents_in_2013.csv") as f:
train = pd.read_csv(f, header=0, delimiter="\t")
print(train.head()) # print the first 5 rows
And the result will be:
结果将是:
X,Y,CCN,REPORT_DAT,SHIFT,METHOD,OFFENSE,BLOCK,XBLOCK,YBLOCK,WARD,ANC,DISTRICT,PSA,NEIGHBORHOOD_CLUSTER,BLOCK_GROUP,CENSUS_TRACT,VOTING_PRECINCT,XCOORD,YCOORD,LATITUDE,LONGITUDE,BID,START_DATE,END_DATE,OBJECTID
0 -77.054968548763071,38.899775938598317,0925135...
1 -76.967309569035052,38.872119553647011,1003352...
2 -76.996184958456539,38.927921847721443,1101010...
3 -76.943077541353617,38.883686046653935,1104551...
4 -76.939209158039446,38.892278093281632,1125028...
回答by TDS
It seems you don't even have to specify the compression any more. The following snippet loads the data from filename.zip into df.
看来您甚至不必再指定压缩了。以下代码段将 filename.zip 中的数据加载到 df 中。
import pandas as pd
df = pd.read_csv('filename.zip')
(Of course you will need to specify separator, header, etc. if they are different from the defaults.)
(当然,如果它们与默认值不同,您将需要指定分隔符、标题等。)