Python 将压缩文件作为 Pandas DataFrame 读取

Question

提问by user2793667

I'm trying to unzip a csv file and pass it into pandas so I can work on the file.
The code I have tried so far is:

我正在尝试解压缩 csv 文件并将其传递给 Pandas，以便我可以处理该文件。
到目前为止我尝试过的代码是：

import requests, zipfile, StringIO
r = requests.get('http://data.octo.dc.gov/feeds/crime_incidents/archive/crime_incidents_2013_CSV.zip')
z = zipfile.ZipFile(StringIO.StringIO(r.content))
crime2013 = pandas.read_csv(z.read('crime_incidents_2013_CSV.csv'))

After the last line, although python is able to get the file, I get a "does not exist" at the end of the error.

在最后一行之后，虽然 python 能够获取文件，但在错误结束时我得到一个“不存在”。

Can someone tell me what I'm doing incorrectly?

有人可以告诉我我做错了什么吗？

Answer 1

回答by Andy Hayden

I think you want to openthe ZipFile, which returns a file-like object, rather than read:

我认为您想要openZipFile，它返回一个类似文件的对象，而不是read：

In [11]: crime2013 = pd.read_csv(z.open('crime_incidents_2013_CSV.csv'))

In [12]: crime2013
Out[12]:
<class 'pandas.core.frame.DataFrame'>
Int64Index: 24567 entries, 0 to 24566
Data columns (total 15 columns):
CCN                            24567  non-null values
REPORTDATETIME                 24567  non-null values
SHIFT                          24567  non-null values
OFFENSE                        24567  non-null values
METHOD                         24567  non-null values
LASTMODIFIEDDATE               24567  non-null values
BLOCKSITEADDRESS               24567  non-null values
BLOCKXCOORD                    24567  non-null values
BLOCKYCOORD                    24567  non-null values
WARD                           24563  non-null values
ANC                            24567  non-null values
DISTRICT                       24567  non-null values
PSA                            24567  non-null values
NEIGHBORHOODCLUSTER            24263  non-null values
BUSINESSIMPROVEMENTDISTRICT    3613  non-null values
dtypes: float64(4), int64(1), object(10)

Answer 2

回答by Suchit

If you want to read a zipped or a tar.gz file into pandas dataframe, the read_csvmethods includes this particular implementation.

如果您想将压缩文件或 tar.gz 文件读入 Pandas 数据帧，这些read_csv方法包括此特定实现。

df = pd.read_csv('filename.zip')

Or the long form:

或长格式：

df = pd.read_csv('filename.zip', compression='zip', header=0, sep=',', quotechar='"')

Description of the compression argument from the docs:

文档中压缩参数的描述：

compression: {‘infer', ‘gzip', ‘bz2', ‘zip', ‘xz', None}, default ‘infer' For on-the-fly decompression of on-disk data. If ‘infer' and filepath_or_buffer is path-like, then detect compression from the following extensions: ‘.gz', ‘.bz2', ‘.zip', or ‘.xz' (otherwise no decompression). If using ‘zip', the ZIP file must contain only one data file to be read in. Set to None for no decompression.
New in version 0.18.1: support for ‘zip' and ‘xz' compression.

压缩: {'infer', 'gzip', 'bz2', 'zip', 'xz', None}，默认为 'infer' 用于磁盘数据的即时解压缩。如果 'infer' 和 filepath_or_buffer 类似于路径，则检测来自以下扩展名的压缩：'.gz'、'.bz2'、'.zip' 或 '.xz'（否则不解压缩）。如果使用“zip”，则 ZIP 文件必须只包含一个要读入的数据文件。设置为 None 表示不解压。
0.18.1 新版功能：支持“zip”和“xz”压缩。

Answer 3

回答by imanzabet

For "zip" files, you can use import zipfileand your code will be working simply with these lines:

对于“ zip”文件，您可以使用import zipfile这些行，您的代码将简单地使用这些行：

import zipfile
import pandas as pd
with zipfile.ZipFile("Crime_Incidents_in_2013.zip") as z:
   with z.open("Crime_Incidents_in_2013.csv") as f:
      train = pd.read_csv(f, header=0, delimiter="\t")
      print(train.head())    # print the first 5 rows

And the result will be:

结果将是：

X,Y,CCN,REPORT_DAT,SHIFT,METHOD,OFFENSE,BLOCK,XBLOCK,YBLOCK,WARD,ANC,DISTRICT,PSA,NEIGHBORHOOD_CLUSTER,BLOCK_GROUP,CENSUS_TRACT,VOTING_PRECINCT,XCOORD,YCOORD,LATITUDE,LONGITUDE,BID,START_DATE,END_DATE,OBJECTID
0  -77.054968548763071,38.899775938598317,0925135...                                                                                                                                                               
1  -76.967309569035052,38.872119553647011,1003352...                                                                                                                                                               
2  -76.996184958456539,38.927921847721443,1101010...                                                                                                                                                               
3  -76.943077541353617,38.883686046653935,1104551...                                                                                                                                                               
4  -76.939209158039446,38.892278093281632,1125028...

Answer 4

回答by TDS

It seems you don't even have to specify the compression any more. The following snippet loads the data from filename.zip into df.

看来您甚至不必再指定压缩了。以下代码段将 filename.zip 中的数据加载到 df 中。

import pandas as pd
df = pd.read_csv('filename.zip')

(Of course you will need to specify separator, header, etc. if they are different from the defaults.)

（当然，如果它们与默认值不同，您将需要指定分隔符、标题等。）

Python 将压缩文件作为 Pandas DataFrame 读取

提问by user2793667

回答by Andy Hayden

回答by Suchit

回答by imanzabet

回答by TDS

相关推荐

最近更新

标签

Python 将压缩文件作为 Pandas DataFrame 读取

提问by user2793667

回答by Andy Hayden

回答by Suchit

回答by imanzabet

回答by TDS

相关推荐

使用python在文本文件中的两个字符串之间提取值

Python 如何使用 Brew 安装旧配方？

Docker 镜像错误：“/bin/sh: 1: [python,: not found”

Python 如何在文件读取期间从每一行中去除换行符？

相关推荐

最近更新

标签