Python 如何使用带有 gzip 压缩选项的 pandas read_csv 读取 tar.gz 文件？

Question

提问by Geet

I have a very simple csv, with the following data, compressed inside the tar.gz file. I need to read that in dataframe using pandas.read_csv.

我有一个非常简单的 csv，包含以下数据，压缩在 tar.gz 文件中。我需要使用pandas.read_csv 在数据框中读取它。

   A  B
0  1  4
1  2  5
2  3  6

import pandas as pd
pd.read_csv("sample.tar.gz",compression='gzip')

However, I am getting error:

但是，我收到错误：

CParserError: Error tokenizing data. C error: Expected 1 fields in line 440, saw 2

Following are the set of read_csv commands and the different errors I get with them:

以下是 read_csv 命令集以及我遇到的不同错误：

pd.read_csv("sample.tar.gz",compression='gzip',  engine='python')
Error: line contains NULL byte

pd.read_csv("sample.tar.gz",compression='gzip', header=0)
CParserError: Error tokenizing data. C error: Expected 1 fields in line 440, saw 2

pd.read_csv("sample.tar.gz",compression='gzip', header=0, sep=" ")
CParserError: Error tokenizing data. C error: Expected 2 fields in line 94, saw 14    

pd.read_csv("sample.tar.gz",compression='gzip', header=0, sep=" ", engine='python')
Error: line contains NULL byte

What's going wrong here? How can I fix this?

这里出了什么问题？我怎样才能解决这个问题？

Answer 1

回答by Marlon Abeykoon

df = pd.read_csv('sample.tar.gz', compression='gzip', header=0, sep=' ', quotechar='"', error_bad_lines=False)

Note: error_bad_lines=Falsewill ignore the offending rows.

注意：error_bad_lines=False将忽略违规行。

Answer 2

回答by user3780389

You can use the tarfilemoduleto read a particular file from the tar.gz archive (as discussed in this resolved issue). If there is only one file in the archive, then you can do this:

您可以使用该tarfile模块从 tar.gz 存档中读取特定文件（如本已解决问题中所述）。如果存档中只有一个文件，那么您可以这样做：

import tarfile
import pandas as pd
with tarfile.open("sample.tar.gz", "r:*") as tar:
    csv_path = tar.getnames()[0]
    df = pd.read_csv(tar.extractfile(csv_path), header=0, sep=" ")

The read mode r:*handles the gz extension (or other kinds of compression) appropriately. If there are multiple files in the zipped tar file, then you could do something like csv_path = list(n for n in tar.getnames() if n.endswith('.csv'))[-1]line to get the last csv file in the archived folder.

读取模式r:*适当地处理 gz 扩展（或其他类型的压缩）。如果压缩的 tar 文件中有多个文件，那么您可以执行类似csv_path = list(n for n in tar.getnames() if n.endswith('.csv'))[-1]line 的操作来获取存档文件夹中的最后一个 csv 文件。

Python 如何使用带有 gzip 压缩选项的 pandas read_csv 读取 tar.gz 文件？

提问by Geet

回答by Marlon Abeykoon

回答by user3780389

相关推荐

最近更新

标签

Python 如何使用带有 gzip 压缩选项的 pandas read_csv 读取 tar.gz 文件？

提问by Geet

回答by Marlon Abeykoon

回答by user3780389

相关推荐

Python Microsoft SQL 中的错误“字符串或二进制数据将被截断”

Python 如何使用 boto3 在 EC2 中通过 SSH 和运行命令？

Python 将 Base64 字符串解码为字节数组

Python 没有名为 urllib3 的模块

相关推荐

最近更新

标签