Python 如何使用带有 gzip 压缩选项的 pandas read_csv 读取 tar.gz 文件?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/39263929/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
How can I read tar.gz file using pandas read_csv with gzip compression option?
提问by Geet
I have a very simple csv, with the following data, compressed inside the tar.gz file. I need to read that in dataframe using pandas.read_csv.
我有一个非常简单的 csv,包含以下数据,压缩在 tar.gz 文件中。我需要使用pandas.read_csv 在数据框中读取它。
A B
0 1 4
1 2 5
2 3 6
import pandas as pd
pd.read_csv("sample.tar.gz",compression='gzip')
However, I am getting error:
但是,我收到错误:
CParserError: Error tokenizing data. C error: Expected 1 fields in line 440, saw 2
Following are the set of read_csv commands and the different errors I get with them:
以下是 read_csv 命令集以及我遇到的不同错误:
pd.read_csv("sample.tar.gz",compression='gzip', engine='python')
Error: line contains NULL byte
pd.read_csv("sample.tar.gz",compression='gzip', header=0)
CParserError: Error tokenizing data. C error: Expected 1 fields in line 440, saw 2
pd.read_csv("sample.tar.gz",compression='gzip', header=0, sep=" ")
CParserError: Error tokenizing data. C error: Expected 2 fields in line 94, saw 14
pd.read_csv("sample.tar.gz",compression='gzip', header=0, sep=" ", engine='python')
Error: line contains NULL byte
What's going wrong here? How can I fix this?
这里出了什么问题?我怎样才能解决这个问题?
回答by Marlon Abeykoon
df = pd.read_csv('sample.tar.gz', compression='gzip', header=0, sep=' ', quotechar='"', error_bad_lines=False)
Note: error_bad_lines=False
will ignore the offending rows.
注意:error_bad_lines=False
将忽略违规行。
回答by user3780389
You can use the tarfile
moduleto read a particular file from the tar.gz archive (as discussed in this resolved issue).
If there is only one file in the archive, then you can do this:
您可以使用该tarfile
模块从 tar.gz 存档中读取特定文件(如本已解决问题中所述)。如果存档中只有一个文件,那么您可以这样做:
import tarfile
import pandas as pd
with tarfile.open("sample.tar.gz", "r:*") as tar:
csv_path = tar.getnames()[0]
df = pd.read_csv(tar.extractfile(csv_path), header=0, sep=" ")
The read mode r:*
handles the gz extension (or other kinds of compression) appropriately. If there are multiple files in the zipped tar file, then you could do something like csv_path = list(n for n in tar.getnames() if n.endswith('.csv'))[-1]
line to get the last csv file in the archived folder.
读取模式r:*
适当地处理 gz 扩展(或其他类型的压缩)。如果压缩的 tar 文件中有多个文件,那么您可以执行类似csv_path = list(n for n in tar.getnames() if n.endswith('.csv'))[-1]
line 的操作来获取存档文件夹中的最后一个 csv 文件。