pandas Python：标记数据时出错。C 错误：在源上调用 read(nbytes) 失败，输入 nzip 文件

Question

提问by add-semi-colons

I am using condapython2.7

我在用 condapython2.7

python --version
Python 2.7.12 :: Anaconda 2.4.1 (x86_64)

I have fallowing method to read large gzip files:

我有读取大型 gzip 文件的休闲方法：

df = pd.read_csv(os.path.join(filePath, fileName),
     sep='|', compression = 'gzip', dtype='unicode', error_bad_lines=False)

but when I read the file I get the following error:

但是当我读取文件时，出现以下错误：

pandas.parser.CParserError: Error tokenizing data. C error: Calling read(nbytes) on source failed. Try engine='python'.
Segmentation fault: 11

I read all the existing answers but most of those questions had errors such as additional columns. I was already handling that with error_bad_lines=Falseoption.

我阅读了所有现有答案，但大多数问题都有错误，例如附加列。我已经在处理这个问题了error_bad_lines=False。

What are my options here?

我在这里有哪些选择？

Found something interesting when I tried to uncompress the file:

当我尝试解压缩文件时发现了一些有趣的东西：

gunzip -k myfile.txt.gz 
gunzip: myfile.txt.gz: unexpected end of file
gunzip: myfile.txt.gz: uncompress failed

Answer 1

采纳答案by add-semi-colons

I didn't really find a python solution but using unixtools I manage to find a solution:

我并没有真正找到 python 解决方案，而是使用unix我设法找到解决方案的工具：

First I use zless myfile.txt.gz > uncompressedMyfile.txtthen I use sedtool to remove the last line because I clearly saw that last line was corrupt.

首先我使用zless myfile.txt.gz > uncompressedMyfile.txt然后我使用sed工具删除最后一行，因为我清楚地看到最后一行已损坏。

sed '$d' uncompressedMyfile.txt

I gzipped the file again gzip -k uncompressedMyfile.txt

我再次压缩文件 gzip -k uncompressedMyfile.txt

I was able to successfully read the file with following python code:

我能够使用以下python代码成功读取文件：

try:
    df = pd.read_csv(os.path.join(filePath, fileName),
                        sep='|', compression = 'gzip', dtype='unicode', error_bad_lines=False)
except CParserError:
    print "Something wrong the file"
return df

Answer 2

回答by Boud

The input zip file is corrupted. Get a proper copy of this file from the source of try to use zip repairing tools before you pass it along to pandas.

输入 zip 文件已损坏。在将它传递给Pandas之前，从尝试使用 zip 修复工具的来源获取此文件的正确副本。

Answer 3

回答by Aseem Ahir

Sometimes the error shows up if you have the file already open. Try closing the file and re-running

如果您已经打开了文件，有时会出现错误。尝试关闭文件并重新运行

pandas Python：标记数据时出错。C 错误：在源上调用 read(nbytes) 失败，输入 nzip 文件

提问by add-semi-colons

采纳答案by add-semi-colons

回答by Boud

回答by Aseem Ahir

相关推荐

最近更新

标签

pandas Python：标记数据时出错。C 错误：在源上调用 read(nbytes) 失败，输入 nzip 文件

提问by add-semi-colons

采纳答案by add-semi-colons

回答by Boud

回答by Aseem Ahir

相关推荐

在 Pandas 数据框中提取嵌套的 JSON

pandas python中从float到int的类型转换

pandas 计算大于pandas groupby中一个值的项目

pandas 如何用零替换numpy数组中的inf

相关推荐

最近更新

标签