pandas Python:标记数据时出错。C 错误:在源上调用 read(nbytes) 失败,输入 nzip 文件

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/40835287/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-14 02:31:15  来源:igfitidea点击:

Python: Error tokenizing data. C error: Calling read(nbytes) on source failed with input nzip file

pythonpandas

提问by add-semi-colons

I am using condapython2.7

我在用 condapython2.7

python --version
Python 2.7.12 :: Anaconda 2.4.1 (x86_64)

I have fallowing method to read large gzip files:

我有读取大型 gzip 文件的休闲方法:

df = pd.read_csv(os.path.join(filePath, fileName),
     sep='|', compression = 'gzip', dtype='unicode', error_bad_lines=False)

but when I read the file I get the following error:

但是当我读取文件时,出现以下错误:

pandas.parser.CParserError: Error tokenizing data. C error: Calling read(nbytes) on source failed. Try engine='python'.
Segmentation fault: 11

I read all the existing answers but most of those questions had errors such as additional columns. I was already handling that with error_bad_lines=Falseoption.

我阅读了所有现有答案,但大多数问题都有错误,例如附加列。我已经在处理这个问题了error_bad_lines=False

What are my options here?

我在这里有哪些选择?

Found something interesting when I tried to uncompress the file:

当我尝试解压缩文件时发现了一些有趣的东西:

gunzip -k myfile.txt.gz 
gunzip: myfile.txt.gz: unexpected end of file
gunzip: myfile.txt.gz: uncompress failed

采纳答案by add-semi-colons

I didn't really find a python solution but using unixtools I manage to find a solution:

我并没有真正找到 python 解决方案,而是使用unix我设法找到解决方案的工具:

First I use zless myfile.txt.gz > uncompressedMyfile.txtthen I use sedtool to remove the last line because I clearly saw that last line was corrupt.

首先我使用zless myfile.txt.gz > uncompressedMyfile.txt然后我使用sed工具删除最后一行,因为我清楚地看到最后一行已损坏。

sed '$d' uncompressedMyfile.txt

sed '$d' uncompressedMyfile.txt

I gzipped the file again gzip -k uncompressedMyfile.txt

我再次压缩文件 gzip -k uncompressedMyfile.txt

I was able to successfully read the file with following python code:

我能够使用以下python代码成功读取文件:

try:
    df = pd.read_csv(os.path.join(filePath, fileName),
                        sep='|', compression = 'gzip', dtype='unicode', error_bad_lines=False)
except CParserError:
    print "Something wrong the file"
return df

回答by Boud

The input zip file is corrupted. Get a proper copy of this file from the source of try to use zip repairing tools before you pass it along to pandas.

输入 zip 文件已损坏。在将它传递给Pandas之前,从尝试使用 zip 修复工具的来源获取此文件的正确副本。

回答by Aseem Ahir

Sometimes the error shows up if you have the file already open. Try closing the file and re-running

如果您已经打开了文件,有时会出现错误。尝试关闭文件并重新运行