pandas Python:标记数据时出错。C 错误:在源上调用 read(nbytes) 失败,输入 nzip 文件
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/40835287/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Python: Error tokenizing data. C error: Calling read(nbytes) on source failed with input nzip file
提问by add-semi-colons
I am using conda
python
2.7
我在用 conda
python
2.7
python --version
Python 2.7.12 :: Anaconda 2.4.1 (x86_64)
I have fallowing method to read large gzip files:
我有读取大型 gzip 文件的休闲方法:
df = pd.read_csv(os.path.join(filePath, fileName),
sep='|', compression = 'gzip', dtype='unicode', error_bad_lines=False)
but when I read the file I get the following error:
但是当我读取文件时,出现以下错误:
pandas.parser.CParserError: Error tokenizing data. C error: Calling read(nbytes) on source failed. Try engine='python'.
Segmentation fault: 11
I read all the existing answers but most of those questions had errors such as additional columns. I was already handling that with error_bad_lines=False
option.
我阅读了所有现有答案,但大多数问题都有错误,例如附加列。我已经在处理这个问题了error_bad_lines=False
。
What are my options here?
我在这里有哪些选择?
Found something interesting when I tried to uncompress the file:
当我尝试解压缩文件时发现了一些有趣的东西:
gunzip -k myfile.txt.gz
gunzip: myfile.txt.gz: unexpected end of file
gunzip: myfile.txt.gz: uncompress failed
采纳答案by add-semi-colons
I didn't really find a python solution but using unix
tools I manage to find a solution:
我并没有真正找到 python 解决方案,而是使用unix
我设法找到解决方案的工具:
First I use zless myfile.txt.gz > uncompressedMyfile.txt
then I use sed
tool to remove the last line because I clearly saw that last line was corrupt.
首先我使用zless myfile.txt.gz > uncompressedMyfile.txt
然后我使用sed
工具删除最后一行,因为我清楚地看到最后一行已损坏。
sed '$d' uncompressedMyfile.txt
sed '$d' uncompressedMyfile.txt
I gzipped the file again gzip -k uncompressedMyfile.txt
我再次压缩文件 gzip -k uncompressedMyfile.txt
I was able to successfully read the file with following python code:
我能够使用以下python代码成功读取文件:
try:
df = pd.read_csv(os.path.join(filePath, fileName),
sep='|', compression = 'gzip', dtype='unicode', error_bad_lines=False)
except CParserError:
print "Something wrong the file"
return df
回答by Boud
The input zip file is corrupted. Get a proper copy of this file from the source of try to use zip repairing tools before you pass it along to pandas.
输入 zip 文件已损坏。在将它传递给Pandas之前,从尝试使用 zip 修复工具的来源获取此文件的正确副本。
回答by Aseem Ahir
Sometimes the error shows up if you have the file already open. Try closing the file and re-running
如果您已经打开了文件,有时会出现错误。尝试关闭文件并重新运行