Python Pandas 数据帧 read_csv 对坏数据
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/33440805/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Pandas dataframe read_csv on bad data
提问by Fonti
I want to read in a very large csv (cannot be opened in excel and edited easily) but somewhere around the 100,000th row, there is a row with one extra column causing the program to crash. This row is errored so I need a way to ignore the fact that it was an extra column. There is around 50 columns so hardcoding the headers and using names or usecols isn't preferable. I'll also possibly encounter this issue in other csv's and want a generic solution. I couldn't find anything in read_csv unfortunately. The code is as simple as this:
我想读取一个非常大的 csv(无法在 excel 中打开并轻松编辑)但在第 100,000 行左右,有一行有一个额外的列导致程序崩溃。该行出错了,所以我需要一种方法来忽略它是一个额外列的事实。大约有 50 列,因此对标题进行硬编码并使用名称或 usecols 不是可取的。我也可能会在其他 csv 中遇到这个问题并想要一个通用的解决方案。不幸的是,我在 read_csv 中找不到任何内容。代码很简单:
def loadCSV(filePath):
dataframe = pd.read_csv(filePath, index_col=False, encoding='iso-8859-1', nrows=1000)
datakeys = dataframe.keys();
return dataframe, datakeys
回答by EdChum
pass error_bad_lines=False
to skip erroneous rows:
通过error_bad_lines=False
跳过错误的行:
error_bad_lines : boolean, default True Lines with too many fields (e.g. a csv line with too many commas) will by default cause an exception to be raised, and no DataFrame will be returned. If False, then these “bad lines” will dropped from the DataFrame that is returned. (Only valid with C parser)
error_bad_lines : 布尔值,默认情况下具有太多字段的 True 行(例如,带有太多逗号的 csv 行)将默认导致引发异常,并且不会返回任何 DataFrame。如果为 False,那么这些“坏行”将从返回的 DataFrame 中删除。(仅对 C 解析器有效)