Python Pandas 数据帧 read_csv 对坏数据

Question

提问by Fonti

I want to read in a very large csv (cannot be opened in excel and edited easily) but somewhere around the 100,000th row, there is a row with one extra column causing the program to crash. This row is errored so I need a way to ignore the fact that it was an extra column. There is around 50 columns so hardcoding the headers and using names or usecols isn't preferable. I'll also possibly encounter this issue in other csv's and want a generic solution. I couldn't find anything in read_csv unfortunately. The code is as simple as this:

我想读取一个非常大的 csv（无法在 excel 中打开并轻松编辑）但在第 100,000 行左右，有一行有一个额外的列导致程序崩溃。该行出错了，所以我需要一种方法来忽略它是一个额外列的事实。大约有 50 列，因此对标题进行硬编码并使用名称或 usecols 不是可取的。我也可能会在其他 csv 中遇到这个问题并想要一个通用的解决方案。不幸的是，我在 read_csv 中找不到任何内容。代码很简单：

def loadCSV(filePath):
    dataframe = pd.read_csv(filePath, index_col=False, encoding='iso-8859-1', nrows=1000)
    datakeys = dataframe.keys();
    return dataframe, datakeys

Answer 1

回答by EdChum

pass error_bad_lines=Falseto skip erroneous rows:

通过error_bad_lines=False跳过错误的行：

error_bad_lines : boolean, default True Lines with too many fields (e.g. a csv line with too many commas) will by default cause an exception to be raised, and no DataFrame will be returned. If False, then these “bad lines” will dropped from the DataFrame that is returned. (Only valid with C parser)

error_bad_lines : 布尔值，默认情况下具有太多字段的 True 行（例如，带有太多逗号的 csv 行）将默认导致引发异常，并且不会返回任何 DataFrame。如果为 False，那么这些“坏行”将从返回的 DataFrame 中删除。（仅对 C 解析器有效）

Python Pandas 数据帧 read_csv 对坏数据

提问by Fonti

回答by EdChum

相关推荐

最近更新

标签

Python Pandas 数据帧 read_csv 对坏数据

提问by Fonti

回答by EdChum

相关推荐

Python向上和向下循环

Python “ascii”编解码器无法解码位置 319 中的字节 0xef：序号不在范围内（128）？

从 Python 中的字符串中删除表情符号

Matplotlib：从 iPython 笔记本中将图形另存为文件

相关推荐

最近更新

标签