在 Pandas read_csv 期间标记数据时出错。如何真正看到坏线?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/38902553/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-14 01:48:19  来源:igfitidea点击:

Error tokenizing data during Pandas read_csv. How to actually see the bad lines?

pythoncsvpandas

提问by ??????

I have a large csv that I load as follows

我有一个很大的 csv,我按如下方式加载

df=pd.read_csv('my_data.tsv',sep='\t',header=0, skiprows=[1,2,3])

I get several errors during the loading process.

我在加载过程中遇到几个错误。

  1. First, if I dont specify warn_bad_lines=True,error_bad_lines=FalseI get:

    Error tokenizing data. C error: Expected 22 fields in line 329867, saw 24

  2. Second, if I use the options above, I now get:

    CParserError: Error tokenizing data. C error: EOF inside string starting at line 32357585

  1. 首先,如果我不指定warn_bad_lines=True,error_bad_lines=False我得到:

    标记数据时出错。C 错误:在第 329867 行预计有 22 个字段,看到 24

  2. 其次,如果我使用上面的选项,我现在得到:

    CParserError:标记数据时出错。C 错误:字符串内的 EOF 从 32357585 行开始

Question is: how can I have a look at these bad linesto understand what's going on? Is it possible to have read_csvreturn these bogus lines?

问题是:我怎样才能查看这些坏行以了解发生了什么?是否有可能read_csv退回这些假行?

I tried the following hint (Pandas ParserError EOF character when reading multiple csv files to HDF5):

我尝试了以下提示(将多个 csv 文件读取到 HDF5 时的 Pandas ParserError EOF 字符):

from pandas import parser

try:
  df=pd.read_csv('mydata.tsv',sep='\t',header=0, skiprows=[1,2,3])
except (parser.CParserError) as detail:
  print  detail

but still get

但仍然得到

Error tokenizing data. C error: Expected 22 fields in line 329867, saw 24

标记数据时出错。C 错误:在第 329867 行预计有 22 个字段,看到 24

回答by Sameh Farouk

i'll will give my answer in two parts:

我将分两部分给出我的答案:

part1:the op asked how to output these bad lines, to answer this we can use python csv module in a simple code like that:

第 1 部分:op 询问如何输出这些坏行,为了回答这个问题,我们可以在一个简单的代码中使用 python csv 模块,如下所示:

import csv
file = 'your_filename.csv' # use your filename
lines_set = set([100, 200]) # use your bad lines numbers here

with open(file) as f_obj:
    for line_number, row in enumerate(csv.reader(f_obj)):
        if line_number > max(lines_set):
            break
        elif line_number in lines_set: # put your bad lines numbers here
            print(line_number, row)

also we can put it in more general function like that:

我们也可以把它放在更通用的函数中:

import csv


def read_my_lines(file, lines_list, reader=csv.reader):
    lines_set = set(lines_list)
    with open(file) as f_obj:
        for line_number, row in enumerate(csv.reader(f_obj)):
            if line_number > max(lines_set):
                break
            elif line_number in lines_set:
                print(line_number, row)


if __name__ == '__main__':
    read_my_lines(file='your_filename.csv', lines_list=[100, 200])

part2: the cause of the error you get:

第 2 部分:您得到的错误的原因:

it's hard to diagnose problem like this without a sample of the file you use. but you should try this ..

如果没有您使用的文件样本,很难诊断这样的问题。但你应该试试这个..

pd.read_csv(filename)

is it parse the file with no error ? if so, i will explain why.

它解析文件没有错误吗?如果是这样,我会解释原因。

the number of columns is inferred from the first line.

从第一行推断列数。

by using skiprows and header=0you escaped the first 3 rows, i guess that contains the columns names or the header that should contains the correct number of columns.

通过使用 skiprows 并header=0转义前 3 行,我猜其中包含列名称或应包含正确列数的标题。

basically you constraining what the parser is doing.

基本上你限制了解析器在做什么。

so parse without skiprows, or header=0then reindexing to what you need later.

所以在没有skiprows的情况下解析,或者header=0然后重新索引到你以后需要的东西。

note:

注意

if you unsure about what delimiter used in the file use sep=None, but it would be slower.

如果您不确定文件中使用的分隔符 use sep=None,但它会更慢。

from pandas.read_csv docs:

来自 pandas.read_csv 文档:

sep : str, default ‘,' Delimiter to use. If sep is None, the C engine cannot automatically detect the separator, but the Python parsing engine can, meaning the latter will be used and automatically detect the separator by Python's builtin sniffer tool, csv.Sniffer. In addition, separators longer than 1 character and different from '\s+' will be interpreted as regular expressions and will also force the use of the Python parsing engine. Note that regex delimiters are prone to ignoring quoted data. Regex example: '\r\t'

sep : str, 默认 ',' 要使用的分隔符。如果 sep 为 None,C 引擎无法自动检测分隔符,但 Python 解析引擎可以,这意味着 Python 的内置嗅探器工具 csv.Sniffer 将使用后者并自动检测分隔符。此外,超过 1 个字符且与 '\s+' 不同的分隔符将被解释为正则表达式,也会强制使用 Python 解析引擎。请注意,正则表达式分隔符容易忽略引用的数据。正则表达式示例:'\r\t'

link

关联

回答by Marina

In my case, adding a separator helped:

就我而言,添加分隔符有帮助:

data = pd.read_csv('/Users/myfile.csv', encoding='cp1251', sep=';')

回答by Yugandhar Chaudhari

We can get line number from error and print line to see what it looks like

我们可以从错误中获取行号并打印行以查看它的样子

Try:

尝试:

import subprocess
import re
from pandas import parser

try:
  filename='mydata.tsv'
  df=pd.read_csv(filename,sep='\t',header=0, skiprows=[1,2,3])
except (parser.CParserError) as detail:
  print  detail
  err=re.findall(r'\b\d+\b', detail) #will give all the numbers ['22', '329867', '24'] line number is at index 1
  line=subprocess.check_output("sed -n %s %s" %(str(err[1])+'p',filename),stderr=subprocess.STDOUT,shell=True) # shell command 'sed -n 2p filename'  for printing line 2 of filename
  print 'Bad line'
  print line # to see line