Python Pandas 错误标记数据

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/18039057/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-19 09:47:04  来源:igfitidea点击:

Python Pandas Error tokenizing data

pythoncsvpandas

提问by abuteau

I'm trying to use pandas to manipulate a .csv file but I get this error:

我正在尝试使用 Pandas 来操作 .csv 文件,但出现此错误:

pandas.parser.CParserError: Error tokenizing data. C error: Expected 2 fields in line 3, saw 12

pandas.parser.CParserError:标记数据时出错。C 错误:第 3 行预期有 2 个字段,看到 12

I have tried to read the pandas docs, but found nothing.

我试图阅读熊猫文档,但一无所获。

My code is simple:

我的代码很简单:

path = 'GOOG Key Ratios.csv'
#print(open(path).read())
data = pd.read_csv(path)

How can I resolve this? Should I use the csvmodule or another language ?

我该如何解决这个问题?我应该使用csv模块还是其他语言?

File is from Morningstar

文件来自晨星

采纳答案by richie

you could also try;

你也可以试试;

data = pd.read_csv('file1.csv', error_bad_lines=False)

Do note that this will cause the offending lines to be skipped.

请注意,这将导致跳过违规行。

回答by TomAugspurger

The parser is getting confused by the header of the file. It reads the first row and infers the number of columns from that row. But the first two rows aren't representative of the actual data in the file.

解析器被文件头弄糊涂了。它读取第一行并从该行推断列数。但前两行并不代表文件中的实际数据。

Try it with data = pd.read_csv(path, skiprows=2)

试试看 data = pd.read_csv(path, skiprows=2)

回答by Legend_Ari

I came across the same issue. Using pd.read_table()on the same source file seemed to work. I could not trace the reason for this but it was a useful workaround for my case. Perhaps someone more knowledgeable can shed more light on why it worked.

我遇到了同样的问题。使用pd.read_table()相同的源文件似乎工作。我无法找到原因,但对我的案例来说这是一个有用的解决方法。也许知识渊博的人可以更清楚地了解它的工作原理。

Edit: I found that this error creeps up when you have some text in your file that does not have the same format as the actual data. This is usually header or footer information (greater than one line, so skip_header doesn't work) which will not be separated by the same number of commas as your actual data (when using read_csv). Using read_table uses a tab as the delimiter which could circumvent the users current error but introduce others.

编辑:我发现当你的文件中有一些与实际数据格式不同的文本时,这个错误就会出现。这通常是页眉或页脚信息(多于一行,因此skip_header 不起作用),它们不会用与实际数据相同数量的逗号分隔(使用read_csv 时)。使用 read_table 使用制表符作为分隔符,这可以规避用户当前的错误但引入其他错误。

I usually get around this by reading the extra data into a file then use the read_csv() method.

我通常通过将额外数据读入文件然后使用 read_csv() 方法来解决这个问题。

The exact solution might differ depending on your actual file, but this approach has worked for me in several cases

确切的解决方案可能因您的实际文件而异,但这种方法在几种情况下对我有用

回答by grisaitis

It might be an issue with

这可能是一个问题

  • the delimiters in your data
  • the first row, as @TomAugspurger noted
  • 数据中的分隔符
  • 第一行,正如@TomAugspurger 指出的那样

To solve it, try specifying the sepand/or headerarguments when calling read_csv. For instance,

为了解决这个问题,尝试指定sep和/或header打电话时的参数read_csv。例如,

df = pandas.read_csv(fileName, sep='delimiter', header=None)

In the code above, sepdefines your delimiter and header=Nonetells pandas that your source data has no row for headers / column titles. Thus saith the docs: "If file contains no header row, then you should explicitly pass header=None". In this instance, pandas automatically creates whole-number indices for each field {0,1,2,...}.

在上面的代码中,sep定义您的分隔符并header=None告诉熊猫您的源数据没有用于标题/列标题的行。因此文档说:“如果文件不包含标题行,那么你应该明确地传递 header=None”。在这种情况下,pandas 会自动为每个字段 {0,1,2,...} 创建整数索引。

According to the docs, the delimiter thing should notbe an issue. The docs say that "if sep is None [not specified], will try to automatically determine this." I however have not had good luck with this, including instances with obvious delimiters.

根据文档,分隔符应该不是问题。文档说“如果 sep 是 None [未指定],将尝试自动确定这一点。” 然而,我对此并没有好运,包括带有明显分隔符的实例。

回答by Piyush S. Wanare

This is definitely an issue of delimiter, as most of the csv CSV are got create using sep='/t'so try to read_csvusing the tab character (\t)using separator /t. so, try to open using following code line.

这绝对是分隔符的问题,因为大多数 csv CSV 都是使用创建的,sep='/t'因此请尝试read_csv使用(\t)使用分隔符的制表符/t。因此,尝试使用以下代码行打开。

data=pd.read_csv("File_path", sep='\t')

回答by Robert Geiger

I had this problem as well but perhaps for a different reason. I had some trailing commas in my CSV that were adding an additional column that pandas was attempting to read. Using the following works but it simply ignores the bad lines:

我也有这个问题,但也许是出于不同的原因。我的 CSV 中有一些尾随逗号,它们添加了一个熊猫试图读取的附加列。使用以下工作,但它只是忽略坏行:

data = pd.read_csv('file1.csv', error_bad_lines=False)

If you want to keep the lines an ugly kind of hack for handling the errors is to do something like the following:

如果您想保留这些行来处理错误的丑陋技巧是执行以下操作:

line     = []
expected = []
saw      = []     
cont     = True 

while cont == True:     
    try:
        data = pd.read_csv('file1.csv',skiprows=line)
        cont = False
    except Exception as e:    
        errortype = e.message.split('.')[0].strip()                                
        if errortype == 'Error tokenizing data':                        
           cerror      = e.message.split(':')[1].strip().replace(',','')
           nums        = [n for n in cerror.split(' ') if str.isdigit(n)]
           expected.append(int(nums[0]))
           saw.append(int(nums[2]))
           line.append(int(nums[1])-1)
         else:
           cerror      = 'Unknown'
           print 'Unknown Error - 222'

if line != []:
    # Handle the errors however you want

I proceeded to write a script to reinsert the lines into the DataFrame since the bad lines will be given by the variable 'line' in the above code. This can all be avoided by simply using the csv reader. Hopefully the pandas developers can make it easier to deal with this situation in the future.

我继续编写一个脚本来将这些行重新插入到 DataFrame 中,因为上面代码中的变量“line”会给出坏行。这一切都可以通过简单地使用 csv 阅读器来避免。希望 Pandas 的开发者以后可以更轻松地处理这种情况。

回答by elPastor

I've had this problem a few times myself. Almost every time, the reason is that the file I was attempting to open was not a properly saved CSV to begin with. And by "properly", I mean each row had the same number of separators or columns.

我自己也遇到过几次这个问题。几乎每次,原因是我试图打开的文件不是一个正确保存的 CSV 开始。“正确”是指每一行都有相同数量的分隔符或列。

Typically it happened because I had opened the CSV in Excel then improperly saved it. Even though the file extension was still .csv, the pure CSV format had been altered.

通常发生这种情况是因为我在 Excel 中打开了 CSV 然后不正确地保存了它。尽管文件扩展名仍然是 .csv,但纯 CSV 格式已被更改。

Any file saved with pandas to_csv will be properly formatted and shouldn't have that issue. But if you open it with another program, it may change the structure.

任何用 pandas to_csv 保存的文件都将被正确格式化,不应该有这个问题。但是如果你用另一个程序打开它,它可能会改变结构。

Hope that helps.

希望有帮助。

回答by RegularlyScheduledProgramming

Although not the case for this question, this error may also appear with compressed data. Explicitly setting the value for kwargcompressionresolved my problem.

尽管此问题的情况并非如此,但压缩数据也可能出现此错误。显式设置值以kwargcompression解决我的问题。

result = pandas.read_csv(data_source, compression='gzip')

回答by computerist

Your CSV file might have variable number of columns and read_csvinferred the number of columns from the first few rows. Two ways to solve it in this case:

您的 CSV 文件可能具有可变的列数,并read_csv从前几行推断出列数。这种情况下有两种解决方法:

1) Change the CSV file to have a dummy first line with max number of columns (and specify header=[0])

1) 将 CSV 文件更改为具有最大列数的虚拟第一行(并指定header=[0]

2) Or use names = list(range(0,N))where N is the max number of columns.

2) 或者使用names = list(range(0,N))其中 N 是最大列数。

回答by lotrus28

I've had a similar problem while trying to read a tab-delimited table with spaces, commas and quotes:

我在尝试读取带有空格、逗号和引号的制表符分隔表时遇到了类似的问题:

1115794 4218    "k__Bacteria", "p__Firmicutes", "c__Bacilli", "o__Bacillales", "f__Bacillaceae", ""
1144102 3180    "k__Bacteria", "p__Firmicutes", "c__Bacilli", "o__Bacillales", "f__Bacillaceae", "g__Bacillus", ""
368444  2328    "k__Bacteria", "p__Bacteroidetes", "c__Bacteroidia", "o__Bacteroidales", "f__Bacteroidaceae", "g__Bacteroides", ""



import pandas as pd
# Same error for read_table
counts = pd.read_csv(path_counts, sep='\t', index_col=2, header=None, engine = 'c')

pandas.io.common.CParserError: Error tokenizing data. C error: out of memory

This says it has something to do with C parsing engine (which is the default one). Maybe changing to a python one will change anything

这表示它与 C 解析引擎(这是默认引擎)有关。也许改成蟒蛇会改变任何事情

counts = pd.read_table(path_counts, sep='\t', index_col=2, header=None, engine='python')

Segmentation fault (core dumped)

Now that is a different error.
If we go ahead and try to remove spaces from the table, the error from python-engine changes once again:

现在这是一个不同的错误。
如果我们继续尝试从表中删除空格,python-engine 的错误将再次更改:

1115794 4218    "k__Bacteria","p__Firmicutes","c__Bacilli","o__Bacillales","f__Bacillaceae",""
1144102 3180    "k__Bacteria","p__Firmicutes","c__Bacilli","o__Bacillales","f__Bacillaceae","g__Bacillus",""
368444  2328    "k__Bacteria","p__Bacteroidetes","c__Bacteroidia","o__Bacteroidales","f__Bacteroidaceae","g__Bacteroides",""


_csv.Error: '   ' expected after '"'

And it gets clear that pandas was having problems parsing our rows. To parse a table with python engine I needed to remove all spaces and quotes from the table beforehand. Meanwhile C-engine kept crashing even with commas in rows.

To avoid creating a new file with replacements I did this, as my tables are small:

很明显,pandas 在解析我们的行时遇到了问题。要使用 python 引擎解析表,我需要事先从表中删除所有空格和引号。同时,即使在行中使用逗号,C 引擎也会崩溃。

为了避免创建带有替换的新文件,我这样做了,因为我的表很小:

from io import StringIO
with open(path_counts) as f:
    input = StringIO(f.read().replace('", ""', '').replace('"', '').replace(', ', ',').replace('##代码##',''))
    counts = pd.read_table(input, sep='\t', index_col=2, header=None, engine='python')

tl;dr
Change parsing engine, try to avoid any non-delimiting quotes/commas/spaces in your data.

tl;dr
更改解析引擎,尽量避免数据中出现任何非分隔引号/逗号/空格。