Pandas：如何解决“错误标记数据”？

Question

提问by theozh

A lot of questions have been already asked about this topic on SO. (and many others). Among the numerous answers, none of them was really helpful to me so far. If I missed theuseful one, please let me know.

已经在 SO 上就这个主题提出了很多问题。（和许多其他人）。在众多的答案中，到目前为止，没有一个对我真正有帮助。如果我错过了有用的，请告诉我。

I simply would like to read a CSV file with pandas into a dataframe. Sounds like a simple task.

我只是想将带有Pandas的 CSV 文件读入数据帧。听起来像一个简单的任务。

My file Test.csv

我的档案 Test.csv

1,2,3,4,5
1,2,3,4,5,6
,,3,4,5
1,2,3,4,5,6,7
,2,,4

My code:

我的代码：

import pandas as pd
df = pd.read_csv('Test.csv',header=None)

My error:

我的错误：

pandas.errors.ParserError: Error tokenizing data. C error: Expected 5 fields in line 2, saw 6

My guess about the issue is that Pandas looks to the first line and expects the same number of tokens in the following rows. If this is not the case it will stop with an error.

我对这个问题的猜测是 Pandas 查看第一行并期望在接下来的行中有相同数量的令牌。如果不是这种情况，它将因错误而停止。

In the numerous answers, the suggestions for using options are, e.g.: error_bad_lines=Falseor header=Noneor skiprows=3and more non-helpful suggestions.

在众多答案中，使用选项的建议是，例如： error_bad_lines=False或header=None或skiprows=3和更多无用的建议。

However, I don't want to ignore any lines or skip. And I don't know in advance how many columns and rows the datafile has.

但是，我不想忽略任何行或跳过。而且我事先不知道数据文件有多少列和行。

So it basically boils down to how to find the maximum number of columns in the datafile. Is this the way to go? I hoped that there was an easy way to simply read a CSV file which does not have the maximum column number in the first line. Thank you for any hints. I'm using Python 3.6.3, Pandas 0.24.1 on Win7.

所以它基本上归结为如何找到数据文件中的最大列数。这是要走的路吗？我希望有一种简单的方法可以简单地读取第一行没有最大列数的 CSV 文件。谢谢你的任何提示。我在 Win7 上使用 Python 3.6.3，Pandas 0.24.1。

Answer 1

回答by theozh

Thank you @ALollz for the "very fresh" link (lucky coincidence) and @Rich Andrews for pointing out that my example actually is not "strictly correct" CSV data.

感谢@ALollz 提供“非常新鲜”的链接（幸运的巧合）和@Rich Andrews 指出我的示例实际上不是“严格正确”的 CSV 数据。

So, the way it works for me for the time being is adapted from @ALollz' compact solution (https://stackoverflow.com/a/55129746/7295599)

所以，它暂时对我有用的方式是从@ALollz 的紧凑解决方案中改编的（https://stackoverflow.com/a/55129746/7295599）

### reading an "incorrect" CSV to dataframe having a variable number of columns/tokens 
import pandas as pd

df = pd.read_csv('Test.csv', header=None, sep='\n')
df = df[0].str.split(',', expand=True)
# ... do some modifications with df
### end of code

dfcontains empty string ''for the missing entries at the beginning and the middle, and Nonefor the missing tokens at the end.

df包含''开头和中间缺失条目以及None结尾缺失标记的空字符串。

   0  1  2  3     4     5     6
0  1  2  3  4     5  None  None
1  1  2  3  4     5     6  None
2        3  4     5  None  None
3  1  2  3  4     5     6     7
4     2     4  None  None  None

If you write this again to a file via:

如果您通过以下方式再次将其写入文件：

df.to_csv("Test.tab",sep="\t",header=False,index=False)

1   2   3   4   5       
1   2   3   4   5   6   
        3   4   5       
1   2   3   4   5   6   7
    2       4

Nonewill be converted to empty string ''and everything is fine.

None将被转换为空字符串''，一切都很好。

The next level would be to account for data strings in quotes which contain the separator, but that's another topic.

下一个级别是在包含分隔符的引号中考虑数据字符串，但这是另一个主题。

1,2,3,4,5
,,3,"Hello, World!",5,6
1,2,3,4,5,6,7

Answer 2

回答by Rich Andrews

Read the csv using the tolerant python csv module, and fix the loaded file prior to handing it off to pandas, which will fails on the otherwise malformed csv data regardless of the csv engine pandas uses.

使用宽容的python csv 模块读取csv，并在将加载的文件交给pandas 之前修复加载的文件，无论pandas 使用什么csv 引擎，这都会在其他格式错误的csv 数据上失败。

import pandas as pd
import csv

not_csv = """1,2,3,4,5
1,2,3,4,5,6
,,3,4,5
1,2,3,4,5,6,7
,2,,4
"""

with open('not_a.csv', 'w') as csvfile:
    csvfile.write(not_csv)

d = []
with open('not_a.csv') as csvfile:
    areader = csv.reader(csvfile)
    max_elems = 0
    for row in areader:
        if max_elems < len(row): max_elems = len(row)
    csvfile.seek(0)
    for i, row in enumerate(areader):
        # fix my csv by padding the rows
        d.append(row + ["" for x in range(max_elems-len(row))])

df = pd.DataFrame(d)
print df

# the default engine
# provides "pandas.errors.ParserError: Error tokenizing data. C error: Expected 5 fields in line 2, saw 6 "
#df = pd.read_csv('Test.csv',header=None, engine='c')

# the python csv engine
# provides "pandas.errors.ParserError: Expected 6 fields in line 4, saw 7 "
#df = pd.read_csv('Test.csv',header=None, engine='python')

Preprocess file outside of python if concerned about extra code inside python creating too much python code.

如果担心 python 内部的额外代码会创建过多的 python 代码，则在 python 之外预处理文件。

Richs-MBP:tmp randrews$ cat test.csv
1,2,3
1,
2
1,2,
,,,
Richs-MBP:tmp randrews$ awk 'BEGIN {FS=","}; {print ","","","","}' < test.csv
1,2,3,,
1,,,,
2,,,,
1,2,,,
,,,,

Answer 3

回答by mujjiga

I have a different take on the solution. Let pandas take care of creating the table and deleting None values and let us take care of writing a proper tokenizer.

我对解决方案有不同的看法。让 Pandas 负责创建表并删除 None 值，让我们负责编写适当的标记器。

Tokenizer

分词器

def tokenize(str):
    idx = [x for x, v in enumerate(str) if v == '\"']
    if len(idx) % 2 != 0:
        idx = idx[:-1]
    memory = {}
    for i in range(0, len(idx), 2):
        val = str[idx[i]:idx[i+1]+1]
        key = "_"*(len(val)-1)+"{0}".format(i)
        memory[key] = val
        str = str.replace(memory[key], key, 1)        
    return [memory.get(token, token) for token in str.split(",")]

Test cases for Tokenizer

Tokenizer 的测试用例

print (tokenize("1,2,3,4,5"))
print (tokenize(",,3,\"Hello, World!\",5,6"))
print (tokenize(",,3,\"Hello,,,, World!\",5,6"))
print (tokenize(",,3,\"Hello, World!\",5,6,,3,\"Hello, World!\",5,6"))
print (tokenize(",,3,\"Hello, World!\",5,6,,3,\"Hello,,5,6"))

Output

输出

['1', '2', '3', '4', '5'] ['', '', '3', '"Hello, World!"', '5', '6'] ['', '', '3', '"Hello,,,, World!"', '5', '6'] ['', '', '3', '"Hello, World!"', '5', '6', '', '3', '"Hello, World!"', '5', '6'] ['', '', '3', '"Hello, World!"', '5', '6', '', '3', '"Hello', '', '5', '6']

Putting the tokenizer into action

将分词器付诸行动

with open("test1.csv", "r") as fp:
    lines = fp.readlines()

lines = list(map(lambda x: tokenize(x.strip()), lines))
df = pd.DataFrame(lines).replace(np.nan, '')

Advantage:

优势：

Now we can teak the tokenizer function as per our needs

现在我们可以根据需要使用标记器功能

Answer 4

回答by Corina Roca

In my case 1 I opened the *.csv in Excel 2 I saved the *.csv as CSV (comma-delimited) 3 I loaded the file in python via:

就我而言 1 我在 Excel 中打开了 *.csv 2 我将 *.csv 保存为 CSV（逗号分隔） 3 我通过以下方式在 python 中加载了文件：

import pandas as pd
df = pd.read_csv('yourcsvfile.csv', sep=',')

Hope it helps!

希望能帮助到你！

Pandas：如何解决“错误标记数据”？

提问by theozh

回答by theozh

回答by Rich Andrews

回答by mujjiga

Tokenizer

分词器

Test cases for Tokenizer

Tokenizer 的测试用例

Putting the tokenizer into action

将分词器付诸行动

Advantage:

优势：

回答by Corina Roca

相关推荐

最近更新

标签

Pandas：如何解决“错误标记数据”？

提问by theozh

回答by theozh

回答by Rich Andrews

回答by mujjiga

Tokenizer

分词器

Test cases for Tokenizer

Tokenizer 的测试用例

Putting the tokenizer into action

将分词器付诸行动

Advantage:

优势：

回答by Corina Roca

相关推荐

pandas 如何在 Python 中删除异常值？

Pandas df.at() 引发 AttributeError: 'BlockManager' 对象没有属性 'T'

pandas 检查数据框列中的所有值是否相同

pandas 如何检查熊猫数据框是否仅包含数字列？

相关推荐

最近更新

标签