Pandas:如何解决“错误标记数据”?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/55188544/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Pandas: How to workaround "error tokenizing data"?
提问by theozh
A lot of questions have been already asked about this topic on SO. (and many others). Among the numerous answers, none of them was really helpful to me so far. If I missed theuseful one, please let me know.
已经在 SO 上就这个主题提出了很多问题。(和许多其他人)。在众多的答案中,到目前为止,没有一个对我真正有帮助。如果我错过 了有用的,请告诉我。
I simply would like to read a CSV file with pandas into a dataframe. Sounds like a simple task.
我只是想将带有Pandas的 CSV 文件读入数据帧。听起来像一个简单的任务。
My file Test.csv
我的档案 Test.csv
1,2,3,4,5
1,2,3,4,5,6
,,3,4,5
1,2,3,4,5,6,7
,2,,4
My code:
我的代码:
import pandas as pd
df = pd.read_csv('Test.csv',header=None)
My error:
我的错误:
pandas.errors.ParserError: Error tokenizing data. C error: Expected 5 fields in line 2, saw 6
My guess about the issue is that Pandas looks to the first line and expects the same number of tokens in the following rows. If this is not the case it will stop with an error.
我对这个问题的猜测是 Pandas 查看第一行并期望在接下来的行中有相同数量的令牌。如果不是这种情况,它将因错误而停止。
In the numerous answers, the suggestions for using options are, e.g.:
error_bad_lines=False
or header=None
or skiprows=3
and more non-helpful suggestions.
在众多答案中,使用选项的建议是,例如:
error_bad_lines=False
或header=None
或skiprows=3
和更多无用的建议。
However, I don't want to ignore any lines or skip. And I don't know in advance how many columns and rows the datafile has.
但是,我不想忽略任何行或跳过。而且我事先不知道数据文件有多少列和行。
So it basically boils down to how to find the maximum number of columns in the datafile. Is this the way to go? I hoped that there was an easy way to simply read a CSV file which does not have the maximum column number in the first line. Thank you for any hints. I'm using Python 3.6.3, Pandas 0.24.1 on Win7.
所以它基本上归结为如何找到数据文件中的最大列数。这是要走的路吗?我希望有一种简单的方法可以简单地读取第一行没有最大列数的 CSV 文件。谢谢你的任何提示。我在 Win7 上使用 Python 3.6.3,Pandas 0.24.1。
回答by theozh
Thank you @ALollz for the "very fresh" link (lucky coincidence) and @Rich Andrews for pointing out that my example actually is not "strictly correct" CSV data.
感谢@ALollz 提供“非常新鲜”的链接(幸运的巧合)和@Rich Andrews 指出我的示例实际上不是“严格正确”的 CSV 数据。
So, the way it works for me for the time being is adapted from @ALollz' compact solution (https://stackoverflow.com/a/55129746/7295599)
所以,它暂时对我有用的方式是从@ALollz 的紧凑解决方案中改编的(https://stackoverflow.com/a/55129746/7295599)
### reading an "incorrect" CSV to dataframe having a variable number of columns/tokens
import pandas as pd
df = pd.read_csv('Test.csv', header=None, sep='\n')
df = df[0].str.split(',', expand=True)
# ... do some modifications with df
### end of code
df
contains empty string ''
for the missing entries at the beginning and the middle, and None
for the missing tokens at the end.
df
包含''
开头和中间缺失条目以及None
结尾缺失标记的空字符串。
0 1 2 3 4 5 6
0 1 2 3 4 5 None None
1 1 2 3 4 5 6 None
2 3 4 5 None None
3 1 2 3 4 5 6 7
4 2 4 None None None
If you write this again to a file via:
如果您通过以下方式再次将其写入文件:
df.to_csv("Test.tab",sep="\t",header=False,index=False)
df.to_csv("Test.tab",sep="\t",header=False,index=False)
1 2 3 4 5
1 2 3 4 5 6
3 4 5
1 2 3 4 5 6 7
2 4
None
will be converted to empty string ''
and everything is fine.
None
将被转换为空字符串''
,一切都很好。
The next level would be to account for data strings in quotes which contain the separator, but that's another topic.
下一个级别是在包含分隔符的引号中考虑数据字符串,但这是另一个主题。
1,2,3,4,5
,,3,"Hello, World!",5,6
1,2,3,4,5,6,7
回答by Rich Andrews
Read the csv using the tolerant python csv module, and fix the loaded file prior to handing it off to pandas, which will fails on the otherwise malformed csv data regardless of the csv engine pandas uses.
使用宽容的python csv 模块读取csv,并在将加载的文件交给pandas 之前修复加载的文件,无论pandas 使用什么csv 引擎,这都会在其他格式错误的csv 数据上失败。
import pandas as pd
import csv
not_csv = """1,2,3,4,5
1,2,3,4,5,6
,,3,4,5
1,2,3,4,5,6,7
,2,,4
"""
with open('not_a.csv', 'w') as csvfile:
csvfile.write(not_csv)
d = []
with open('not_a.csv') as csvfile:
areader = csv.reader(csvfile)
max_elems = 0
for row in areader:
if max_elems < len(row): max_elems = len(row)
csvfile.seek(0)
for i, row in enumerate(areader):
# fix my csv by padding the rows
d.append(row + ["" for x in range(max_elems-len(row))])
df = pd.DataFrame(d)
print df
# the default engine
# provides "pandas.errors.ParserError: Error tokenizing data. C error: Expected 5 fields in line 2, saw 6 "
#df = pd.read_csv('Test.csv',header=None, engine='c')
# the python csv engine
# provides "pandas.errors.ParserError: Expected 6 fields in line 4, saw 7 "
#df = pd.read_csv('Test.csv',header=None, engine='python')
Preprocess file outside of python if concerned about extra code inside python creating too much python code.
如果担心 python 内部的额外代码会创建过多的 python 代码,则在 python 之外预处理文件。
Richs-MBP:tmp randrews$ cat test.csv
1,2,3
1,
2
1,2,
,,,
Richs-MBP:tmp randrews$ awk 'BEGIN {FS=","}; {print ","","","","}' < test.csv
1,2,3,,
1,,,,
2,,,,
1,2,,,
,,,,
回答by mujjiga
I have a different take on the solution. Let pandas take care of creating the table and deleting None values and let us take care of writing a proper tokenizer.
我对解决方案有不同的看法。让 Pandas 负责创建表并删除 None 值,让我们负责编写适当的标记器。
Tokenizer
分词器
def tokenize(str):
idx = [x for x, v in enumerate(str) if v == '\"']
if len(idx) % 2 != 0:
idx = idx[:-1]
memory = {}
for i in range(0, len(idx), 2):
val = str[idx[i]:idx[i+1]+1]
key = "_"*(len(val)-1)+"{0}".format(i)
memory[key] = val
str = str.replace(memory[key], key, 1)
return [memory.get(token, token) for token in str.split(",")]
Test cases for Tokenizer
Tokenizer 的测试用例
print (tokenize("1,2,3,4,5"))
print (tokenize(",,3,\"Hello, World!\",5,6"))
print (tokenize(",,3,\"Hello,,,, World!\",5,6"))
print (tokenize(",,3,\"Hello, World!\",5,6,,3,\"Hello, World!\",5,6"))
print (tokenize(",,3,\"Hello, World!\",5,6,,3,\"Hello,,5,6"))
Output
输出
['1', '2', '3', '4', '5']
['', '', '3', '"Hello, World!"', '5', '6']
['', '', '3', '"Hello,,,, World!"', '5', '6']
['', '', '3', '"Hello, World!"', '5', '6', '', '3', '"Hello, World!"', '5', '6']
['', '', '3', '"Hello, World!"', '5', '6', '', '3', '"Hello', '', '5', '6']
['1', '2', '3', '4', '5']
['', '', '3', '"Hello, World!"', '5', '6']
['', '', '3', '"Hello,,,, World!"', '5', '6']
['', '', '3', '"Hello, World!"', '5', '6', '', '3', '"Hello, World!"', '5', '6']
['', '', '3', '"Hello, World!"', '5', '6', '', '3', '"Hello', '', '5', '6']
Putting the tokenizer into action
将分词器付诸行动
with open("test1.csv", "r") as fp:
lines = fp.readlines()
lines = list(map(lambda x: tokenize(x.strip()), lines))
df = pd.DataFrame(lines).replace(np.nan, '')
Advantage:
优势:
Now we can teak the tokenizer function as per our needs
现在我们可以根据需要使用标记器功能
回答by Corina Roca
In my case 1 I opened the *.csv in Excel 2 I saved the *.csv as CSV (comma-delimited) 3 I loaded the file in python via:
就我而言 1 我在 Excel 中打开了 *.csv 2 我将 *.csv 保存为 CSV(逗号分隔) 3 我通过以下方式在 python 中加载了文件:
import pandas as pd
df = pd.read_csv('yourcsvfile.csv', sep=',')
Hope it helps!
希望能帮助到你!