在 Pandas 中读取带有逗号和字符的 CSV 文件时出现问题

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/14550441/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-13 20:37:16  来源:igfitidea点击:

Problems reading CSV file with commas and characters in pandas

pythoncsvspecial-characterspandas

提问by user1992696

I am trying to read a csv file using pandas and the file has a column called Tags which consist of user provided tags and has tags like - , "", '',1950's, 16th-century. Since these are user provided, there are many special characters which are entered by mistake as well. The issue is that I cannot open the csv file using pandas read_csv. It shows error:Cparser, error tokenizing data. Can someone help me with reading the csv file into pandas?

我正在尝试使用 Pandas 读取一个 csv 文件,该文件有一列名为“标签”的列,其中包含用户提供的标签,并具有诸如 - 、""、''、1950 年代、16 世纪之类的标签。由于这些是用户提供的,因此也有许多错误输入的特殊字符。问题是我无法使用 pandas read_csv 打开 csv 文件。它显示错误:Cparser,错误标记数据。有人可以帮我将 csv 文件读入 Pandas 吗?

采纳答案by DSM

Okay. Starting from a badly formatted CSV we can't read:

好的。从格式错误的 CSV 开始,我们无法读取:

>>> !cat unquoted.csv
1950's,xyz.nl/user_003,bad, 123
17th,red,flower,xyz.nl/user_001,good,203
"",xyz.nl/user_239,not very,345
>>> pd.read_csv("unquoted.csv", header=None)
Traceback (most recent call last):
  File "<ipython-input-40-7d9aadb2fad5>", line 1, in <module>
    pd.read_csv("unquoted.csv", header=None)
[...]
  File "parser.pyx", line 1572, in pandas._parser.raise_parser_error (pandas/src/parser.c:17041)
CParserError: Error tokenizing data. C error: Expected 4 fields in line 2, saw 6

We can make a nicer version, taking advantage of the fact the last three columns are well-behaved:

我们可以制作一个更好的版本,利用最后三列表现良好的事实:

import csv

with open("unquoted.csv", "rb") as infile, open("quoted.csv", "wb") as outfile:
    reader = csv.reader(infile)
    writer = csv.writer(outfile)
    for line in reader:
        newline = [','.join(line[:-3])] + line[-3:]
        writer.writerow(newline)

which produces

产生

>>> !cat quoted.csv
1950's,xyz.nl/user_003,bad, 123
"17th,red,flower",xyz.nl/user_001,good,203
,xyz.nl/user_239,not very,345

and then we can read it:

然后我们可以阅读它:

>>> pd.read_csv("quoted.csv", header=None)
                 0                1         2    3
0           1950's  xyz.nl/user_003       bad  123
1  17th,red,flower  xyz.nl/user_001      good  203
2              NaN  xyz.nl/user_239  not very  345

I'd look into fixing this problem at source and getting data in a tolerable format, though. Tricks like this shouldn't be necessary, and it would have been very easy for it to be impossible to repair.

不过,我会考虑从源头上解决这个问题,并以可接受的格式获取数据。这样的花招应该是没有必要的,想要修复也很容易。