Python csv中的双引号元素无法用pandas读取

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/26595819/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-19 00:43:24  来源:igfitidea点击:

double quoted elements in csv cant read with pandas

pythoncsvpandas

提问by PopcornKing

I have an input file where every value is stored as a string. It is inside a csv file with each entry inside double quotes.

我有一个输入文件,其中每个值都存储为字符串。它在一个 csv 文件中,每个条目都在双引号内。

Example file:

示例文件:

"column1","column2", "column3", "column4", "column5", "column6"
"AM", "07", "1", "SD", "SD", "CR"
"AM", "08", "1,2,3", "PR,SD,SD", "PR,SD,SD", "PR,SD,SD"
"AM", "01", "2", "SD", "SD", "SD"

There are only six columns. What options do I need to enter to pandas read_csv to read this correctly?

只有六列。我需要输入哪些选项到 pandas read_csv 才能正确阅读?

I currently am trying:

我目前正在尝试:

import pandas as pd
df = pd.read_csv(file, quotechar='"')

but this gives me the error message: CParserError: Error tokenizing data. C error: Expected 6 fields in line 3, saw 14

但这给了我错误信息: CParserError: Error tokenizing data. C error: Expected 6 fields in line 3, saw 14

Which obviously means that it is ignoring the '"' and parsing every comma as a field. However, for line 3, columns 3 through 6 should be strings with commas in them. ("1,2,3", "PR,SD,SD", "PR,SD,SD", "PR,SD,SD")

这显然意味着它忽略了 '"' 并将每个逗号解析为一个字段。但是,对于第 3 行,第 3 列到第 6 列应该是包含逗号的字符串。("1,2,3", "PR,SD ,SD", "PR,SD,SD", "PR,SD,SD")

How do I get pandas.read_csv to parse this correctly?

如何让 pandas.read_csv 正确解析它?

Thanks.

谢谢。

采纳答案by Jeff

This will work. It falls back to the python parser (as you have non-regular separators, e.g. they are comma and sometimes space). If you only have commas it would use the c-parser and be much faster.

这将起作用。它回退到 python 解析器(因为你有非常规分隔符,例如它们是逗号,有时是空格)。如果您只有逗号,它将使用 c-parser 并且速度要快得多。

In [1]: import csv

In [2]: !cat test.csv
"column1","column2", "column3", "column4", "column5", "column6"
"AM", "07", "1", "SD", "SD", "CR"
"AM", "08", "1,2,3", "PR,SD,SD", "PR,SD,SD", "PR,SD,SD"
"AM", "01", "2", "SD", "SD", "SD"

In [3]: pd.read_csv('test.csv',sep=',\s+',quoting=csv.QUOTE_ALL)
pandas/io/parsers.py:637: ParserWarning: Falling back to the 'python' engine because the 'c' engine does not support regex separators; you can avoid this warning by specifying engine='python'.
  ParserWarning)
Out[3]: 
     "column1","column2" "column3"   "column4"   "column5"   "column6"
"AM"                "07"       "1"        "SD"        "SD"        "CR"
"AM"                "08"   "1,2,3"  "PR,SD,SD"  "PR,SD,SD"  "PR,SD,SD"
"AM"                "01"       "2"        "SD"        "SD"        "SD"