numpy genfromtxt/pandas read_csv;忽略引号内的逗号

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/24079304/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-13 22:08:28  来源:igfitidea点击:

numpy genfromtxt/pandas read_csv; ignore commas within quote marks

pythonfile-ionumpypandasgenfromtxt

提问by atomh33ls

Consider a file, a.dat, with contents:

考虑一个a.dat包含内容的文件:

address 1, address 2, address 3, num1, num2, num3
address 1, address 2, address 3, 1.0, 2.0, 3
address 1, address 2, "address 3, address4", 1.0, 2.0, 3

I am trying to import with numpy.genfromtxt. However the function sees an additional column in row 3. I get a similar error with pandas.read_csv:

我正在尝试使用numpy.genfromtxt. 然而,该函数在第 3 行看到了一个额外的列。我得到了一个类似的错误pandas.read_csv

np.genfromtxt('a.dat',delimiter=',',dtype=None,skiprows=1)

ValueError: Some errors were detected !
    Line #3 (got 7 columns instead of 6)

and

pandas read_csv sort of works - but it gives me an unaligned data structure:

pd.read_csv('a.dat')

pandas.parser.CParserError: Error tokenizing data. C error: Expected 6 fields in line 3, saw 7

I'm trying to find an input parameter to compensate for this. I don't mind if I end up with a numpy ndarray or pandas dataframe.

我试图找到一个输入参数来弥补这一点。我不介意我最终得到一个 numpy ndarray 还是 pandas 数据框。

Is there a parameter that I can set within genfromtxtand/or read_csvthat will let me ignore the comma within the speech marks?

是否有我可以在其中设置genfromtxt和/或read_csv让我忽略语音标记中的逗号的参数?

I note that read_csvincludes a quotechar='"'parameter, defined thus:

我注意到它read_csv包含一个quotechar='"'参数,定义如下:

quotechar: string (length 1) The character used to denote the start and end of a quoted item. Quoted items can include the delimiter and it will be ignored.

quotechar: string (length 1) 用于表示引用项的开始和结束的字符。引用的项目可以包含分隔符,它将被忽略。

This reads to me like read_csv should work for my case by default - yet it doesn't.

这对我来说就像 read_csv 默认情况下应该适用于我的情况一样 - 但事实并非如此。

I can see that I could pre-process the file to strip out the commas - I'd like to avoid that if possible but would welcome suggestions if this is the only way.

我可以看到我可以预处理文件以去除逗号 - 如果可能的话,我想避免这种情况,但如果这是唯一的方法,欢迎提出建议。

回答by atomh33ls

Just managed to find this:

刚刚设法找到了这个

The key parameter that I was missing is skipinitialspace=True- this "deals with the spaces after the comma-delimiter"

我缺少的关键参数是skipinitialspace=True- 这“处理逗号分隔符后的空格”

a=pd.read_csv('a.dat',quotechar='"',skipinitialspace=True)

   address 1  address 2            address 3  num1  num2  num3
0  address 1  address 2            address 3     1     2     3
1  address 1  address 2  address 3, address4     1     2     3

This works :-)

这有效:-)

回答by Sven Marnach

Python's built-in csvmodule can deal with this kind of data.

Python 的内置csv模块可以处理这种数据。

with open("a.dat") as f:
    reader = csv.reader(f, skipinitialspace=True)
    header = next(reader)
    dtype = numpy.dtype(zip(header, ['S20', 'S20', 'S20', 'f8', 'f8', 'f8']))
    data = numpy.fromiter(itertools.imap(tuple, reader), dtype=dtype)