Python 使用 numpy.genfromtxt 读取包含逗号的字符串的 csv 文件

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/17933282/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-19 09:31:34  来源:igfitidea点击:

Using numpy.genfromtxt to read a csv file with strings containing commas

pythonnumpypandasgenfromtxt

提问by CraigO

I am trying to read in a csv file with numpy.genfromtxtbut some of the fields are strings which contain commas. The strings are in quotes, but numpy is not recognizing the quotes as defining a single string. For example, with the data in 't.csv':

我正在尝试读取 csv 文件,numpy.genfromtxt但其中一些字段是包含逗号的字符串。字符串在引号中,但 numpy 没有将引号识别为定义单个字符串。例如,使用“t.csv”中的数据:

2012, "Louisville KY", 3.5
2011, "Lexington, KY", 4.0

the code

编码

np.genfromtxt('t.csv', delimiter=',')

produces the error:

产生错误:

ValueError: Some errors were detected ! Line #2 (got 4 columns instead of 3)

ValueError:检测到一些错误!第 2 行(有 4 列而不是 3 列)

The data structure I am looking for is:

我正在寻找的数据结构是:

array([['2012', 'Louisville KY', '3.5'],
       ['2011', 'Lexington, KY', '4.0']], 
      dtype='|S13')

Looking over the documentation, I don't see any options to deal with this. Is there a way do to it with numpy, or do I just need to read in the data with the csvmodule and then convert it to a numpy array?

查看文档,我没有看到任何处理此问题的选项。有没有办法用 numpy 来解决它,还是我只需要用csv模块读入数据然后将其转换为 numpy 数组?

采纳答案by joris

You can use pandas(the becoming default library for working with dataframes (heterogeneous data) in scientific python) for this. It's read_csvcan handle this. From the docs:

为此,您可以使用pandas(成为在科学 Python 中处理数据帧(异构数据)的默认库)。它read_csv可以处理这个。从文档:

quotechar : string

The character to used to denote the start and end of a quoted item. Quoted items 
can include the delimiter and it will be ignored.

引号:字符串

The character to used to denote the start and end of a quoted item. Quoted items 
can include the delimiter and it will be ignored.

The default value is ". An example:

默认值为"。一个例子:

In [1]: import pandas as pd

In [2]: from StringIO import StringIO

In [3]: s="""year, city, value
   ...: 2012, "Louisville KY", 3.5
   ...: 2011, "Lexington, KY", 4.0"""

In [4]: pd.read_csv(StringIO(s), quotechar='"', skipinitialspace=True)
Out[4]:
   year           city  value
0  2012  Louisville KY    3.5
1  2011  Lexington, KY    4.0

The trick here is that you also have to use skipinitialspace=Trueto deal with the spaces after the comma-delimiter.

这里的技巧是您还必须使用skipinitialspace=True处理逗号分隔符后的空格。

Apart from a powerful csv reader, I can also strongly advice to use pandas with the heterogeneous data you have (the example output in numpy you give are all strings, although you could use structured arrays).

除了强大的 csv 阅读器之外,我还强烈建议将 Pandas 与您拥有的异构数据一起使用(您提供的 numpy 中的示例输出都是字符串,尽管您可以使用结构化数组)。

回答by Bitwise

The problem with the additional comma, np.genfromtxtdoes not deal with that.

附加逗号的问题,np.genfromtxt不处理。

One simple solution is to read the file with csv.reader()from python's csvmodule into a list and then dump it into a numpy array if you like.

一个简单的解决方案是将文件csv.reader()从 python 的csv模块读取到一个列表中,然后根据需要将其转储到一个 numpy 数组中。

If you really want to use np.genfromtxt, note that it can take iterators instead of files, e.g. np.genfromtxt(my_iterator, ...). So, you can wrap a csv.readerin an iterator and give it to np.genfromtxt.

如果您真的想使用np.genfromtxt,请注意它可以使用迭代器而不是文件,例如np.genfromtxt(my_iterator, ...). 因此,您可以将 a 包装csv.reader在迭代器中并将其提供给np.genfromtxt.

That would go something like this:

那会是这样的:

import csv
import numpy as np

np.genfromtxt(("\t".join(i) for i in csv.reader(open('myfile.csv'))), delimiter="\t")

This essentially replaces on-the-fly only the appropriate commas with tabs.

这基本上只用制表符即时替换了适当的逗号。

回答by Michael Yurin

If you are using a numpy you probably want to work with numpy.ndarray. This will give you a numpy.ndarray:

如果您使用的是 numpy,您可能想使用 numpy.ndarray。这会给你一个 numpy.ndarray:

import pandas
data = pandas.read_csv('file.csv').as_matrix()

Pandas will handle the "Lexington, KY" case correctly

Pandas 将正确处理“列克星敦,肯塔基州”案例

回答by Mike T

Make a better function that combines the power of the standard csvmoduleand Numpy's recfromcsv. For instance, the csvmodule has good control and customization of dialects, quotes, escape characters, etc., which you can add to the example below.

结合标准csv模块和 Numpy 的recfromcsv. 例如,该csv模块对方言、引号、转义字符等具有良好的控制和自定义,您可以将其添加到下面的示例中。

The example genfromcsv_modfunction below reads in a complicated CSV file similar to what Microsoft Excel sees, which may contain commas within quoted fields. Internally, the function has a generator function that rewrites each row with tab delimiters.

下面的示例genfromcsv_mod函数读取类似于 Microsoft Excel 所看到的复杂 CSV 文件,其中可能在引用字段中包含逗号。在内部,该函数具有一个生成器函数,该函数使用制表符分隔符重写每一行。

import csv
import numpy as np

def recfromcsv_mod(fname, **kwargs):
    def rewrite_csv_as_tab(fname):
        with open(fname, 'rb') as fp:
            reader = csv.reader(fp)
            for row in reader:
                yield '\t'.join(row)
    return np.recfromcsv(rewrite_csv_as_tab(fname), delimiter='\t', **kwargs)

# Use it to read a CSV file into a record array
x = recfromcsv_mod('t.csv', case_sensitive=True)