Python 使用 numpy.genfromtxt 读取包含逗号的字符串的 csv 文件

Question

提问by CraigO

I am trying to read in a csv file with numpy.genfromtxtbut some of the fields are strings which contain commas. The strings are in quotes, but numpy is not recognizing the quotes as defining a single string. For example, with the data in 't.csv':

我正在尝试读取 csv 文件，numpy.genfromtxt但其中一些字段是包含逗号的字符串。字符串在引号中，但 numpy 没有将引号识别为定义单个字符串。例如，使用“t.csv”中的数据：

2012, "Louisville KY", 3.5
2011, "Lexington, KY", 4.0

the code

编码

np.genfromtxt('t.csv', delimiter=',')

produces the error:

产生错误：

ValueError: Some errors were detected ! Line #2 (got 4 columns instead of 3)

ValueError：检测到一些错误！第 2 行（有 4 列而不是 3 列）

The data structure I am looking for is:

我正在寻找的数据结构是：

array([['2012', 'Louisville KY', '3.5'],
       ['2011', 'Lexington, KY', '4.0']], 
      dtype='|S13')

Looking over the documentation, I don't see any options to deal with this. Is there a way do to it with numpy, or do I just need to read in the data with the csvmodule and then convert it to a numpy array?

查看文档，我没有看到任何处理此问题的选项。有没有办法用 numpy 来解决它，还是我只需要用csv模块读入数据然后将其转换为 numpy 数组？

Answer 1

采纳答案by joris

You can use pandas(the becoming default library for working with dataframes (heterogeneous data) in scientific python) for this. It's read_csvcan handle this. From the docs:

为此，您可以使用pandas（成为在科学 Python 中处理数据帧（异构数据）的默认库）。它read_csv可以处理这个。从文档：

quotechar : string

The character to used to denote the start and end of a quoted item. Quoted items 
can include the delimiter and it will be ignored.

引号：字符串

The character to used to denote the start and end of a quoted item. Quoted items 
can include the delimiter and it will be ignored.

The default value is ". An example:

默认值为"。一个例子：

In [1]: import pandas as pd

In [2]: from StringIO import StringIO

In [3]: s="""year, city, value
   ...: 2012, "Louisville KY", 3.5
   ...: 2011, "Lexington, KY", 4.0"""

In [4]: pd.read_csv(StringIO(s), quotechar='"', skipinitialspace=True)
Out[4]:
   year           city  value
0  2012  Louisville KY    3.5
1  2011  Lexington, KY    4.0

The trick here is that you also have to use skipinitialspace=Trueto deal with the spaces after the comma-delimiter.

这里的技巧是您还必须使用skipinitialspace=True处理逗号分隔符后的空格。

Apart from a powerful csv reader, I can also strongly advice to use pandas with the heterogeneous data you have (the example output in numpy you give are all strings, although you could use structured arrays).

除了强大的 csv 阅读器之外，我还强烈建议将 Pandas 与您拥有的异构数据一起使用（您提供的 numpy 中的示例输出都是字符串，尽管您可以使用结构化数组）。

Answer 2

回答by Bitwise

The problem with the additional comma, np.genfromtxtdoes not deal with that.

附加逗号的问题，np.genfromtxt不处理。

One simple solution is to read the file with csv.reader()from python's csvmodule into a list and then dump it into a numpy array if you like.

一个简单的解决方案是将文件csv.reader()从 python 的csv模块读取到一个列表中，然后根据需要将其转储到一个 numpy 数组中。

If you really want to use np.genfromtxt, note that it can take iterators instead of files, e.g. np.genfromtxt(my_iterator, ...). So, you can wrap a csv.readerin an iterator and give it to np.genfromtxt.

如果您真的想使用np.genfromtxt，请注意它可以使用迭代器而不是文件，例如np.genfromtxt(my_iterator, ...). 因此，您可以将 a 包装csv.reader在迭代器中并将其提供给np.genfromtxt.

That would go something like this:

那会是这样的：

import csv
import numpy as np

np.genfromtxt(("\t".join(i) for i in csv.reader(open('myfile.csv'))), delimiter="\t")

This essentially replaces on-the-fly only the appropriate commas with tabs.

这基本上只用制表符即时替换了适当的逗号。

Answer 3

回答by Michael Yurin

If you are using a numpy you probably want to work with numpy.ndarray. This will give you a numpy.ndarray:

如果您使用的是 numpy，您可能想使用 numpy.ndarray。这会给你一个 numpy.ndarray：

import pandas
data = pandas.read_csv('file.csv').as_matrix()

Pandas will handle the "Lexington, KY" case correctly

Pandas 将正确处理“列克星敦，肯塔基州”案例

Answer 4

回答by Mike T

Make a better function that combines the power of the standard csvmoduleand Numpy's recfromcsv. For instance, the csvmodule has good control and customization of dialects, quotes, escape characters, etc., which you can add to the example below.

结合标准csv模块和 Numpy 的recfromcsv. 例如，该csv模块对方言、引号、转义字符等具有良好的控制和自定义，您可以将其添加到下面的示例中。

The example genfromcsv_modfunction below reads in a complicated CSV file similar to what Microsoft Excel sees, which may contain commas within quoted fields. Internally, the function has a generator function that rewrites each row with tab delimiters.

下面的示例genfromcsv_mod函数读取类似于 Microsoft Excel 所看到的复杂 CSV 文件，其中可能在引用字段中包含逗号。在内部，该函数具有一个生成器函数，该函数使用制表符分隔符重写每一行。

import csv
import numpy as np

def recfromcsv_mod(fname, **kwargs):
    def rewrite_csv_as_tab(fname):
        with open(fname, 'rb') as fp:
            reader = csv.reader(fp)
            for row in reader:
                yield '\t'.join(row)
    return np.recfromcsv(rewrite_csv_as_tab(fname), delimiter='\t', **kwargs)

# Use it to read a CSV file into a record array
x = recfromcsv_mod('t.csv', case_sensitive=True)

Python 使用 numpy.genfromtxt 读取包含逗号的字符串的 csv 文件

提问by CraigO

采纳答案by joris

回答by Bitwise

回答by Michael Yurin

回答by Mike T

相关推荐

最近更新

标签

Python 使用 numpy.genfromtxt 读取包含逗号的字符串的 csv 文件

提问by CraigO

采纳答案by joris

回答by Bitwise

回答by Michael Yurin

回答by Mike T

相关推荐

Selenium (with python) 如何修改一个元素的css样式

Python 字符串中的 u'\ufeff'

Python 导入错误：没有名为 concurrent.futures.process 的模块

在 scipy python 中使用 UnivariateSpline 拟合数据

相关推荐

最近更新

标签