Python 使用 numpy.genfromtxt 读取包含逗号的字符串的 csv 文件
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/17933282/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Using numpy.genfromtxt to read a csv file with strings containing commas
提问by CraigO
I am trying to read in a csv file with numpy.genfromtxt
but some of the fields are strings which contain commas. The strings are in quotes, but numpy is not recognizing the quotes as defining a single string. For example, with the data in 't.csv':
我正在尝试读取 csv 文件,numpy.genfromtxt
但其中一些字段是包含逗号的字符串。字符串在引号中,但 numpy 没有将引号识别为定义单个字符串。例如,使用“t.csv”中的数据:
2012, "Louisville KY", 3.5
2011, "Lexington, KY", 4.0
the code
编码
np.genfromtxt('t.csv', delimiter=',')
produces the error:
产生错误:
ValueError: Some errors were detected ! Line #2 (got 4 columns instead of 3)
ValueError:检测到一些错误!第 2 行(有 4 列而不是 3 列)
The data structure I am looking for is:
我正在寻找的数据结构是:
array([['2012', 'Louisville KY', '3.5'],
['2011', 'Lexington, KY', '4.0']],
dtype='|S13')
Looking over the documentation, I don't see any options to deal with this. Is there a way do to it with numpy, or do I just need to read in the data with the csv
module and then convert it to a numpy array?
查看文档,我没有看到任何处理此问题的选项。有没有办法用 numpy 来解决它,还是我只需要用csv
模块读入数据然后将其转换为 numpy 数组?
采纳答案by joris
You can use pandas(the becoming default library for working with dataframes (heterogeneous data) in scientific python) for this. It's read_csv
can handle this. From the docs:
为此,您可以使用pandas(成为在科学 Python 中处理数据帧(异构数据)的默认库)。它read_csv
可以处理这个。从文档:
quotechar : string
The character to used to denote the start and end of a quoted item. Quoted items can include the delimiter and it will be ignored.
引号:字符串
The character to used to denote the start and end of a quoted item. Quoted items can include the delimiter and it will be ignored.
The default value is "
. An example:
默认值为"
。一个例子:
In [1]: import pandas as pd
In [2]: from StringIO import StringIO
In [3]: s="""year, city, value
...: 2012, "Louisville KY", 3.5
...: 2011, "Lexington, KY", 4.0"""
In [4]: pd.read_csv(StringIO(s), quotechar='"', skipinitialspace=True)
Out[4]:
year city value
0 2012 Louisville KY 3.5
1 2011 Lexington, KY 4.0
The trick here is that you also have to use skipinitialspace=True
to deal with the spaces after the comma-delimiter.
这里的技巧是您还必须使用skipinitialspace=True
处理逗号分隔符后的空格。
Apart from a powerful csv reader, I can also strongly advice to use pandas with the heterogeneous data you have (the example output in numpy you give are all strings, although you could use structured arrays).
除了强大的 csv 阅读器之外,我还强烈建议将 Pandas 与您拥有的异构数据一起使用(您提供的 numpy 中的示例输出都是字符串,尽管您可以使用结构化数组)。
回答by Bitwise
The problem with the additional comma, np.genfromtxt
does not deal with that.
附加逗号的问题,np.genfromtxt
不处理。
One simple solution is to read the file with csv.reader()
from python's csvmodule into a list and then dump it into a numpy array if you like.
一个简单的解决方案是将文件csv.reader()
从 python 的csv模块读取到一个列表中,然后根据需要将其转储到一个 numpy 数组中。
If you really want to use np.genfromtxt
, note that it can take iterators instead of files, e.g. np.genfromtxt(my_iterator, ...)
. So, you can wrap a csv.reader
in an iterator and give it to np.genfromtxt
.
如果您真的想使用np.genfromtxt
,请注意它可以使用迭代器而不是文件,例如np.genfromtxt(my_iterator, ...)
. 因此,您可以将 a 包装csv.reader
在迭代器中并将其提供给np.genfromtxt
.
That would go something like this:
那会是这样的:
import csv
import numpy as np
np.genfromtxt(("\t".join(i) for i in csv.reader(open('myfile.csv'))), delimiter="\t")
This essentially replaces on-the-fly only the appropriate commas with tabs.
这基本上只用制表符即时替换了适当的逗号。
回答by Michael Yurin
If you are using a numpy you probably want to work with numpy.ndarray. This will give you a numpy.ndarray:
如果您使用的是 numpy,您可能想使用 numpy.ndarray。这会给你一个 numpy.ndarray:
import pandas
data = pandas.read_csv('file.csv').as_matrix()
Pandas will handle the "Lexington, KY" case correctly
Pandas 将正确处理“列克星敦,肯塔基州”案例
回答by Mike T
Make a better function that combines the power of the standard csv
moduleand Numpy's recfromcsv
. For instance, the csv
module has good control and customization of dialects, quotes, escape characters, etc., which you can add to the example below.
结合标准csv
模块和 Numpy 的recfromcsv
. 例如,该csv
模块对方言、引号、转义字符等具有良好的控制和自定义,您可以将其添加到下面的示例中。
The example genfromcsv_mod
function below reads in a complicated CSV file similar to what Microsoft Excel sees, which may contain commas within quoted fields. Internally, the function has a generator function that rewrites each row with tab delimiters.
下面的示例genfromcsv_mod
函数读取类似于 Microsoft Excel 所看到的复杂 CSV 文件,其中可能在引用字段中包含逗号。在内部,该函数具有一个生成器函数,该函数使用制表符分隔符重写每一行。
import csv
import numpy as np
def recfromcsv_mod(fname, **kwargs):
def rewrite_csv_as_tab(fname):
with open(fname, 'rb') as fp:
reader = csv.reader(fp)
for row in reader:
yield '\t'.join(row)
return np.recfromcsv(rewrite_csv_as_tab(fname), delimiter='\t', **kwargs)
# Use it to read a CSV file into a record array
x = recfromcsv_mod('t.csv', case_sensitive=True)