Python 使用 numpy.loadtxt() 将文本文件加载为字符串
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/14985233/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Load text file as strings using numpy.loadtxt()
提问by user1966176
I would like to load a big text file (around 1 GB with 3*10^6 rows and 10 - 100 columns) as a 2D np-array containing strings. However, it seems like numpy.loadtxt() only takes floats as default. Is it possible to specify another data type for the entire array? I've tried the following without luck:
我想加载一个大文本文件(大约 1 GB,3*10^6 行和 10 - 100 列)作为包含字符串的 2D np-array。但是,似乎 numpy.loadtxt() 仅将浮点数作为默认值。是否可以为整个数组指定另一种数据类型?我试过以下没有运气:
loadedData = np.loadtxt(address, dtype=np.str)
I get the following error message:
我收到以下错误消息:
/Library/Python/2.7/site-packages/numpy-1.8.0.dev_20224ea_20121123-py2.7-macosx-10.8-x86_64.egg/numpy/lib/npyio.pyc in loadtxt(fname, dtype, comments, delimiter, converters, skiprows, usecols, unpack, ndmin)
833 fh.close()
834
--> 835 X = np.array(X, dtype)
836 # Multicolumn data are returned with shape (1, N, M), i.e.
837 # (1, 1, M) for a single row - remove the singleton dimension there
ValueError: cannot set an array element with a sequence
Any ideas? (I don't know the exact number of columns in my file on beforehand.)
有任何想法吗?(我事先不知道我的文件中的确切列数。)
回答by Hooked
Use genfromtxtinstead. It's a much more general method than loadtxt:
使用genfromtxt来代替。这是一种比以下更通用的方法loadtxt:
import numpy as np
print np.genfromtxt('col.txt',dtype='str')
Using the file col.txt:
使用文件col.txt:
foo bar
cat dog
man wine
This gives:
这给出:
[['foo' 'bar']
['cat' 'dog']
['man' 'wine']]
If you expect that each row has the same number of columns, read the first row and set the attribute filling_valuesto fix any missing rows.
如果您希望每行具有相同的列数,请读取第一行并设置属性filling_values以修复任何丢失的行。
回答by flonk
Is it essential that you need a NumPy array? Otherwise you could speed things up by loading the data as a nested list.
您是否需要 NumPy 数组?否则,您可以通过将数据作为嵌套列表加载来加快速度。
def load(fname):
''' Load the file using std open'''
f = open(fname,'r')
data = []
for line in f.readlines():
data.append(line.replace('\n','').split(' '))
f.close()
return data
For a text file with 4000x4000 words this is about 10 times faster than loadtxt.
对于 4000x4000 字的文本文件,这比loadtxt.
回答by Alexander Tronchin-James
There is also read_csvin Pandas, which is fast and supports non-comma column separators and automatic typing by column:
read_csv在Pandas中也有,它速度快,支持非逗号列分隔符和按列自动键入:
import pandas as pd
df = pd.read_csv('your_file',sep='\t')
It can be converted to a NumPy array if you prefer that type with:
如果您更喜欢该类型,则可以将其转换为 NumPy 数组:
import numpy as np
arr = np.array(df)
This is by far the easiest and most mature text import approach I've come across.
这是迄今为止我遇到的最简单、最成熟的文本导入方法。

