Python 使用 numpy.loadtxt() 将文本文件加载为字符串

Question

提问by user1966176

I would like to load a big text file (around 1 GB with 3*10^6 rows and 10 - 100 columns) as a 2D np-array containing strings. However, it seems like numpy.loadtxt() only takes floats as default. Is it possible to specify another data type for the entire array? I've tried the following without luck:

我想加载一个大文本文件（大约 1 GB，3*10^6 行和 10 - 100 列）作为包含字符串的 2D np-array。但是，似乎 numpy.loadtxt() 仅将浮点数作为默认值。是否可以为整个数组指定另一种数据类型？我试过以下没有运气：

loadedData = np.loadtxt(address, dtype=np.str)

I get the following error message:

我收到以下错误消息：

/Library/Python/2.7/site-packages/numpy-1.8.0.dev_20224ea_20121123-py2.7-macosx-10.8-x86_64.egg/numpy/lib/npyio.pyc in loadtxt(fname, dtype, comments, delimiter, converters, skiprows, usecols, unpack, ndmin)
    833             fh.close()
    834
--> 835     X = np.array(X, dtype)
    836     # Multicolumn data are returned with shape (1, N, M), i.e.
    837     # (1, 1, M) for a single row - remove the singleton dimension there

ValueError: cannot set an array element with a sequence

Any ideas? (I don't know the exact number of columns in my file on beforehand.)

有任何想法吗？（我事先不知道我的文件中的确切列数。）

Answer 1

回答by Hooked

Use genfromtxtinstead. It's a much more general method than loadtxt:

使用genfromtxt来代替。这是一种比以下更通用的方法loadtxt：

import numpy as np
print np.genfromtxt('col.txt',dtype='str')

Using the file col.txt:

使用文件col.txt：

foo bar
cat dog
man wine

This gives:

这给出：

[['foo' 'bar']
 ['cat' 'dog']
 ['man' 'wine']]

If you expect that each row has the same number of columns, read the first row and set the attribute filling_valuesto fix any missing rows.

如果您希望每行具有相同的列数，请读取第一行并设置属性filling_values以修复任何丢失的行。

Answer 2

回答by flonk

Is it essential that you need a NumPy array? Otherwise you could speed things up by loading the data as a nested list.

您是否需要 NumPy 数组？否则，您可以通过将数据作为嵌套列表加载来加快速度。

def load(fname):
    ''' Load the file using std open'''
    f = open(fname,'r')

    data = []
    for line in f.readlines():
        data.append(line.replace('\n','').split(' '))

    f.close()

    return data

For a text file with 4000x4000 words this is about 10 times faster than loadtxt.

对于 4000x4000 字的文本文件，这比loadtxt.

Answer 3

回答by Alexander Tronchin-James

There is also read_csvin Pandas, which is fast and supports non-comma column separators and automatic typing by column:

read_csv在Pandas中也有，它速度快，支持非逗号列分隔符和按列自动键入：

import pandas as pd
df = pd.read_csv('your_file',sep='\t')

It can be converted to a NumPy array if you prefer that type with:

如果您更喜欢该类型，则可以将其转换为 NumPy 数组：

import numpy as np
arr = np.array(df)

This is by far the easiest and most mature text import approach I've come across.

这是迄今为止我遇到的最简单、最成熟的文本导入方法。

Python 使用 numpy.loadtxt() 将文本文件加载为字符串

提问by user1966176

回答by Hooked

回答by flonk

回答by Alexander Tronchin-James

相关推荐

最近更新

标签

Python 使用 numpy.loadtxt() 将文本文件加载为字符串

提问by user1966176

回答by Hooked

回答by flonk

回答by Alexander Tronchin-James

相关推荐

Python 使用 easy_install 安装特定版本

在python中将unicode字符串字典转换为字典

如何在 Python 中获取网络接口卡名称？

Python os.getuid() 和 os.geteuid() 有什么区别？

相关推荐

最近更新

标签