Python 使用 numpy.loadtxt() 将文本文件加载为字符串

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/14985233/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-18 13:00:05  来源:igfitidea点击:

Load text file as strings using numpy.loadtxt()

pythonnumpy

提问by user1966176

I would like to load a big text file (around 1 GB with 3*10^6 rows and 10 - 100 columns) as a 2D np-array containing strings. However, it seems like numpy.loadtxt() only takes floats as default. Is it possible to specify another data type for the entire array? I've tried the following without luck:

我想加载一个大文本文件(大约 1 GB,3*10^6 行和 10 - 100 列)作为包含字符串的 2D np-array。但是,似乎 numpy.loadtxt() 仅将浮点数作为默认值。是否可以为整个数组指定另一种数据类型?我试过以下没有运气:

loadedData = np.loadtxt(address, dtype=np.str)

I get the following error message:

我收到以下错误消息:

/Library/Python/2.7/site-packages/numpy-1.8.0.dev_20224ea_20121123-py2.7-macosx-10.8-x86_64.egg/numpy/lib/npyio.pyc in loadtxt(fname, dtype, comments, delimiter, converters, skiprows, usecols, unpack, ndmin)
    833             fh.close()
    834
--> 835     X = np.array(X, dtype)
    836     # Multicolumn data are returned with shape (1, N, M), i.e.
    837     # (1, 1, M) for a single row - remove the singleton dimension there

ValueError: cannot set an array element with a sequence

Any ideas? (I don't know the exact number of columns in my file on beforehand.)

有任何想法吗?(我事先不知道我的文件中的确切列数。)

回答by Hooked

Use genfromtxtinstead. It's a much more general method than loadtxt:

使用genfromtxt来代替。这是一种比以下更通用的方法loadtxt

import numpy as np
print np.genfromtxt('col.txt',dtype='str')

Using the file col.txt:

使用文件col.txt

foo bar
cat dog
man wine

This gives:

这给出:

[['foo' 'bar']
 ['cat' 'dog']
 ['man' 'wine']]

If you expect that each row has the same number of columns, read the first row and set the attribute filling_valuesto fix any missing rows.

如果您希望每行具有相同的列数,请读取第一行并设置属性filling_values以修复任何丢失的行。

回答by flonk

Is it essential that you need a NumPy array? Otherwise you could speed things up by loading the data as a nested list.

您是否需要 NumPy 数组?否则,您可以通过将数据作为嵌套列表加载来加快速度。

def load(fname):
    ''' Load the file using std open'''
    f = open(fname,'r')

    data = []
    for line in f.readlines():
        data.append(line.replace('\n','').split(' '))

    f.close()

    return data

For a text file with 4000x4000 words this is about 10 times faster than loadtxt.

对于 4000x4000 字的文本文件,这比loadtxt.

回答by Alexander Tronchin-James

There is also read_csvin Pandas, which is fast and supports non-comma column separators and automatic typing by column:

read_csvPandas中也有,它速度快,支持非逗号列分隔符和按列自动键入:

import pandas as pd
df = pd.read_csv('your_file',sep='\t')

It can be converted to a NumPy array if you prefer that type with:

如果您更喜欢该类型,则可以将其转换为 NumPy 数组:

import numpy as np
arr = np.array(df)

This is by far the easiest and most mature text import approach I've come across.

这是迄今为止我遇到的最简单、最成熟的文本导入方法。