如何从 Python 中的 txt 文件中读取数据集?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/25013792/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
How to read a dataset from a txt file in Python?
提问by Ewybe
I have a dataset in this format:
我有一个这种格式的数据集:


I need to import the data and work with it.
我需要导入数据并使用它。
The main problem is that the first and the fourth columns are strings while the second and third columns are floats and ints, respectively.
主要问题是第一列和第四列是字符串,而第二列和第三列分别是浮点数和整数。
I'd like to put the data in a matrix or at least obtain a list of each column's data.
我想将数据放入矩阵中,或者至少获得每列数据的列表。
I tried to read the whole dataset as a string but it's a mess:
我试图将整个数据集作为字符串读取,但它很混乱:
f = open ( 'input.txt' , 'r')
l = [ map(str,line.split('\t')) for line in f ]
What could be a good solution?
什么是好的解决方案?
回答by Padraic Cunningham
split and transpose the list:
拆分和转置列表:
with open ( 'in.txt' , 'r') as f: # use with to open your files, it close them automatically
l = [x.split() for x in f]
rows = [list(x) for x in zip(*l)]
rows[1],rows[2] = map(float,rows[1]),map(int,rows[2])
In [16]: rows
Out[16]:
[['bbbbffdd', 'bbbWWWff', 'ajkfbdafa'],
[434343.0, 43545343.0, 2345345.0],
[228, 289, 2312],
['D', 'E', 'F']]
回答by ford
Here's a solution to read in the data and convert those second and third columns to numeric types:
这是读取数据并将第二列和第三列转换为数字类型的解决方案:
f = open('input.txt', 'r')
rows = []
for line in f:
# Split on any whitespace (including tab characters)
row = line.split()
# Convert strings to numeric values:
row[1] = float(row[1])
row[2] = int(row[2])
# Append to our list of lists:
rows.append(row)
print rows
With the following input.txt:
具有以下内容input.txt:
string1 5.005069 284 D
string2 5.005049 142 D
string3 5.005066 284 D
string4 5.005037 124 D
It produces the following output:
它产生以下输出:
[['string1', 5.005069, 284, 'D'],
['string2', 5.005049, 142, 'D'],
['string3', 5.005066, 284, 'D'],
['string4', 5.005037, 124, 'D']]
回答by mhawke
You seem to have CSV data (with tabs as the delimiter) so why not use the csv module?
您似乎有 CSV 数据(以制表符作为分隔符),那么为什么不使用csv 模块呢?
import csv
with open('data.csv') as f:
reader = csv.reader(f, delimiter='\t')
data = [(col1, float(col2), int(col3), col4)
for col1, col2, col3, col4 in reader]
datais a list of tuples containing the converted data (column 2 -> float, column 3 -> int). If data.csv contains (with tabs, not spaces):
data是包含转换数据的元组列表(第 2 列 -> 浮点数,第 3 列 -> 整数)。如果 data.csv 包含(带有制表符,而不是空格):
thing1 5.005069 284 D
thing2 5.005049 142 D
thing3 5.005066 248 D
thing4 5.005037 124 D
datawould contain :
data将包含:
[('thing1', 5.005069, 284, 'D'),
('thing2', 5.005049, 142, 'D'),
('thing3', 5.005066, 248, 'D'),
('thing4', 5.005037, 124, 'D')]
回答by Sudipta Basak
You can use pandas. They are great for reading csv files, tab delimited files etc. Pandas will almost all the time read the data type correctly and put them in an numpy array when accessed using rows/columns as demonstrated.
你可以使用熊猫。它们非常适合读取 csv 文件、制表符分隔的文件等。 Pandas 几乎总是会正确读取数据类型,并在使用行/列访问时将它们放入一个 numpy 数组中,如图所示。
I used this tab delimited 'test.txt' file:
我使用了这个制表符分隔的“test.txt”文件:
bbbbffdd 434343 228 D
bbbWWWff 43545343 289 E
ajkfbdafa 2345345 2312 F
Here is the pandas code. Your file will be read in a nice dataframe using one line in python. You can change the 'sep' value to anything else to suit your file.
这是熊猫代码。将使用 Python 中的一行在一个不错的数据框中读取您的文件。您可以将 'sep' 值更改为适合您的文件的任何其他值。
import pandas as pd
X = pd.read_csv('test.txt', sep="\t", header=None)
Then try:
然后尝试:
print X
0 1 2 3
0 bbbbffdd 434343 228 D
1 bbbWWWff 43545343 289 E
2 ajkfbdafa 2345345 2312 F
print X[0]
0 bbbbffdd
1 bbbWWWff
2 ajkfbdafa
print X[2]
0 228
1 289
2 2312
print X[1][1:]
1 43545343
2 2345345
You can add column names as:
您可以将列名称添加为:
X.columns = ['random_letters', 'number', 'simple_number', 'letter']
And then get the columns as:
然后将列获取为:
X['number'].values
array([ 434343, 43545343, 2345345])
回答by Rakesh Arya
Use numpy.loadtxt("data.txt")to read data as a list of rows
用于numpy.loadtxt("data.txt")将数据作为行列表读取
[[row1],[row2],[row3]...]
each row has elements of each column
每行都有每列的元素
[row1] = [col1, col2, col3, ...]
Use dtype = stringto read each entry as string
用于dtype = string将每个条目读取为字符串
You can convert corresponding values to integer, float, etc. with a for loop.
您可以使用 for 循环将相应的值转换为整数、浮点数等。
Reference: https://docs.scipy.org/doc/numpy-1.15.0/reference/generated/numpy.loadtxt.html
参考:https: //docs.scipy.org/doc/numpy-1.15.0/reference/generated/numpy.loadtxt.html

