如何从 Python 中的 txt 文件中读取数据集？

Question

提问by Ewybe

I have a dataset in this format:

我有一个这种格式的数据集：

example data

示例数据

I need to import the data and work with it.

我需要导入数据并使用它。

The main problem is that the first and the fourth columns are strings while the second and third columns are floats and ints, respectively.

主要问题是第一列和第四列是字符串，而第二列和第三列分别是浮点数和整数。

I'd like to put the data in a matrix or at least obtain a list of each column's data.

我想将数据放入矩阵中，或者至少获得每列数据的列表。

I tried to read the whole dataset as a string but it's a mess:

我试图将整个数据集作为字符串读取，但它很混乱：

f = open ( 'input.txt' , 'r')
l = [ map(str,line.split('\t')) for line in f ]

What could be a good solution?

什么是好的解决方案？

Answer 1

回答by Padraic Cunningham

split and transpose the list:

拆分和转置列表：

 with open ( 'in.txt' , 'r') as f: # use with to open your files, it close them automatically
    l = [x.split() for x in f]
    rows = [list(x) for x in zip(*l)]
    rows[1],rows[2] = map(float,rows[1]),map(int,rows[2])
In [16]: rows
Out[16]: 
[['bbbbffdd', 'bbbWWWff', 'ajkfbdafa'],
 [434343.0, 43545343.0, 2345345.0],
 [228, 289, 2312],
 ['D', 'E', 'F']]

Answer 2

回答by ford

Here's a solution to read in the data and convert those second and third columns to numeric types:

这是读取数据并将第二列和第三列转换为数字类型的解决方案：

f = open('input.txt', 'r')

rows = []
for line in f:
    # Split on any whitespace (including tab characters)
    row = line.split()
    # Convert strings to numeric values:
    row[1] = float(row[1])
    row[2] = int(row[2])
    # Append to our list of lists:
    rows.append(row)

print rows

With the following input.txt:

具有以下内容input.txt：

string1 5.005069    284 D
string2 5.005049    142 D
string3 5.005066    284 D
string4 5.005037    124 D

It produces the following output:

它产生以下输出：

[['string1', 5.005069, 284, 'D'], 
 ['string2', 5.005049, 142, 'D'], 
 ['string3', 5.005066, 284, 'D'], 
 ['string4', 5.005037, 124, 'D']]

Answer 3

回答by mhawke

You seem to have CSV data (with tabs as the delimiter) so why not use the csv module?

您似乎有 CSV 数据（以制表符作为分隔符），那么为什么不使用csv 模块呢？

import csv

with open('data.csv') as f:
    reader = csv.reader(f, delimiter='\t')
    data = [(col1, float(col2), int(col3), col4)
                for col1, col2, col3, col4 in reader]

datais a list of tuples containing the converted data (column 2 -> float, column 3 -> int). If data.csv contains (with tabs, not spaces):

data是包含转换数据的元组列表（第 2 列 -> 浮点数，第 3 列 -> 整数）。如果 data.csv 包含（带有制表符，而不是空格）：

thing1  5.005069    284 D
thing2  5.005049    142 D
thing3  5.005066    248 D
thing4  5.005037    124 D

datawould contain :

data将包含：

[('thing1', 5.005069, 284, 'D'),
 ('thing2', 5.005049, 142, 'D'),
 ('thing3', 5.005066, 248, 'D'),
 ('thing4', 5.005037, 124, 'D')]

Answer 4

回答by Sudipta Basak

You can use pandas. They are great for reading csv files, tab delimited files etc. Pandas will almost all the time read the data type correctly and put them in an numpy array when accessed using rows/columns as demonstrated.

你可以使用熊猫。它们非常适合读取 csv 文件、制表符分隔的文件等。 Pandas 几乎总是会正确读取数据类型，并在使用行/列访问时将它们放入一个 numpy 数组中，如图所示。

I used this tab delimited 'test.txt' file:

我使用了这个制表符分隔的“test.txt”文件：

    bbbbffdd    434343  228 D 
    bbbWWWff    43545343    289 E
    ajkfbdafa   2345345 2312    F

Here is the pandas code. Your file will be read in a nice dataframe using one line in python. You can change the 'sep' value to anything else to suit your file.

这是熊猫代码。将使用 Python 中的一行在一个不错的数据框中读取您的文件。您可以将 'sep' 值更改为适合您的文件的任何其他值。

    import pandas as pd
    X = pd.read_csv('test.txt', sep="\t", header=None)

Then try:

然后尝试：

    print X
            0         1     2   3
    0   bbbbffdd    434343   228  D 
    1   bbbWWWff  43545343   289   E
    2  ajkfbdafa   2345345  2312   F

    print X[0]
    0     bbbbffdd
    1     bbbWWWff
    2    ajkfbdafa

    print X[2]
    0     228
    1     289
    2    2312

    print X[1][1:]
    1    43545343
    2     2345345

You can add column names as:

您可以将列名称添加为：

    X.columns = ['random_letters', 'number', 'simple_number', 'letter']

And then get the columns as:

然后将列获取为：

    X['number'].values
    array([  434343, 43545343,  2345345])

Answer 5

回答by Rakesh Arya

Use numpy.loadtxt("data.txt")to read data as a list of rows

用于numpy.loadtxt("data.txt")将数据作为行列表读取

[[row1],[row2],[row3]...]

each row has elements of each column

每行都有每列的元素

[row1] = [col1, col2, col3, ...]

Use dtype = stringto read each entry as string

用于dtype = string将每个条目读取为字符串

You can convert corresponding values to integer, float, etc. with a for loop.

您可以使用 for 循环将相应的值转换为整数、浮点数等。

Reference: https://docs.scipy.org/doc/numpy-1.15.0/reference/generated/numpy.loadtxt.html

参考：https: //docs.scipy.org/doc/numpy-1.15.0/reference/generated/numpy.loadtxt.html

如何从 Python 中的 txt 文件中读取数据集？

提问by Ewybe

回答by Padraic Cunningham

回答by ford

回答by mhawke

回答by Sudipta Basak

回答by Rakesh Arya

相关推荐

最近更新

标签

如何从 Python 中的 txt 文件中读取数据集？

提问by Ewybe

回答by Padraic Cunningham

回答by ford

回答by mhawke

回答by Sudipta Basak

回答by Rakesh Arya

相关推荐

Python 从 git repo 分支安装 pip

python正则表达式获取电子邮件地址的第一部分

Python 未找到 Django 404 错误页面

Python 在 pandas 0.10.1 上使用 pandas.read_csv 指定 dtype float32

相关推荐

最近更新

标签