从python中的.dat文件读取和进行计算

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/37956344/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-19 20:09:11  来源:igfitidea点击:

reading and doing calculation from .dat file in python

pythoncsv

提问by bhjghjh

I need to read a .dat file in python which has 12 columns in total and millions of lines of rows. I need to divide column 2,3 and 4 with column 1 for my calculation. So before I load that .dat file, do I need to delete all the other unwanted columns? If not, how do I selectively declare the column and ask python to do the math?

我需要在 python 中读取一个 .dat 文件,它总共有 12 列和数百万行行。我需要将第 2,3 和第 4 列与第 1 列分开进行计算。因此,在加载该 .dat 文件之前,是否需要删除所有其他不需要的列?如果没有,我如何有选择地声明该列并让 python 进行数学计算?

an example of the .dat file would be data.dat

.dat 文件的一个例子是 data.dat

I am new to python , so a little instruction to open , read and calculation would be appreciated.

我是 python 的新手,所以如果能提供一些打开、阅读和计算的说明,我们将不胜感激。

I have added the code I am using as a starter from your suggestion:

我已经根据您的建议添加了我用作初学者的代码:

from sys import argv

import pandas as pd



script, filename = argv

txt = open(filename)

print "Here's your file %r:" % filename
print txt.read()

def your_func(row):
    return row['x-momentum'] / row['mass']

columns_to_keep = ['mass', 'x-momentum']
dataframe = pd.read_csv('~/Pictures', delimiter="," , usecols=columns_to_keep)
dataframe['new_column'] = dataframe.apply(your_func, axis=1)

and also the error I get through it:

还有我遇到的错误:

Traceback (most recent call last):
  File "flash.py", line 18, in <module>
    dataframe = pd.read_csv('~/Pictures', delimiter="," , usecols=columns_to_keep)
  File "/home/trina/anaconda2/lib/python2.7/site-packages/pandas/io/parsers.py", line 529, in parser_f
    return _read(filepath_or_buffer, kwds)
  File "/home/trina/anaconda2/lib/python2.7/site-packages/pandas/io/parsers.py", line 295, in _read
    parser = TextFileReader(filepath_or_buffer, **kwds)
  File "/home/trina/anaconda2/lib/python2.7/site-packages/pandas/io/parsers.py", line 612, in __init__
    self._make_engine(self.engine)
  File "/home/trina/anaconda2/lib/python2.7/site-packages/pandas/io/parsers.py", line 747, in _make_engine
    self._engine = CParserWrapper(self.f, **self.options)
  File "/home/trina/anaconda2/lib/python2.7/site-packages/pandas/io/parsers.py", line 1119, in __init__
    self._reader = _parser.TextReader(src, **kwds)
  File "pandas/parser.pyx", line 518, in pandas.parser.TextReader.__cinit__ (pandas/parser.c:5030)
ValueError: No columns to parse from file

回答by Bill

After looking at your flash.datfile, it's clear you need to do a little clean up before you process it. The following code converts it to a CSV file:

查看您的flash.dat文件后,很明显您需要在处理之前进行一些清理。以下代码将其转换为 CSV 文件:

import csv

# read flash.dat to a list of lists
datContent = [i.strip().split() for i in open("./flash.dat").readlines()]

# write it as a new CSV file
with open("./flash.csv", "wb") as f:
    writer = csv.writer(f)
    writer.writerows(datContent)

Now, use Pandas to compute new column.

现在,使用 Pandas 计算新列。

import pandas as pd

def your_func(row):
    return row['x-momentum'] / row['mass']

columns_to_keep = ['#time', 'x-momentum', 'mass']
dataframe = pd.read_csv("./flash.csv", usecols=columns_to_keep)
dataframe['new_column'] = dataframe.apply(your_func, axis=1)

print dataframe

回答by ppaulojr

Try something like:

尝试类似:

datContent = [i.strip().split() for i in open("filename.dat").readlines()]

Then you'll have your data in a list.

然后你将把你的数据放在一个列表中。

If you want to have something more sophisticated you can use Pandas, see the linked cookbook.

如果您想要更复杂的东西,可以使用Pandas,请参阅链接的食谱。

回答by Parfait

Consider using the general read_table()function (of which read_csv()is a special type) where pandas can easily import the specific .dat file specifying the space separator, sep='\s+'. Additionally, no defined function with apply()is needed for column by column calculation.

考虑使用通用read_table()函数(它read_csv()是一种特殊类型),pandas 可以轻松导入指定空格分隔符的特定 .dat 文件sep='\s+'。此外,apply()逐列计算不需要定义的函数 with 。

Below numpy is used to condition for division by zero. Also, the example .dat file's first column is #timeand columns 2, 3, 4 are x-momentum, y-momentum, and mass(different expression in your code but revise as needed).

下面的 numpy 用于条件除以零。此外,示例 .dat 文件的第一列是#time,第2、3、4列是x-momentumy-momentummass(代码中的不同表达式,但根据需要进行修改)。

import pandas as pd
import numpy as np

columns_to_keep = ['#time', 'x-momentum', 'y-momentum', 'mass']
df = pd.read_table("flash.dat", sep="\s+", usecols=columns_to_keep)

df['mass_per_time'] = np.where(df['#time'] > 0, df['mass']/df['#time'], np.nan)
df['x-momentum_per_time'] = np.where(df['#time'] > 0, df['x-momentum']/df['#time'], np.nan)
df['y-momentum_per_time'] = np.where(df['#time'] > 0, df['y-momentum']/df['#time'], np.nan)

回答by Nisarg Bhatt

train=pd.read_csv("Path",sep=" ::",header=None)

Now you can access the dat file.

现在您可以访问 dat 文件。

train.columns=["A","B","C"]# Number of columns you can see in the dat file.

then you can use this as csv files.

那么您可以将其用作 csv 文件。

回答by Jan Christoph Terasa

The problem you face here is that the column header names have whitespaces in them. You need to fix/ignore that to make pandas.read_csvbehave nicely. This will read the column header names into a list based on the fixed length of the field name strings:

您在这里面临的问题是列标题名称中包含空格。您需要修复/忽略它以使其pandas.read_csv表现良好。这将根据字段名称字符串的固定长度将列标题名称读入列表:

import pandas

with open('flash.dat') as f:
    header = f.readline()[2:-1]
    header_fixed = [header[i*23:(i+1)*23].strip() for i in range(26)]
    header_fixed[0] = header_fixed[0][1:] # remove '#' from time

    # pandas doesn't handle "Infinity" properly, read Infinity as NaN, then convert back to infinity
    df = pandas.read_csv(f, sep='\s+', names=header_fixed, na_values="Infinity")
    df.fillna(pandas.np.inf, inplace=True)

# processing
df['new_column'] = df['x-momentum'] / df['mass']