从python中的.dat文件读取和进行计算
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/37956344/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
reading and doing calculation from .dat file in python
提问by bhjghjh
I need to read a .dat file in python which has 12 columns in total and millions of lines of rows. I need to divide column 2,3 and 4 with column 1 for my calculation. So before I load that .dat file, do I need to delete all the other unwanted columns? If not, how do I selectively declare the column and ask python to do the math?
我需要在 python 中读取一个 .dat 文件,它总共有 12 列和数百万行行。我需要将第 2,3 和第 4 列与第 1 列分开进行计算。因此,在加载该 .dat 文件之前,是否需要删除所有其他不需要的列?如果没有,我如何有选择地声明该列并让 python 进行数学计算?
an example of the .dat file would be data.dat
.dat 文件的一个例子是 data.dat
I am new to python , so a little instruction to open , read and calculation would be appreciated.
我是 python 的新手,所以如果能提供一些打开、阅读和计算的说明,我们将不胜感激。
I have added the code I am using as a starter from your suggestion:
我已经根据您的建议添加了我用作初学者的代码:
from sys import argv
import pandas as pd
script, filename = argv
txt = open(filename)
print "Here's your file %r:" % filename
print txt.read()
def your_func(row):
return row['x-momentum'] / row['mass']
columns_to_keep = ['mass', 'x-momentum']
dataframe = pd.read_csv('~/Pictures', delimiter="," , usecols=columns_to_keep)
dataframe['new_column'] = dataframe.apply(your_func, axis=1)
and also the error I get through it:
还有我遇到的错误:
Traceback (most recent call last):
File "flash.py", line 18, in <module>
dataframe = pd.read_csv('~/Pictures', delimiter="," , usecols=columns_to_keep)
File "/home/trina/anaconda2/lib/python2.7/site-packages/pandas/io/parsers.py", line 529, in parser_f
return _read(filepath_or_buffer, kwds)
File "/home/trina/anaconda2/lib/python2.7/site-packages/pandas/io/parsers.py", line 295, in _read
parser = TextFileReader(filepath_or_buffer, **kwds)
File "/home/trina/anaconda2/lib/python2.7/site-packages/pandas/io/parsers.py", line 612, in __init__
self._make_engine(self.engine)
File "/home/trina/anaconda2/lib/python2.7/site-packages/pandas/io/parsers.py", line 747, in _make_engine
self._engine = CParserWrapper(self.f, **self.options)
File "/home/trina/anaconda2/lib/python2.7/site-packages/pandas/io/parsers.py", line 1119, in __init__
self._reader = _parser.TextReader(src, **kwds)
File "pandas/parser.pyx", line 518, in pandas.parser.TextReader.__cinit__ (pandas/parser.c:5030)
ValueError: No columns to parse from file
回答by Bill
After looking at your flash.dat
file, it's clear you need to do a little clean up before you process it. The following code converts it to a CSV file:
查看您的flash.dat
文件后,很明显您需要在处理之前进行一些清理。以下代码将其转换为 CSV 文件:
import csv
# read flash.dat to a list of lists
datContent = [i.strip().split() for i in open("./flash.dat").readlines()]
# write it as a new CSV file
with open("./flash.csv", "wb") as f:
writer = csv.writer(f)
writer.writerows(datContent)
Now, use Pandas to compute new column.
现在,使用 Pandas 计算新列。
import pandas as pd
def your_func(row):
return row['x-momentum'] / row['mass']
columns_to_keep = ['#time', 'x-momentum', 'mass']
dataframe = pd.read_csv("./flash.csv", usecols=columns_to_keep)
dataframe['new_column'] = dataframe.apply(your_func, axis=1)
print dataframe
回答by ppaulojr
回答by Parfait
Consider using the general read_table()
function (of which read_csv()
is a special type) where pandas can easily import the specific .dat file specifying the space separator, sep='\s+'
. Additionally, no defined function with apply()
is needed for column by column calculation.
考虑使用通用read_table()
函数(它read_csv()
是一种特殊类型),pandas 可以轻松导入指定空格分隔符的特定 .dat 文件sep='\s+'
。此外,apply()
逐列计算不需要定义的函数 with 。
Below numpy is used to condition for division by zero. Also, the example .dat file's first column is #timeand columns 2, 3, 4 are x-momentum, y-momentum, and mass(different expression in your code but revise as needed).
下面的 numpy 用于条件除以零。此外,示例 .dat 文件的第一列是#time,第2、3、4列是x-momentum、y-momentum和mass(代码中的不同表达式,但根据需要进行修改)。
import pandas as pd
import numpy as np
columns_to_keep = ['#time', 'x-momentum', 'y-momentum', 'mass']
df = pd.read_table("flash.dat", sep="\s+", usecols=columns_to_keep)
df['mass_per_time'] = np.where(df['#time'] > 0, df['mass']/df['#time'], np.nan)
df['x-momentum_per_time'] = np.where(df['#time'] > 0, df['x-momentum']/df['#time'], np.nan)
df['y-momentum_per_time'] = np.where(df['#time'] > 0, df['y-momentum']/df['#time'], np.nan)
回答by Nisarg Bhatt
train=pd.read_csv("Path",sep=" ::",header=None)
Now you can access the dat file.
现在您可以访问 dat 文件。
train.columns=["A","B","C"]# Number of columns you can see in the dat file.
then you can use this as csv files.
那么您可以将其用作 csv 文件。
回答by Jan Christoph Terasa
The problem you face here is that the column header names have whitespaces in them. You need to fix/ignore that to make pandas.read_csv
behave nicely. This will read the column header names into a list based on the fixed length of the field name strings:
您在这里面临的问题是列标题名称中包含空格。您需要修复/忽略它以使其pandas.read_csv
表现良好。这将根据字段名称字符串的固定长度将列标题名称读入列表:
import pandas
with open('flash.dat') as f:
header = f.readline()[2:-1]
header_fixed = [header[i*23:(i+1)*23].strip() for i in range(26)]
header_fixed[0] = header_fixed[0][1:] # remove '#' from time
# pandas doesn't handle "Infinity" properly, read Infinity as NaN, then convert back to infinity
df = pandas.read_csv(f, sep='\s+', names=header_fixed, na_values="Infinity")
df.fillna(pandas.np.inf, inplace=True)
# processing
df['new_column'] = df['x-momentum'] / df['mass']