pandas 将 .DAT 文件导入熊猫数据框

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/40409321/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-14 02:20:43  来源:igfitidea点击:

Importing .DAT file into pandas dataframe

pythonpandas

提问by ValientProcess

I have a .DAT file with rows like these:

我有一个包含以下行的 .DAT 文件:

2016 01 01 00 00 19 348 2.05 7 618.4
2016 01 01 00 01 19 351 2.05 7 618.4
2016 01 01 00 02 18 0 2.05 7 618.4
2016 01 01 00 03 17 353 2.05 7 618.4
2016 01 01 00 04 19 346 2.02 7 618.4
2016 01 01 00 05 20 345 2.00 7 618.4
2016 01 01 00 06 22 348 1.97 7 618.4
.......

the data format is:

数据格式为:

year month day hour minute(HST) wind_speed(kts) wind_direction(dec) temperature(C) relative_humidity(%) pressure

I want to import the .DAT file into a pandas dataframe, with the year-month-day-hour-minute as a single index column, and the rest of the values as separate columns.

我想将 .DAT 文件导入到 Pandas 数据框中,将年-月-日-小时-分钟作为单个索引列,将其余值作为单独的列。

Any suggestions?

有什么建议?

Thanks !!

谢谢 !!

采纳答案by jezrael

You can use read_csv:

您可以使用read_csv

import pandas as pd
import numpy as np
from pandas.compat import StringIO
import datetime as dt

temp=u"""2016 01 01 00 00 19 348 2.05 7 618.4
2016 01 01 00 01 19 351 2.05 7 618.4
2016 01 01 00 02 18 0 2.05 7 618.4
2016 01 01 00 03 17 353 2.05 7 618.4
2016 01 01 00 04 19 346 2.02 7 618.4
2016 01 01 00 05 20 345 2.00 7 618.4
2016 01 01 00 06 22 348 1.97 7 618.4"""
#after testing replace StringIO(temp) to filename

parser = lambda date: pd.datetime.strptime(date, '%Y %m %d %H %M')
df = pd.read_csv(StringIO(temp), 
                 sep="\s+", #separator whitespace
                 index_col=0, #convert first column to datetimeindex
                 date_parser=parser, #function for converting dates
                 parse_dates=[[0,1,2,3,4]], #columns to datetime
                 header=None) #none header

Then need set column names, because if use parameter namesget:

然后需要设置列名,因为如果使用参数namesget:

NotImplementedError: file structure not yet supported

NotImplementedError:尚不支持文件结构

df.columns = ['wind_speed(kts)', 'wind_direction(dec)', 'temperature(C)', 'relative_humidity(%)', 'pressure'] 
#remove index name
df.index.name = None 
print (df)
                     wind_speed(kts)  wind_direction(dec)  temperature(C)  \
2016-01-01 00:00:00               19                  348            2.05   
2016-01-01 00:01:00               19                  351            2.05   
2016-01-01 00:02:00               18                    0            2.05   
2016-01-01 00:03:00               17                  353            2.05   
2016-01-01 00:04:00               19                  346            2.02   
2016-01-01 00:05:00               20                  345            2.00   
2016-01-01 00:06:00               22                  348            1.97   

                     relative_humidity(%)  pressure  
2016-01-01 00:00:00                     7     618.4  
2016-01-01 00:01:00                     7     618.4  
2016-01-01 00:02:00                     7     618.4  
2016-01-01 00:03:00                     7     618.4  
2016-01-01 00:04:00                     7     618.4  
2016-01-01 00:05:00                     7     618.4  
2016-01-01 00:06:00                     7     618.4  

print (df.dtypes)
wind_speed(kts)           int64
wind_direction(dec)       int64
temperature(C)          float64
relative_humidity(%)      int64
pressure                float64
dtype: object

print (df.index)
DatetimeIndex(['2016-01-01 00:00:00', '2016-01-01 00:01:00',
               '2016-01-01 00:02:00', '2016-01-01 00:03:00',
               '2016-01-01 00:04:00', '2016-01-01 00:05:00',
               '2016-01-01 00:06:00'],
              dtype='datetime64[ns]', freq=None)

回答by MaxU

Here is a bit faster version:

这是一个更快的版本:

In [86]: df = (pd.read_csv(fn, sep='\s+', header=None,
    ...:                   parse_dates={'Date':[0,1,2,3,4]},
    ...:                   date_parser=lambda x: pd.to_datetime(x, format='%Y %m %d %H %M'))
    ...:         .set_index('Date'))
    ...:

In [87]: df
Out[87]:
                      5    6     7  8      9
Date
2016-01-01 00:00:00  19  348  2.05  7  618.4
2016-01-01 00:01:00  19  351  2.05  7  618.4
2016-01-01 00:02:00  18    0  2.05  7  618.4
2016-01-01 00:03:00  17  353  2.05  7  618.4
2016-01-01 00:04:00  19  346  2.02  7  618.4
2016-01-01 00:05:00  20  345  2.00  7  618.4
2016-01-01 00:06:00  22  348  1.97  7  618.4

In [88]: cols_str = 'wind_speed(kts) wind_direction(dec) temperature(C) relative_humidity(%) pressure'
    ...: cols = cols_str.split()
    ...:

In [89]: cols
Out[89]:
['wind_speed(kts)',
 'wind_direction(dec)',
 'temperature(C)',
 'relative_humidity(%)',
 'pressure']

In [90]: df.columns = cols

In [91]: df
Out[91]:
                     wind_speed(kts)  wind_direction(dec)  temperature(C)  relative_humidity(%)  pressure
Date
2016-01-01 00:00:00               19                  348            2.05                     7     618.4
2016-01-01 00:01:00               19                  351            2.05                     7     618.4
2016-01-01 00:02:00               18                    0            2.05                     7     618.4
2016-01-01 00:03:00               17                  353            2.05                     7     618.4
2016-01-01 00:04:00               19                  346            2.02                     7     618.4
2016-01-01 00:05:00               20                  345            2.00                     7     618.4
2016-01-01 00:06:00               22                  348            1.97                     7     618.4