pandas 将 .DAT 文件导入熊猫数据框

Question

提问by ValientProcess

I have a .DAT file with rows like these:

我有一个包含以下行的 .DAT 文件：

2016 01 01 00 00 19 348 2.05 7 618.4
2016 01 01 00 01 19 351 2.05 7 618.4
2016 01 01 00 02 18 0 2.05 7 618.4
2016 01 01 00 03 17 353 2.05 7 618.4
2016 01 01 00 04 19 346 2.02 7 618.4
2016 01 01 00 05 20 345 2.00 7 618.4
2016 01 01 00 06 22 348 1.97 7 618.4
.......

the data format is:

数据格式为：

year month day hour minute(HST) wind_speed(kts) wind_direction(dec) temperature(C) relative_humidity(%) pressure

I want to import the .DAT file into a pandas dataframe, with the year-month-day-hour-minute as a single index column, and the rest of the values as separate columns.

我想将 .DAT 文件导入到 Pandas 数据框中，将年-月-日-小时-分钟作为单个索引列，将其余值作为单独的列。

Any suggestions?

有什么建议？

Thanks !!

谢谢！！

Answer 1

采纳答案by jezrael

You can use read_csv:

您可以使用read_csv：

import pandas as pd
import numpy as np
from pandas.compat import StringIO
import datetime as dt

temp=u"""2016 01 01 00 00 19 348 2.05 7 618.4
2016 01 01 00 01 19 351 2.05 7 618.4
2016 01 01 00 02 18 0 2.05 7 618.4
2016 01 01 00 03 17 353 2.05 7 618.4
2016 01 01 00 04 19 346 2.02 7 618.4
2016 01 01 00 05 20 345 2.00 7 618.4
2016 01 01 00 06 22 348 1.97 7 618.4"""
#after testing replace StringIO(temp) to filename

parser = lambda date: pd.datetime.strptime(date, '%Y %m %d %H %M')
df = pd.read_csv(StringIO(temp), 
                 sep="\s+", #separator whitespace
                 index_col=0, #convert first column to datetimeindex
                 date_parser=parser, #function for converting dates
                 parse_dates=[[0,1,2,3,4]], #columns to datetime
                 header=None) #none header

Then need set column names, because if use parameter namesget:

然后需要设置列名，因为如果使用参数namesget：

NotImplementedError: file structure not yet supported

NotImplementedError：尚不支持文件结构

df.columns = ['wind_speed(kts)', 'wind_direction(dec)', 'temperature(C)', 'relative_humidity(%)', 'pressure'] 
#remove index name
df.index.name = None

print (df)
                     wind_speed(kts)  wind_direction(dec)  temperature(C)  \
2016-01-01 00:00:00               19                  348            2.05   
2016-01-01 00:01:00               19                  351            2.05   
2016-01-01 00:02:00               18                    0            2.05   
2016-01-01 00:03:00               17                  353            2.05   
2016-01-01 00:04:00               19                  346            2.02   
2016-01-01 00:05:00               20                  345            2.00   
2016-01-01 00:06:00               22                  348            1.97   

                     relative_humidity(%)  pressure  
2016-01-01 00:00:00                     7     618.4  
2016-01-01 00:01:00                     7     618.4  
2016-01-01 00:02:00                     7     618.4  
2016-01-01 00:03:00                     7     618.4  
2016-01-01 00:04:00                     7     618.4  
2016-01-01 00:05:00                     7     618.4  
2016-01-01 00:06:00                     7     618.4  

print (df.dtypes)
wind_speed(kts)           int64
wind_direction(dec)       int64
temperature(C)          float64
relative_humidity(%)      int64
pressure                float64
dtype: object

print (df.index)
DatetimeIndex(['2016-01-01 00:00:00', '2016-01-01 00:01:00',
               '2016-01-01 00:02:00', '2016-01-01 00:03:00',
               '2016-01-01 00:04:00', '2016-01-01 00:05:00',
               '2016-01-01 00:06:00'],
              dtype='datetime64[ns]', freq=None)

Answer 2

回答by MaxU

Here is a bit faster version:

这是一个更快的版本：

In [86]: df = (pd.read_csv(fn, sep='\s+', header=None,
    ...:                   parse_dates={'Date':[0,1,2,3,4]},
    ...:                   date_parser=lambda x: pd.to_datetime(x, format='%Y %m %d %H %M'))
    ...:         .set_index('Date'))
    ...:

In [87]: df
Out[87]:
                      5    6     7  8      9
Date
2016-01-01 00:00:00  19  348  2.05  7  618.4
2016-01-01 00:01:00  19  351  2.05  7  618.4
2016-01-01 00:02:00  18    0  2.05  7  618.4
2016-01-01 00:03:00  17  353  2.05  7  618.4
2016-01-01 00:04:00  19  346  2.02  7  618.4
2016-01-01 00:05:00  20  345  2.00  7  618.4
2016-01-01 00:06:00  22  348  1.97  7  618.4

In [88]: cols_str = 'wind_speed(kts) wind_direction(dec) temperature(C) relative_humidity(%) pressure'
    ...: cols = cols_str.split()
    ...:

In [89]: cols
Out[89]:
['wind_speed(kts)',
 'wind_direction(dec)',
 'temperature(C)',
 'relative_humidity(%)',
 'pressure']

In [90]: df.columns = cols

In [91]: df
Out[91]:
                     wind_speed(kts)  wind_direction(dec)  temperature(C)  relative_humidity(%)  pressure
Date
2016-01-01 00:00:00               19                  348            2.05                     7     618.4
2016-01-01 00:01:00               19                  351            2.05                     7     618.4
2016-01-01 00:02:00               18                    0            2.05                     7     618.4
2016-01-01 00:03:00               17                  353            2.05                     7     618.4
2016-01-01 00:04:00               19                  346            2.02                     7     618.4
2016-01-01 00:05:00               20                  345            2.00                     7     618.4
2016-01-01 00:06:00               22                  348            1.97                     7     618.4

pandas 将 .DAT 文件导入熊猫数据框

提问by ValientProcess

采纳答案by jezrael

回答by MaxU

相关推荐

最近更新

标签

pandas 将 .DAT 文件导入熊猫数据框

提问by ValientProcess

采纳答案by jezrael

回答by MaxU

相关推荐

带有 pct_change 的 Pandas groupby

抑制来自 python pandas 的 Name dtype 描述

Pandas 交叉表 - 如何为数据集中不存在的值打印行/列？

Pandas：用百分比制作数据透视表

相关推荐

最近更新

标签