将 csv 文件转换为 Pandas 数据框

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/14514046/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-13 20:36:55  来源:igfitidea点击:

Convert csv file to pandas dataframe

pandas

提问by amaity

I have a CSV file in the following format:

我有以下格式的 CSV 文件:

DATES, 01-12-2010, 01-12-2010, 01-12-2010, 02-12-2010, 02-12-2010, 02-12-2010
UNITS, Hz, kV, MW,  Hz, kV, MW
Interval, , , , , ,                                           
00:15, 49.82, 33.73755, 34.65, 49.92, 33.9009, 36.33,
00:30, 49.9, 33.7722, 35.34, 49.89, 33.8382, 37.65,
00:45, 49.94, 33.8316, 33.5, 50.09, 34.07745, 37.41,
01:00, 49.86, 33.94875, 30.91, 50.18,   34.20945, 36.11,
01:15, 49.97, 34.2243,  27.28,  50.11,  34.3596, 33.24,
01:30,  50.02,  34.3332, 26.91, 50.12,  34.452, 31.03,
01:45,  50.01,  34.1286, 31.26, 50, 33.9306, 38.86,
02:00,  50.08,  33.9141, 34.96, 50.14,  33.99165, 38.31,
02:15,  50.07,  33.84975, 35.33, 50.01, 33.9537, 39.78,
02:30,  49.97,  34.0263, 33.63, 50.07,  33.8547, 41.48,

I would like to convert the above to a dataframe in the following format:

我想将上述内容转换为以下格式的数据帧:

                    Hz      kV          MW
DATES_Interval
01-12-2010 00:15    49.82   33.73755    34.65
01-12-2010 00:30    49.9    33.7722     35.34
01-12-2010 00:45    49.94   33.8316     33.5
01-12-2010 01:00    49.86   33.94875    30.91
01-12-2010 01:15    49.97   34.2243     27.28
01-12-2010 01:30    50.02   34.3332     26.91
01-12-2010 01:45    50.01   34.1286     31.26
01-12-2010 02:00    50.08   33.9141     34.96
01-12-2010 02:15    50.07   33.84975    35.33
01-12-2010 02:30    49.97   34.0263     33.63
02-12-2010 00:15    49.92   33.9009     36.33
02-12-2010 00:30    49.89   33.8382     37.65
02-12-2010 00:45    50.09   34.07745    37.41
02-12-2010 01:00    50.09   34.07745    37.41
02-12-2010 01:15    50.11   34.3596     33.24
02-12-2010 01:30    50.12   34.452      31.03
02-12-2010 01:45    50      33.9306     38.86
02-12-2010 02:00    50.14   33.99165    38.31
02-12-2010 02:15    50.01   33.9537     39.78
02-12-2010 02:30    50.07   33.8547     41.48

How do I do this with pandas?

我如何用Pandas做到这一点?

回答by Andy Hayden

The key to doing this kind of thing in pandas is the stack()method:

在pandas中做这种事情的关键是stack()方法:

df.stack(level=0)

However, I found getting to a place where you can use this, at least particular csv is tricky.. to say the least(there is almost certainly a nicer way to do this!):

但是,我发现到了一个可以使用它的地方,至少特定的 csv 很棘手..至少可以说(几乎可以肯定有更好的方法来做到这一点!):

df_data = pd.read_csv('e.csv', sep=',\s+', header=None, skiprows=3)[range(7)].set_index(0)
df_cols = pd.read_csv('e.csv', sep=',\s+', header=None, nrows=2).set_index(0)[:2] #interval causing problems    
df_ = df_cols.append(df_data).T.set_index(['DATES','UNITS','Interval']).T
df = df_.stack(level=0)
df_dates = map(lambda x: pd.to_datetime(' '.join(x[::-1])), df.index)
df.index = df_dates

In [7]: df
Out[7]: 
UNITS                   Hz      MW        kV
2010-01-12 00:15:00  49.82   34.65  33.73755
2010-02-12 00:15:00  49.92  36.33,   33.9009
2010-01-12 00:30:00   49.9   35.34   33.7722
2010-02-12 00:30:00  49.89  37.65,   33.8382
2010-01-12 00:45:00  49.94    33.5   33.8316
2010-02-12 00:45:00  50.09  37.41,  34.07745
2010-01-12 01:00:00  49.86   30.91  33.94875
2010-02-12 01:00:00  50.18  36.11,  34.20945
2010-01-12 01:15:00  49.97   27.28   34.2243
2010-02-12 01:15:00  50.11  33.24,   34.3596
2010-01-12 01:30:00  50.02   26.91   34.3332
2010-02-12 01:30:00  50.12  31.03,    34.452
2010-01-12 01:45:00  50.01   31.26   34.1286
2010-02-12 01:45:00     50  38.86,   33.9306
2010-01-12 02:00:00  50.08   34.96   33.9141
2010-02-12 02:00:00  50.14  38.31,  33.99165
2010-01-12 02:15:00  50.07   35.33  33.84975
2010-02-12 02:15:00  50.01  39.78,   33.9537
2010-01-12 02:30:00  49.97   33.63   34.0263
2010-02-12 02:30:00  50.07  41.48,   33.8547

That's a little messy and has commas in some columns!:

这有点乱,在某些列中有逗号!:

def clean(s):
    try: return float(s.strip(','))
    except: return s

In [9]: df.applymap(clean)
Out[9]: 
                        Hz     MW        kV
2010-01-12 00:15:00  49.82  34.65  33.73755
2010-02-12 00:15:00  49.92  36.33  33.90090
2010-01-12 00:30:00  49.90  35.34  33.77220
2010-02-12 00:30:00  49.89  37.65  33.83820
2010-01-12 00:45:00  49.94  33.50  33.83160
2010-02-12 00:45:00  50.09  37.41  34.07745
2010-01-12 01:00:00  49.86  30.91  33.94875
2010-02-12 01:00:00  50.18  36.11  34.20945
2010-01-12 01:15:00  49.97  27.28  34.22430
2010-02-12 01:15:00  50.11  33.24  34.35960
2010-01-12 01:30:00  50.02  26.91  34.33320
2010-02-12 01:30:00  50.12  31.03  34.45200
2010-01-12 01:45:00  50.01  31.26  34.12860
2010-02-12 01:45:00  50.00  38.86  33.93060
2010-01-12 02:00:00  50.08  34.96  33.91410
2010-02-12 02:00:00  50.14  38.31  33.99165
2010-01-12 02:15:00  50.07  35.33  33.84975
2010-02-12 02:15:00  50.01  39.78  33.95370
2010-01-12 02:30:00  49.97  33.63  34.02630
2010-02-12 02:30:00  50.07  41.48  33.85470

回答by Aman

Another solution would be to

另一种解决方案是

  1. Read the dates (first row of csv)
  2. Read in the rest of the data, including intervals
  3. Construct the index as desired and apply it to the dataframe
  1. 读取日期(csv的第一行)
  2. 读入其余数据,包括间隔
  3. 根据需要构建索引并将其应用于数据帧

Here's some sample code:

这是一些示例代码:

In [1]: from StringIO import StringIO    
In [2]: import pandas as pd
In [3]: pd.__version__
Out[3]: '0.10.1'

In [4]: CSV_SAMPLE = """
DATES, 01-12-2010, 01-12-2010, 01-12-2010, 02-12-2010, 02-12-2010, 02-12-2010
UNITS, Hz, kV, MW,  Hz, kV, MW
Interval, , , , , ,
00:15, 49.82, 33.73755, 34.65, 49.92, 33.9009, 36.33,
00:30, 49.9, 33.7722, 35.34, 49.89, 33.8382, 37.65,
00:45, 49.94, 33.8316, 33.5, 50.09, 34.07745, 37.41,
01:00, 49.86, 33.94875, 30.91, 50.18,   34.20945, 36.11,
01:15, 49.97, 34.2243,  27.28,  50.11,  34.3596, 33.24,
01:30,  50.02,  34.3332, 26.91, 50.12,  34.452, 31.03,
"""

#Create one dataframe from just the dates (and we'll grab the units, too)
In [6]: datesdf = pd.read_csv(StringIO(CSV_SAMPLE), nrows= 2)
In [7]: dates, units = datesdf.index.droplevel()

In [9]: dates, units
Out[9]:
((' 01-12-2010',
  ' 01-12-2010',
  ' 01-12-2010',
  ' 02-12-2010',
  ' 02-12-2010',
  ' 02-12-2010'),
 (' Hz', ' kV', ' MW', '  Hz', ' kV', ' MW'))

#Create a second dataframe from the rest of the data
In [11]: data = pd.read_csv(StringIO(CSV_SAMPLE), skiprows=3)
In [12]: data = data.icol([0,1,2])

#Note: Instead, in pandas 0.10, you can use the usecols paramater in read_csv()
#  to combine the above two steps into one. 

In [14]: data.columns = units[:3]
In [15]: print data
          Hz        kV     MW
00:15  49.82  33.73755  34.65
00:30  49.90  33.77220  35.34
00:45  49.94  33.83160  33.50
01:00  49.86  33.94875  30.91
01:15  49.97  34.22430  27.28
01:30  50.02  34.33320  26.91

Now create the desired index and apply it. Here are several approaches for the index.

现在创建所需的索引并应用它。以下是索引的几种方法。

#We'll need to grab the intervals from this data df
In [16]: intervals = data.index.tolist()  

In [17]: index1 = pd.MultiIndex.from_arrays([dates,intervals])

#This is a multi-index
In [18]: print index1
MultiIndex
[( 01-12-2010, 00:15), ( 01-12-2010, 00:30), ( 01-12-2010, 00:45), ( 02-12-2010, 01:00), ( 02-12-2010, 01:15), ( 02-12-2
010, 01:30)]

#This index is a tuple of date,interval
In [21]: index2 = pd.Index(zip(dates, intervals))
In [22]: print index2
Index([( 01-12-2010, 00:15), ( 01-12-2010, 00:30), ( 01-12-2010, 00:45), ( 02-12-2010, 01:00), ( 02-12-2010, 01:15), ( 0
2-12-2010, 01:30)], dtype=object)

#This index is based on a string concat of date and interval
In [23]: def list_join(x,y):
   ....:         joined = x + ' ' + y
   ....:         return joined.strip()
   ....:

In [24]: index3 = pd.Index(map(list_join, dates, intervals))    
In [25]: print index3
Simple index:
Index([01-12-2010 00:15, 01-12-2010 00:30, 01-12-2010 00:45, 02-12-2010 01:00, 02-12-2010 01:15, 02-12-2010 01:30], dtyp
e=object)

Since third third type of index is what you had in your original request, I'll use that.

由于第三种类型的索引是您在原始请求中拥有的索引,我将使用它。

In [26]: data.index = index3
In [27]: print data
                     Hz        kV     MW
01-12-2010 00:15  49.82  33.73755  34.65
01-12-2010 00:30  49.90  33.77220  35.34
01-12-2010 00:45  49.94  33.83160  33.50
02-12-2010 01:00  49.86  33.94875  30.91
02-12-2010 01:15  49.97  34.22430  27.28
02-12-2010 01:30  50.02  34.33320  26.91

You may have to modify the code above to handle the complete dataset if it complains about non-unique index values. In that case, read in the Intervals csv column as a data column (not an index) and pull it out as an array to create the desired index, same as above.

如果它抱怨非唯一索引值,您可能需要修改上面的代码来处理完整的数据集。在这种情况下,将 Intervals csv 列作为数据列(不是索引)读入并将其作为数组拉出以创建所需的索引,与上述相同。