pandas 在pandas中读取多个具有不同工作表名称的excel文件

Question

提问by user1345283

To read files from a directory, try the following:

要从目录中读取文件，请尝试以下操作：

import os
import pandas as pd
path=os.getcwd()
files=os.listdir(path)
files

['wind-diciembre.xls', 'stat_noviembre.xls', 'stat_marzo.xls', 'wind-noviembre.xls', 'wind-enero.xls', 'stat_octubre.xls', 'wind-septiembre.xls', 'stat_septiembre.xls', 'wind-febrero.xls', 'wind-marzo.xls', 'wind-julio.xls', 'wind-octubre.xls', 'stat_diciembre.xls', 'stat_julio.xls', 'wind-junio.xls', 'stat_abril.xls', 'stat_enero.xls', 'stat_junio.xls', 'stat_agosto.xls', 'stat_febrero.xls', 'wind-abril.xls', 'wind-agosto.xls']

where:

在哪里：

stat_enero

     Fecha  HR  PreciAcu  RadSolar     T  Presion  Tmax  HRmax  \
01/01/2011  37         0       162  18.5        0  31.2     86   
02/01/2011  70         0        58  12.0        0  14.6     95   
03/01/2011  62         0       188  15.3        0  24.9     86   
04/01/2011  69         0       181  17.0        0  29.2     97 
     .
     .
     .

          Presionmax  RadSolarmax  Tmin  HRmin  Presionmin  
    0            0          774  12.3      9           0  
    1            0          314   9.2     52           0  
    2            0          713   8.3     32           0  
    3            0          730   7.7     26           0
    .
    .
    .

and

和

 wind-enero

            Fecha  MagV  MagMax  Rachas  MagRes  DirRes DirWind
01/08/2011 00:00   4.3    14.1    17.9     1.0   281.3     ONO
02/08/2011 00:00   4.2    15.7    20.6     1.5    28.3     NNE
03/08/2011 00:00   4.6    23.3    25.6     2.9    49.2     ENE
04/08/2011 00:00   4.8    17.9    23.0     2.0    30.5     NNE
    .
    .
    .

The next step is to read, parse and add the files to a dataframe, Now I do the following:

下一步是读取、解析文件并将其添加到数据帧中，现在我执行以下操作：

for f in files:
    data=pd.ExcelFile(f)
    data1=data.sheet_names
    print data1
    [u'diciembre']
    [u'Hoja1']
    [u'Hoja1']
    [u'noviembre']
    [u'enero']
    [u'Hoja1']
    [u'septiembre']
    [u'Hoja1']
    [u'febrero']
    [u'marzo']
    [u'julio']
        .
        .
        .

for sheet in data1:
    data2=data.parse(sheet)
data2
                Fecha  MagV  MagMax  Rachas  MagRes  DirRes DirWind
01/08/2011 00:00   4.3    14.1    17.9     1.0   281.3     ONO
02/08/2011 00:00   4.2    15.7    20.6     1.5    28.3     NNE
03/08/2011 00:00   4.6    23.3    25.6     2.9    49.2     ENE
04/08/2011 00:00   4.8    17.9    23.0     2.0    30.5     NNE
05/08/2011 00:00   6.0    22.5    26.3     4.4    68.7     ENE
06/08/2011 00:00   4.9    23.8    23.0     3.3    57.3     ENE
07/08/2011 00:00   3.4    12.9    20.2     1.6   104.0     ESE
08/08/2011 00:00   4.0    20.5    22.4     2.6    79.1     ENE
09/08/2011 00:00   4.1    22.4    25.8     2.9    74.1     ENE
10/08/2011 00:00   4.6    18.4    24.0     2.3    52.1     ENE
11/08/2011 00:00   5.0    22.3    27.8     3.3    65.0     ENE
12/08/2011 00:00   5.4    24.9    25.6     4.1    78.7     ENE
13/08/2011 00:00   5.3    26.0    31.7     4.5    79.7     ENE
14/08/2011 00:00   5.9    31.7    29.2     4.5    59.5     ENE 
15/08/2011 00:00   6.3    23.0    25.1     4.6    70.8     ENE
16/08/2011 00:00   6.3    19.5    30.8     4.8    64.0     ENE
17/08/2011 00:00   5.2    21.2    25.3     3.9    57.5     ENE
18/08/2011 00:00   5.0    22.3    23.7     2.6    59.4     ENE
19/08/2011 00:00   4.4    21.6    27.5     2.4    57.0     ENE

The above output shows only part of the file,how I can parse all files and add them to a dataframe

上面的输出仅显示文件的一部分，我如何解析所有文件并将它们添加到数据帧

Answer 1

回答by David Hagan

First off, it appears you have a few different datasets in these files. You may want them all in one dataframe, but for now, I am going to assume you want them separated. Ex (All of the wind*.xls files in one dataframe and all of the stat*.xls files in another.) You could parse the data using read_exceland then concatenate the results using the timestamp as the index as follows:

首先，这些文件中似乎有几个不同的数据集。您可能希望将它们全部放在一个数据框中，但现在，我假设您希望将它们分开。例如（一个数据框中的所有 wind*.xls 文件和另一个中的所有 stat*.xls 文件。）您可以使用解析数据read_excel，然后使用时间戳作为索引连接结果，如下所示：

import numpy as np
import pandas as pd, datetime as dt
import glob, os

runDir = "Path to files"

if os.getcwd() != runDir:
    os.chdir(runDir)

files = glob.glob("wind*.xls")

df = pd.DataFrame()

for each in files:
    sheets = pd.ExcelFile(each).sheet_names

    for sheet in sheets:
        df = df.append(pd.read_excel(each, sheet, index_col='Fecha'))

You now have a time-indexed dataframe! If you really want to have all of the data in one dataframe (from all of the file types), you can just adjust the globto include all of the files using something like glob.glob('*.xls'). I would warn from personal experience that it may be easier for you to read in each type of data separately and then merge them after you have done some error checking/munging etc.

您现在有一个时间索引数据框！如果您真的希望将所有数据都包含在一个数据框中（来自所有文件类型），您可以glob使用类似glob.glob('*.xls'). 我会根据个人经验发出警告，您可能更容易分别读取每种类型的数据，然后在完成一些错误检查/修改等后合并它们。

Answer 2

回答by ihightower

Below solution is just a minor tweak on @DavidHagan's answer above.

下面的解决方案只是对上面@DavidHagan 的回答的一个小调整。

This one includes a column to identify the read File No like F0, F1, etc.and sheet no of each file as S0, S1, etc.So that we can know where the rows came from.

这包括一列，用于标识读取的文件号，如F0、F1 等，以及每个文件的表号为S0、S1 等。这样我们就可以知道行来自哪里。

import numpy as np
import pandas as pd, datetime as dt
import glob, os
import sys

runDir = r'c:\blah\blah'

if os.getcwd() != runDir:
    os.chdir(runDir)

files = glob.glob(r'*.*xls*')

df = pd.DataFrame()

#fno is 0, 1, 2, ... (for each file)
for fno, each in enumerate(files):

    sheets = pd.ExcelFile(each).sheet_names

    # sno iss 0, 1, 2, ... (for each sheet)
    for sno, sheet in enumerate(sheets):

        FileNo = 'F' + str(fno) #F0, F1, F2, etc.
        SheetNo = 'S' + str(sno) #S0, S1, S2, etc.

        # print FileNo, SheetNo, each, sheet #debug info

        #header = None if you don't want header or take this out.
        #dfxl is dataframe of each xl sheet

        dfxl = pd.read_excel(each, sheet, header=None)

        #add column of FileNo and SheetNo to the dataframe
        dfxl['FileNo'] = FileNo
        dfxl['SheetNo'] = SheetNo

        #now add the current xl sheet to main dataframe
        df = df.append(dfxl)

After doing above.. i.e. reading multiple XL Files and Sheets into a single dataframe (df)... you can do this.. to get a sample row from each File, Sheet combination.. and the sample wil be available in dataframe (dfs1).

完成上述操作后..即将多个 XL 文件和工作表读取到单个数据帧 (df) 中......您可以这样做.. 从每个文件、工作表组合中获取一个示例行.. 并且该示例将在数据帧中可用 ( dfs1)。

#get unique FileNo and SheetNo in dft2
dft2 = df.loc[0,['FileNo', 'SheetNo']]

#empty dataframe to collect sample from each of the read file/sheets
dfs1 = pd.DataFrame()

#loop through each sheet and fileno names
for row in dft2.itertuples():   

    #get a sample from each file to view
    dfts = df[(df.FileNo == row[1]) & (df.SheetNo ==row[2])].sample(1)

    #append the 1 sample to dfs1. this will have a sample row
    # from each xl sheet and file
    dfs1 = dfs1.append(dfts, ignore_index = True) 

dfs1.to_clipboard()

pandas 在pandas中读取多个具有不同工作表名称的excel文件

提问by user1345283

回答by David Hagan

回答by ihightower

相关推荐

最近更新

标签

pandas 在pandas中读取多个具有不同工作表名称的excel文件

提问by user1345283

回答by David Hagan

回答by ihightower

相关推荐

在将 Pandas 数据帧列传递给 scikit 学习回归器之前，是否应该以某种方式对其进行转换？

pandas python pandas如何从数据框中删除异常值并替换为先前记录的平均值

将函数应用于 Pandas 中的列集，按列“循环”整个数据框

pandas 使用多索引在熊猫中添加小计列

相关推荐

最近更新

标签