pandas 在pandas中读取多个具有不同工作表名称的excel文件
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/21125596/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Read multiple excel file with different sheets names in pandas
提问by user1345283
To read files from a directory, try the following:
要从目录中读取文件,请尝试以下操作:
import os
import pandas as pd
path=os.getcwd()
files=os.listdir(path)
files
['wind-diciembre.xls', 'stat_noviembre.xls', 'stat_marzo.xls', 'wind-noviembre.xls', 'wind-enero.xls', 'stat_octubre.xls', 'wind-septiembre.xls', 'stat_septiembre.xls', 'wind-febrero.xls', 'wind-marzo.xls', 'wind-julio.xls', 'wind-octubre.xls', 'stat_diciembre.xls', 'stat_julio.xls', 'wind-junio.xls', 'stat_abril.xls', 'stat_enero.xls', 'stat_junio.xls', 'stat_agosto.xls', 'stat_febrero.xls', 'wind-abril.xls', 'wind-agosto.xls']
where:
在哪里:
stat_enero
Fecha HR PreciAcu RadSolar T Presion Tmax HRmax \
01/01/2011 37 0 162 18.5 0 31.2 86
02/01/2011 70 0 58 12.0 0 14.6 95
03/01/2011 62 0 188 15.3 0 24.9 86
04/01/2011 69 0 181 17.0 0 29.2 97
.
.
.
Presionmax RadSolarmax Tmin HRmin Presionmin
0 0 774 12.3 9 0
1 0 314 9.2 52 0
2 0 713 8.3 32 0
3 0 730 7.7 26 0
.
.
.
and
和
wind-enero
Fecha MagV MagMax Rachas MagRes DirRes DirWind
01/08/2011 00:00 4.3 14.1 17.9 1.0 281.3 ONO
02/08/2011 00:00 4.2 15.7 20.6 1.5 28.3 NNE
03/08/2011 00:00 4.6 23.3 25.6 2.9 49.2 ENE
04/08/2011 00:00 4.8 17.9 23.0 2.0 30.5 NNE
.
.
.
The next step is to read, parse and add the files to a dataframe, Now I do the following:
下一步是读取、解析文件并将其添加到数据帧中,现在我执行以下操作:
for f in files:
data=pd.ExcelFile(f)
data1=data.sheet_names
print data1
[u'diciembre']
[u'Hoja1']
[u'Hoja1']
[u'noviembre']
[u'enero']
[u'Hoja1']
[u'septiembre']
[u'Hoja1']
[u'febrero']
[u'marzo']
[u'julio']
.
.
.
for sheet in data1:
data2=data.parse(sheet)
data2
Fecha MagV MagMax Rachas MagRes DirRes DirWind
01/08/2011 00:00 4.3 14.1 17.9 1.0 281.3 ONO
02/08/2011 00:00 4.2 15.7 20.6 1.5 28.3 NNE
03/08/2011 00:00 4.6 23.3 25.6 2.9 49.2 ENE
04/08/2011 00:00 4.8 17.9 23.0 2.0 30.5 NNE
05/08/2011 00:00 6.0 22.5 26.3 4.4 68.7 ENE
06/08/2011 00:00 4.9 23.8 23.0 3.3 57.3 ENE
07/08/2011 00:00 3.4 12.9 20.2 1.6 104.0 ESE
08/08/2011 00:00 4.0 20.5 22.4 2.6 79.1 ENE
09/08/2011 00:00 4.1 22.4 25.8 2.9 74.1 ENE
10/08/2011 00:00 4.6 18.4 24.0 2.3 52.1 ENE
11/08/2011 00:00 5.0 22.3 27.8 3.3 65.0 ENE
12/08/2011 00:00 5.4 24.9 25.6 4.1 78.7 ENE
13/08/2011 00:00 5.3 26.0 31.7 4.5 79.7 ENE
14/08/2011 00:00 5.9 31.7 29.2 4.5 59.5 ENE
15/08/2011 00:00 6.3 23.0 25.1 4.6 70.8 ENE
16/08/2011 00:00 6.3 19.5 30.8 4.8 64.0 ENE
17/08/2011 00:00 5.2 21.2 25.3 3.9 57.5 ENE
18/08/2011 00:00 5.0 22.3 23.7 2.6 59.4 ENE
19/08/2011 00:00 4.4 21.6 27.5 2.4 57.0 ENE
The above output shows only part of the file,how I can parse all files and add them to a dataframe
上面的输出仅显示文件的一部分,我如何解析所有文件并将它们添加到数据帧
回答by David Hagan
First off, it appears you have a few different datasets in these files. You may want them all in one dataframe, but for now, I am going to assume you want them separated. Ex (All of the wind*.xls files in one dataframe and all of the stat*.xls files in another.) You could parse the data using read_exceland then concatenate the results using the timestamp as the index as follows:
首先,这些文件中似乎有几个不同的数据集。您可能希望将它们全部放在一个数据框中,但现在,我假设您希望将它们分开。例如(一个数据框中的所有 wind*.xls 文件和另一个中的所有 stat*.xls 文件。)您可以使用解析数据read_excel,然后使用时间戳作为索引连接结果,如下所示:
import numpy as np
import pandas as pd, datetime as dt
import glob, os
runDir = "Path to files"
if os.getcwd() != runDir:
os.chdir(runDir)
files = glob.glob("wind*.xls")
df = pd.DataFrame()
for each in files:
sheets = pd.ExcelFile(each).sheet_names
for sheet in sheets:
df = df.append(pd.read_excel(each, sheet, index_col='Fecha'))
You now have a time-indexed dataframe! If you really want to have all of the data in one dataframe (from all of the file types), you can just adjust the globto include all of the files using something like glob.glob('*.xls'). I would warn from personal experience that it may be easier for you to read in each type of data separately and then merge them after you have done some error checking/munging etc.
您现在有一个时间索引数据框!如果您真的希望将所有数据都包含在一个数据框中(来自所有文件类型),您可以glob使用类似glob.glob('*.xls'). 我会根据个人经验发出警告,您可能更容易分别读取每种类型的数据,然后在完成一些错误检查/修改等后合并它们。
回答by ihightower
Below solution is just a minor tweak on @DavidHagan's answer above.
下面的解决方案只是对上面@DavidHagan 的回答的一个小调整。
This one includes a column to identify the read File No like F0, F1, etc.and sheet no of each file as S0, S1, etc.So that we can know where the rows came from.
这包括一列,用于标识读取的文件号,如F0、F1 等,以及每个文件的表号为S0、S1 等。这样我们就可以知道行来自哪里。
import numpy as np
import pandas as pd, datetime as dt
import glob, os
import sys
runDir = r'c:\blah\blah'
if os.getcwd() != runDir:
os.chdir(runDir)
files = glob.glob(r'*.*xls*')
df = pd.DataFrame()
#fno is 0, 1, 2, ... (for each file)
for fno, each in enumerate(files):
sheets = pd.ExcelFile(each).sheet_names
# sno iss 0, 1, 2, ... (for each sheet)
for sno, sheet in enumerate(sheets):
FileNo = 'F' + str(fno) #F0, F1, F2, etc.
SheetNo = 'S' + str(sno) #S0, S1, S2, etc.
# print FileNo, SheetNo, each, sheet #debug info
#header = None if you don't want header or take this out.
#dfxl is dataframe of each xl sheet
dfxl = pd.read_excel(each, sheet, header=None)
#add column of FileNo and SheetNo to the dataframe
dfxl['FileNo'] = FileNo
dfxl['SheetNo'] = SheetNo
#now add the current xl sheet to main dataframe
df = df.append(dfxl)
After doing above.. i.e. reading multiple XL Files and Sheets into a single dataframe (df)... you can do this.. to get a sample row from each File, Sheet combination.. and the sample wil be available in dataframe (dfs1).
完成上述操作后..即将多个 XL 文件和工作表读取到单个数据帧 (df) 中......您可以这样做.. 从每个文件、工作表组合中获取一个示例行.. 并且该示例将在数据帧中可用 ( dfs1)。
#get unique FileNo and SheetNo in dft2
dft2 = df.loc[0,['FileNo', 'SheetNo']]
#empty dataframe to collect sample from each of the read file/sheets
dfs1 = pd.DataFrame()
#loop through each sheet and fileno names
for row in dft2.itertuples():
#get a sample from each file to view
dfts = df[(df.FileNo == row[1]) & (df.SheetNo ==row[2])].sample(1)
#append the 1 sample to dfs1. this will have a sample row
# from each xl sheet and file
dfs1 = dfs1.append(dfts, ignore_index = True)
dfs1.to_clipboard()

