组织从 Excel 读取的数据到 Pandas DataFrame

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/17711585/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-13 20:59:50  来源:igfitidea点击:

Organizing data read from Excel to Pandas DataFrame

pythonexcelpandas

提问by pbreach

My goal with this script is to: 1.read timseries data in from excel file (>100,000k rows) as well as headers (Labels, Units) 2.convert excel numeric dates to best datetime object for pandas dataFrame 3.Be able to use timestamps to reference rows and series labels to reference columns

我使用此脚本的目标是: 1. 从 excel 文件(> 100,000k 行)以及标题(标签、单位)中读取时间序列数据 2. 将 excel 数字日期转换为Pandas数据框的最佳日期时间对象 3.能够使用时间戳来引用行和系列标签来引用列

So far I used xlrd to read the excel data into a list. Made pandas Series with each list and used time list as index. Combined series with series headers to make python dictionary. Passed dictionary to pandas DataFrame. Despite my efforts the df.index seems to be set to the column headers and I'm not sure when to convert the dates into datetime object.

到目前为止,我使用 xlrd 将 excel 数据读入列表。用每个列表制作Pandas系列,并使用时间列表作为索引。将系列与系列标题结合起来制作 python 字典。将字典传递给 Pandas DataFrame。尽管我努力了 df.index 似乎设置为列标题,但我不确定何时将日期转换为 datetime 对象。

I just started using python 3 days ago so any advice would be great! Here's my code:

我 3 天前才开始使用 python,所以任何建议都会很棒!这是我的代码:

    #Open excel workbook and first sheet
    wb = xlrd.open_workbook("C:\GreenCSV\Calgary\CWater.xlsx")
    sh = wb.sheet_by_index(0)

    #Read rows containing labels and units
    Labels = sh.row_values(1, start_colx=0, end_colx=None)
    Units = sh.row_values(2, start_colx=0, end_colx=None)

    #Initialize list to hold data
    Data = [None] * (sh.ncols)

    #read column by column and store in list
    for colnum in range(sh.ncols):
        Data[colnum] = sh.col_values(colnum, start_rowx=5, end_rowx=None)

    #Delete unecessary rows and columns
    del Labels[3],Labels[0:2], Units[3], Units[0:2], Data[3], Data[0:2]   

    #Create Pandas Series
    s = [None] * (sh.ncols - 4)
    for colnum in range(sh.ncols - 4):
        s[colnum] = Series(Data[colnum+1], index=Data[0])

    #Create Dictionary of Series
    dictionary = {}
    for i in range(sh.ncols-4):
        dictionary[i]= {Labels[i] : s[i]}

    #Pass Dictionary to Pandas DataFrame
    df = pd.DataFrame.from_dict(dictionary)

回答by Andy Hayden

You can use pandas directly here, I usually like to create a dictionary of DataFrames (with keys being the sheet name):

你可以在这里直接使用pandas,我通常喜欢创建一个DataFrames的字典(键是表名):

In [11]: xl = pd.ExcelFile("C:\GreenCSV\Calgary\CWater.xlsx")

In [12]: xl.sheet_names  # in your example it may be different
Out[12]: [u'Sheet1', u'Sheet2', u'Sheet3']

In [13]: dfs = {sheet: xl.parse(sheet) for sheet in xl.sheet_names}

In [14]: dfs['Sheet1'] # access DataFrame by sheet name

You can check out the docs on the parsewhich offers some more options (for example skiprows), and these allows you to parse individual sheets with much more control...

您可以查看提供更多选项的文档parse(例如skiprows),这些文档使您可以通过更多控制来解析单个工作表...