pandas 如何阅读 Excel 工作簿(熊猫)

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/32976423/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-13 23:59:14  来源:igfitidea点击:

How to read Excel Workbook (pandas)

pythonpython-2.7pandasopenpyxl

提问by R. Nero

First I want to say that I am not an expert by any means. I am versed but carry a burden of schedule and learning Python like I should have at a younger age!

首先我想说我无论如何都不是专家。我很精通但是背负着日程安排和学习 Python 的负担,就像我年轻时应该做的一样!

Question:
I have a workbook that will on occasion have more than one worksheet. When reading in the workbook I will not know the number of sheets or their sheet name. The data arrangement will be the same on every sheet with some columns going by the name of 'Unnamed'. The problem is that everything I try or find online uses the pandas.ExcelFile to gather all sheets which is fine but i need to be able to skips 4 rows and only read 42 rows after that and parse specific columns. Although the sheets might have the exact same structure the column names might be the same or different but would like them to be merged.

问题:
我有一本工作簿,有时会有多个工作表。在工作簿中阅读时,我不知道工作表的数量或工作表的名称。每张工作表上的数据排列都相同,有些列的名称为“未命名”。问题是我在网上尝试或找到的所有内容都使用 pandas.ExcelFile 来收集所有工作表,这很好,但我需要能够跳过 4 行,然后只读取 42 行并解析特定列。尽管工作表可能具有完全相同的结构,但列名可能相同或不同,但希望将它们合并。

So here is what I have:

所以这就是我所拥有的:

import pandas as pd
from openpyxl import load_workbook

# Load in the file location and name
cause_effect_file = r'C:\Users\Owner\Desktop\C&E Template.xlsx'

# Set up the ability to write dataframe to the same workbook
book = load_workbook(cause_effect_file)
writer = pd.ExcelWriter(cause_effect_file) 
writer.book = book
writer.sheets = dict((ws.title, ws) for ws in book.worksheets)

# Get the file skip rows and parse columns needed
xl_file = pd.read_excel(cause_effect_file, skiprows=4, parse_cols = 'B:AJ', na_values=['NA'], convert_float=False)

# Loop through the sheets loading data in the dataframe
dfi = {sheet_name: xl_file.parse(sheet_name)
          for sheet_name in xl_file.sheet_names}

# Remove columns labeled as un-named
for col in dfi:
    if r'Unnamed' in col:
        del dfi[col]

# Write dataframe to sheet so we can see what the data looks like
dfi.to_excel(writer, "PyDF", index=False)

# Save it back to the book
writer.save()

The link to the file i am working with is below Excel File

我正在使用的文件的链接位于Excel 文件下方

采纳答案by user29791

Try to modify the following based on your specific need:

尝试根据您的特定需要修改以下内容:

import os
import pandas as pd

df = pd.DataFrame()
xls = pd.ExcelFile(path)

Then iterate over all the available data sheets:

然后遍历所有可用的数据表:

for x in range(0, len(xls.sheet_names)): 
    a = xls.parse(x,header = 4, parse_cols = 'B:AJ')
    a["Sheet Name"] = [xls.sheet_names[x]] * len(a)
    df = df.append(a)

You can adjust the header row and the columns to read for each sheet. I added a column that will indicate the name of the data sheet the row came from.

您可以调整每个工作表的标题行和列。我添加了一个列,该列将指示该行来自的数据表的名称。

回答by Charlie Clark

You probably want to look at using read_onlymode in openpyxl. This will allow you to load only those sheets that you're interested and look at only the cells you're interested in.

你可能想看看read_only在 openpyxl中使用模式。这将允许您仅加载您感兴趣的工作表并仅查看您感兴趣的单元格。

If you want to work with Pandas dataframes then you'll have to create these yourself but that shouldn't be too hard.

如果你想使用 Pandas 数据框,那么你必须自己创建这些,但这应该不会太难。