Python Pandas:在 excel 文件中查找工作表列表

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/17977540/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-19 09:38:59  来源:igfitidea点击:

Pandas: Looking up the list of sheets in an excel file

pythonexcelpandasopenpyxlxlrd

提问by Amelio Vazquez-Reina

The new version of Pandas uses the following interfaceto load Excel files:

新版 Pandas 使用如下界面加载 Excel 文件:

read_excel('path_to_file.xls', 'Sheet1', index_col=None, na_values=['NA'])

but what if I don't know the sheets that are available?

但是如果我不知道可用的床单怎么办?

For example, I am working with excel files that the following sheets

例如,我正在处理以下工作表的 excel 文件

Data 1, Data 2 ..., Data N, foo, bar

数据 1、数据 2 ...、数据 N、foo、bar

but I don't know Na priori.

但我不知道N先验。

Is there any way to get the list of sheets from an excel document in Pandas?

有什么方法可以从 Pandas 的 Excel 文档中获取工作表列表?

采纳答案by Andy Hayden

You can still use the ExcelFileclass (and the sheet_namesattribute):

您仍然可以使用ExcelFile类(和sheet_names属性):

xl = pd.ExcelFile('foo.xls')

xl.sheet_names  # see all sheet names

xl.parse(sheet_name)  # read a specific sheet to DataFrame

see docs for parsefor more options...

有关更多选项,请参阅解析文档...

回答by Nicholas Lu

You should explicitly specify the second parameter (sheetname) as None. like this:

您应该将第二个参数 (sheetname) 明确指定为 None。像这样:

 df = pandas.read_excel("/yourPath/FileName.xlsx", None);

"df" are all sheets as a dictionary of DataFrames, you can verify it by run this:

“df”都是作为DataFrames字典的工作表,您可以通过运行以下命令来验证它:

df.keys()

result like this:

结果是这样的:

[u'201610', u'201601', u'201701', u'201702', u'201703', u'201704', u'201705', u'201706', u'201612', u'fund', u'201603', u'201602', u'201605', u'201607', u'201606', u'201608', u'201512', u'201611', u'201604']

please refer pandas doc for more details: https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_excel.html

请参阅熊猫文档了解更多详情:https: //pandas.pydata.org/pandas-docs/stable/generated/pandas.read_excel.html

回答by Dhwanil shah

I have tried xlrd, pandas, openpyxl and other such libraries and all of them seem to take exponential time as the file size increase as it reads the entire file. The other solutions mentioned above where they used 'on_demand' did not work for me. If you just want to get the sheet names initially, the following function works for xlsx files.

我已经尝试过 xlrd、pandas、openpyxl 和其他类似的库,并且随着读取整个文件时文件大小的增加,所有这些库似乎都需要指数时间。上面提到的其他使用“on_demand”的解决方案对我不起作用。如果您最初只想获取工作表名称,以下函数适用于 xlsx 文件。

def get_sheet_details(file_path):
    sheets = []
    file_name = os.path.splitext(os.path.split(file_path)[-1])[0]
    # Make a temporary directory with the file name
    directory_to_extract_to = os.path.join(settings.MEDIA_ROOT, file_name)
    os.mkdir(directory_to_extract_to)

    # Extract the xlsx file as it is just a zip file
    zip_ref = zipfile.ZipFile(file_path, 'r')
    zip_ref.extractall(directory_to_extract_to)
    zip_ref.close()

    # Open the workbook.xml which is very light and only has meta data, get sheets from it
    path_to_workbook = os.path.join(directory_to_extract_to, 'xl', 'workbook.xml')
    with open(path_to_workbook, 'r') as f:
        xml = f.read()
        dictionary = xmltodict.parse(xml)
        for sheet in dictionary['workbook']['sheets']['sheet']:
            sheet_details = {
                'id': sheet['@sheetId'],
                'name': sheet['@name']
            }
            sheets.append(sheet_details)

    # Delete the extracted files directory
    shutil.rmtree(directory_to_extract_to)
    return sheets

Since all xlsx are basically zipped files, we extract the underlying xml data and read sheet names from the workbook directly which takes a fraction of a second as compared to the library functions.

由于所有 xlsx 基本上都是压缩文件,我们提取底层 xml 数据并直接从工作簿中读取工作表名称,与库函数相比,这需要几分之一秒。

Benchmarking: (On a 6mb xlsx file with 4 sheets)
Pandas, xlrd:12 seconds
openpyxl:24 seconds
Proposed method:0.4 seconds

基准测试:(在 4 张
6mb xlsx 文件上)Pandas,xlrd:12 秒
openpyxl:24 秒
建议方法:0.4 秒

Since my requirement was just reading the sheet names, the unnecessary overhead of reading the entire time was bugging me so I took this route instead.

由于我的要求只是阅读工作表名称,因此阅读整个时间的不必要开销困扰着我,所以我选择了这条路线。

回答by divingTobi

Building on @dhwanil_shah 's answer, you do not need to extract the whole file. With zf.openit is possible to read from a zipped file directly.

基于@dhwanil_shah 的回答,您无需提取整个文件。有了zf.open它,可以直接从一个压缩文件中读取。

import xml.etree.ElementTree as ET
import zipfile

def xlsxSheets(f):
    zf = zipfile.ZipFile(f)

    f = zf.open(r'xl/workbook.xml')

    l = f.readline()
    l = f.readline()
    root = ET.fromstring(l)
    sheets=[]
    for c in root.findall('{http://schemas.openxmlformats.org/spreadsheetml/2006/main}sheets/*'):
        sheets.append(c.attrib['name'])
    return sheets

The two consecutive readlines are ugly, but the content is only in the second line of the text. No need to parse the whole file.

连续两个readlines丑,但内容只在正文的第二行。无需解析整个文件。

This solution seems to be much faster than the read_excelversion, and most likely also faster than the full extract version.

此解决方案似乎比read_excel版本快得多,而且很可能也比完整提取版本快。

回答by S.E.A

This is the fastest way I have found, inspired by @divingTobi's answer. All The answers based on xlrd, openpyxl or pandas are slow for me, as they all load the whole file first.

这是我找到的最快的方法,灵感来自@divingTobi 的回答。所有基于 xlrd、openpyxl 或 pandas 的答案对我来说都很慢,因为它们都首先加载整个文件。

from zipfile import ZipFile
from bs4 import BeautifulSoup  # you also need to install "lxml" for the XML parser

with ZipFile(file) as zipped_file:
    summary = zipped_file.open(r'xl/workbook.xml').read()
soup = BeautifulSoup(summary, "xml")
sheets = [sheet.get("name") for sheet in soup.find_all("sheet")]