Python Pandas:在 excel 文件中查找工作表列表
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/17977540/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Pandas: Looking up the list of sheets in an excel file
提问by Amelio Vazquez-Reina
The new version of Pandas uses the following interfaceto load Excel files:
新版 Pandas 使用如下界面加载 Excel 文件:
read_excel('path_to_file.xls', 'Sheet1', index_col=None, na_values=['NA'])
but what if I don't know the sheets that are available?
但是如果我不知道可用的床单怎么办?
For example, I am working with excel files that the following sheets
例如,我正在处理以下工作表的 excel 文件
Data 1, Data 2 ..., Data N, foo, bar
数据 1、数据 2 ...、数据 N、foo、bar
but I don't know N
a priori.
但我不知道N
先验。
Is there any way to get the list of sheets from an excel document in Pandas?
有什么方法可以从 Pandas 的 Excel 文档中获取工作表列表?
采纳答案by Andy Hayden
You can still use the ExcelFileclass (and the sheet_names
attribute):
您仍然可以使用ExcelFile类(和sheet_names
属性):
xl = pd.ExcelFile('foo.xls')
xl.sheet_names # see all sheet names
xl.parse(sheet_name) # read a specific sheet to DataFrame
see docs for parsefor more options...
有关更多选项,请参阅解析文档...
回答by Nicholas Lu
You should explicitly specify the second parameter (sheetname) as None. like this:
您应该将第二个参数 (sheetname) 明确指定为 None。像这样:
df = pandas.read_excel("/yourPath/FileName.xlsx", None);
"df" are all sheets as a dictionary of DataFrames, you can verify it by run this:
“df”都是作为DataFrames字典的工作表,您可以通过运行以下命令来验证它:
df.keys()
result like this:
结果是这样的:
[u'201610', u'201601', u'201701', u'201702', u'201703', u'201704', u'201705', u'201706', u'201612', u'fund', u'201603', u'201602', u'201605', u'201607', u'201606', u'201608', u'201512', u'201611', u'201604']
please refer pandas doc for more details: https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_excel.html
请参阅熊猫文档了解更多详情:https: //pandas.pydata.org/pandas-docs/stable/generated/pandas.read_excel.html
回答by Dhwanil shah
I have tried xlrd, pandas, openpyxl and other such libraries and all of them seem to take exponential time as the file size increase as it reads the entire file. The other solutions mentioned above where they used 'on_demand' did not work for me. If you just want to get the sheet names initially, the following function works for xlsx files.
我已经尝试过 xlrd、pandas、openpyxl 和其他类似的库,并且随着读取整个文件时文件大小的增加,所有这些库似乎都需要指数时间。上面提到的其他使用“on_demand”的解决方案对我不起作用。如果您最初只想获取工作表名称,以下函数适用于 xlsx 文件。
def get_sheet_details(file_path):
sheets = []
file_name = os.path.splitext(os.path.split(file_path)[-1])[0]
# Make a temporary directory with the file name
directory_to_extract_to = os.path.join(settings.MEDIA_ROOT, file_name)
os.mkdir(directory_to_extract_to)
# Extract the xlsx file as it is just a zip file
zip_ref = zipfile.ZipFile(file_path, 'r')
zip_ref.extractall(directory_to_extract_to)
zip_ref.close()
# Open the workbook.xml which is very light and only has meta data, get sheets from it
path_to_workbook = os.path.join(directory_to_extract_to, 'xl', 'workbook.xml')
with open(path_to_workbook, 'r') as f:
xml = f.read()
dictionary = xmltodict.parse(xml)
for sheet in dictionary['workbook']['sheets']['sheet']:
sheet_details = {
'id': sheet['@sheetId'],
'name': sheet['@name']
}
sheets.append(sheet_details)
# Delete the extracted files directory
shutil.rmtree(directory_to_extract_to)
return sheets
Since all xlsx are basically zipped files, we extract the underlying xml data and read sheet names from the workbook directly which takes a fraction of a second as compared to the library functions.
由于所有 xlsx 基本上都是压缩文件,我们提取底层 xml 数据并直接从工作簿中读取工作表名称,与库函数相比,这需要几分之一秒。
Benchmarking: (On a 6mb xlsx file with 4 sheets)
Pandas, xlrd:12 seconds
openpyxl:24 seconds
Proposed method:0.4 seconds
基准测试:(在 4 张
6mb xlsx 文件上)Pandas,xlrd:12 秒
openpyxl:24 秒
建议方法:0.4 秒
Since my requirement was just reading the sheet names, the unnecessary overhead of reading the entire time was bugging me so I took this route instead.
由于我的要求只是阅读工作表名称,因此阅读整个时间的不必要开销困扰着我,所以我选择了这条路线。
回答by divingTobi
Building on @dhwanil_shah 's answer, you do not need to extract the whole file. With zf.open
it is possible to read from a zipped file directly.
基于@dhwanil_shah 的回答,您无需提取整个文件。有了zf.open
它,可以直接从一个压缩文件中读取。
import xml.etree.ElementTree as ET
import zipfile
def xlsxSheets(f):
zf = zipfile.ZipFile(f)
f = zf.open(r'xl/workbook.xml')
l = f.readline()
l = f.readline()
root = ET.fromstring(l)
sheets=[]
for c in root.findall('{http://schemas.openxmlformats.org/spreadsheetml/2006/main}sheets/*'):
sheets.append(c.attrib['name'])
return sheets
The two consecutive readline
s are ugly, but the content is only in the second line of the text. No need to parse the whole file.
连续两个readline
s丑,但内容只在正文的第二行。无需解析整个文件。
This solution seems to be much faster than the read_excel
version, and most likely also faster than the full extract version.
此解决方案似乎比read_excel
版本快得多,而且很可能也比完整提取版本快。
回答by S.E.A
This is the fastest way I have found, inspired by @divingTobi's answer. All The answers based on xlrd, openpyxl or pandas are slow for me, as they all load the whole file first.
这是我找到的最快的方法,灵感来自@divingTobi 的回答。所有基于 xlrd、openpyxl 或 pandas 的答案对我来说都很慢,因为它们都首先加载整个文件。
from zipfile import ZipFile
from bs4 import BeautifulSoup # you also need to install "lxml" for the XML parser
with ZipFile(file) as zipped_file:
summary = zipped_file.open(r'xl/workbook.xml').read()
soup = BeautifulSoup(summary, "xml")
sheets = [sheet.get("name") for sheet in soup.find_all("sheet")]