Python ExcelFile 对比 熊猫中的 read_excel

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/26474693/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-19 00:31:45  来源:igfitidea点击:

ExcelFile Vs. read_excel in pandas

pythonexcelpandas

提问by Optimesh

I'm diving into pandas and experimenting around. As for reading data from an Excel file. I wonder what's the difference between using ExcelFile to read_excel. Both seem to work (albeit slightly different syntax, as could be expected), and the documentation supports both. In both cases, the documentation describes the method the same: "Read an Excel table into DataFrame" and "Read an Excel table into a pandas DataFrame". (documentation for read_excel, and for excel_file)

我正在深入研究熊猫并进行实验。至于从 Excel 文件中读取数据。我想知道使用 ExcelFile 到 read_excel 有什么区别。两者似乎都有效(尽管语法略有不同,正如预期的那样),并且文档支持两者。在这两种情况下,文档都描述了相同的方法:“将 Excel 表读入 DataFrame”和“将 Excel 表读入 Pandas DataFrame”。(read_excelexcel_file 的文档

I'm seeing answers here on SO that uses either, w/o addressing the difference. Also, a Google search didn't produce a result that discusses this issue.

我在 SO 上看到了使用其中任何一个的答案,没有解决差异。此外,谷歌搜索没有产生讨论这个问题的结果。

WRT my testing, these seem equivalent:

WRT我的测试,这些看起来是等价的:

path = "test/dummydata.xlsx"
xl = pd.ExcelFile(path)
df = xl.parse("dummydata")  # sheet name

and

path = "test/dummydata.xlsx" 
df = pd.io.excel.read_excel(path, sheetname=0)

other than the fact that the latter saves me a line, is there a difference between the two, and is there a reason to use either one?

除了后者为我节省了一行之外,两者之间是否存在差异,是否有理由使用任何一个?

Thanks!

谢谢!

回答by Bob Haffner

I believe Pandas first implementation of excel used the two step process, but then added the one step process called read_excel. Probably left the first one in because folks were already using it

我相信 Pandas 第一次实现 excel 使用了两步过程,但后来添加了名为 read_excel 的一步过程。可能把第一个留在了,因为人们已经在使用它了

回答by Pranav

ExcelFile.parseis faster.

ExcelFile.parse是比较快的。

Suppose you are reading dataframes in a loop. With ExcelFile.parseyou just pass the Excelfileobject(xlin your case). So the excel sheet is just loaded once and you use this to get your dataframes. In case of Read_Excel you pass the path instead of Excelfileobject. So essentially every time the workbook is loaded again. Makes a mess if your workbook has loads of sheets and tens of thousands of rows.

假设您正在循环读取数据帧。随着ExcelFile.parse你只是传递Excelfile对象(xl你的情况)。因此,excel 表只加载一次,您可以使用它来获取数据框。在 Read_Excel 的情况下,您传递路径而不是Excelfile对象。所以基本上每次工作簿再次加载时。如果您的工作簿有大量的工作表和数万行,就会一团糟。

回答by John Y

There's no particular difference beyond the syntax. Technically, ExcelFileis a class and read_excelis a function. In either case, the actual parsing is handled by the _parse_excelmethod defined within ExcelFile.

除了语法之外没有特别的区别。从技术上讲,ExcelFile是一个类,read_excel是一个函数。在任何一种情况下,实际解析都由 中_parse_excel定义的方法处理ExcelFile

In earlier versions of pandas, read_excelconsisted entirely of a single statement (other than comments):

在早期版本的 pandas 中,read_excel完全由一个语句组成(注释除外):

return ExcelFile(path_or_buf,kind=kind).parse(sheetname=sheetname,
                                              kind=kind, **kwds)

And ExcelFile.parsedidn't do much more than call ExcelFile._parse_excel.

ExcelFile.parse没有做更多的事情 call ExcelFile._parse_excel

In recent versions of pandas, read_excelensures that it has an ExcelFileobject (and creates one if it doesn't), and then calls the _parse_excelmethod directly:

在最新版本的 pandas 中,read_excel确保它有一个ExcelFile对象(如果没有则创建一个),然后_parse_excel直接调用该方法:

if not isinstance(io, ExcelFile):
    io = ExcelFile(io, engine=engine)

return io._parse_excel(...)

and with the updated (and unified) parameter handling, ExcelFile.parsereally is just the single statement:

并且通过更新(和统一)的参数处理,ExcelFile.parse真的只是一个语句:

return self._parse_excel(...)

That is why the docs for ExcelFile.parsenow say

这就是为什么ExcelFile.parse现在的文档说

Equivalent to read_excel(ExcelFile, ...) See the read_excel docstring for more info on accepted parameters
等效于 read_excel(ExcelFile, ...) 有关可接受参数的更多信息,请参阅 read_excel 文档字符串

As for another answerwhich claims that ExcelFile.parseis faster in a loop, that really just comes down to whether you are creating the ExcelFileobject from scratch every time. You could certainly create your ExcelFileonce, outside the loop, and pass thatto read_excelinside your loop:

至于另一个声称ExcelFile.parse在循环中更快的答案,这实际上归结为您是否ExcelFile每次都从头开始创建对象。你当然可以创建你ExcelFile一次,外循环,并通过read_excel你的循环内:

xl = pd.ExcelFile(path)
for name in xl.sheet_names:
    df = pd.read_excel(xl, name)

This would be equivalent to

这将相当于

xl = pd.ExcelFile(path)
for name in xl.sheet_names:
    df = xl.parse(name)

If your loop involves different paths(in other words, you are reading many different workbooks, not just multiple sheets within a single workbook), then you can't get around having to create a brand-new ExcelFileinstance for each path anyway, and then once again, both ExcelFile.parseand read_excelwill be equivalent (and equally slow).

如果您的循环涉及不同的路径(换句话说,您正在阅读许多不同的工作簿,而不仅仅是单个工作簿中的多个工作表),那么ExcelFile无论如何您都无法避免必须为每个路径创建一个全新的实例,然后再一次,两者ExcelFile.parseread_excel将是等价的(并且同样慢)。