Python ExcelFile 对比 熊猫中的 read_excel
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/26474693/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
ExcelFile Vs. read_excel in pandas
提问by Optimesh
I'm diving into pandas and experimenting around. As for reading data from an Excel file. I wonder what's the difference between using ExcelFile to read_excel. Both seem to work (albeit slightly different syntax, as could be expected), and the documentation supports both. In both cases, the documentation describes the method the same: "Read an Excel table into DataFrame" and "Read an Excel table into a pandas DataFrame". (documentation for read_excel, and for excel_file)
我正在深入研究熊猫并进行实验。至于从 Excel 文件中读取数据。我想知道使用 ExcelFile 到 read_excel 有什么区别。两者似乎都有效(尽管语法略有不同,正如预期的那样),并且文档支持两者。在这两种情况下,文档都描述了相同的方法:“将 Excel 表读入 DataFrame”和“将 Excel 表读入 Pandas DataFrame”。(read_excel和excel_file 的文档)
I'm seeing answers here on SO that uses either, w/o addressing the difference. Also, a Google search didn't produce a result that discusses this issue.
我在 SO 上看到了使用其中任何一个的答案,没有解决差异。此外,谷歌搜索没有产生讨论这个问题的结果。
WRT my testing, these seem equivalent:
WRT我的测试,这些看起来是等价的:
path = "test/dummydata.xlsx"
xl = pd.ExcelFile(path)
df = xl.parse("dummydata") # sheet name
and
和
path = "test/dummydata.xlsx"
df = pd.io.excel.read_excel(path, sheetname=0)
other than the fact that the latter saves me a line, is there a difference between the two, and is there a reason to use either one?
除了后者为我节省了一行之外,两者之间是否存在差异,是否有理由使用任何一个?
Thanks!
谢谢!
回答by Bob Haffner
I believe Pandas first implementation of excel used the two step process, but then added the one step process called read_excel. Probably left the first one in because folks were already using it
我相信 Pandas 第一次实现 excel 使用了两步过程,但后来添加了名为 read_excel 的一步过程。可能把第一个留在了,因为人们已经在使用它了
回答by Pranav
ExcelFile.parseis faster.
ExcelFile.parse是比较快的。
Suppose you are reading dataframes in a loop.
With ExcelFile.parseyou just pass the Excelfileobject(xlin your case). So the excel sheet is just loaded once and you use this to get your dataframes.
In case of Read_Excel you pass the path instead of Excelfileobject. So essentially every time the workbook is loaded again. Makes a mess if your workbook has loads of sheets and tens of thousands of rows.
假设您正在循环读取数据帧。随着ExcelFile.parse你只是传递Excelfile对象(xl你的情况)。因此,excel 表只加载一次,您可以使用它来获取数据框。在 Read_Excel 的情况下,您传递路径而不是Excelfile对象。所以基本上每次工作簿再次加载时。如果您的工作簿有大量的工作表和数万行,就会一团糟。
回答by John Y
There's no particular difference beyond the syntax. Technically, ExcelFileis a class and read_excelis a function. In either case, the actual parsing is handled by the _parse_excelmethod defined within ExcelFile.
除了语法之外没有特别的区别。从技术上讲,ExcelFile是一个类,read_excel是一个函数。在任何一种情况下,实际解析都由 中_parse_excel定义的方法处理ExcelFile。
In earlier versions of pandas, read_excelconsisted entirely of a single statement (other than comments):
在早期版本的 pandas 中,read_excel完全由一个语句组成(注释除外):
return ExcelFile(path_or_buf,kind=kind).parse(sheetname=sheetname,
kind=kind, **kwds)
And ExcelFile.parsedidn't do much more than call ExcelFile._parse_excel.
并ExcelFile.parse没有做更多的事情 call ExcelFile._parse_excel。
In recent versions of pandas, read_excelensures that it has an ExcelFileobject (and creates one if it doesn't), and then calls the _parse_excelmethod directly:
在最新版本的 pandas 中,read_excel确保它有一个ExcelFile对象(如果没有则创建一个),然后_parse_excel直接调用该方法:
if not isinstance(io, ExcelFile):
io = ExcelFile(io, engine=engine)
return io._parse_excel(...)
and with the updated (and unified) parameter handling, ExcelFile.parsereally is just the single statement:
并且通过更新(和统一)的参数处理,ExcelFile.parse真的只是一个语句:
return self._parse_excel(...)
That is why the docs for ExcelFile.parsenow say
这就是为什么ExcelFile.parse现在的文档说
Equivalent to read_excel(ExcelFile, ...) See the read_excel docstring for more info on accepted parameters
等效于 read_excel(ExcelFile, ...) 有关可接受参数的更多信息,请参阅 read_excel 文档字符串
As for another answerwhich claims that ExcelFile.parseis faster in a loop, that really just comes down to whether you are creating the ExcelFileobject from scratch every time. You could certainly create your ExcelFileonce, outside the loop, and pass thatto read_excelinside your loop:
至于另一个声称ExcelFile.parse在循环中更快的答案,这实际上归结为您是否ExcelFile每次都从头开始创建对象。你当然可以创建你ExcelFile一次,外循环,并通过该给read_excel你的循环内:
xl = pd.ExcelFile(path)
for name in xl.sheet_names:
df = pd.read_excel(xl, name)
This would be equivalent to
这将相当于
xl = pd.ExcelFile(path)
for name in xl.sheet_names:
df = xl.parse(name)
If your loop involves different paths(in other words, you are reading many different workbooks, not just multiple sheets within a single workbook), then you can't get around having to create a brand-new ExcelFileinstance for each path anyway, and then once again, both ExcelFile.parseand read_excelwill be equivalent (and equally slow).
如果您的循环涉及不同的路径(换句话说,您正在阅读许多不同的工作簿,而不仅仅是单个工作簿中的多个工作表),那么ExcelFile无论如何您都无法避免必须为每个路径创建一个全新的实例,然后再一次,两者ExcelFile.parse和read_excel将是等价的(并且同样慢)。

