Python ExcelFile 对比熊猫中的 read_excel

Question

提问by Optimesh

I'm diving into pandas and experimenting around. As for reading data from an Excel file. I wonder what's the difference between using ExcelFile to read_excel. Both seem to work (albeit slightly different syntax, as could be expected), and the documentation supports both. In both cases, the documentation describes the method the same: "Read an Excel table into DataFrame" and "Read an Excel table into a pandas DataFrame". (documentation for read_excel, and for excel_file)

我正在深入研究熊猫并进行实验。至于从 Excel 文件中读取数据。我想知道使用 ExcelFile 到 read_excel 有什么区别。两者似乎都有效（尽管语法略有不同，正如预期的那样），并且文档支持两者。在这两种情况下，文档都描述了相同的方法：“将 Excel 表读入 DataFrame”和“将 Excel 表读入 Pandas DataFrame”。（read_excel和excel_file 的文档）

I'm seeing answers here on SO that uses either, w/o addressing the difference. Also, a Google search didn't produce a result that discusses this issue.

我在 SO 上看到了使用其中任何一个的答案，没有解决差异。此外，谷歌搜索没有产生讨论这个问题的结果。

WRT my testing, these seem equivalent:

WRT我的测试，这些看起来是等价的：

path = "test/dummydata.xlsx"
xl = pd.ExcelFile(path)
df = xl.parse("dummydata")  # sheet name

and

和

path = "test/dummydata.xlsx" 
df = pd.io.excel.read_excel(path, sheetname=0)

other than the fact that the latter saves me a line, is there a difference between the two, and is there a reason to use either one?

除了后者为我节省了一行之外，两者之间是否存在差异，是否有理由使用任何一个？

Thanks!

谢谢！

Answer 1

回答by Bob Haffner

I believe Pandas first implementation of excel used the two step process, but then added the one step process called read_excel. Probably left the first one in because folks were already using it

我相信 Pandas 第一次实现 excel 使用了两步过程，但后来添加了名为 read_excel 的一步过程。可能把第一个留在了，因为人们已经在使用它了

Answer 2

回答by Pranav

ExcelFile.parseis faster.

ExcelFile.parse是比较快的。

Suppose you are reading dataframes in a loop. With ExcelFile.parseyou just pass the Excelfileobject(xlin your case). So the excel sheet is just loaded once and you use this to get your dataframes. In case of Read_Excel you pass the path instead of Excelfileobject. So essentially every time the workbook is loaded again. Makes a mess if your workbook has loads of sheets and tens of thousands of rows.

假设您正在循环读取数据帧。随着ExcelFile.parse你只是传递Excelfile对象（xl你的情况）。因此，excel 表只加载一次，您可以使用它来获取数据框。在 Read_Excel 的情况下，您传递路径而不是Excelfile对象。所以基本上每次工作簿再次加载时。如果您的工作簿有大量的工作表和数万行，就会一团糟。

Answer 3

回答by John Y

There's no particular difference beyond the syntax. Technically, ExcelFileis a class and read_excelis a function. In either case, the actual parsing is handled by the _parse_excelmethod defined within ExcelFile.

除了语法之外没有特别的区别。从技术上讲，ExcelFile是一个类，read_excel是一个函数。在任何一种情况下，实际解析都由中_parse_excel定义的方法处理ExcelFile。

In earlier versions of pandas, read_excelconsisted entirely of a single statement (other than comments):

在早期版本的 pandas 中，read_excel完全由一个语句组成（注释除外）：

return ExcelFile(path_or_buf,kind=kind).parse(sheetname=sheetname,
                                              kind=kind, **kwds)

And ExcelFile.parsedidn't do much more than call ExcelFile._parse_excel.

并ExcelFile.parse没有做更多的事情 call ExcelFile._parse_excel。

In recent versions of pandas, read_excelensures that it has an ExcelFileobject (and creates one if it doesn't), and then calls the _parse_excelmethod directly:

在最新版本的 pandas 中，read_excel确保它有一个ExcelFile对象（如果没有则创建一个），然后_parse_excel直接调用该方法：

if not isinstance(io, ExcelFile):
    io = ExcelFile(io, engine=engine)

return io._parse_excel(...)

and with the updated (and unified) parameter handling, ExcelFile.parsereally is just the single statement:

并且通过更新（和统一）的参数处理，ExcelFile.parse真的只是一个语句：

return self._parse_excel(...)

That is why the docs for ExcelFile.parsenow say

这就是为什么ExcelFile.parse现在的文档说

Equivalent to read_excel(ExcelFile, ...) See the read_excel docstring for more info on accepted parameters

等效于 read_excel(ExcelFile, ...) 有关可接受参数的更多信息，请参阅 read_excel 文档字符串

As for another answerwhich claims that ExcelFile.parseis faster in a loop, that really just comes down to whether you are creating the ExcelFileobject from scratch every time. You could certainly create your ExcelFileonce, outside the loop, and pass thatto read_excelinside your loop:

至于另一个声称ExcelFile.parse在循环中更快的答案，这实际上归结为您是否ExcelFile每次都从头开始创建对象。你当然可以创建你ExcelFile一次，外循环，并通过该给read_excel你的循环内：

xl = pd.ExcelFile(path)
for name in xl.sheet_names:
    df = pd.read_excel(xl, name)

This would be equivalent to

这将相当于

xl = pd.ExcelFile(path)
for name in xl.sheet_names:
    df = xl.parse(name)

If your loop involves different paths(in other words, you are reading many different workbooks, not just multiple sheets within a single workbook), then you can't get around having to create a brand-new ExcelFileinstance for each path anyway, and then once again, both ExcelFile.parseand read_excelwill be equivalent (and equally slow).

如果您的循环涉及不同的路径（换句话说，您正在阅读许多不同的工作簿，而不仅仅是单个工作簿中的多个工作表），那么ExcelFile无论如何您都无法避免必须为每个路径创建一个全新的实例，然后再一次，两者ExcelFile.parse和read_excel将是等价的（并且同样慢）。

Python ExcelFile 对比熊猫中的 read_excel

提问by Optimesh

回答by Bob Haffner

回答by Pranav

回答by John Y

相关推荐

最近更新

标签

Python ExcelFile 对比 熊猫中的 read_excel

提问by Optimesh

回答by Bob Haffner

回答by Pranav

回答by John Y

相关推荐

如何在 python matplotlib 点（散点）图中添加趋势线？

Python NameError: 名称 'tkFileDialog' 未定义

python pandas用数字替换数据框中的字符串

Python 将数据从 Django 传递到 D3

相关推荐

最近更新

标签

Python ExcelFile 对比熊猫中的 read_excel