如何在 Pandas 中使用 read_excel 提高处理速度?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/50695778/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-14 05:38:59  来源:igfitidea点击:

How to increase process speed using read_excel in pandas?

pythonexcelpandasperformancedataframe

提问by james.peng

I need use pd.read_excelto process every sheet in one excel file.
But in most cases,I did not know the sheet name.
So I use this to judge how many sheet in excel:

我需要使用pd.read_excel来处理一个 excel 文件中的每个工作表。
但在大多数情况下,我不知道工作表名称。
所以我用这个来判断excel中有多少张表:

i_sheet_count=0
i=0
try:
  df.read_excel('/tmp/1.xlsx',sheetname=i)
  i_sheet_count+=1
  i+=1
else:
  i+=1
print(i_sheet_count)

During the process,I found that the process is quite slow,
So,can read_excelonly read limited rows to improve the speed?
I tried nrowsbut did not work..still slow..

在这个过程中,我发现这个过程很慢,
那么read_excel可以只读取有限的行来提高速度吗?
我试过nrows但没有用..仍然很慢..

回答by jpp

Read all worksheets without guessing

无需猜测即可阅读所有工作表

Use sheetname = Noneargument to pd.read_excel. This will read allworksheets into a dictionary of dataframes. For example:

使用sheetname = None参数pd.read_excel. 这会将所有工作表读入数据框字典。例如:

dfs = pd.read_excel('file.xlsx', sheetname=None)

# access 'Sheet1' worksheet
res = dfs['Sheet1']

Limit number of rows or columns

限制行数或列数

You can use parse_colsand skip_footerarguments to limit the number of columns and/or rows. This will reduce read time, and also works with sheetname = None.

您可以使用parse_colsskip_footer参数来限制列数和/或行数。这将减少读取时间,并且也适用于sheetname = None.

For example, the following will read the first 3 columns and, if your worksheet has 100 rows, it will read only the first 20.

例如,以下内容将读取前 3 列,如果您的工作表有 100 行,它将仅读取前 20 行。

df = pd.read_excel('file.xlsx', sheetname=None, parse_cols='A:C', skip_footer=80)

If you wish to apply worksheet-specific logic, you can do so by extracting sheetnames:

如果您希望应用特定于工作表的逻辑,您可以通过提取工作表名称来实现:

sheet_names = pd.ExcelFile('file.xlsx', on_demand=True).sheet_names

dfs = {}
for sheet in sheet_names:
    dfs[sheet] = pd.read_excel('file.xlsx', sheet)

Improving performance

提高性能

Reading Excel files into Pandas is naturally slower than other options (CSV, Pickle, HDF5). If you wish to improve performance, I strongly suggest you consider these other formats.

将 Excel 文件读入 Pandas 自然比其他选项(CSV、Pickle、HDF5)慢。如果您希望提高性能,我强烈建议您考虑这些其他格式。

One option, for example, is to use a VBA scriptto convert your Excel worksheets to CSV files; then use pd.read_csv.

例如,一种选择是使用 VBA 脚本将 Excel 工作表转换为 CSV 文件;然后使用pd.read_csv.

回答by Ricardo

Hi james,

嗨,詹姆斯

I'm running into pandas vs Excel (rather pandas againstExcel) right now. Here's my approach.

我现在遇到了Pandas vs Excel(而不是PandasExcel)。这是我的方法。

Sheet names

工作表名称

In order to avoid tryoverhead, I'm reading all sheet names with this:

为了避免try开销,我正在阅读所有工作表名称:

import xlrd
xls = xlrd.open_workbook('file.xlsx', on_demand=True)
Labels = xls.sheet_names()

The on_demand=Trueparameter ensures that no actual data read will occur until I absolutely need it, which is good because all I need here is the list of sheet names.

on_demand=True参数确保在我绝对需要之前不会读取实际数据,这很好,因为我在这里只需要工作表名称列表。

Data read

数据读取

Enter pandas. My issue is — I believe — worse than yours, as I have multiple data blobs on each sheet and I need to pinpoint each of these, resulting on crazy loops and coordinates pinpointing, but assuming a simplified (and mentally saner) version of this is your case, you could simply do as above:

输入pandas。我的问题是——我相信——比你的更糟糕,因为我在每张纸上都有多个数据块,我需要精确定位每一个,导致疯狂的循环和坐标精确定位,但假设这是一个简化的(和精神上更理智的)版本你的情况,你可以简单地按照上面的方法做:

dfs = {}
for sheet in sheet_names:
    dfs[sheet] = pd.read_excel('file.xlsx', sheet)

In fact, looking at jpp's solution for sheet name reading, I'm thinking of borrowing it (imitation is the sincerest form of flattery). I was already doing the dictionary thing, in order to keep sheet names somewhere.

其实看jpp的sheet name reading的解决方法,我想借用一下(模仿是最真诚的奉承)。我已经在做字典的事情,以便将工作表名称保留在某处。

Performance

表现

Finally, how do I deal with what seems to me an excruciatingly slow experience? As I mentioned, my reads are more complex, but my source file is only one and not changing.

最后,我该如何处理在我看来极其缓慢的体验?正如我所提到的,我的读取更复杂,但我的源文件只有一个并且没有改变。

With this in mind, what I do is, as soon as I finish reading, I export everything to csv. For some mystical reason which I can only guess to be related to the name Microsoft, even reading csvwith all the text parsing is many times faster than xlsx.

考虑到这一点,我所做的是,一旦我读完,我就将所有内容导出到csv. 由于一些我只能猜测与 Microsoft 名称有关的神秘原因,即使阅读csv所有文本解析也比xlsx.

My exporting code is this:

我的导出代码是这样的:

if glob('*.csv') != CSVS:
    for label, csvlabel in zip(Labels, CSVS):
        print(f'Exporting {label} to {csvlabel}...')
        data[label].to_csv(csvlabel)

The CSVSis a list of csvfile names, based on the sheet names (but slightly sanitised). So, essentially, I'm testing for the existence of said csvs, but you could just ignore the ifand go on overwriting them.

CSVScsv基于工作表名称的文件名列表(但略有清理)。所以,本质上,我正在测试所述csvs的存在,但您可以忽略if并继续覆盖它们。

As for the VBA script, I hope you have your psychiatrist on speed dial. You're going to need drugs after that, or perhaps an autopsy. I would sincerely rather select/copy the Excel data blobs and either paste them into notepad or simply pd.read_clipboardthem.

至于 VBA 脚本,我希望你有你的精神科医生在快速拨号。在那之后你将需要药物,或者可能需要尸检。我真诚地宁愿选择/复制 Excel 数据 blob,然后将它们粘贴到记事本中或仅粘贴pd.read_clipboard它们。