Python 与 xlrd 相比,使用 openpyxl 读取 Excel 文件的速度要慢得多

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/35823835/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-19 17:01:03  来源:igfitidea点击:

Reading Excel file is magnitudes slower using openpyxl compared to xlrd

pythonxlrdopenpyxl

提问by Ron Johnson

I have an Excel spreadsheet that I need to import into SQL Server on a daily basis. The spreadsheet will contain around 250,000 rows across around 50 columns. I have tested both using openpyxland xlrdusing nearly identical code.

我有一个 Excel 电子表格,我需要每天将其导入 SQL Server。该电子表格将包含大约 50 列的大约 250,000 行。我已经使用几乎相同的代码使用openpyxlxlrd进行了测试。

Here's the code I'm using (minus debugging statements):

这是我正在使用的代码(减去调试语句):

import xlrd
import openpyxl

def UseXlrd(file_name):
    workbook = xlrd.open_workbook(file_name, on_demand=True)
    worksheet = workbook.sheet_by_index(0)
    first_row = []
    for col in range(worksheet.ncols):
        first_row.append(worksheet.cell_value(0,col))
    data = []
    for row in range(1, worksheet.nrows):
        record = {}
        for col in range(worksheet.ncols):
            if isinstance(worksheet.cell_value(row,col), str):
                record[first_row[col]] = worksheet.cell_value(row,col).strip()
            else:
                record[first_row[col]] = worksheet.cell_value(row,col)
        data.append(record)
    return data


def UseOpenpyxl(file_name):
    wb = openpyxl.load_workbook(file_name, read_only=True)
    sheet = wb.active
    first_row = []
    for col in range(1,sheet.max_column+1):
        first_row.append(sheet.cell(row=1,column=col).value)
    data = []
    for r in range(2,sheet.max_row+1):
        record = {}
        for col in range(sheet.max_column):
            if isinstance(sheet.cell(row=r,column=col+1).value, str):
                record[first_row[col]] = sheet.cell(row=r,column=col+1).value.strip()
            else:
                record[first_row[col]] = sheet.cell(row=r,column=col+1).value
        data.append(record)
    return data

xlrd_results = UseXlrd('foo.xls')
openpyxl_resuts = UseOpenpyxl('foo.xls')

Passing the same Excel file containing 3500 rows gives drastically different run times. Using xlrdI can read the entire file into a list of dictionaries in under 2 second. Using openpyxlI get the following results:

传递包含 3500 行的相同 Excel 文件会产生截然不同的运行时间。使用xlrd我可以在 2 秒内将整个文件读入字典列表。使用openpyxl我得到以下结果:

Reading Excel File...
Read 100 lines in 114.14509415626526 seconds
Read 200 lines in 471.43183994293213 seconds
Read 300 lines in 982.5288782119751 seconds
Read 400 lines in 1729.3348784446716 seconds
Read 500 lines in 2774.886833190918 seconds
Read 600 lines in 4384.074863195419 seconds
Read 700 lines in 6396.7723388671875 seconds
Read 800 lines in 7998.775000572205 seconds
Read 900 lines in 11018.460735321045 seconds

While I can use xlrdin the final script, I will have to hard code a lot of formatting because of various issues (i.e. int reads as float, date reads as int, datetime reads as float). Being that I need to reuse this code for a few more imports, it doesn't make sense to try and hard code specific columns to format them properly and have to maintain similar code across 4 different scripts.

虽然我可以xlrd在最终脚本中使用,但由于各种问题(即 int 读取为 float,date 读取为 int,datetime 读取为 float),我将不得不对很多格式进行硬编码。由于我需要将此代码重用于更多导入,因此尝试硬编码特定列以正确格式化它们并且必须在 4 个不同的脚本中维护类似的代码是没有意义的。

Any advice on how to proceed?

有关如何进行的任何建议?

采纳答案by Mike Müller

You can just iterateover the sheet:

您可以遍历工作表:

def UseOpenpyxl(file_name):
    wb = openpyxl.load_workbook(file_name, read_only=True)
    sheet = wb.active
    rows = sheet.rows
    first_row = [cell.value for cell in next(rows)]
    data = []
    for row in rows:
        record = {}
        for key, cell in zip(first_row, row):
            if cell.data_type == 's':
                record[key] = cell.value.strip()
            else:
                record[key] = cell.value
        data.append(record)
    return data

This should scale to large files. You may want to chunk your result if the list datagets too large.

这应该扩展到大文件。如果列表data太大,您可能希望对结果进行分块 。

Now the openpyxl version takes about twice as long as the xlrd one:

现在 openpyxl 版本的时间大约是 xlrd 版本的两倍:

%timeit xlrd_results = UseXlrd('foo.xlsx')
1 loops, best of 3: 3.38 s per loop

%timeit openpyxl_results = UseOpenpyxl('foo.xlsx')
1 loops, best of 3: 6.87 s per loop

Note that xlrd and openpyxl might interpret what is an integer and what is a float slightly differently. For my test data, I needed to add float()to make the outputs comparable:

请注意, xlrd 和 openpyxl 可能会稍微不同地解释什么是整数和什么是浮点数。对于我的测试数据,我需要添加float()以使输出具有可比性:

def UseOpenpyxl(file_name):
    wb = openpyxl.load_workbook(file_name, read_only=True)
    sheet = wb.active
    rows = sheet.rows
    first_row = [float(cell.value) for cell in next(rows)]
    data = []
    for row in rows:
        record = {}
        for key, cell in zip(first_row, row):
            if cell.data_type == 's':
                record[key] = cell.value.strip()
            else:
                record[key] = float(cell.value)
        data.append(record)
    return data

Now, both versions give the same results for my test data:

现在,两个版本都为我的测试数据提供了相同的结果:

>>> xlrd_results == openpyxl_results
True

回答by MaxU

It sounds to me like a perfect candidate for Pandasmodule:

在我看来,它是Pandas模块的完美候选:

import pandas as pd
import sqlalchemy
import pyodbc

# pyodbc
#
# assuming the following:
# username: scott
# password: tiger
# DSN: mydsn
engine = create_engine('mssql+pyodbc://scott:tiger@mydsn')

# pymssql
#
#engine = create_engine('mssql+pymssql://scott:tiger@hostname:port/dbname')


df = pd.read_excel('foo.xls')

# write the DataFrame to a table in the sql database
df.to_sql("table_name", engine)

Description for DataFrame.to_sql()function

描述为DataFrame.to_sql()函数

PS It should be pretty fast and very easy to use

PS它应该非常快并且非常易于使用

回答by soartseng

You call several times of "sheet.max_column" or "sheet.max_row". Don't do that. Just call it once. If you call it on for loop, each time it will calculate once max_column or max_row.

您多次调用“sheet.max_column”或“sheet.max_row”。不要那样做。只需调用一次。如果你在 for 循环中调用它,每次它都会计算一次 max_column 或 max_row。

I modify as below for your reference:

我修改如下供大家参考:

def UseOpenpyxl(file_name):
    wb = openpyxl.load_workbook(file_name, read_only=True)
    sheet = wb.active
    max_col = sheet.max_column
    max_row = sheet.max_row
    first_row = []
    for col in range(1,max_col +1):
        first_row.append(sheet.cell(row=1,column=col).value)
    data = []
    for r in range(2,max_row +1):
        record = {}
        for col in range(max_col):
            if isinstance(sheet.cell(row=r,column=col+1).value, str):
                record[first_row[col]] = sheet.cell(row=r,column=col+1).value.strip()
            else:
                record[first_row[col]] = sheet.cell(row=r,column=col+1).value
        data.append(record)
    return data