Python 与 xlrd 相比，使用 openpyxl 读取 Excel 文件的速度要慢得多

Question

提问by Ron Johnson

I have an Excel spreadsheet that I need to import into SQL Server on a daily basis. The spreadsheet will contain around 250,000 rows across around 50 columns. I have tested both using openpyxland xlrdusing nearly identical code.

我有一个 Excel 电子表格，我需要每天将其导入 SQL Server。该电子表格将包含大约 50 列的大约 250,000 行。我已经使用几乎相同的代码使用openpyxl和xlrd进行了测试。

Here's the code I'm using (minus debugging statements):

这是我正在使用的代码（减去调试语句）：

import xlrd
import openpyxl

def UseXlrd(file_name):
    workbook = xlrd.open_workbook(file_name, on_demand=True)
    worksheet = workbook.sheet_by_index(0)
    first_row = []
    for col in range(worksheet.ncols):
        first_row.append(worksheet.cell_value(0,col))
    data = []
    for row in range(1, worksheet.nrows):
        record = {}
        for col in range(worksheet.ncols):
            if isinstance(worksheet.cell_value(row,col), str):
                record[first_row[col]] = worksheet.cell_value(row,col).strip()
            else:
                record[first_row[col]] = worksheet.cell_value(row,col)
        data.append(record)
    return data


def UseOpenpyxl(file_name):
    wb = openpyxl.load_workbook(file_name, read_only=True)
    sheet = wb.active
    first_row = []
    for col in range(1,sheet.max_column+1):
        first_row.append(sheet.cell(row=1,column=col).value)
    data = []
    for r in range(2,sheet.max_row+1):
        record = {}
        for col in range(sheet.max_column):
            if isinstance(sheet.cell(row=r,column=col+1).value, str):
                record[first_row[col]] = sheet.cell(row=r,column=col+1).value.strip()
            else:
                record[first_row[col]] = sheet.cell(row=r,column=col+1).value
        data.append(record)
    return data

xlrd_results = UseXlrd('foo.xls')
openpyxl_resuts = UseOpenpyxl('foo.xls')

Passing the same Excel file containing 3500 rows gives drastically different run times. Using xlrdI can read the entire file into a list of dictionaries in under 2 second. Using openpyxlI get the following results:

传递包含 3500 行的相同 Excel 文件会产生截然不同的运行时间。使用xlrd我可以在 2 秒内将整个文件读入字典列表。使用openpyxl我得到以下结果：

Reading Excel File...
Read 100 lines in 114.14509415626526 seconds
Read 200 lines in 471.43183994293213 seconds
Read 300 lines in 982.5288782119751 seconds
Read 400 lines in 1729.3348784446716 seconds
Read 500 lines in 2774.886833190918 seconds
Read 600 lines in 4384.074863195419 seconds
Read 700 lines in 6396.7723388671875 seconds
Read 800 lines in 7998.775000572205 seconds
Read 900 lines in 11018.460735321045 seconds

While I can use xlrdin the final script, I will have to hard code a lot of formatting because of various issues (i.e. int reads as float, date reads as int, datetime reads as float). Being that I need to reuse this code for a few more imports, it doesn't make sense to try and hard code specific columns to format them properly and have to maintain similar code across 4 different scripts.

虽然我可以xlrd在最终脚本中使用，但由于各种问题（即 int 读取为 float，date 读取为 int，datetime 读取为 float），我将不得不对很多格式进行硬编码。由于我需要将此代码重用于更多导入，因此尝试硬编码特定列以正确格式化它们并且必须在 4 个不同的脚本中维护类似的代码是没有意义的。

Any advice on how to proceed?

有关如何进行的任何建议？

Answer 1

采纳答案by Mike Müller

You can just iterateover the sheet:

您可以遍历工作表：

def UseOpenpyxl(file_name):
    wb = openpyxl.load_workbook(file_name, read_only=True)
    sheet = wb.active
    rows = sheet.rows
    first_row = [cell.value for cell in next(rows)]
    data = []
    for row in rows:
        record = {}
        for key, cell in zip(first_row, row):
            if cell.data_type == 's':
                record[key] = cell.value.strip()
            else:
                record[key] = cell.value
        data.append(record)
    return data

This should scale to large files. You may want to chunk your result if the list datagets too large.

这应该扩展到大文件。如果列表data太大，您可能希望对结果进行分块。

Now the openpyxl version takes about twice as long as the xlrd one:

现在 openpyxl 版本的时间大约是 xlrd 版本的两倍：

%timeit xlrd_results = UseXlrd('foo.xlsx')
1 loops, best of 3: 3.38 s per loop

%timeit openpyxl_results = UseOpenpyxl('foo.xlsx')
1 loops, best of 3: 6.87 s per loop

Note that xlrd and openpyxl might interpret what is an integer and what is a float slightly differently. For my test data, I needed to add float()to make the outputs comparable:

请注意， xlrd 和 openpyxl 可能会稍微不同地解释什么是整数和什么是浮点数。对于我的测试数据，我需要添加float()以使输出具有可比性：

def UseOpenpyxl(file_name):
    wb = openpyxl.load_workbook(file_name, read_only=True)
    sheet = wb.active
    rows = sheet.rows
    first_row = [float(cell.value) for cell in next(rows)]
    data = []
    for row in rows:
        record = {}
        for key, cell in zip(first_row, row):
            if cell.data_type == 's':
                record[key] = cell.value.strip()
            else:
                record[key] = float(cell.value)
        data.append(record)
    return data

Now, both versions give the same results for my test data:

现在，两个版本都为我的测试数据提供了相同的结果：

>>> xlrd_results == openpyxl_results
True

Answer 2

回答by MaxU

It sounds to me like a perfect candidate for Pandasmodule:

在我看来，它是Pandas模块的完美候选：

import pandas as pd
import sqlalchemy
import pyodbc

# pyodbc
#
# assuming the following:
# username: scott
# password: tiger
# DSN: mydsn
engine = create_engine('mssql+pyodbc://scott:tiger@mydsn')

# pymssql
#
#engine = create_engine('mssql+pymssql://scott:tiger@hostname:port/dbname')


df = pd.read_excel('foo.xls')

# write the DataFrame to a table in the sql database
df.to_sql("table_name", engine)

Description for DataFrame.to_sql()function

描述为DataFrame.to_sql（）函数

PS It should be pretty fast and very easy to use

PS它应该非常快并且非常易于使用

Answer 3

回答by soartseng

You call several times of "sheet.max_column" or "sheet.max_row". Don't do that. Just call it once. If you call it on for loop, each time it will calculate once max_column or max_row.

您多次调用“sheet.max_column”或“sheet.max_row”。不要那样做。只需调用一次。如果你在 for 循环中调用它，每次它都会计算一次 max_column 或 max_row。

I modify as below for your reference:

我修改如下供大家参考：

def UseOpenpyxl(file_name):
    wb = openpyxl.load_workbook(file_name, read_only=True)
    sheet = wb.active
    max_col = sheet.max_column
    max_row = sheet.max_row
    first_row = []
    for col in range(1,max_col +1):
        first_row.append(sheet.cell(row=1,column=col).value)
    data = []
    for r in range(2,max_row +1):
        record = {}
        for col in range(max_col):
            if isinstance(sheet.cell(row=r,column=col+1).value, str):
                record[first_row[col]] = sheet.cell(row=r,column=col+1).value.strip()
            else:
                record[first_row[col]] = sheet.cell(row=r,column=col+1).value
        data.append(record)
    return data

Python 与 xlrd 相比，使用 openpyxl 读取 Excel 文件的速度要慢得多

提问by Ron Johnson

采纳答案by Mike Müller

回答by MaxU

回答by soartseng

相关推荐

最近更新

标签

Python 与 xlrd 相比，使用 openpyxl 读取 Excel 文件的速度要慢得多

提问by Ron Johnson

采纳答案by Mike Müller

回答by MaxU

回答by soartseng

相关推荐

Python Pandas: IndexingError: Unalignable boolean Series 作为索引器提供

Python CX_Oracle - 将数据从 Oracle 导入 Pandas 数据框

Python AttributeError: 模块“pandas”没有属性“core”

Python 如何在 Windows 中使用 pip？

相关推荐

最近更新

标签