java 如何有效地打开一个巨大的excel文件

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/46373895/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-11-03 09:11:22  来源:igfitidea点击:

How to open a huge excel file efficiently

javac#pythonc++excel

提问by David542

I have a 150MB one-sheet excel file that takes about 7 minutes to open on a very powerful machine using the following:

我有一个 150MB 的单页 Excel 文件,在使用以下功能的非常强大的机器上打开大约需要 7 分钟:

# using python
import xlrd
wb = xlrd.open_workbook(file)
sh = wb.sheet_by_index(0)

Is there any way to open the excel file quicker? I'm open to even very outlandish suggestions (such as hadoop, spark, c, java, etc.). Ideally I'm looking for a way to open the file in under 30 seconds if that's not a pipe dream. Also, the above example is using python, but it doesn't have to be python.

有什么办法可以更快地打开excel文件?我对甚至非常古怪的建议(例如 hadoop、spark、c、java 等)持开放态度。理想情况下,如果这不是白日梦,我正在寻找一种在 30 秒内打开文件的方法。另外,上面的例子使用的是python,但它不一定是python。



Note: this is an Excel file from a client. It cannot be converted into any other format before we receive it. It is not our file

注意:这是来自客户端的 Excel 文件。在我们收到它之前,它不能转换成任何其他格式。这不是我们的文件



UPDATE:Answer with a working example of code that will open the following 200MB excel file in under 30 seconds will be rewarded with bounty: https://drive.google.com/file/d/0B_CXvCTOo7_2VW9id2VXRWZrbzQ/view?usp=sharing. This file should have string (col 1), date (col 9), and number (col 11).

更新:使用可在 30 秒内打开以下 200MB excel 文件的有效代码示例的答案将获得奖励:https: //drive.google.com/file/d/0B_CXvCTOo7_2VW9id2VXRWZrbzQ/view?usp =sharing。这个文件应该有字符串(col 1)、日期(col 9)和数字(col 11)。

采纳答案by devsaki

Well, if your excel is going to be as simple as a CSV file like your example (https://drive.google.com/file/d/0B_CXvCTOo7_2UVZxbnpRaEVnaFk/view?usp=sharing), you can try to open the file as a zip file and read directly every xml:

好吧,如果您的 excel 将像您的示例(https://drive.google.com/file/d/0B_CXvCTOo7_2UVZxbnpRaEVnaFk/view?usp=sharing)一样简单,您可以尝试将文件打开为一个 zip 文件并直接读取每个 xml:

Intel i5 4460, 12 GB RAM, SSD Samsung EVO PRO.

英特尔 i5 4460、12 GB 内存、SSD 三星 EVO PRO。

If you have a lot of memory ram:This code needs a lot of ram, but it takes 20~25 seconds. (You need the parameter -Xmx7g)

如果你有很多内存ram:这段代码需要很多ram,但需要20~25秒。(您需要参数 -Xmx7g)

package com.devsaki.opensimpleexcel;

import java.io.BufferedReader;
import java.io.IOException;
import java.io.InputStreamReader;
import java.io.PrintWriter;
import java.nio.charset.Charset;
import java.time.LocalDateTime;
import java.time.format.DateTimeFormatter;
import java.util.concurrent.ExecutionException;
import java.util.concurrent.ExecutorService;
import java.util.concurrent.Executors;
import java.util.concurrent.Future;
import java.util.zip.ZipFile;

public class Multithread {

    public static final char CHAR_END = (char) -1;

    public static void main(String[] args) throws IOException, ExecutionException, InterruptedException {
        String excelFile = "C:/Downloads/BigSpreadsheetAllTypes.xlsx";
        ZipFile zipFile = new ZipFile(excelFile);
        long init = System.currentTimeMillis();
        ExecutorService executor = Executors.newFixedThreadPool(4);
        char[] sheet1 = readEntry(zipFile, "xl/worksheets/sheet1.xml").toCharArray();
        Future<Object[][]> futureSheet1 = executor.submit(() -> processSheet1(new CharReader(sheet1), executor));
        char[] sharedString = readEntry(zipFile, "xl/sharedStrings.xml").toCharArray();
        Future<String[]> futureWords = executor.submit(() -> processSharedStrings(new CharReader(sharedString)));

        Object[][] sheet = futureSheet1.get();
        String[] words = futureWords.get();

        executor.shutdown();

        long end = System.currentTimeMillis();
        System.out.println("only read: " + (end - init) / 1000);

        ///Doing somethin with the file::Saving as csv
        init = System.currentTimeMillis();
        try (PrintWriter writer = new PrintWriter(excelFile + ".csv", "UTF-8");) {
            for (Object[] rows : sheet) {
                for (Object cell : rows) {
                    if (cell != null) {
                        if (cell instanceof Integer) {
                            writer.append(words[(Integer) cell]);
                        } else if (cell instanceof String) {
                            writer.append(toDate(Double.parseDouble(cell.toString())));
                        } else {
                            writer.append(cell.toString()); //Probably a number
                        }
                    }
                    writer.append(";");
                }
                writer.append("\n");
            }
        }
        end = System.currentTimeMillis();
        System.out.println("Main saving to csv: " + (end - init) / 1000);
    }

    private static final DateTimeFormatter formatter = DateTimeFormatter.ISO_DATE_TIME;
    private static final LocalDateTime INIT_DATE = LocalDateTime.parse("1900-01-01T00:00:00+00:00", formatter).plusDays(-2);

    //The number in excel is from 1900-jan-1, so every number time that you get, you have to sum to that date
    public static String toDate(double s) {
        return formatter.format(INIT_DATE.plusSeconds((long) ((s*24*3600))));
    }

    public static String readEntry(ZipFile zipFile, String entry) throws IOException {
        System.out.println("Initialing readEntry " + entry);
        long init = System.currentTimeMillis();
        String result = null;

        try (BufferedReader br = new BufferedReader(new InputStreamReader(zipFile.getInputStream(zipFile.getEntry(entry)), Charset.forName("UTF-8")))) {
            br.readLine();
            result = br.readLine();
        }

        long end = System.currentTimeMillis();
        System.out.println("readEntry '" + entry + "': " + (end - init) / 1000);
        return result;
    }


    public static String[] processSharedStrings(CharReader br) throws IOException {
        System.out.println("Initialing processSharedStrings");
        long init = System.currentTimeMillis();
        String[] words = null;
        char[] wordCount = "Count=\"".toCharArray();
        char[] token = "<t>".toCharArray();
        String uniqueCount = extractNextValue(br, wordCount, '"');
        words = new String[Integer.parseInt(uniqueCount)];
        String nextWord;
        int currentIndex = 0;
        while ((nextWord = extractNextValue(br, token, '<')) != null) {
            words[currentIndex++] = nextWord;
            br.skip(11); //you can skip at least 11 chars "/t></si><si>"
        }
        long end = System.currentTimeMillis();
        System.out.println("SharedStrings: " + (end - init) / 1000);
        return words;
    }


    public static Object[][] processSheet1(CharReader br, ExecutorService executorService) throws IOException, ExecutionException, InterruptedException {
        System.out.println("Initialing processSheet1");
        long init = System.currentTimeMillis();
        char[] dimensionToken = "dimension ref=\"".toCharArray();
        String dimension = extractNextValue(br, dimensionToken, '"');
        int[] sizes = extractSizeFromDimention(dimension.split(":")[1]);
        br.skip(30); //Between dimension and next tag c exists more or less 30 chars
        Object[][] result = new Object[sizes[0]][sizes[1]];

        int parallelProcess = 8;
        int currentIndex = br.currentIndex;
        CharReader[] charReaders = new CharReader[parallelProcess];
        int totalChars = Math.round(br.chars.length / parallelProcess);
        for (int i = 0; i < parallelProcess; i++) {
            int endIndex = currentIndex + totalChars;
            charReaders[i] = new CharReader(br.chars, currentIndex, endIndex, i);
            currentIndex = endIndex;
        }
        Future[] futures = new Future[parallelProcess];
        for (int i = charReaders.length - 1; i >= 0; i--) {
            final int j = i;
            futures[i] = executorService.submit(() -> inParallelProcess(charReaders[j], j == 0 ? null : charReaders[j - 1], result));
        }
        for (Future future : futures) {
            future.get();
        }

        long end = System.currentTimeMillis();
        System.out.println("Sheet1: " + (end - init) / 1000);
        return result;
    }

    public static void inParallelProcess(CharReader br, CharReader back, Object[][] result) {
        System.out.println("Initialing inParallelProcess : " + br.identifier);

        char[] tokenOpenC = "<c r=\"".toCharArray();
        char[] tokenOpenV = "<v>".toCharArray();

        char[] tokenAttributS = " s=\"".toCharArray();
        char[] tokenAttributT = " t=\"".toCharArray();

        String v;
        int firstCurrentIndex = br.currentIndex;
        boolean first = true;

        while ((v = extractNextValue(br, tokenOpenC, '"')) != null) {
            if (first && back != null) {
                int sum = br.currentIndex - firstCurrentIndex - tokenOpenC.length - v.length() - 1;
                first = false;
                System.out.println("Adding to : " + back.identifier + " From : " + br.identifier);
                back.plusLength(sum);
            }
            int[] indexes = extractSizeFromDimention(v);

            int s = foundNextTokens(br, '>', tokenAttributS, tokenAttributT);
            char type = 's'; //3 types: number (n), string (s) and date (d)
            if (s == 0) { // Token S = number or date
                char read = br.read();
                if (read == '1') {
                    type = 'n';
                } else {
                    type = 'd';
                }
            } else if (s == -1) {
                type = 'n';
            }
            String c = extractNextValue(br, tokenOpenV, '<');
            Object value = null;
            switch (type) {
                case 'n':
                    value = Double.parseDouble(c);
                    break;
                case 's':
                    try {
                        value = Integer.parseInt(c);
                    } catch (Exception ex) {
                        System.out.println("Identifier Error : " + br.identifier);
                    }
                    break;
                case 'd':
                    value = c.toString();
                    break;
            }
            result[indexes[0] - 1][indexes[1] - 1] = value;
            br.skip(7); ///v></c>
        }
    }

    static class CharReader {
        char[] chars;
        int currentIndex;
        int length;

        int identifier;

        public CharReader(char[] chars) {
            this.chars = chars;
            this.length = chars.length;
        }

        public CharReader(char[] chars, int currentIndex, int length, int identifier) {
            this.chars = chars;
            this.currentIndex = currentIndex;
            if (length > chars.length) {
                this.length = chars.length;
            } else {
                this.length = length;
            }
            this.identifier = identifier;
        }

        public void plusLength(int n) {
            if (this.length + n <= chars.length) {
                this.length += n;
            }
        }

        public char read() {
            if (currentIndex >= length) {
                return CHAR_END;
            }
            return chars[currentIndex++];
        }

        public void skip(int n) {
            currentIndex += n;
        }
    }


    public static int[] extractSizeFromDimention(String dimention) {
        StringBuilder sb = new StringBuilder();
        int columns = 0;
        int rows = 0;
        for (char c : dimention.toCharArray()) {
            if (columns == 0) {
                if (Character.isDigit(c)) {
                    columns = convertExcelIndex(sb.toString());
                    sb = new StringBuilder();
                }
            }
            sb.append(c);
        }
        rows = Integer.parseInt(sb.toString());
        return new int[]{rows, columns};
    }

    public static int foundNextTokens(CharReader br, char until, char[]... tokens) {
        char character;
        int[] indexes = new int[tokens.length];
        while ((character = br.read()) != CHAR_END) {
            if (character == until) {
                break;
            }
            for (int i = 0; i < indexes.length; i++) {
                if (tokens[i][indexes[i]] == character) {
                    indexes[i]++;
                    if (indexes[i] == tokens[i].length) {
                        return i;
                    }
                } else {
                    indexes[i] = 0;
                }
            }
        }

        return -1;
    }

    public static String extractNextValue(CharReader br, char[] token, char until) {
        char character;
        StringBuilder sb = new StringBuilder();
        int index = 0;

        while ((character = br.read()) != CHAR_END) {
            if (index == token.length) {
                if (character == until) {
                    return sb.toString();
                } else {
                    sb.append(character);
                }
            } else {
                if (token[index] == character) {
                    index++;
                } else {
                    index = 0;
                }
            }
        }
        return null;
    }

    public static int convertExcelIndex(String index) {
        int result = 0;
        for (char c : index.toCharArray()) {
            result = result * 26 + ((int) c - (int) 'A' + 1);
        }
        return result;
    }
}

Old answer (Not need the parameter Xms7g, so take less memory):It takes to open and read the example file about 35 seconds (200MB) with an HDD, with SDD takes a little less (30 seconds).

旧答案(不需要参数 Xms7g,因此占用更少的内存):使用 HDD 打开和读取示例文件大约需要 35 秒(200MB),使用 SDD 需要更少(30 秒)。

Here the code: https://github.com/csaki/OpenSimpleExcelFast.git

这里的代码:https: //github.com/csaki/OpenSimpleExcelFast.git

import java.io.BufferedReader;
import java.io.IOException;
import java.io.InputStreamReader;
import java.io.PrintWriter;
import java.nio.charset.Charset;
import java.time.LocalDateTime;
import java.time.format.DateTimeFormatter;
import java.util.concurrent.ExecutionException;
import java.util.concurrent.ExecutorService;
import java.util.concurrent.Executors;
import java.util.concurrent.Future;
import java.util.zip.ZipFile;

public class Launcher {

    public static final char CHAR_END = (char) -1;

    public static void main(String[] args) throws IOException, ExecutionException, InterruptedException {
        long init = System.currentTimeMillis();
        String excelFile = "D:/Downloads/BigSpreadsheet.xlsx";
        ZipFile zipFile = new ZipFile(excelFile);

        ExecutorService executor = Executors.newFixedThreadPool(4);
        Future<String[]> futureWords = executor.submit(() -> processSharedStrings(zipFile));
        Future<Object[][]> futureSheet1 = executor.submit(() -> processSheet1(zipFile));
        String[] words = futureWords.get();
        Object[][] sheet1 = futureSheet1.get();
        executor.shutdown();

        long end = System.currentTimeMillis();
        System.out.println("Main only open and read: " + (end - init) / 1000);


        ///Doing somethin with the file::Saving as csv
        init = System.currentTimeMillis();
        try (PrintWriter writer = new PrintWriter(excelFile + ".csv", "UTF-8");) {
            for (Object[] rows : sheet1) {
                for (Object cell : rows) {
                    if (cell != null) {
                        if (cell instanceof Integer) {
                            writer.append(words[(Integer) cell]);
                        } else if (cell instanceof String) {
                            writer.append(toDate(Double.parseDouble(cell.toString())));
                        } else {
                            writer.append(cell.toString()); //Probably a number
                        }
                    }
                    writer.append(";");
                }
                writer.append("\n");
            }
        }
        end = System.currentTimeMillis();
        System.out.println("Main saving to csv: " + (end - init) / 1000);
    }

    private static final DateTimeFormatter formatter = DateTimeFormatter.ISO_DATE_TIME;
    private static final LocalDateTime INIT_DATE = LocalDateTime.parse("1900-01-01T00:00:00+00:00", formatter).plusDays(-2);

    //The number in excel is from 1900-jan-1, so every number time that you get, you have to sum to that date
    public static String toDate(double s) {
        return formatter.format(INIT_DATE.plusSeconds((long) ((s*24*3600))));
    }

    public static Object[][] processSheet1(ZipFile zipFile) throws IOException {
        String entry = "xl/worksheets/sheet1.xml";
        Object[][] result = null;
        char[] dimensionToken = "dimension ref=\"".toCharArray();
        char[] tokenOpenC = "<c r=\"".toCharArray();
        char[] tokenOpenV = "<v>".toCharArray();

        char[] tokenAttributS = " s=\"".toCharArray();
        char[] tokenAttributT = " t=\"".toCharArray();
        try (BufferedReader br = new BufferedReader(new InputStreamReader(zipFile.getInputStream(zipFile.getEntry(entry)), Charset.forName("UTF-8")))) {
            String dimension = extractNextValue(br, dimensionToken, '"');
            int[] sizes = extractSizeFromDimention(dimension.split(":")[1]);
            br.skip(30); //Between dimension and next tag c exists more or less 30 chars
            result = new Object[sizes[0]][sizes[1]];
            String v;
            while ((v = extractNextValue(br, tokenOpenC, '"')) != null) {
                int[] indexes = extractSizeFromDimention(v);

                int s = foundNextTokens(br, '>', tokenAttributS, tokenAttributT);
                char type = 's'; //3 types: number (n), string (s) and date (d)
                if (s == 0) { // Token S = number or date
                    char read = (char) br.read();
                    if (read == '1') {
                        type = 'n';
                    } else {
                        type = 'd';
                    }
                } else if (s == -1) {
                    type = 'n';
                }
                String c = extractNextValue(br, tokenOpenV, '<');
                Object value = null;
                switch (type) {
                    case 'n':
                        value = Double.parseDouble(c);
                        break;
                    case 's':
                        value = Integer.parseInt(c);
                        break;
                    case 'd':
                        value = c.toString();
                        break;
                }
                result[indexes[0] - 1][indexes[1] - 1] = value;
                br.skip(7); ///v></c>
            }
        }
        return result;
    }

    public static int[] extractSizeFromDimention(String dimention) {
        StringBuilder sb = new StringBuilder();
        int columns = 0;
        int rows = 0;
        for (char c : dimention.toCharArray()) {
            if (columns == 0) {
                if (Character.isDigit(c)) {
                    columns = convertExcelIndex(sb.toString());
                    sb = new StringBuilder();
                }
            }
            sb.append(c);
        }
        rows = Integer.parseInt(sb.toString());
        return new int[]{rows, columns};
    }

    public static String[] processSharedStrings(ZipFile zipFile) throws IOException {
        String entry = "xl/sharedStrings.xml";
        String[] words = null;
        char[] wordCount = "Count=\"".toCharArray();
        char[] token = "<t>".toCharArray();
        try (BufferedReader br = new BufferedReader(new InputStreamReader(zipFile.getInputStream(zipFile.getEntry(entry)), Charset.forName("UTF-8")))) {
            String uniqueCount = extractNextValue(br, wordCount, '"');
            words = new String[Integer.parseInt(uniqueCount)];
            String nextWord;
            int currentIndex = 0;
            while ((nextWord = extractNextValue(br, token, '<')) != null) {
                words[currentIndex++] = nextWord;
                br.skip(11); //you can skip at least 11 chars "/t></si><si>"
            }
        }
        return words;
    }

    public static int foundNextTokens(BufferedReader br, char until, char[]... tokens) throws IOException {
        char character;
        int[] indexes = new int[tokens.length];
        while ((character = (char) br.read()) != CHAR_END) {
            if (character == until) {
                break;
            }
            for (int i = 0; i < indexes.length; i++) {
                if (tokens[i][indexes[i]] == character) {
                    indexes[i]++;
                    if (indexes[i] == tokens[i].length) {
                        return i;
                    }
                } else {
                    indexes[i] = 0;
                }
            }
        }

        return -1;
    }

    public static String extractNextValue(BufferedReader br, char[] token, char until) throws IOException {
        char character;
        StringBuilder sb = new StringBuilder();
        int index = 0;

        while ((character = (char) br.read()) != CHAR_END) {
            if (index == token.length) {
                if (character == until) {
                    return sb.toString();
                } else {
                    sb.append(character);
                }
            } else {
                if (token[index] == character) {
                    index++;
                } else {
                    index = 0;
                }
            }
        }
        return null;
    }

    public static int convertExcelIndex(String index) {
        int result = 0;
        for (char c : index.toCharArray()) {
            result = result * 26 + ((int) c - (int) 'A' + 1);
        }
        return result;
    }

}

回答by Jeremy Thompson

Most programming languages that work with Office products have some middle layer and this is usually where the bottleneck is, a good example is using PIA's/Interop or Open XML SDK.

大多数与 Office 产品配合使用的编程语言都有一些中间层,这通常是瓶颈所在,一个很好的例子是使用 PIA/Interop 或 Open XML SDK。

One way to get the data at a lower level (bypassing the middle layer) is using a Driver.

在较低级别(绕过中间层)获取数据的一种方法是使用驱动程序。

150MB one-sheet excel file that takes about 7 minutes.

150MB 的一张 Excel 文件,大约需要 7 分钟。

The best I could do is a 130MB file in 135 seconds, roughly 3 times faster:

我能做的最好的事情是在 135 秒内创建一个 130MB 的文件,大约快 3 倍:

Stopwatch sw = new Stopwatch();
sw.Start();

DataSet excelDataSet = new DataSet();

string filePath = @"c:\temp\BigBook.xlsx";

// For .XLSXs we use =Microsoft.ACE.OLEDB.12.0;, for .XLS we'd use Microsoft.Jet.OLEDB.4.0; with  "';Extended Properties=\"Excel 8.0;HDR=YES;\"";
string connectionString = "Provider=Microsoft.ACE.OLEDB.12.0;Data Source='" + filePath + "';Extended Properties=\"Excel 12.0;HDR=YES;\"";

using (OleDbConnection conn = new OleDbConnection(connectionString))
{
    conn.Open();
    OleDbDataAdapter objDA = new System.Data.OleDb.OleDbDataAdapter
    ("select * from [Sheet1$]", conn);
    objDA.Fill(excelDataSet);
    //dataGridView1.DataSource = excelDataSet.Tables[0];
}
sw.Stop();
Debug.Print("Load XLSX tool: " + sw.ElapsedMilliseconds + " millisecs. Records = "  + excelDataSet.Tables[0].Rows.Count);

enter image description here

在此处输入图片说明

Win 7x64, Intel i5, 2.3ghz, 8GB ram, SSD250GB.

Win 7x64,Intel i5,2.3GHz,8GB 内存,SSD250GB。

If I could recommend a hardware solution as well, try to resolve it with an SSD if you're using standard HDD's.

如果我也可以推荐一个硬件解决方案,如果您使用的是标准 HDD,请尝试使用 SSD 解决它。

Note: I cant download your Excel spreadsheet example as I'm behind a corporate firewall.

注意:我无法下载您的 Excel 电子表格示例,因为我位于公司防火墙后面。

PS. See MSDN - Fastest Way to import xlsx files with 200 MB of Data, the consensusbeing OleDB is the fastest.

附注。请参阅MSDN - 导入具有 200 MB 数据的 xlsx 文件的最快方法共识是 OleDB 是最快的。

PS 2. Here's how you can do it with python: http://code.activestate.com/recipes/440661-read-tabular-data-from-excel-spreadsheets-the-fast/

PS 2. 以下是使用 python 的方法:http: //code.activestate.com/recipes/440661-read-tabular-data-from-excel-spreadsheets-the-fast/

回答by Isma

I managed to read the file in about 30 seconds using .NET core and the Open XML SDK.

我设法使用 .NET core 和 Open XML SDK 在大约 30 秒内读取了该文件。

The following example returns a list of objects containing all rows and cells with the matching types, it supports date, numeric and text cells. The project is available here: https://github.com/xferaa/BigSpreadSheetExample/(Should work on Windows, Linux and Mac OS and does not require Excel or any Excel component to be installed).

下面的示例返回一个包含所有行和具有匹配类型的单元格的对象列表,它支持日期、数字和文本单元格。该项目可在此处获得:https: //github.com/xferaa/BigSpreadSheetExample/(应适用于 Windows、Linux 和 Mac OS,不需要安装 Excel 或任何 Excel 组件)。

public List<List<object>> ParseSpreadSheet()
{
    List<List<object>> rows = new List<List<object>>();

    using (SpreadsheetDocument spreadsheetDocument = SpreadsheetDocument.Open(filePath, false))
    {
        WorkbookPart workbookPart = spreadsheetDocument.WorkbookPart;
        WorksheetPart worksheetPart = workbookPart.WorksheetParts.First();

        OpenXmlReader reader = OpenXmlReader.Create(worksheetPart);

        Dictionary<int, string> sharedStringCache = new Dictionary<int, string>();

        int i = 0;
        foreach (var el in workbookPart.SharedStringTablePart.SharedStringTable.ChildElements)
        {
            sharedStringCache.Add(i++, el.InnerText);
        }

        while (reader.Read())
        {
            if(reader.ElementType == typeof(Row))
            {
                reader.ReadFirstChild();

                List<object> cells = new List<object>();

                do
                {
                    if (reader.ElementType == typeof(Cell))
                    {
                        Cell c = (Cell)reader.LoadCurrentElement();

                        if (c == null || c.DataType == null || !c.DataType.HasValue)
                            continue;

                        object value;

                        switch(c.DataType.Value)
                        {
                            case CellValues.Boolean:
                                value = bool.Parse(c.CellValue.InnerText);
                                break;
                            case CellValues.Date:
                                value = DateTime.Parse(c.CellValue.InnerText);
                                break;
                            case CellValues.Number:
                                value = double.Parse(c.CellValue.InnerText);
                                break;
                            case CellValues.InlineString:
                            case CellValues.String:
                                value = c.CellValue.InnerText;
                                break;
                            case CellValues.SharedString:
                                value = sharedStringCache[int.Parse(c.CellValue.InnerText)];
                                break;
                            default:
                                continue;
                        }

                        if (value != null)
                            cells.Add(value);
                    }

                } while (reader.ReadNextSibling());

                if (cells.Any())
                    rows.Add(cells);
            }
        }
    }

    return rows;
}

I ran the program in a three year old Laptop with a SSD drive, 8GB of RAM and an Intel Core i7-4710 CPU @ 2.50GHz (two cores) on Windows 10 64 bits.

我在一台配备 SSD 驱动器、8GB RAM 和 Intel Core i7-4710 CPU @ 2.50GHz(两个内核)的 3 年前笔记本电脑上运行该程序,Windows 10 64 位。

Note that although opening and parsing the whole file as strings takes a bit less than 30 seconds, when using objects as in the example of my last edit, the time goes up to almost 50 seconds with my crappy laptop. You will probably get closer to 30 seconds in your server with Linux.

请注意,虽然将整个文件作为字符串打开和解析需要不到 30 秒,但在使用我上次编辑示例中的对象时,我的蹩脚笔记本电脑的时间增加了近 50 秒。在使用 Linux 的服务器中,您可能会接近 30 秒。

The trick was to use the SAX approach as explained here:

诀窍是使用 SAX 方法,如下所述:

https://msdn.microsoft.com/en-us/library/office/gg575571.aspx

https://msdn.microsoft.com/en-us/library/office/gg575571.aspx

回答by Martin Evans

Python's Pandas library could be used to hold and process your data, but using it to directly load the .xlsxfile will be quite slow, e.g. using read_excel().

Python 的 Pandas 库可用于保存和处理您的数据,但使用它直接加载.xlsx文件会很慢,例如使用read_excel().

One approach would be to use Python to automate the conversion of your file into CSV using Excel itself and to then use Pandas to load the resulting CSV file using read_csv(). This will give you a good speed up, but not under 30 seconds:

一种方法是使用 Python 使用 Excel 本身将文件自动转换为 CSV,然后使用 Pandas 使用read_csv(). 这会给你一个很好的加速,但不会低于 30 秒:

import win32com.client as win32        
import pandas as pd    
from datetime import datetime    

print ("Starting")
start = datetime.now()

# Use Excel to load the xlsx file and save it in csv format
excel = win32.gencache.EnsureDispatch('Excel.Application')
wb = excel.Workbooks.Open(r'c:\full path\BigSpreadsheet.xlsx')
excel.DisplayAlerts = False
wb.DoNotPromptForConvert = True
wb.CheckCompatibility = False

print('Saving')
wb.SaveAs(r'c:\full path\temp.csv', FileFormat=6, ConflictResolution=2) 
excel.Application.Quit()

# Use Pandas to load the resulting CSV file
print('Loading CSV')
df = pd.read_csv(r'c:\full path\temp.csv', dtype=str)

print(df.shape)
print("Done", datetime.now() - start)

Column types
The types for your columns can be specified by passing dtypeand convertersand parse_dates:

列类型 列
的类型可以通过传递dtypeandconverters和来指定parse_dates

df = pd.read_csv(r'c:\full path\temp.csv', dtype=str, converters={10:int}, parse_dates=[8], infer_datetime_format=True)

You should also specify infer_datetime_format=True, as this will greatly speed up the date conversion.

您还应该指定infer_datetime_format=True,因为这将大大加快日期转换。

nfer_datetime_format: boolean, default False

If True and parse_dates is enabled, pandas will attempt to infer the format of the datetime strings in the columns, and if it can be inferred, switch to a faster method of parsing them. In some cases this can increase the parsing speed by 5-10x.

nfer_datetime_format: 布尔值,默认为 False

如果启用了 True 和 parse_dates,pandas 将尝试推断列中日期时间字符串的格式,如果可以推断,则切换到解析它们的更快方法。在某些情况下,这可以将解析速度提高 5-10 倍。

Also add dayfirst=Trueif dates are in the form DD/MM/YYYY.

dayfirst=True如果日期在表单中,还要添加DD/MM/YYYY

Selective columns
If you actually only need to work on columns 1 9 11, then you could further reduce resources by specifying usecols=[0, 8, 10]as follows:

选择性列
如果您实际上只需要处理列1 9 11,那么您可以通过usecols=[0, 8, 10]如下指定进一步减少资源:

df = pd.read_csv(r'c:\full path\temp.csv', dtype=str, converters={10:int}, parse_dates=[1], dayfirst=True, infer_datetime_format=True, usecols=[0, 8, 10])

The resulting dataframe would then only contain those 3 columns of data.

生成的数据框将只包含这 3 列数据。

RAM drive
Using a RAM drive to store the temporary CSV file to would further speed up the load time.

RAM 驱动器
使用 RAM 驱动器存储临时 CSV 文件将进一步加快加载时间。

Note: This does assume you are using a Windows PC with Excel available.

注意:这确实假设您使用的是带有 Excel 的 Windows PC。

回答by skadya

I have created an sample Javaprogram which is able to load the file in ~40 seconds my laptop ( Intel i7 4 core, 16 GB RAM).

我创建了一个示例Java程序,它能够在大约 40 秒内加载我的笔记本电脑(Intel i7 4 核,16 GB RAM)文件。

https://github.com/skadyan/largefile

https://github.com/skadyan/largefile

This program uses the Apache POI libraryto load the .xlsx file using the XSSF SAX API.

该程序使用Apache POI 库通过XSSF SAX API加载 .xlsx 文件。

The callback interface com.stackoverlfow.largefile.RecordHandlerimplementation can be used to process the data loaded from the excel. This interface define only one method which take three arguments

回调接口com.stackoverlfow.largefile.RecordHandler实现可用于处理从excel加载的数据。这个接口只定义了一个接受三个参数的方法

  • sheetname : String, excel sheet name
  • row number: int, row number of data
  • and data map: Map: excel cell reference and excel formatted cell value
  • sheetname : 字符串,excel表格名称
  • 行号:int,数据的行号
  • data map: Map: excel 单元格引用和 excel 格式的单元格值

The class com.stackoverlfow.largefile.Maindemonstrate one basic implementation of this interface which just print the row number on console.

该类com.stackoverlfow.largefile.Main演示了此接口的一个基本实现,它只是在控制台上打印行号。

Update

更新

woodstoxparser seems have better performance than standard SAXReader. (code updated in repo).

woodstox解析器似乎具有比标准更好的性能SAXReader。(代码在 repo 中更新)。

Also in order to meet the desired performance requirement, you may consider to re-implement the org.apache.poi...XSSFSheetXMLHandler. In the implementation, more optimized string/text value handling can be implemented and unnecessary text formatting operation may be skipped.

此外,为了满足所需的性能要求,您可以考虑重新实现org.apache.poi...XSSFSheetXMLHandler. 在实现中,可以实现更加优化的字符串/文本值处理,可以跳过不必要的文本格式化操作。

回答by R. Roe

I'm using a Dell Precision T1700 workstation and using c# I was able to open the file and read it's contents in about 24 seconds just using standard code to open a workbook using interop services. Using references to the Microsoft Excel 15.0 Object Library here is my code.

我正在使用 Dell Precision T1700 工作站并使用 c# 我能够在大约 24 秒内打开文件并读取其内容,只需使用标准代码使用互操作服务打开工作簿。在此处使用对 Microsoft Excel 15.0 对象库的引用是我的代码。

My using statements:

我的使用语句:

using System.Runtime.InteropServices;
using Excel = Microsoft.Office.Interop.Excel;

Code to open and read workbook:

打开和阅读工作簿的代码:

public partial class MainWindow : Window {
    public MainWindow() {
        InitializeComponent();

        Excel.Application xlApp;
        Excel.Workbook wb;
        Excel.Worksheet ws;

        xlApp = new Excel.Application();
        xlApp.Visible = false;
        xlApp.ScreenUpdating = false;

        wb = xlApp.Workbooks.Open(@"Desired Path of workbook\Copy of BigSpreadsheet.xlsx");

        ws = wb.Sheets["Sheet1"];

        //string rng = ws.get_Range("A1").Value;
        MessageBox.Show(ws.get_Range("A1").Value);

        Marshal.FinalReleaseComObject(ws);

        wb.Close();
        Marshal.FinalReleaseComObject(wb);

        xlApp.Quit();
        Marshal.FinalReleaseComObject(xlApp);

        GC.Collect();
        GC.WaitForPendingFinalizers();
    }
}

回答by obgnaw

The c# and ole solution still have some bottleneck.So i test it by c++ and ado.

c# 和 ole 解决方案仍然存在一些瓶颈。所以我通过 c++ 和 ado 对其进行了测试。

_bstr_t connStr(makeConnStr(excelFile, header).c_str());

TESTHR(pRec.CreateInstance(__uuidof(Recordset)));       
TESTHR(pRec->Open(sqlSelectSheet(connStr, sheetIndex).c_str(), connStr, adOpenStatic, adLockOptimistic, adCmdText));

while(!pRec->adoEOF)
{
    for(long i = 0; i < pRec->Fields->GetCount(); ++i)
    {   
        _variant_t v = pRec->Fields->GetItem(i)->Value;
        if(v.vt == VT_R8)
            num[i] = v.dblVal;
        if(v.vt == VT_BSTR)
            str[i] = v.bstrVal;          
        ++cellCount;
    }                                    
    pRec->MoveNext();
}

In i5-4460 and HDD machine,i find 500 thousands of cell in xls will take 1.5s.But same data in xlsx will take 2.829s.so it's possible for manipulating your data under 30s.

在 i5-4460 和 HDD 机器中,我发现 xls 中的 50 万个单元将需要 1.5 秒。但 xlsx 中的相同数据将需要 2.829 秒。因此可以在 30 秒内处理您的数据。

If you really need under 30s,use RAM Drive to reduce file IO.It will significantly improve your process. I cannot download your data to test it,so please tell me the result.

如果你真的需要 30 秒以下,使用 RAM Drive 来减少文件 IO。它会显着改善你的进程。我无法下载您的数据进行测试,所以请告诉我结果。

回答by void

Looks like it is hardly achievable in Python at all. If we unpack a sheet data file then it would take all required 30 seconds just to pass it through the C-based iterative SAX parser (using lxml, a very fast wrapper over libxml2):

看起来它在 Python 中几乎无法实现。如果我们解压一个工作表数据文件,那么仅仅通过基于 C 的迭代 SAX 解析器就需要 30 秒的时间(使用lxml,一个非常快速的包装器libxml2):

from __future__ import print_function

from lxml import etree
import time


start_ts = time.time()

for data in etree.iterparse(open('xl/worksheets/sheet1.xml'), events=('start',), 
                            collect_ids=False, resolve_entities=False,
                            huge_tree=True):
    pass

print(time.time() - start_ts)

The sample output: 27.2134890556

样本输出:27.2134890556

By the way, the Excel itself needs about 40 seconds to load the workbook.

顺便说一下,Excel 本身需要大约 40 秒来加载工作簿。

回答by ZEE

Another way that should improve largely the load/operation time is a RAMDrive

另一种应该大大改善加载/操作时间的方法是 RAMDrive

create a RAMDrive with enough space for your file and a 10%..20% extra space...
copy the file for the RAMDrive...
Load the file from there... depending on your drive and filesystem the speed improvement should be huge...

创建一个 RAMDrive,为您的文件提供足够的空间和 10%..20% 的额外空间...
复制 RAMDrive
的文件...从那里加载文件...取决于您的驱动器和文件系统,速度改进应该是巨大的...

My favourite is IMDisk toolkit
(https://sourceforge.net/projects/imdisk-toolkit/) here you have a powerfull command line to script everything...

我最喜欢的是 IMDisk 工具包
( https://sourceforge.net/projects/imdisk-toolkit/) 在这里你有一个强大的命令行来编写所有内容......

I also recommend SoftPerfect ramdisk
(http://www.majorgeeks.com/files/details/softperfect_ram_disk.html)

我还推荐 SoftPerfect ramdisk
( http://www.majorgeeks.com/files/details/softperfect_ram_disk.html)

but that also depends of your OS...

但这也取决于您的操作系统...

回答by ZEE

I would like to have more info about the system where you are opening the file... anyway:

我想了解有关您打开文件的系统的更多信息......无论如何:

look in your system for a Windows update called
"Office File Validation Add-In for Office ..."

在您的系统中查找名为
“Office File Validation Add-In for Office ...”的 Windows 更新

if you have it... uninstall it...
the file should load much more quickly
specially if is loaded froma share

如果你有它......卸载它......
该文件应该加载得更快,
特别是如果从共享加载