java 解析超大 Excel 2007 文件的最佳语言
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/3560950/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Best language to parse extremely large Excel 2007 files
提问by arcdrag
My boss has a habit of performing queries on our databases that return tens of thousands of rows and saving them into excel files. I, being the intern, constantly have to write scripts that work with the information from these files. Thus far I've tried VBScript and Powershell for my scripting needs. Both of these can take several minutes to perform even the simplest of tasks, which would mean that the script when finished would take most of an 8 hour day.
我的老板习惯于对我们的数据库执行查询,这些查询返回数万行并将它们保存到 excel 文件中。作为实习生,我经常需要编写脚本来处理这些文件中的信息。到目前为止,我已经尝试了 VBScript 和 Powershell 来满足我的脚本需求。即使是最简单的任务,这两者都可能需要几分钟才能执行,这意味着脚本完成后将花费一天 8 小时的大部分时间。
My workaround right now is simply to write a PowerShell script that removes all of the commas and newline characters from an xlsx file, saves the .xlsx files to .csv, and then have a Java program handle the data gathering and output, and have my script clean up the .csv files when finished. This runs in a matter of seconds for my current project, but I can't help but wonder if there's a more elegant alternative for my next one. Any suggestions?
我现在的解决方法只是编写一个 PowerShell 脚本,从 xlsx 文件中删除所有逗号和换行符,将 .xlsx 文件保存到 .csv,然后让 Java 程序处理数据收集和输出,并让我的脚本完成后清理 .csv 文件。对于我当前的项目,这会在几秒钟内运行,但我不禁想知道我的下一个项目是否有更优雅的替代方案。有什么建议?
采纳答案by trashgod
I kept getting all kinds of weird errors when working with .xlsx files.
使用 .xlsx 文件时,我不断收到各种奇怪的错误。
Here's a simple example of using Apache POIto traverse an .xlsxfile. See also Upgrading to POI 3.5, including converting existing HSSF Usermodel code to SS Usermodel (for XSSF and HSSF).
这是一个使用Apache POI遍历.xlsx文件的简单示例。另请参阅升级到 POI 3.5,包括将现有的 HSSF 用户模型代码转换为 SS 用户模型(用于 XSSF 和 HSSF)。
import java.io.FileInputStream;
import java.io.IOException;
import java.io.InputStream;
import org.apache.poi.ss.usermodel.Cell;
import org.apache.poi.ss.usermodel.DateUtil;
import org.apache.poi.ss.usermodel.FormulaEvaluator;
import org.apache.poi.ss.usermodel.Row;
import org.apache.poi.ss.usermodel.Sheet;
import org.apache.poi.ss.usermodel.Workbook;
import org.apache.poi.xssf.usermodel.XSSFWorkbook;
public class XlsxReader {
public static void main(String[] args) throws IOException {
InputStream myxls = new FileInputStream("test.xlsx");
Workbook book = new XSSFWorkbook(myxls);
FormulaEvaluator eval =
book.getCreationHelper().createFormulaEvaluator();
Sheet sheet = book.getSheetAt(0);
for (Row row : sheet) {
for (Cell cell : row) {
printCell(cell, eval);
System.out.print("; ");
}
System.out.println();
}
myxls.close();
}
private static void printCell(Cell cell, FormulaEvaluator eval) {
switch (cell.getCellType()) {
case Cell.CELL_TYPE_BLANK:
System.out.print("EMPTY");
break;
case Cell.CELL_TYPE_STRING:
System.out.print(cell.getStringCellValue());
break;
case Cell.CELL_TYPE_NUMERIC:
if (DateUtil.isCellDateFormatted(cell)) {
System.out.print(cell.getDateCellValue());
} else {
System.out.print(cell.getNumericCellValue());
}
break;
case Cell.CELL_TYPE_BOOLEAN:
System.out.print(cell.getBooleanCellValue());
break;
case Cell.CELL_TYPE_FORMULA:
System.out.print(cell.getCellFormula());
break;
default:
System.out.print("DEFAULT");
}
}
}
回答by Benoit Courtine
Your goal is to do "data transformation" on your Excel files.
您的目标是对 Excel 文件进行“数据转换”。
To solve this, I would use a dedicated ETL tool (Extract Transform Load), such as Talend Open Studio.
为了解决这个问题,我将使用专用的 ETL 工具(提取转换加载),例如Talend Open Studio。
You just have to put a "Excel Input" component, a "data transform" component, and a "CSV output component". Talend ETL will convert this functional description of your problem into a Java code. Finally, you just have to execute this program...
您只需要放置一个“Excel 输入”组件、一个“数据转换”组件和一个“CSV 输出组件”。Talend ETL 会将您的问题的此功能描述转换为 Java 代码。最后,你只需要执行这个程序......
回答by Tal
You should always think about the future of your code...
您应该始终考虑代码的未来......
Who will maintain your script in the future? Does your company have any other developers that are familiar with PowerShell/VBScript?
将来谁来维护您的脚本?贵公司是否有其他熟悉 PowerShell/VBScript 的开发人员?
I would have to say that you should stick to one language that fits your (and your company's) needs. As Nathan suggested, Python would be a great choice for creating fast scripts.
我不得不说,您应该坚持使用一种适合您(和您公司)需求的语言。正如 Nathan 所建议的,Python 将是创建快速脚本的绝佳选择。
And one more thing - If you can control the SQL statements your boss does, you can make him create outputs that will ease your parsers' development and make them much more simple.
还有一件事——如果你可以控制你的老板所做的 SQL 语句,你可以让他创建输出,这将简化你的解析器的开发并使它们更简单。
Good luck!
祝你好运!
Tal.
塔尔。
回答by ktsujister
in addition to trashgod's answer, for large files, I'd suggest POI SXSSF (Since POI 3.8 beta3). (http://poi.apache.org/spreadsheet/) With SXSSF, you can handle large files in streams, and that will help avoiding memory errors.
除了垃圾神的回答,对于大文件,我建议使用 POI SXSSF(自 POI 3.8 beta3)。( http://poi.apache.org/spreadsheet/) 使用 SXSSF,您可以处理流中的大文件,这将有助于避免内存错误。
adding link to SXSSF details: http://poi.apache.org/spreadsheet/how-to.html#sxssf
添加指向 SXSSF 详细信息的链接:http://poi.apache.org/spreadsheet/how-to.html#sxssf
回答by Nate
I personally would use Python for this. I have found that it runs fast enough to not be a noticeable problem.
我个人会为此使用 Python。我发现它运行得足够快,不会成为一个明显的问题。
If you don't want to worry about a new language, why not just use Java for the entire thing? Removing commas and newlines is pretty trivial in Java and it would save you a step.
如果您不想担心一门新语言,为什么不完全使用 Java?在 Java 中删除逗号和换行符非常简单,它可以为您节省一个步骤。
回答by Robert Harvey
回答by Skarab
If you need an ADVANCE analysis -- beyond grouping, joining, filtering --, just go for free mining tools, such as Wekka*, Rapidminer (based on Wekka but nicer GUI), or knime. These tools have very nice interfaces and provide operators to read cvs files. You can also run rapidminer and wekka libraries inside your java program. If not, go for embedded database as I proposed before.
如果您需要 ADVANCE 分析——除了分组、加入、过滤——,只需使用免费的挖掘工具,例如Wekka*、Rapidminer(基于 Wekka 但更好的 GUI)或knime。这些工具有非常好的界面,并提供操作符来读取 cvs 文件。您还可以在 Java 程序中运行 Rapidminer 和 wekka 库。如果没有,请按照我之前的建议使用嵌入式数据库。
Using Apache POI is not bad idea but I -- personally -- prefer to use it only to read excel before uploading it into e.g., a database.
使用 Apache POI 是一个不错的主意,但我个人更喜欢在将其上传到例如数据库之前仅使用它来读取 excel。
Regarding the language. The best language that I have found for adhoc tasks is groovy. It is scripting language on the top of Java so you can use all Java libs (POI, jdbcs, ...a very looong list) and mix groovy classes with Java classes.
关于语言。我为临时任务找到的最好的语言是groovy。它是 Java 之上的脚本语言,因此您可以使用所有 Java 库(POI、jdbcs、......一个非常长的列表)并将 groovy 类与 Java 类混合。
回答by Skarab
You can import data into an embedded database - e.g., apache derby (or http://hsqldb.org/). Depending on the nature of your queries, it can be a little bit faster. For sure, it will save a lot of your time if your boss requests new features often. You will simply write most of your new functionality in SQL.
您可以将数据导入嵌入式数据库 - 例如,apache derby(或http://hsqldb.org/)。根据查询的性质,它可能会快一点。当然,如果你的老板经常要求新功能,它会节省你很多时间。您只需用 SQL 编写大部分新功能。
回答by Avdhesh Yadav
I have two options for parsing excel(.xlsx or xls) files. 1-You can use apache POI api to extract data from it.Now Apache poi has improved and fast.
我有两种解析 excel(.xlsx 或 xls)文件的选项。1-您可以使用apache POI api从中提取数据。现在Apache poi已经改进且速度快。
2- Convert excel to open xml then write a xslt file. I think it should do work for a long file excel file.
2- 将 excel 转换为打开的 xml,然后编写一个 xslt 文件。我认为它应该适用于长文件 excel 文件。

