实时解析大文本文件 (Java)
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/781293/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Parsing Large Text Files in Real-time (Java)
提问by Christopher McAtackney
I'm interested in parsing a fairly large text file in Java (1.6.x) and was wondering what approach(es) would be considered best practice?
我有兴趣在 Java (1.6.x) 中解析一个相当大的文本文件,并且想知道哪种方法会被认为是最佳实践?
The file will probably be about 1Mb in size, and will consist of thousands of entries along the lines of;
该文件的大小可能约为 1Mb,并且将包含数千个条目;
Entry
{
property1=value1
property2=value2
...
}
etc.
等等。
My first instinct is to use regular expressions, but I have no prior experience of using Java in a production environment, and so am unsure how powerful the java.util.regex classes are.
我的第一直觉是使用正则表达式,但我之前没有在生产环境中使用 Java 的经验,因此我不确定 java.util.regex 类的功能有多强大。
To clarify a bit, my application is going to be a web app (JSP) which parses the file in question and displays the various values it retrieves. There is only ever the one file which gets parsed (it resides in a 3rd party directory on the host).
澄清一下,我的应用程序将是一个 Web 应用程序 (JSP),它解析有问题的文件并显示它检索到的各种值。只有一个文件被解析(它位于主机的第 3 方目录中)。
The app will have a fairly low usage (maybe only a handful of users using it a couple of times a day), but it is vital that when they do use it, the information is retrieved as quickly as possible.
该应用程序的使用率相当低(可能只有少数用户每天使用它几次),但至关重要的是,当他们使用它时,尽快检索信息。
Also, are there any precautions to take around loading the file into memory every time it is parsed?
此外,是否有任何预防措施可以在每次解析文件时将文件加载到内存中?
Can anyone recommend an approach to take here?
有人可以推荐一种方法吗?
Thanks
谢谢
回答by Neil Coffey
If it's going to be about 1MB and literally in the format you state, then it sounds like you're overengineering things.
如果它大约为 1MB,并且实际上是您所说的格式,那么听起来您正在过度设计。
Unless your server is a ZX Spectrum or something, just use regular expressions to parse it, whack the data in a hash map (and keep it there), and don't worry about it. It'll take up a few megabytes in memory, but so what...?
除非您的服务器是 ZX Spectrum 或其他东西,否则只需使用正则表达式来解析它,将数据重击到哈希映射中(并将其保留在那里),不要担心。它会在内存中占用几兆字节,但那又怎样……?
Update:just to give you a concrete idea of performance, some measurements I took of the performance of String.split()(which uses regular expressions) show that on a 2GHz machine, it takes milliseconds to split 10,000 100-character strings(in other words, about 1 megabyte of data -- actually nearer 2MB in pure volume of bytes, since Strings are 2 bytes per char). Obvioualy, that's not quite the operation you're performing, but you get my point: things aren't that bad...
更新:只是为了让您对性能有一个具体的了解,我对 String.split()(使用正则表达式)的性能进行的一些测量表明,在 2GHz 机器上,拆分 10,000 个 100 个字符的字符串需要几毫秒(在换句话说,大约 1 兆字节的数据——实际上在纯字节量中接近 2MB,因为字符串是每个字符 2 个字节)。显然,这不是您正在执行的操作,但是您明白我的意思:事情并没有那么糟糕......
回答by Lucero
If it is a proper grammar, use a parser builder such as the GOLD Parsing System. This allows you to specify the format and use an efficient parser to get the tokens you need, getting error-handling almost for free.
如果是正确的语法,请使用解析器构建器,例如GOLD Parsing System。这允许您指定格式并使用高效的解析器来获取您需要的标记,几乎免费获得错误处理。
回答by Brian Agnew
I'm wondering why this isn't in XML, and then you could leverage off the available XML tooling. I'm thinking particularly of SAX, in which case you could easily parse/process this without holding it all in memory.
我想知道为什么这不在 XML 中,然后您可以利用可用的 XML 工具。我特别考虑 SAX,在这种情况下,您可以轻松地解析/处理它,而无需将其全部保存在内存中。
So can you convert this to XML ?
那么你能把它转换成 XML 吗?
If you can't, and you need a parser, then take a look at JavaCC
如果你不能,并且你需要一个解析器,那么看看JavaCC
回答by mP.
Use the Scanner class and process your file a line at a time. Im not sure why you mentioned regex. Regex is almost never the right answer to any parsing question because of the ambiguity and lack of symmantic contorl over whats happening in what context.
使用 Scanner 类并一次处理一行文件。我不确定你为什么提到正则表达式。正则表达式几乎从来都不是任何解析问题的正确答案,因为对于在什么情况下发生的事情的模糊性和缺乏符号控制。
回答by paweloque
回答by pgras
Not answering the question about parsing ... but you could parse the files and generate static pages as soon as new files arrive. So you would have no performance problems... (And I think 1Mb isn't a big file so you can load it in memory, as long as you don't load too many files concurrently...)
不回答有关解析的问题……但是您可以解析文件并在新文件到达时立即生成静态页面。所以你不会有性能问题......(我认为 1Mb 不是一个大文件,所以你可以将它加载到内存中,只要你不同时加载太多文件......)
回答by Yuval F
This seems like a simple enough file format, so you may consider using a Recursive Descent Parser. Compared to JavaCC and Antlr, its pros are that you can write a few simple methods, get the data you need, and you do not need to learn a parser generator formalism. Its cons - it may be less efficient. A recursive descent parser is in principle stronger than regular expressions. If you can come up with a grammar for this file type, it will serve you for whatever solution you choose.
这似乎是一种足够简单的文件格式,因此您可以考虑使用Recursive Descent Parser。与JavaCC和Antlr相比,它的优点是你可以编写一些简单的方法,得到你需要的数据,不需要学习解析器生成器的形式。它的缺点 - 它可能效率较低。递归下降解析器原则上比正则表达式更强。如果您可以为这种文件类型想出一个语法,它将为您提供任何您选择的解决方案。
回答by Alan Moore
If it's the limitations of Java regexes you're wondering about, don't worry about it. Assuming you're reasonably competent at crafting regexes, performance shouldn't be a problem. The feature set is satisfyingly rich, too--including my favorite, possessive quantifiers.
如果您想知道 Java 正则表达式的局限性,请不要担心。假设您在制作正则表达式方面相当有能力,那么性能应该不是问题。功能集也非常丰富——包括我最喜欢的所有格量词。
回答by Chii
the other solution is to do some form of preprocessing (done offline, or as a cron job) which produces a very optimized data structure, which is then used to serve the many web requests (without having to reparse the file).
另一种解决方案是进行某种形式的预处理(离线完成,或作为 cron 作业),它产生一个非常优化的数据结构,然后用于服务许多 Web 请求(无需重新解析文件)。
though, looking at the scenario in question, that doesnt seem to be needed.
不过,看看有问题的场景,这似乎没有必要。

