Java中的文本文件解析

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/890862/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-11 20:44:18  来源:igfitidea点击:

Text File Parsing in Java

javafileparsing

提问by

I am reading in a text file using FileInputStreamthat puts the file contents into a byte array. I then convert the byte array into a String using new String(byte).

我正在使用FileInputStream将文件内容放入字节数组的文本文件中读取。然后我使用 new String(byte) 将字节数组转换为字符串。

Once I have the string I'm using String.split("\n")to split the file into a String array and then taking that string array and parsing it by doing a String.split(",")and hold the contents in an Arraylist.

一旦我有了字符串,我就String.split("\n")可以将文件拆分为一个字符串数组,然后获取该字符串数组并通过执行 a 来解析它String.split(",")并将内容保存在一个 Arraylist 中。

I have a 200MB+file and it is running out of memory when I start the JVM up with a 1GB of memory. I know I must be doing something in correctly somewhere, I'm just not sure if the way I'm parsing is incorrect or the data structure I'm using.

我有一个200MB 以上的文件,当我用 1GB 的内存启动 JVM 时,它的内存不足。我知道我必须在某个地方正确地做某事,我只是不确定我解析的方式或我使用的数据结构是否不正确。

It is also taking me about 12 seconds to parse the file which seems like a lot of time. Can anyone point out what I may be doing that is causing me to run out of memory and what may be causing my program to run slow?

解析文件也需要我大约 12 秒,这似乎需要很多时间。任何人都可以指出我可能在做什么导致我内存不足以及什么可能导致我的程序运行缓慢?

The contents of the file look as shown below:

文件内容如下图所示:

"12334", "100", "1.233", "TEST", "TEXT", "1234"
"12334", "100", "1.233", "TEST", "TEXT", "1234"
.
.
.
"12334", "100", "1.233", "TEST", "TEXT", "1234"

Thanks

谢谢

采纳答案by duffymo

It sounds like you're doing something wrong to me - a whole lotta object creation going on.

听起来你对我做错了 - 一个完整的对象创建正在进行中。

How representative is that "test" file? What are you really doing with that data? If that's typical of what you really have, I'd say there's lots of repetition in that data.

该“测试”文件的代表性如何?你真的用这些数据做什么?如果这是您真正拥有的典型数据,我会说该数据中有很多重复。

If it's all going to be in Strings anyway, start with a BufferedReader to read each line. Pre-allocate that List to a size that's close to what you need so you don't waste resources adding to it each time. Split each of those lines at the comma; be sure to strip off the double quotes.

如果无论如何都将在字符串中,请从 BufferedReader 开始读取每一行。将该 List 预先分配到接近您需要的大小,这样您就不会浪费每次添加到它的资源。在逗号处拆分每一行;一定要去掉双引号。

You might want to ask yourself: "Why do I need this whole file in memory all at once?" Can you read a little, process a little, and never have the whole thing in memory at once? Only you know your problem well enough to answer.

您可能会问自己:“为什么我需要一次性将整个文件保存在内存中?” 你能读一点,处理一点,而且永远不会一次把整个事情都记在记忆里吗?只有您足够了解您的问题才能回答。

Maybe you can fire up jvisualvm if you have JDK 6 and see what's going on with memory. That would be a great clue.

如果您有 JDK 6,也许您可​​以启动 jvisualvm 并查看内存发生了什么。那将是一个很好的线索。

回答by Tom Hawtin - tackline

If you have a 200,000,000 character files and split that every five characters, you have 40,000,000 Stringobjects. Assume they are sharing actual character data with the original 400 MB String(charis 2 bytes). A Stringis say 32 bytes, so that is 1,280,000,000 bytes of Stringobjects.

如果您有一个 200,000,000 个字符的文件,并且每五个字符拆分一次,那么您就有 40,000,000 个String对象。假设他们与原始 400 MB Stringchar即 2 个字节)共享实际字符数据。AString是 32 字节,因此是 1,280,000,000 字节的String对象。

(It's probably worth noting that this is very implementation dependent. splitcould create entirely strings with entirely new backing char[]or, OTOH, share some common Stringvalues. Some Java implementations to not use the slicing of char[]. Some may use a UTF-8-like compact form and give very poor random access times.)

(可能值得注意的是,这非常依赖于实现。split可以创建具有全新支持的完全字符串,char[]或者,OTOH,共享一些公共String值。一些 Java 实现不使用 的切片char[]。有些可能使用类似 UTF-8 的紧凑形式并给出非常差的随机访问时间。)

Even assuming longer strings, that's a lot of objects. With that much data, you probably want to work with most of it in compact form like the original (only with indexes). Only convert to objects that which you need. The implementation should be database like (although they traditionally don't handle variable length strings efficiently).

即使假设更长的字符串,那也是很多对象。有了这么多数据,您可能希望像原始数据一样以紧凑的形式使用其中的大部分(仅使用索引)。仅转换为您需要的对象。实现应该像数据库一样(尽管它们传统上不能有效地处理可变长度的字符串)。

回答by Laurence Gonsalves

It sounds like you currently have 3 copies of the entire file in memory: the byte array, the string, and the array of the lines.

听起来您目前在内存中拥有整个文件的 3 个副本:字节数组、字符串和行数组。

Instead of reading the bytes into a byte array and then converting to characters using new String()it would be better to use an InputStreamReader, which will convert to characters incrementally, rather than all up-front.

与其将字节读入字节数组,然后使用new String()它转换为字符,不如使用 InputStreamReader,它会逐步转换为字符,而不是预先全部转换为字符。

Also, instead of using String.split("\n") to get the individual lines, you should read one line at a time. You can use the readLine()method in BufferedReader.

此外,不应使用 String.split("\n") 来获取各行,而应一次读取一行。您可以使用 中的readLine()方法BufferedReader

Try something like this:

尝试这样的事情:

BufferedReader reader = new BufferedReader(new InputStreamReader(fileInputStream, "UTF-8"));
try {
  while (true) {
    String line = reader.readLine();
    if (line == null) break;
    String[] fields = line.split(",");
    // process fields here
  }
} finally {
  reader.close();
}

回答by Cogsy

I'm not sure how efficient it is memory-wise, but my first approach would be using a Scanneras it is incredibly easy to use:

我不确定它在内存方面的效率如何,但我的第一种方法是使用Scanner,因为它非常容易使用:

File file = new File("/path/to/my/file.txt");
Scanner input = new Scanner(file);

while(input.hasNext()) {
    String nextToken = input.next();
    //or to process line by line
    String nextLine = input.nextLine();
}

input.close();

Check the API for how to alter the delimiter it uses to split tokens.

检查 API 以了解如何更改用于拆分令牌的分隔符。

回答by stenix

Have a look at these pages. They contain many open source CSV parsers. JSaParis one of them.

看看这些页面。它们包含许多开源 CSV 解析器。JSaPar就是其中之一。

回答by blackberry dev

While calling/invoking your programme you can use this command : java [-options] className [args...]
in place of [-options] provide more memory e.g -Xmx1024m or more. but this is just a workaround, u have to change ur parsing mechanism.

在调用/调用您的程序时,您可以使用以下命令:java [-options] className [args...]
代替 [-options] 提供更多内存,例如 -Xmx1024m 或更多。但这只是一种解决方法,您必须更改解析机制。