Java 在给定内存限制的情况下对具有大量数据的文件进行排序

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/2087469/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-13 03:24:36  来源:igfitidea点击:

Sort a file with huge volume of data given memory constraint

javafilesorting

提问by Erika Gomez

Points:

积分:

  • We process thousands of flat files in a day, concurrently.
  • Memory constraint is a major issue.
  • We use thread for each file process.
  • We don't sort by columns. Each line (record) in the file is treated as one column.
  • 我们一天内同时处理数千个平面文件。
  • 内存限制是一个主要问题。
  • 我们为每个文件进程使用线程。
  • 我们不按列排序。文件中的每一行(记录)都被视为一列。

Can't Do:

不能做:

  • We cannot use unix/linux's sort commands.
  • We cannot use any database system no matter how light they can be.
  • 我们不能使用 unix/linux 的排序命令。
  • 我们不能使用任何数据库系统,无论它们有多轻。

Now, we cannot just load everything in a collection and use the sort mechanism. It will eat up all the memory and the program is gonna get a heap error.

现在,我们不能只是加载集合中的所有内容并使用排序机制。它会吃掉所有的内存,程序会得到一个堆错误。

In that situation, how would you sort the records/lines in a file?

在这种情况下,您将如何对文件中的记录/行进行排序?

采纳答案by phisch

It looks like what you are looking for is external sorting.

看起来您正在寻找的是 外部排序

Basically, you sort small chunks of data first, write it back to the disk and then iterate over those to sort all.

基本上,您首先对小块数据进行排序,将其写回磁盘,然后遍历这些数据以对所有数据进行排序。

回答by danben

I would spin up an EC2 cluster and run Hadoop's MergeSort.

我会启动一个 EC2 集群并运行 Hadoop 的MergeSort

Edit: not sure how much detail you would like, or on what. EC2 is Amazon's Elastic Compute Cloud - it lets you rent virtual servers by the hour at low cost. Here is their website.

编辑:不确定你想要多少细节,或者什么。EC2 是 Amazon 的弹性计算云 - 它让您可以按小时以低成本租用虚拟服务器。这是他们的网站

Hadoop is an open-source MapReduce framework designed for parallel processing of large data sets. A job is a good candidate for MapReduce when it can be split into subsets that can be processed individually and then merged together, usually by sorting on keys (ie the divide-and-conquer strategy). Here is its website.

Hadoop 是一个开源 MapReduce 框架,专为并行处理大型数据集而设计。当一个作业可以被拆分成可以单独处理然后合并在一起的子集时,它是 MapReduce 的一个很好的候选者,通常通过对键进行排序(即分而治之策略)。这是它的网站

As mentioned by the other posters, external sorting is also a good strategy. I think the way I would decide between the two depends on the size of the data and speed requirements. A single machine is likely going to be limited to processing a single file at a time (since you will be using up available memory). So look into something like EC2 only if you need to process files faster than that.

正如其他海报所提到的,外部排序也是一个很好的策略。我认为我在两者之间做出决定的方式取决于数据的大小和速度要求。一台机器可能会被限制为一次处理一个文件(因为您将耗尽可用内存)。因此,仅当您需要更快地处理文件时,才考虑使用 EC2 之类的东西。

回答by x4u

You can read the files in smaller parts, sort these and write them to temporrary files. Then you read two of them sequentially again and merge them to a bigger temporary file and so on. If there is only one left you have your file sorted. Basically that's the Megresort algorithm performed on external files. It scales quite well with aribitrary large files but causes some extra file I/O.

您可以读取较小部分的文件,对它们进行排序并将它们写入临时文件。然后您再次按顺序读取其中的两个并将它们合并到一个更大的临时文件中,依此类推。如果只剩下一个,则您的文件已排序。基本上这是对外部文件执行的 Megresort 算法。它可以很好地处理任意大文件,但会导致一些额外的文件 I/O。

Edit: If you have some knowledge about the likely variance of the lines in your files you can employ a more efficient algorithm (distribution sort). Simplified you would read the original file once and write each line to a temporary file that takes only lines with the same first char (or a certain range of first chars). Then you iterate over all the (now small) temporary files in ascending order, sort them in memory and append them directly to the output file. If a temporary file turns out to be too big for sorting in memory, you can reapeat the same process for this based on the 2nd char in the lines and so on. So if your first partitioning was good enough to produce small enough files, you will have only 100% I/O overhead regardless how large the file is, but in the worst case it can become much more than with the performance wise stable merge sort.

编辑:如果您对文件中行的可能差异有一些了解,则可以采用更有效的算法(分布排序)。简化后,您将读取原始文件一次并将每一行写入一个临时文件,该文件仅包含具有相同第一个字符(或特定范围的第一个字符)的行。然后按升序遍历所有(现在很小的)临时文件,在内存中对它们进行排序并将它们直接附加到输出文件中。如果临时文件太大而无法在内存中排序,您可以根据行中的第二个字符等重复相同的过程。因此,如果您的第一个分区足以生成足够小的文件,那么无论文件有多大,您都将只有 100% 的 I/O 开销,

回答by PaulP1975

I know you mentioned not using a database no matter how light... so, maybe this is not an option. But, what about hsqldb in memory... submit it, sort it by query, purge it. Just a thought.

我知道你提到过无论多么轻量都不使用数据库......所以,也许这不是一个选择。但是,内存中的 hsqldb 呢……提交它,按查询排序,清除它。只是一个想法。

回答by FRotthowe

If your restriction is only to not use an externaldatabase system, you could try an embedded database (e.g. Apache Derby). That way, you get all the advantages of a database without any external infrastructure dependencies.

如果您的限制只是不使用外部数据库系统,您可以尝试使用嵌入式数据库(例如Apache Derby)。这样,您就可以获得数据库的所有优势,而无需任何外部基础设施依赖。

回答by KLE

As other mentionned, you can process in steps.
I would like to explain this with my own words (differs on point 3) :

正如其他人提到的,您可以分步处理。
我想用我自己的话来解释这一点(第 3 点不同):

  1. Read the file sequentially, process N records at a time in memory (N is arbitrary, depending on your memory constraint and the number T of temporary files that you want).

  2. Sort the N records in memory, write them to a temp file. Loop on T until you are done.

  3. Open all the T temp files at the same time, but read only one record per file.(Of course, with buffers). For each of these T records, find the smaller, write it to the final file, and advance only in that file.

  1. 顺序读取文件,在内存中一次处理 N 条记录(N 是任意的,取决于你的内存限制和你想要的临时文件的数量 T)。

  2. 对内存中的 N 条记录进行排序,将它们写入临时文件。在 T 上循环直到完成。

  3. 同时打开所有 T 临时文件,但每个文件只读取一条记录。(当然,使用缓冲区)。对于这些 T 记录中的每一个,找到较小的,将其写入最终文件,并仅在该文件中前进。



Advantages:

好处:

  • The memoryconsumption is as low as you want.
  • You only do the double of disk accessescomparing to a everything-in-memory policy. Not bad! :-)
  • 内存只要你想消耗低。
  • 与内存中的一切策略相比,您只需进行两倍的磁盘访问。不错!:-)


Exemple with numbers:

以数字为例:

  1. Original file with 1 million records.
  2. Choose to have 100 temp files, so read and sort 10 000 records at a time, and drop these in their own temp file.
  3. Open the 100 temp file at a time, read the first record in memory.
  4. Compare the first records, write the smaller and advance this temp file.
  5. Loop on step 5, one million times.
  1. 包含 100 万条记录的原始文件。
  2. 选择有 100 个临时文件,因此一次读取和排序 10 000 条记录,并将它们放入自己的临时文件中。
  3. 一次打开 100 个临时文件,读取内存中的第一条记录。
  4. 比较第一条记录,写入较小的并推进此临时文件。
  5. 循环第 5 步,一百万次。


EDITED

已编辑

You mentionned a multi-threaded application, so I wonder ...

你提到了一个多线程应用程序,所以我想知道......

As we seen from these discussions on this need, using less memory gives less performance, with a dramatic factor in this case. So I could also suggest to use only one threadto process only one sort at a time, not as a multi-threaded application.

正如我们从关于这种需求的这些讨论中看到的那样,使用较少的内存会降低性能,在这种情况下有一个戏剧性的因素。所以我也可以建议只使用一个线程一次处理一种,而不是作为一个多线程应用程序。

If you process ten threads, each with a tenth of the memory available, your performance will be miserable, much much less than a tenth of the initial time. If you use only one thread, and queue the 9 other demands and process them in turn, you global performance will be much better, you will finish the ten tasks much faster.

如果你处理十个线程,每个线程都有十分之一的可用内存,你的性能会很糟糕,远远少于初始时间的十分之一。如果你只使用一个线程,把其他9个需求排队依次处理,你的全局性能会好很多,你会更快地完成10个任务。



After reading this response : Sort a file with huge volume of data given memory constraintI suggest you consider this distribution sort. It could be huge gain in your context.

阅读此回复后: 在给定内存限制的情况下对具有大量数据的文件进行排序,我建议您考虑这种分布排序。在您的背景下,这可能是巨大的收获。

The improvement over my proposal is that you don't need to open all the temp files at once, you only open one of them. It saves your day! :-)

对我的建议的改进是您不需要一次打开所有临时文件,您只需打开其中一个。它可以节省您的一天!:-)

回答by user218447

If you can move forward/backward in a file (seek), and rewrite parts of the file, then you should use bubble sort.

如果您可以在文件中向前/向后移动(查找),并重写文件的某些部分,那么您应该使用冒泡排序

You will have to scan lines in the file, and only have to have 2 rows in memory at the moment, and then swap them if they are not in the right order. Repeat the process until there are no files to swap.

您将必须扫描文件中的行,此时内存中只需要 2 行,如果它们的顺序不正确,则交换它们。重复该过程,直到没有要交换的文件。

回答by VoidPointer

You could use the following divide-and-conquer strategy:

您可以使用以下分而治之的策略:

Create a function H() that can assign each record in the input file a number. For a record r2 that will be sorted behind a record r1 it must return a larger number for r2 than for r1. Use this function to partition all the records into separate files that will fit into memory so you can sort them. Once you have done that you can just concatenate the sorted files to get one large sorted file.

创建一个函数 H() 可以为输入文件中的每个记录分配一个编号。对于将排在记录 r1 之后的记录 r2,它必须为 r2 返回比为 r1 更大的数字。使用此功能将所有记录划分为适合内存的单独文件,以便您可以对它们进行排序。完成此操作后,您只需连接已排序的文件即可获得一个大的已排序文件。

Suppose you have this input file where each line represents a record

假设您有这个输入文件,其中每一行代表一条记录

Alan Smith
Jon Doe
Bill Murray
Johnny Cash

Lets just build H() so that it uses the first letter in the record so you might get up to 26 files but in this example you will just get 3:

让我们构建 H() 以便它使用记录中的第一个字母,这样您最多可以获得 26 个文件,但在本例中,您将只获得 3 个:

<file1>
Alan Smith

<file2>
Bill Murray

<file10>
Jon Doe
Johnny Cash

Now you can sort each individual file. Which would swap "Jon Doe" and "Johnny Cash" in <file10>. Now, if you just concatenate the 3 files you'll have a sorted version of the input.

现在您可以对每个单独的文件进行排序。这将交换 <file10> 中的“Jon Doe”和“Johnny Cash”。现在,如果您只是连接 3 个文件,您将拥有一个排序版本的输入。

Note that you divide first and only conquer (sort) later. However, you make sure to do the partitioning in a way that the resulting parts which you need to sort don't overlap which will make merging the result much simpler.

请注意,您首先进行划分,然后才进行征服(排序)。但是,您确保以一种方式进行分区,即您需要排序的结果部分不会重叠,这将使合并结果更加简单。

The method by which you implement the partitioning function H() depends very much on the nature of your input data. Once you have that part figured out the rest should be a breeze.

实现分区函数 H() 的方法在很大程度上取决于输入数据的性质。一旦你弄清楚了那部分,剩下的就轻而易举了。

回答by Eduardo

In spite of your restriction, I would use embedded database SQLITE3. Like yourself, I work weekly with 10-15 millions of flat file lines and it is very, very fast to import and generate sorted data, and you only need a little free of charge executable (sqlite3.exe). For example: Once you download the .exefile, in a command prompt you can do this:

尽管有您的限制,我还是会使用嵌入式数据库SQLITE3。和您一样,我每周处理 10-15 百万个平面文件行,导入和生成排序数据非常非常快,您只需要一些免费的可执行文件 (sqlite3.exe)。例如:下载.exe文件后,您可以在命令提示符下执行以下操作:

C:> sqlite3.exe dbLines.db
sqlite> create table tabLines(line varchar(5000));
sqlite> create index idx1 on tabLines(line);
sqlite> .separator '\r\n'
sqlite> .import 'FileToImport' TabLines

then:

然后:

sqlite> select * from tabLines order by line;

or save to a file:
sqlite> .output out.txt
sqlite> select * from tabLines order by line;
sqlite> .output stdout

回答by user2071703

You can use SQL Lite file db, load the data to the db and then let it sort and return the results for you. Advantages: No need to worry about writing the best sorting algorithm. Disadvantage: You will need disk space, slower processing. https://sites.google.com/site/arjunwebworld/Home/programming/sorting-large-data-files

您可以使用 SQL Lite 文件 db,将数据加载到 db,然后让它排序并为您返回结果。优点:无需担心编写最佳排序算法。缺点:您将需要磁盘空间,处理速度较慢。 https://sites.google.com/site/arjunwebworld/Home/programming/sorting-large-data-files