bash 在 Linux 中打乱文件中行的最快方法
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/14727151/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Fastest way to shuffle lines in a file in Linux
提问by alpha_cod
I want to shuffle a large file with millions of lines of strings in Linux. I tried 'sort -R'But it is very slow (takes like 50 mins for a 16M big file). Is there a faster utility that I can use in the place of it?
我想在 Linux 中混洗一个包含数百万行字符串的大文件。我试过'sort -R'但它很慢(一个 16M 的大文件需要 50 分钟)。有没有更快的实用程序可以代替它?
回答by dshepherd
Use shufinstead of sort -R(man page).
使用shuf代替sort -R(手册页)。
The slowness of sort -Ris probably due to it hashing every line. shufjust does a random permutation so it doesn't have that problem.
进展缓慢sort -R的原因可能是其散列每一行。shuf只是做一个随机排列所以它没有那个问题。
(This was suggested in a comment but for some reason not written as an answer by anyone)
(这是在评论中提出的,但由于某种原因没有被任何人写成答案)
回答by Paul Miller
The 50 minutes is not caused by the actual mechanics of sorting, based on your description. The time is likely spent waiting on /dev/randomto generate enough entropy.
根据您的描述,这 50 分钟不是由实际的排序机制造成的。时间可能花在等待/dev/random产生足够的熵上。
One approach is to use an external source of random data (http://random.org, for example) along with a variation on a Schwartzian Transform. The Schwartzian Transform turns the data to be sorted into "enriched" data with the sort key embedded. The data is sorted using the key and then the key is discarded.
一种方法是使用外部随机数据源(例如http://random.org)以及Schwartzian Transform 的变体。Schwartzian 变换将要排序的数据转换为嵌入排序键的“丰富”数据。使用键对数据进行排序,然后丢弃该键。
To apply this to your problem:
将此应用于您的问题:
generate a text file with random numbers, 1 per line, with the same number of lines as the file to be sorted. This can be done at any time, run in the background, run on a different server, downloaded from random.org, etc. The point is that this randomness is not generated while you are trying to sort.
create an enriched version of the file using
paste:paste random_number_file.txt string_data.txt > tmp_string_data.txtsort this file:
sort tmp_string_data.txt > sorted_tmp_string_data.txtremove the random data:
cut -f2- sorted_tmp_string_data.txt > random_string_data.txt
生成一个带有随机数的文本文件,每行 1 个,与要排序的文件行数相同。这可以随时完成,在后台运行,在不同的服务器上运行,从 random.org 下载等。关键是当您尝试排序时不会生成这种随机性。
使用以下命令创建文件的丰富版本
paste:paste random_number_file.txt string_data.txt > tmp_string_data.txt排序这个文件:
sort tmp_string_data.txt > sorted_tmp_string_data.txt删除随机数据:
cut -f2- sorted_tmp_string_data.txt > random_string_data.txt
This is the basic idea. I tried it and it does work, but I don't have 16 million lines of text or 16 million lines of random numbers. You may want to pipeline some of those steps instead of saving it all to disk.
这是基本思想。我试过了,它确实有效,但我没有 1600 万行文本或 1600 万行随机数。您可能希望将其中一些步骤流水线化,而不是将其全部保存到磁盘。
回答by Mikhail
You may try my tool: HugeFileProcessor. It's capable of shuffling files of hundreds of GBs in a reasonable time.
你可以试试我的工具:HugeFileProcessor。它能够在合理的时间内改组数百 GB 的文件。
Here are the details on shuffling implementation. It requires specifying batchSize- number of lines to keep in RAM when writing to output. The more is the better (unless you are out of RAM), because total shuffling time would be (number of lines in sourceFile) / batchSize * (time to fully read sourceFile). Please note that the program shuffles whole file, not on per-batch basis.
以下是有关改组实施的详细信息。它需要指定batchSize- 写入输出时保留在 RAM 中的行数。越多越好(除非您的 RAM 不足),因为总混洗时间将为(sourceFile 中的行数) / batchSize * (time to full read sourceFile)。请注意,程序会打乱整个文件,而不是按批次打乱。
The algorithm is as follows.
算法如下。
Count lines in sourceFile. This is done simply by reading whole file line-by-line. (See some comparisons here.) This also gives a measurement of how much time would it take to read whole file once. So we could estimate how many times it would take to make a complete shuffle because it would require Ceil(linesCount / batchSize)complete file reads.
As we now know the total linesCount, we can create an index array of linesCountsize and shuffle it using Fisher–Yates(called orderArrayin the code). This would give us an order in which we want to have lines in a shuffled file. Note that this is a global order over the whole file, not per batch or chunk or something.
Now the actual code. We need to get all lines from sourceFilein a order we just computed, but we can't read whole file in memory. So we just split the task.
- We would go through the sourceFilereading all lines and storing in memory only those lines that would be in first batchSizeof the orderArray. When we get all these lines, we could write them into outFilein required order, and it's a batchSize/linesCountof work done.
- Next we would repeat whole process again and again taking next parts of orderArrayand reading sourceFilefrom start to end for each part. Eventually the whole orderArrayis processed and we are done.
计算sourceFile 中的行数。这只需通过逐行读取整个文件即可完成。(请参阅此处的一些比较。)这还可以衡量读取整个文件一次所需的时间。所以我们可以估计完成一次完整的 shuffle 需要多少次,因为它需要Ceil(linesCount / batchSize)完整的文件读取。
正如我们现在知道总行数,我们可以创建一个行数大小的索引数组,并使用Fisher–Yates(在代码中称为orderArray)对其进行混洗。这会给我们一个顺序,我们希望在一个混洗的文件中有行。请注意,这是整个文件的全局顺序,而不是每个批次或块或其他东西。
现在是实际代码。我们需要按照刚刚计算的顺序从sourceFile中获取所有行,但是我们无法在内存中读取整个文件。所以我们只是拆分任务。
- 我们要经过的资源文件读取所有的线条和存储在存储器中只有那些会在第一线BATCHSIZE的的orderArray。当我们得到所有这些行时,我们可以按照需要的顺序将它们写入outFile,它是完成工作的batchSize/ linesCount。
- 接下来,我们将一次又一次地重复整个过程,取出orderArray 的下一部分,并从头到尾读取每个部分的sourceFile。最终整个orderArray被处理,我们就完成了。
Why it works?
为什么有效?
Because all we do is just reading the source file from start to end. No seeks forward/backward, and that's what HDDs like. File gets read in chunks according to internal HDD buffers, FS blocks, CPU cahce, etc. and everything is being read sequentially.
因为我们所做的只是从头到尾读取源文件。没有向前/向后寻求,这就是 HDD 喜欢的。文件根据内部 HDD 缓冲区、FS 块、CPU 缓存等分块读取,并且所有内容都按顺序读取。
Some numbers
一些数字
On my machine (Core i5, 16GB RAM, Win8.1, HDD Toshiba DT01ACA200 2TB, NTFS) I was able to shuffle a file of 132 GB (84 000 000 lines) in around 5 hours using batchSizeof 3 500 000. With batchSizeof 2 000 000 it took around 8 hours. Reading speed was around 118000 lines per second.
在我的机器(酷睿i5,16GB RAM,Win8.1,HDD东芝DT01ACA200 2TB,NTFS),我能够重新洗牌使用5个小时左右132 GB(84条000 000线)的文件BATCHSIZE的3 500 000随着BATCHSIZE2 000 000 花了大约 8 小时。阅读速度约为每秒 118000 行。

