在linux中对多个文件进行排序
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/7693600/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
sort across multiple files in linux
提问by Paul
I have multiple (many) files; each very large:
我有多个(许多)文件;每一个都很大:
file0.txt
file1.txt
file2.txt
I do not want to join them into a single file because the resulting file would be 10+ Gigs. Each line in each file contains a 40-byte string. The strings are fairly well ordered right now, (about 1:10 steps is a decrease in value instead of an increase).
我不想将它们合并到一个文件中,因为生成的文件将是 10+ 个演出。每个文件中的每一行都包含一个 40 字节的字符串。字符串现在排列得很好,(大约 1:10 的步长是值的减少而不是增加)。
I would like the lines ordered. (in-place if possible?) This means some of the lines from the end of file0.txt
will be moved to the beginning of file1.txt
and vice versa.
我想要订购的线路。(如果可能,就地?)这意味着从末尾的一些file0.txt
行将移动到开头,file1.txt
反之亦然。
I am working on Linux and fairly new to it. I know about the sort
command for a single file, but am wondering if there is a way to sort across multiple files. Or maybe there is a way to make a pseudo-file made from smaller files that linux will treat as a single file.
我正在使用 Linux 并且对它相当陌生。我知道sort
单个文件的命令,但我想知道是否有办法对多个文件进行排序。或者,也许有一种方法可以使 linux 将其视为单个文件的较小文件制成伪文件。
What I know can do:
I can sort each file individually and read into file1.txt
to find the value larger than the largest in file0.txt
(and similarly grab the lines from the end of file0.txt
), join and then sort.. but this is a pain and assumes no values from file2.txt
belong in file0.txt
(however highly unlikely in my case)
我知道可以做什么:我可以单独对每个文件进行排序并读入file1.txt
以找到大于最大的值file0.txt
(并类似地从 的末尾抓取行file0.txt
),加入然后排序..但这很痛苦并且假设没有值file2.txt
属于file0.txt
(但在我的情况下极不可能)
Edit
编辑
To be clear, if the files look like this:
需要明确的是,如果文件如下所示:
f0.txt
DDD
XXX
AAA
f1.txt
BBB
FFF
CCC
f2.txt
EEE
YYY
ZZZ
I want this:
我要这个:
f0.txt
AAA
BBB
CCC
f1.txt
DDD
EEE
FFF
f2.txt
XXX
YYY
ZZZ
采纳答案by JBert
I don't know about a command doing in-place sorting, but I think a faster "merge sort" is possible:
我不知道执行就地排序的命令,但我认为更快的“合并排序”是可能的:
for file in *.txt; do
sort -o $file $file
done
sort -m *.txt | split -d -l 1000000 - output
- The
sort
in the for loop makes sure the content of the input files is sorted. If you don't want to overwrite the original, simply change the value after the-o
parameter. (If you expect the files to be sorted already, you could change the sort statement to "check-only":sort -c $file || exit 1
) - The second
sort
does efficient merging of the input files, all while keeping the output sorted. - This is piped to the
split
command which will then write to suffixed output files. Notice the-
character; this tells split to read from standard input (i.e. the pipe) instead of a file.
sort
for 循环中的in 确保输入文件的内容已排序。如果不想覆盖原来的,只需更改-o
参数后的值即可。(如果您希望文件是已经排序,你可以排序声明更改为“仅检查”:sort -c $file || exit 1
)- 第二个
sort
有效地合并输入文件,同时保持输出排序。 - 这通过管道传输到
split
命令,然后该命令将写入后缀输出文件。注意-
字符;这告诉 split 从标准输入(即管道)而不是文件中读取。
Also, here's a short summary of how the merge sort works:
此外,这里是合并排序如何工作的简短摘要:
sort
reads a line from each file.- It orders these lines and selects the one which should come first. This line gets sent to the output, and a new line is read from the file which contained this line.
- Repeat step 2 until there are no more lines in any file.
- At this point, the output should be a perfectly sorted file.
- Profit!
sort
从每个文件中读取一行。- 它对这些行进行排序并选择应该排在第一位的行。该行被发送到输出,并从包含该行的文件中读取一个新行。
- 重复步骤 2,直到任何文件中都没有更多行。
- 此时,输出应该是一个完美排序的文件。
- 利润!
回答by evil otto
If the files are sorted individually, then you can use sort -m file*.txt
to merge them together - read the first line of each file, output the smallest one, and repeat.
如果文件是单独排序的,那么您可以使用sort -m file*.txt
将它们合并在一起 - 读取每个文件的第一行,输出最小的一行,然后重复。
回答by Cascabel
I believe that this is your best bet, using stock linux utilities:
我相信这是您最好的选择,使用股票 linux 实用程序:
sort each file individually, e.g.
for f in file*.txt; do sort $f > sorted_$f.txt; done
sort everything using
sort -m sorted_file*.txt | split -d -l <lines> - <prefix>
, where<lines>
is the number of lines per file, and<prefix>
is the filename prefix. (The-d
tells split to use numeric suffixes).
分别对每个文件进行排序,例如
for f in file*.txt; do sort $f > sorted_$f.txt; done
使用 对所有内容进行排序
sort -m sorted_file*.txt | split -d -l <lines> - <prefix>
,其中<lines>
是每个文件的行数,<prefix>
是文件名前缀。(-d
告诉 split 使用数字后缀)。
The -m
option to sort lets it know the input files are already sorted, so it can be smart.
该-m
选项排序让它知道输入文件已经排序,因此它可以智能。
回答by sarnold
It isn't exactly what you asked for, but the sort(1)
utility can help, a little, using the --merge
option. Sort each file individually, then sort the resulting pile of files:
这不是您所要求的,但该sort(1)
实用程序可以使用该--merge
选项提供一点帮助。分别对每个文件进行排序,然后对生成的文件堆进行排序:
for f in file*.txt ; do sort -o $f < $f ; done
sort --merge file*.txt | split -l 100000 - sorted_file
(That's 100,000 lines per output file. Perhaps that's still way too small.)
(这是每个输出文件 100,000 行。也许这仍然太小了。)
回答by ott--
mmap() the 3 files, as all lines are 40 bytes long, you can easily sort them in place (SIP :-). Don't forget the msync at the end.
mmap() 3 个文件,因为所有行的长度都是 40 字节,您可以轻松地对它们进行排序(SIP :-)。不要忘记最后的 msync。