在linux中对多个文件进行排序

Question

提问by Paul

I have multiple (many) files; each very large:

我有多个（许多）文件；每一个都很大：

file0.txt
file1.txt
file2.txt

I do not want to join them into a single file because the resulting file would be 10+ Gigs. Each line in each file contains a 40-byte string. The strings are fairly well ordered right now, (about 1:10 steps is a decrease in value instead of an increase).

我不想将它们合并到一个文件中，因为生成的文件将是 10+ 个演出。每个文件中的每一行都包含一个 40 字节的字符串。字符串现在排列得很好，（大约 1:10 的步长是值的减少而不是增加）。

I would like the lines ordered. (in-place if possible?) This means some of the lines from the end of file0.txtwill be moved to the beginning of file1.txtand vice versa.

我想要订购的线路。（如果可能，就地？）这意味着从末尾的一些file0.txt行将移动到开头，file1.txt反之亦然。

I am working on Linux and fairly new to it. I know about the sortcommand for a single file, but am wondering if there is a way to sort across multiple files. Or maybe there is a way to make a pseudo-file made from smaller files that linux will treat as a single file.

我正在使用 Linux 并且对它相当陌生。我知道sort单个文件的命令，但我想知道是否有办法对多个文件进行排序。或者，也许有一种方法可以使 linux 将其视为单个文件的较小文件制成伪文件。

What I know can do: I can sort each file individually and read into file1.txtto find the value larger than the largest in file0.txt(and similarly grab the lines from the end of file0.txt), join and then sort.. but this is a pain and assumes no values from file2.txtbelong in file0.txt(however highly unlikely in my case)

我知道可以做什么：我可以单独对每个文件进行排序并读入file1.txt以找到大于最大的值file0.txt（并类似地从的末尾抓取行file0.txt），加入然后排序..但这很痛苦并且假设没有值file2.txt属于file0.txt（但在我的情况下极不可能）

Edit

编辑

To be clear, if the files look like this:

需要明确的是，如果文件如下所示：

f0.txt
DDD
XXX
AAA

f1.txt
BBB
FFF
CCC

f2.txt
EEE
YYY
ZZZ

I want this:

我要这个：

f0.txt
AAA
BBB
CCC

f1.txt
DDD
EEE
FFF

f2.txt
XXX
YYY
ZZZ

Answer 1

采纳答案by JBert

I don't know about a command doing in-place sorting, but I think a faster "merge sort" is possible:

我不知道执行就地排序的命令，但我认为更快的“合并排序”是可能的：

for file in *.txt; do
    sort -o $file $file
done
sort -m *.txt | split -d -l 1000000 - output

The sortin the for loop makes sure the content of the input files is sorted. If you don't want to overwrite the original, simply change the value after the -oparameter. (If you expect the files to be sorted already, you could change the sort statement to "check-only": sort -c $file || exit 1)
The second sortdoes efficient merging of the input files, all while keeping the output sorted.
This is piped to the splitcommand which will then write to suffixed output files. Notice the -character; this tells split to read from standard input (i.e. the pipe) instead of a file.

sortfor 循环中的in 确保输入文件的内容已排序。如果不想覆盖原来的，只需更改-o参数后的值即可。（如果您希望文件是已经排序，你可以排序声明更改为“仅检查”： sort -c $file || exit 1）
第二个sort有效地合并输入文件，同时保持输出排序。
这通过管道传输到split命令，然后该命令将写入后缀输出文件。注意-字符；这告诉 split 从标准输入（即管道）而不是文件中读取。

Also, here's a short summary of how the merge sort works:

此外，这里是合并排序如何工作的简短摘要：

sortreads a line from each file.
It orders these lines and selects the one which should come first. This line gets sent to the output, and a new line is read from the file which contained this line.
Repeat step 2 until there are no more lines in any file.
At this point, the output should be a perfectly sorted file.
Profit!

sort从每个文件中读取一行。
它对这些行进行排序并选择应该排在第一位的行。该行被发送到输出，并从包含该行的文件中读取一个新行。
重复步骤 2，直到任何文件中都没有更多行。
此时，输出应该是一个完美排序的文件。
利润！

Answer 2

回答by evil otto

If the files are sorted individually, then you can use sort -m file*.txtto merge them together - read the first line of each file, output the smallest one, and repeat.

如果文件是单独排序的，那么您可以使用sort -m file*.txt将它们合并在一起 - 读取每个文件的第一行，输出最小的一行，然后重复。

Answer 3

回答by Cascabel

I believe that this is your best bet, using stock linux utilities:

我相信这是您最好的选择，使用股票 linux 实用程序：

sort each file individually, e.g. for f in file*.txt; do sort $f > sorted_$f.txt; done
sort everything using sort -m sorted_file*.txt | split -d -l <lines> - <prefix>, where <lines>is the number of lines per file, and <prefix>is the filename prefix. (The -dtells split to use numeric suffixes).

分别对每个文件进行排序，例如 for f in file*.txt; do sort $f > sorted_$f.txt; done
使用对所有内容进行排序sort -m sorted_file*.txt | split -d -l <lines> - <prefix>，其中<lines>是每个文件的行数，<prefix>是文件名前缀。（-d告诉 split 使用数字后缀）。

The -moption to sort lets it know the input files are already sorted, so it can be smart.

该-m选项排序让它知道输入文件已经排序，因此它可以智能。

Answer 4

回答by sarnold

It isn't exactly what you asked for, but the sort(1)utility can help, a little, using the --mergeoption. Sort each file individually, then sort the resulting pile of files:

这不是您所要求的，但该sort(1)实用程序可以使用该--merge选项提供一点帮助。分别对每个文件进行排序，然后对生成的文件堆进行排序：

for f in file*.txt ; do sort -o $f < $f ; done
sort --merge file*.txt | split -l 100000 - sorted_file

(That's 100,000 lines per output file. Perhaps that's still way too small.)

（这是每个输出文件 100,000 行。也许这仍然太小了。）

Answer 5

回答by ott--

mmap() the 3 files, as all lines are 40 bytes long, you can easily sort them in place (SIP :-). Don't forget the msync at the end.

mmap() 3 个文件，因为所有行的长度都是 40 字节，您可以轻松地对它们进行排序（SIP :-）。不要忘记最后的 msync。

在linux中对多个文件进行排序

提问by Paul

Edit

编辑

采纳答案by JBert

回答by evil otto

回答by Cascabel

回答by sarnold

回答by ott--

相关推荐

最近更新

标签

在linux中对多个文件进行排序

提问by Paul

Edit

编辑

采纳答案by JBert

回答by evil otto

回答by Cascabel

回答by sarnold

回答by ott--

相关推荐

如何在 Linux 中使用单行命令获取 Java 版本

Linux 查找返回字符串行号的命令

Linux 有没有办法直接将python输出发送到剪贴板？

Linux 通过控制台使用自定义调色板将图像转换为索引颜色

相关推荐

最近更新

标签