在 Linux 上快速连接多个文件

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/5893531/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-05 03:55:49  来源:igfitidea点击:

Fast concatenate multiple files on Linux

linuxcopyparallel-processingcat

提问by san

I am using Python multiprocessing to generate a temporary output file per process. They can be several GBs in size and I make several tens of these. These temporary files need to be concated to form the desired output and this is the step that is proving to be a bottleneck (and a parallelism killer). Is there a Linux tool that will create the concated file by modifying the file-system meta-data and not actually copy the content ? As long as it works on any Linux system that would be acceptable to me. But a file system specific solution wont be of much help.

我正在使用 Python 多处理为每个进程生成一个临时输出文件。它们的大小可以是几 GB,我制作了几十个。需要连接这些临时文件以形成所需的输出,这一步被证明是瓶颈(也是并行性杀手)。是否有 Linux 工具可以通过修改文件系统元数据而不实际复制内容来创建连接文件?只要它适用于我可以接受的任何 Linux 系统。但是文件系统特定的解决方案不会有太大帮助。

I am not OS or CS trained, but in theory it seems it should be possible to create a new inode and copy over the inode pointer structure from the inode of the files I want to copy from, and then unlink those inodes. Is there any utility that will do this ? Given the surfeit of well thought out unix utilities I fully expected it to be, but could not find anything. Hence my question on SO. The file system is on a block device, a hard disk actually, in case this information matters. I dont have the confidence to write this on my own, as I have never done any systems level programming before, so any pointers (to C/Python code snipppets) will be very helpful.

我没有受过 OS 或 CS 培训,但理论上似乎应该可以创建一个新的 inode 并从我想要复制的文件的 inode 复制 inode 指针结构,然后取消链接这些 inode。是否有任何实用程序可以做到这一点?鉴于经过深思熟虑的 unix 实用程序过多,我完全期望它是,但找不到任何东西。因此,我对 SO 的问题。文件系统在一个块设备上,实际上是一个硬盘,以防这些信息很重要。我没有信心自己写这个,因为我以前从未做过任何系统级编程,所以任何(指向 C/Python 代码片段的)指针都会非常有帮助。

采纳答案by Marc Mutz - mmutz

Even if there was such a tool, this could only work if the files except the last were guaranteed to have a size that is a multiple of the filesystem's block size.

即使有这样的工具,这也只有在保证除最后一个文件之外的文件的大小是文件系统块大小的倍数时才有效。

If you control how the data is written into the temporary files, and you knowhow large each one will be, you can instead do the following

如果您控制数据写入临时文件的方式,并且您知道每个文件的大小,则可以改为执行以下操作

  1. Before starting the multiprocessing, create the final output file, and grow it to the final size by fseek()ing to the end, this will create a sparse file.

  2. Start multiprocessing, handing each process the FD and the offset into its particular slice of the file.

  1. 在开始多处理之前,创建最终的输出文件,并通过fseek()ing 到最后将其增长到最终大小 ,这将创建一个 稀疏文件

  2. 启动多处理,将每个进程的 FD 和偏移量处理到文件的特定切片中。

This way, the processes will collaboratively fill the single output file, removing the need to cat them together later.

这样,进程将协作填充单个输出文件,无需稍后将它们组合在一起。

EDIT

编辑

If you can't predict the size of the individual files, but the consumer of the final file can work with sequential (as opposed to random-access) input, you can feed cat tmpfile1 .. tmpfileNto the consumer, either on stdin

如果您无法预测单个文件的大小,但最终文件的使用者可以使用顺序(而不是随机访问)输入,您可以cat tmpfile1 .. tmpfileN在 stdin 上提供给使用者

cat tmpfile1 ... tmpfileN | consumer

or via named pipes (using bash's Process Substitution):

或通过命名管道(使用 bash 的进程替换):

consumer <(cat tmpfile1 ... tmpfileN)

回答by janneb

No, there is no such tool or syscall.

不,没有这样的工具或系统调用。

You might investigate if it's possible for each process to write directly into the final file. Say process 1 writes bytes 0-X, process 2 writes X-2X and so on.

您可能会调查每个进程是否可以直接写入最终文件。假设进程 1 写入字节 0-X,进程 2 写入 X-2X,依此类推。

回答by Xiè Jìléi

I don't think so, inode may be aligned, so it may only possible if you are ok to leave some zeros (or unknown bytes) between one file's footer and another file's header.

我不这么认为,inode 可能已对齐,因此只有在您可以在一个文件的页脚和另一个文件的标题之间保留一些零(或未知字节)时才有可能。

Instead of concatenate these files, I'd like suggest to re-design the analysis tool to support sourcing from multiple files. Take log files for example, many log analyzers support to read log files each for one day.

我建议重新设计分析工具以支持从多个文件中获取资源,而不是连接这些文件。以日志文件为例,很多日志分析器都支持每天读取日志文件。

EDIT

编辑

@san: As you say the code in use you can't control, well you can concatenate the separate files on the fly by using named pipes:

@san:正如您所说,您无法控制正在使用的代码,您可以使用命名管道动态连接单独的文件:

$ mkfifo /tmp/cat
$ cat file1 file2 ... >/tmp/cat &
$ user_program /tmp/cat
...
$ rm /tmp/cat

回答by NPE

You indicate that you don't know in advance the size of each temporary file. With this in mind, I think your best bet is to write a FUSEfilesystem that would present the chunks as a single large file, while keeping them as individual files on the underlying filesystem.

您表示您事先不知道每个临时文件的大小。考虑到这一点,我认为最好的办法是编写一个FUSE文件系统,将块显示为单个大文件,同时将它们作为底层文件系统上的单个文件。

In this solution, your producing and consuming apps remain unchanged. The producers write out a bunch of files that the FUSE layer makes appearas a single file. This virtual file is then presented to the consumer.

在此解决方案中,您的生产和消费应用程序保持不变。生产者写出一堆文件,FUSE 层使这些文件显示为单个文件。然后将该虚拟文件呈现给消费者。

FUSE has bindings for a bunch of languages, including Python. If you look at some examples hereor here(these are for different bindings), this requires surprisingly little code.

FUSE 具有多种语言的绑定,包括 Python。如果您在此处此处查看一些示例(这些示例用于不同的绑定),这需要的代码非常少。

回答by Ryan C. Thompson

A potential alternative is to cat all your temp files into a named pipe and then use that named pipe as input to your single-input program. As long as your single-input program just reads the input sequentially and doesn't seek.

一个潜在的替代方法是将所有临时文件放入一个命名管道中,然后将该命名管道用作单输入程序的输入。只要您的单输入程序只是按顺序读取输入而不寻找。

回答by szabozoltan

For 4 files; xaa, xab, xac, xad a fast concatention in bash (as root):

为 4 个文件;xaa、xab、xac、xad 在 bash 中的快速合并(以 root 用户身份):

losetup -v -f xaa; losetup -v -f xab; losetup -v -f xac; losetup -v -f xad

(Let's suppose that loop0, loop1, loop2, loop3 are the names of the new device files.)

(假设 loop0、loop1、loop2、loop3 是新设备文件的名称。)

Put http://pastebin.com/PtEDQH7Ginto a "join_us" script file. Then you can use it like this:

http://pastebin.com/PtEDQH7G放入“join_us”脚本文件中。然后你可以像这样使用它:

./join_us /dev/loop{0..3}

Then (if this big file is a film) you can give its ownership to a normal user (chown itsme /dev/mapper/joined) and then he/she can play it via: mplayer /dev/mapper/joined

然后(如果这个大文件是一部电影)您可以将其所有权授予普通用户(chown itsme /dev/mapper/joined),然后他/她可以通过以下方式播放: mplayer /dev/mapper/joined

The cleanup after these (as root):

这些之后的清理(以 root 身份):

dmsetup remove joined; losetup -d /dev/loop[0123]