Linux "find" 和 "ls" 与 GNU 并行

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/7610507/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-05 06:26:27  来源:igfitidea点击:

"find" and "ls" with GNU parallel

linuxbashparallel-processingfindgnu-parallel

提问by Dave

I'm trying to use GNU parallelto post a lot of files to a web server. In my directory, I have some files:

我正在尝试将GNU parallel大量文件发布到 Web 服务器。在我的目录中,我有一些文件:

file1.xml
file2.xml

and I have a shell script that looks like this:

我有一个如下所示的 shell 脚本:

#! /usr/bin/env bash

CMD="curl -X POST -d@ http://server/path"

eval $CMD

There's some other stuff in the script, but this was the simplest example. I tried to execute the following command:

脚本中还有一些其他内容,但这是最简单的示例。我尝试执行以下命令:

ls | parallel -j2 script.sh {}

Which is what the GNU parallelpages show as the "normal" way to operate on files in a directory. This seems to pass the name of the file into my script, but curl complains that it can't load the data file passed in. However, if I do:

这就是GNU parallel页面显示的对目录中文件进行操作的“正常”方式。这似乎将文件的名称传递到我的脚本中,但是 curl 抱怨它无法加载传入的数据文件。但是,如果我这样做:

find . -name '*.xml' | parallel -j2 script.sh {}

it works fine. Is there a difference between how lsand findare passing arguments to my script? Or do I need to do something additional in that script?

它工作正常。如何lsfind正在向我的脚本传递参数之间有区别吗?或者我需要在该脚本中做一些额外的事情吗?

采纳答案by another.anon.coward

I have not used parallelbut there is a different between ls& find . -name '*.xml'. lswill list allthe files and directories where as find . -name '*.xml'will list only the files (and directories) which end with a .xml.
As suggested by Paul Rubel, just print the value of $1 in your script to check this. Additionally you may want to consider filtering the input to files only in findwith the -type foption.
Hope this helps!

我没用过,parallells&之间有区别find . -name '*.xml'ls将列出所有文件和目录,其中 asfind . -name '*.xml'将仅列出以.xml结尾的文件(和目录)。
正如 Paul Rubel 所建议的那样,只需在脚本中打印 $1 的值即可进行检查。此外,您可能需要考虑仅find使用该-type f选项过滤文件的输入。
希望这可以帮助!

回答by Casey

Neat.

整洁的。

I had never used parallel before. It appears, though that there are two of them. One is the Gnu Parrallel, and the one that was installed on my system has Tollef Fog Heen listed as the author in the man pages.

我以前从未使用过并行。看起来,虽然有两个。一个是 Gnu Parrallel,安装在我系统上的那个在手册页中将 Tollef Fog Heen 列为作者。

As Paul mentioned, you should use set -x

正如保罗提到的,你应该使用 set -x

Also, the paradigm that you mentioned above doesn't seem to work on my parallel, rather, I have to do the following:

此外,您上面提到的范式似乎不适用于我的并行,相反,我必须执行以下操作:

$ cat ../script.sh
+ cat ../script.sh
#!/bin/bash
echo $@
$ parallel -ij2 ../script.sh {} -- $(find -name '*.xml')
++ find -name '*.xml'
+ parallel -ij2 ../script.sh '{}' -- ./b.xml ./c.xml ./a.xml ./d.xml ./e.xml
./c.xml
./b.xml
./d.xml
./a.xml
./e.xml
$ parallel -ij2 ../script.sh {} -- $(ls *.xml)
++ ls --color=auto a.xml b.xml c.xml d.xml e.xml
+ parallel -ij2 ../script.sh '{}' -- a.xml b.xml c.xml d.xml e.xml
b.xml
a.xml
d.xml
c.xml
e.xml

find does provide a different input, It prepends the relative path to the name. Maybe that is what is messing up your script?

find 确实提供了不同的输入,它在名称前面加上了相对路径。也许这就是弄乱你的脚本的原因?

回答by Swiss

GNU parallelis a variant of xargs. They both have very similar interfaces, and if you're looking for help on parallel, you may have more luck looking up information about xargs.

GNUparallelxargs. 它们都有非常相似的界面,如果您正在寻找有关 的帮助parallel,您可能会更幸运地查找有关xargs.

That being said, the way they both operate is fairly simple. With their default behavior, both programs read input from STDIN, then break the input up into tokens based on whitespace. Each of these tokens is then passed to a provided program as an argument. The default for xargs is to pass as many tokens as possible to the program, and then start a new process when the limit is hit. I'm not sure how the default for parallel works.

话虽如此,它们的运作方式都相当简单。在默认行为下,两个程序都从 STDIN 读取输入,然后根据空格将输入分解为标记。然后将这些令牌中的每一个作为参数传递给提供的程序。xargs 的默认设置是将尽可能多的令牌传递给程序,然后在达到限制时启动一个新进程。我不确定并行的默认值是如何工作的。

Here is an example:

下面是一个例子:

> echo "foo    bar \
  baz" | xargs echo
foo bar baz

There are some problems with the default behavior, so it is common to see several variations.

默认行为存在一些问题,因此通常会看到几种变化。

The first issue is that because whitespace is used to tokenize, any files with white space in them will cause parallel and xargs to break. One solution is to tokenize around the NULL character instead. findeven provides an option to make this easy to do:

第一个问题是,因为空格用于标记化,任何包含空格的文件都会导致 parallel 和 xargs 中断。一种解决方案是围绕 NULL 字符进行标记化。find甚至提供了一个选项,使这很容易做到:

> echo "Success!" > bad\ filename
> find . "bad\ filename" -print0 | xargs -0 cat
Success!

The -print0option tells findto seperate files with the NULL character instead of whitespace.
The -0option tells xargsto use the NULL character to tokenize each argument.

-print0选项告诉find用 NULL 字符而不是空格分隔文件。
-0选项告诉xargs使用 NULL 字符来标记每个参数。

Note that parallelis a little better than xargsin that its default behavior is the tokenize around only newlines, so there is less of a need to change the default behavior.

请注意,这parallelxargs它的默认行为好一点,因为它的默认行为是仅围绕换行符进行标记化,因此不需要更改默认行为。

Another common issue is that you may want to control how the arguments are passed to xargsor parallel. If you need to have a specific placement of the arguments passed to the program, you can use {}to specify where the argument is to be placed.

另一个常见问题是您可能希望控制参数如何传递给xargsparallel。如果您需要将参数传递给程序的特定位置,您可以使用{}来指定参数的放置位置。

> mkdir new_dir
> find -name *.xml | xargs mv {} new_dir

This will move all files in the current directory and subdirectories into the new_dir directory. It actually breaks down into the following:

这会将当前目录和子目录中的所有文件移动到 new_dir 目录中。它实际上分为以下几种:

> find -name *.xml | xargs echo mv {} new_dir
> mv foo.xml new_dir
> mv bar.xml new_dir
> mv baz.xml new_dir

So taking into consideration how xargsand parallelwork, you should hopefully be able to see the issue with your command. find . -name '*.xml'will generate a list of xml files to be passed to the script.shprogram.

因此,考虑到如何xargsparallel工作,您应该希望能够看到您的命令的问题。find . -name '*.xml'将生成要传递给script.sh程序的 xml 文件列表。

> find . -name '*.xml' | parallel -j2 echo script.sh {}
> script.sh foo.xml
> script.sh bar.xml
> script.sh baz.xml

However, ls | parallel -j2 script.sh {}will generate a list of ALL files in the current directory to be passed to the script.sh program.

但是,ls | parallel -j2 script.sh {}会生成当前目录中所有文件的列表,以传递给 script.sh 程序。

> ls | parallel -j2 echo script.sh {}
> script.sh some_directory
> script.sh some_file
> script.sh foo.xml
> ...

A more correct variant on the lsversion would be as follows:

ls版本更正确的变体如下:

> ls *.xml | parallel -j2 script.sh {}

However, and important difference between this and the find version is that find will search through all subdirectories for files, while ls will only search the current directory. The equivalent findversion of the above lscommand would be as follows:

但是,此版本与 find 版本之间的重要区别在于 find 将搜索所有子目录以查找文件,而 ls 将仅搜索当前目录。find上述ls命令的等效版本如下:

> find -maxdepth 1 -name '*.xml'

This will only search the current directory.

这只会搜索当前目录。

回答by Ole Tange

Since it works with findyou probably want to see what command GNU Parallel is running (using -v or --dryrun) and then try to run the failing commands manually.

由于它适用于find您,您可能希望查看 GNU Parallel 正在运行的命令(使用 -v 或 --dryrun),然后尝试手动运行失败的命令。

ls *.xml | parallel --dryrun -j2 script.sh
find -maxdepth 1 -name '*.xml' | parallel --dryrun -j2 script.sh