Linux Shell：在目录下的列表中查找文件

Question

提问by Dagang

I have a list containing about 1000 file names to search under a directory and its subdirectories. There are hundreds of subdirs with more than 1,000,000 files. The following command will run find for 1000 times:

我有一个包含大约 1000 个文件名的列表，可以在目录及其子目录下进行搜索。有数百个子目录，包含超过 1,000,000 个文件。以下命令将运行 find 1000 次：

cat filelist.txt | while read f; do find /dir -name $f; done

Is there a much faster way to do it?

有没有更快的方法来做到这一点？

Answer 1

采纳答案by huon

If filelist.txthas a single filename per line:

如果filelist.txt每行有一个文件名：

find /dir | grep -f <(sed 's@^@/@; s/$/$/; s/\([\.[\*]\|\]\)/\/g' filelist.txt)

(The -foption means that grep searches for all the patterns in the given file.)

（该-f选项意味着 grep 搜索给定文件中的所有模式。）

Explanation of <(sed 's@^@/@; s/$/$/; s/$[\.[\*]\|\]$/\\\1/g' filelist.txt):

的解释<(sed 's@^@/@; s/$/$/; s/$[\.[\*]\|\]$/\\\1/g' filelist.txt)：

The <( ... )is called a process subsitution, and is a little similar to $( ... ). The situation is equivalent to (but using the process substitution is neater and possibly a little faster):

这<( ... )称为进程替换，有点类似于$( ... ). 这种情况相当于（但使用进程替换更整洁，可能更快一点）：

sed 's@^@/@; s/$/$/; s/\([\.[\*]\|\]\)/\/g' filelist.txt > processed_filelist.txt
find /dir | grep -f processed_filelist.txt

The call to sedruns the commands s@^@/@, s/$/$/and s/$[\.[\*]\|\]$/\\\1/gon each line of filelist.txtand prints them out. These commands convert the filenames into a format that will work better with grep.

调用sed运行命令s@^@/@，s/$/$/并s/$[\.[\*]\|\]$/\\\1/g在每一行filelist.txt打印出来。这些命令将文件名转换为更适合 grep 的格式。

s@^@/@means put a /at the before each filename. (The ^means "start of line" in a regex)
s/$/$/means put a $at the end of each filename. (The first $means "end of line", the second is just a literal $which is then interpreted by grep to mean "end of line").

s@^@/@意味着/在每个文件名之前放一个。（^在正则表达式中的意思是“行首”）
s/$/$/意味着$在每个文件名的末尾放一个。（第一个$意思是“行尾”，第二个只是一个文字$，然后由 grep 解释为“行尾”）。

The combination of these two rules means that grep will only look for matches like .../<filename>, so that a.txtdoesn't match ./a.txt.backupor ./abba.txt.

这两个规则的组合意味着 grep 将只查找匹配项，例如.../<filename>，因此a.txt不匹配./a.txt.backup或./abba.txt。

s/$[\.[\*]\|\]$/\\\1/gputs a \before each occurrence of .[]or *. Grep uses regexes and those characters are considered special, but we want them to be plain so we need to escape them (if we didn't escape them, then a file name like a.txtwould match files like abtxt).

s/$[\.[\*]\|\]$/\\\1/g\在每次出现.[]or之前放置一个*。Grep 使用正则表达式，这些字符被认为是特殊的，但我们希望它们是普通的，所以我们需要对它们进行转义（如果我们没有对它们进行转义，那么像那样的文件名将a.txt匹配像abtxt）这样的文件。

As an example:

举个例子：

$ cat filelist.txt
file1.txt
file2.txt
blah[2012].txt
blah[2011].txt
lastfile

$ sed 's@^@/@; s/$/$/; s/\([\.[\*]\|\]\)/\/g' filelist.txt
/file1\.txt$
/file2\.txt$
/blah\[2012\]\.txt$
/blah\[2011\]\.txt$
/lastfile$

Grep then uses each line of that output as a pattern when it is searching the output of find.

然后，Grep 在搜索的输出时使用该输出的每一行作为模式find。

Answer 2

回答by majie

Use xargs(1)for the while loop can be a bit faster than in bash.

使用xargs(1)while循环可能会有点比在bash更快。

Like this

像这样

xargs -a filelist.txt -I filename find /dir -name filename

Be careful if the file names in filelist.txt contains whitespaces, read the second paragraph in the DESCRIPTION section of xargs(1)manpageabout this problem.

请注意 filelist.txt 中的文件名是否包含空格，请阅读有关此问题的xargs(1)联机帮助页说明部分的第二段。

An improvement based on some assumptions. For example, a.txt is in filelist.txt, and you can make sure there is only one a.txt in /dir. Then you can tell find(1)to exit early when it finds the instance.

基于某些假设的改进。例如，a.txt 在filelist.txt 中，您可以确保/dir 中只有一个a.txt。然后你可以告诉find(1)它在找到实例时提前退出。

xargs -a filelist.txt -I filename find /dir -name filename -print -quit

Another solution. You can pre-process the filelist.txt, make it into a find(1)arguments list like this. This will reduce find(1)invocations:

另一种解决方案。您可以对 filelist.txt 进行预处理，使其成为这样的find(1)参数列表。这将减少find(1)调用：

find /dir -name 'a.txt' -or -name 'b.txt' -or -name 'c.txt'

Answer 3

回答by James Morris

I'm not entirely sure of the question here, but I came to this page after trying to find a way to discover which 4 of 13000 files had failed to copy.

我不完全确定这里的问题，但在试图找到一种方法来发现 13000 个文件中的哪 4 个未能复制后，我来到了这个页面。

Neither of the answers did it for me so I did this:

两个答案都不适合我，所以我这样做了：

cp file-list file-list2
find dir/ >> file-list2
sort file-list2 | uniq -u

Which resulted with a list of the 4 files I needed.

这导致了我需要的 4 个文件的列表。

The idea is to combine the two file lists to determine the unique entries. sortis used to make duplicate entries adjacent to each other which is the only way uniqwill filter them out.

这个想法是结合两个文件列表来确定唯一的条目。 sort用于使重复条目彼此相邻，这是uniq将它们过滤掉的唯一方法。

Answer 4

回答by jhoran

If filelist.txtis a plain list:

如果filelist.txt是一个普通列表：

$ find /dir | grep -F -f filelist.txt

If filelist.txtis a pattern list:

如果filelist.txt是模式列表：

$ find /dir | grep -f filelist.txt

Linux Shell：在目录下的列表中查找文件

提问by Dagang

采纳答案by huon

回答by majie

回答by James Morris

回答by jhoran

相关推荐

最近更新

标签

Linux Shell：在目录下的列表中查找文件

提问by Dagang

采纳答案by huon

回答by majie

回答by James Morris

回答by jhoran

相关推荐

C# 如何重置 table.DefaultView.RowFilter？

Linux 如何反汇编原始 MIPS 代码？

在 Linux Mint/Ubuntu 上进行 apt-get 升级后恢复系统更改

C# 名为 String.Format，有可能吗？

相关推荐

最近更新

标签