bash 查找和删除具有非 ASCII 名称的文件
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/19146240/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
find and delete files with non-ascii names
提问by Rohit Chopra
I have some old migrated files that contain non-printable characters. I would like to find all files with such names and delete them completely from the system.
我有一些包含不可打印字符的旧迁移文件。我想找到所有具有此类名称的文件并将它们从系统中完全删除。
Example:
例子:
ls -l
-rwxrwxr-x 1 cws cws 0 Dec 28 2011 ??"??
ls -lb
-rwxrwxr-x 1 cws cws 0 Dec 28 2011 \a1"61
I would like to find all such files.
我想找到所有这些文件。
Here is an example screenshot of what I'm seeing when I do a ls
in such folders:
这是我ls
在此类文件夹中执行操作时所看到的示例屏幕截图:
I want to find these files with the non-printable characters and just delete them.
我想找到这些带有不可打印字符的文件,然后删除它们。
回答by ThisSuitIsBlackNot
Non-ASCII characters
非 ASCII 字符
ASCII character codes range from 0x00
to 0x7F
in hex. Therefore, any character with a code greater than 0x7F
is a non-ASCII character. This includes the bulk of the characters in UTF-8 (ASCII codes are essentially a subset of UTF-8). For example, the Japanese character
ASCII 字符代码范围从0x00
到0x7F
十六进制。因此,任何代码大于 的字符0x7F
都是非 ASCII 字符。这包括 UTF-8 中的大部分字符(ASCII 代码本质上是 UTF-8 的子集)。例如,日文字符
あ
あ
is encoded in hex in UTF-8 as
在 UTF-8 中以十六进制编码为
E3 81 82
E3 81 82
UTF-8 has been the default character encoding on, among others, Red Hat Linux since version 8.0 (2002), SuSE Linux since version 9.1 (2004), and Ubuntu Linux since version 5.04 (2005).
UTF-8自 8.0 (2002) 版以来一直是 Red Hat Linux、SuSE Linux 自 9.1 (2004) 版和 Ubuntu Linux 5.04 (2005) 版以来的默认字符编码。
ASCII control characters
ASCII 控制字符
Out of the ASCII codes, 0x00
through 0x1F
and 0x7F
represent control characters such as ESC
(0x1B
). These control characters were not originally intended to be printable even though some of them, like the line feed character 0x0A
, can be interpreted and displayed.
在 ASCII 码中,0x00
通过0x1F
和0x7F
表示控制字符,如ESC
( 0x1B
)。这些控制字符最初并不打算用于打印,即使其中一些字符0x0A
(如换行符)可以被解释和显示。
On my system, ls
displays all control characters as ?
by default, unless I pass the --show-control-chars
option. I'm guessing that the files you want to delete contain ASCII control characters, as opposed to non-ASCII characters. This is an important distinction: if you delete filenames containing non-ASCII characters, you may blow away legitimate files that just happen to be named in another language.
在我的系统上,默认情况下ls
显示所有控制字符?
,除非我传递该--show-control-chars
选项。我猜您要删除的文件包含 ASCII 控制字符,而不是非 ASCII 字符。这是一个重要的区别:如果您删除包含非 ASCII 字符的文件名,您可能会删除恰好以另一种语言命名的合法文件。
Regular expressions for character codes
字符代码的正则表达式
POSIX
POSIX
POSIX provides a very handy collection of character classes for dealing with these types of characters (thanks to bashophilfor pointing this out):
POSIX 提供了一个非常方便的字符类集合来处理这些类型的字符(感谢bashophil指出这一点):
[:cntrl:] Control characters
[:graph:] Graphic printable characters (same as [:print:] minus the space character)
[:print:] Printable characters (same as [:graph:] plus the space character)
PCRE
聚合酶链反应
Perl Compatible Regular Expressionsallow hexadecimal character codes using the syntax
Perl 兼容正则表达式允许使用语法的十六进制字符代码
\x00
For example, a PCRE regex for the Japanese character あ
would be
例如,对于日文字符一个PCRE正则表达式あ
会
\xE3\x81\x82
In addition to the POSIX character classes listed above, PCRE also provides the [:ascii:]
character class, which is a convenient shorthand for [\x00-\x7F]
.
除了上面列出的 POSIX 字符类,PCRE 还提供了[:ascii:]
字符类,它是[\x00-\x7F]
.
GNU's version of grep
supports PCRE using the -P
flag, but BSD grep
(on Mac OS X, for example) does not. Neither GNU nor BSD find
supports PCRE regexes.
GNU 的版本grep
支持使用该-P
标志的PCRE ,但 BSD grep
(例如在 Mac OS X 上)不支持。GNU 和 BSD 都不find
支持 PCRE 正则表达式。
Finding the files
查找文件
GNU find
supports POSIX regexes (thanks to iscfrcfor pointing out the pure find
solution to avoid spawning additional processes). The following command will list all filenames (but not directory names) below the current directory that contain non-printable control characters:
GNUfind
支持 POSIX 正则表达式(感谢iscfrc指出find
避免产生额外进程的纯解决方案)。以下命令将列出当前目录下包含不可打印控制字符的所有文件名(但不是目录名):
find -type f -regextype posix-basic -regex '^.*/[^/]*[[:cntrl:]][^/]*$'
The regex is a little complicated because the -regex
option has to match the entire file path, not just the filename, and because I'm assuming that we don't want to blow away files with normal names simply because they are inside directories with names containing control characters.
正则表达式有点复杂,因为该-regex
选项必须匹配整个文件路径,而不仅仅是文件名,并且因为我假设我们不想仅仅因为它们位于名称包含的目录中而删除具有正常名称的文件控制字符。
To delete the matching files, simply pass the -delete
option to find
, after all other options(this is critical; passing -delete
as the first option will blow away everything in your current directory):
要删除匹配的文件,只需将-delete
选项传递给find
,在所有其他选项之后(这很关键;-delete
作为第一个选项传递将清除当前目录中的所有内容):
find -type f -regextype posix-basic -regex '^.*/[^/]*[[:cntrl:]][^/]*$' -delete
I highlyrecommend running the command withoutthe -delete
first, so you can see what will be deleted before it's too late.
我强烈建议在没有第-delete
一个命令的情况下运行该命令,这样您就可以在为时已晚之前看到将要删除的内容。
If you also pass the -print
option, you can see what is being deleted as the command runs:
如果您还传递了该-print
选项,则可以看到命令运行时正在删除的内容:
find -type f -regextype posix-basic -regex '^.*/[^/]*[[:cntrl:]][^/]*$' -print -delete
To blow away any paths(files ordirectories) that contain control characters, the regex can be simplified and you can drop the -type
option:
要删除包含控制字符的任何路径(文件或目录),可以简化正则表达式,您可以删除该-type
选项:
find -regextype posix-basic -regex '.*[[:cntrl:]].*' -print -delete
With this command, if a directory name contains control characters, even if none of the filenames inside the directory do, they will allbe deleted.
使用此命令,如果目录名称包含控制字符,即使没有该目录内的文件名的做,他们将全部被删除。
Update: Finding both non-ASCII andcontrol characters
更新:同时查找非 ASCII 字符和控制字符
It looks like your files contain both non-ASCII characters andASCII control characters. As it turns out, [:ascii:]
is nota POSIX character class, but it isprovided by PCRE. I couldn't find a POSIX regex to do this, so it's Perl to the rescue. We'll still use find
to traverse our directory tree, but we'll pass the results to Perl for processing.
看起来您的文件同时包含非 ASCII 字符和ASCII 控制字符。事实证明,[:ascii:]
是不是一个POSIX字符类,但它是由PCRE提供。我找不到 POSIX 正则表达式来执行此操作,因此可以使用 Perl。我们仍将使用find
遍历我们的目录树,但我们会将结果传递给 Perl 进行处理。
To make sure we can handle filenames containing newlines (which seems likely in this case), we need to use the -print0
argument to find
(supported on both GNU and BSD versions); this separates records with a null character (0x00
) instead of a newline, since the null character is the only character that can't be in a valid filename on Linux. We need to pass the corresponding flag -0
to our Perl code so it knows how records are separated. The following command will print every path inside the current directory, recursively:
为了确保我们可以处理包含换行符的文件名(在这种情况下似乎很可能),我们需要使用-print0
参数 to find
(在 GNU 和 BSD 版本上都支持);这将使用空字符 ( 0x00
) 而不是换行符分隔记录,因为空字符是 Linux 上唯一不能出现在有效文件名中的字符。我们需要将相应的标志传递-0
给我们的 Perl 代码,以便它知道如何分隔记录。以下命令将递归打印当前目录中的每个路径:
find . -print0 | perl -n0e 'print $_, "\n"'
Note that this command only spawns a single instance of the Perl interpreter, which is good for performance. The starting path argument (in this case, .
for CWD
) is optional in GNU find
but is required in BSD find
on Mac OS X, so I've included it for the sake of portability.
请注意,此命令仅生成 Perl 解释器的单个实例,这对性能有好处。起始路径参数(在本例中.
为 for CWD
)在 GNU 中是可选的,find
但find
在 Mac OS X 上的BSD 中是必需的,因此为了可移植性,我将其包含在内。
Now for our regex. Here is a PCRE regex matching names that contain either non-ASCII or non-printable (i.e. control) characters (or both):
现在是我们的正则表达式。这是一个 PCRE 正则表达式匹配包含非 ASCII 或不可打印(即控制)字符(或两者)的名称:
[[:^ascii:][:cntrl:]]
The following command will print all paths(directories orfiles) in the current directory that match this regex:
以下命令将打印当前目录中与此正则表达式匹配的所有路径(目录或文件):
find . -print0 | perl -n0e 'chomp; print $_, "\n" if /[[:^ascii:][:cntrl:]]/'
The chomp
is necessary because it strips off the trailing null character from each path, which would otherwise match our regex. To delete the matching files and directories, we can use the following:
这chomp
是必要的,因为它从每个路径中去除了尾随的空字符,否则它将与我们的正则表达式匹配。要删除匹配的文件和目录,我们可以使用以下命令:
find . -print0 | perl -MFile::Path=remove_tree -n0e 'chomp; remove_tree($_, {verbose=>1}) if /[[:^ascii:][:cntrl:]]/'
This will also print out what is being deleted as the command runs (although control characters are interpreted so the output will not quite match the output of ls
).
这也将打印出在命令运行时被删除的内容(尽管控制字符被解释,因此输出与 的输出不太匹配ls
)。
回答by Alexandre Schmidt
By now, you probably have solved your question, but it didn't work well for my case, as I had files that was not being shown by find
when I used the -regex
switch. So I developed this workaround using ls
. Hope it can be useful to someone.
到目前为止,您可能已经解决了您的问题,但它对我的情况并不适用,因为我有一些文件find
在我使用-regex
开关时没有显示出来。因此,我使用ls
. 希望它可以对某人有用。
Basically, what worked for me was this:
基本上,对我有用的是:
ls -1 -R -i | grep -a "[^A-Za-z0-9_.':@ /-]" | while read f; do inode=$(echo "$f" | cut -d ' ' -f 1); find -inum "$inode" -delete; done
Breaking it in parts:
把它分成几部分:
ls -1 -R -i
This will recursively (-R
) list (ls
) files under current directory, one file per line (-1
), prefixing each file by its inode number (-i
). Results will be piped to grep
.
这将递归 ( -R
) 列出ls
当前目录下的( ) 文件,每行一个文件 ( -1
),并在每个文件前加上其 inode 编号 ( -i
)。结果将通过管道传送到grep
.
grep -a "[^A-Za-z0-9_.':@ /-]"
Filter each entry considering each input as text (-a
), even when it is eventually binary. grep
will let a line pass if it contains a character different from the specified in the list. Results will be piped to while
.
过滤每个条目,将每个输入视为文本 ( -a
),即使它最终是二进制的。grep
如果它包含与列表中指定的字符不同的字符,将让行通过。结果将通过管道传送到while
.
while read f
do
inode=$(echo "$f" | cut -d ' ' -f 1)
find -inum "$inode" -delete
done
This while
will iterate through all entries, extracting the inode number and passing the inode to find
, which will then delete the file.
这while
将遍历所有条目,提取 inode 编号并将 inode 传递给find
,然后将删除文件。
回答by Dan
It is possible to use PCRE with grep -P, just not with find (unfortunately). You can chain find with grep using exec. With PCRE (perl regex), we can use the ascii class and find any char that is non-ascii.
可以将 PCRE 与 grep -P 一起使用,但不能与 find 一起使用(不幸的是)。您可以使用 exec 与 grep 链接查找。使用 PCRE(perl regex),我们可以使用 ascii 类并找到任何非 ascii 的字符。
find . -type f -exec sh -c "echo \"{}\" | grep -qP '[^[:ascii:]]'" \; -exec rm {} \;
The following exec will not execute unless the first one returns a non-error code. In this case, it means the expression matched the filename. I used sh -c because -exec doesn't like pipes.
除非第一个 exec 返回非错误代码,否则下面的 exec 不会执行。在这种情况下,这意味着表达式与文件名匹配。我使用 sh -c 因为 -exec 不喜欢管道。
回答by kenorb
Based on this answer, try:
基于此答案,请尝试:
LC_ALL=C find . -regex '.*[^ -~].*' -print # -delete
or:
或者:
LC_ALL=C find . -type f -regex '*[^[:alnum:][:punct:]]*' -print # -delete
Note: After files are printed right, remove the #
character.
注意:文件打印正确后,去掉#
字符。
See also: How do I grep for all non-ASCII characters.
另请参阅:如何对所有非 ASCII 字符进行 grep。
回答by dave12345678
You could print only lines containing a backslash with grep:
您只能使用 grep 打印包含反斜杠的行:
ls -lb | grep \\