bash 查找和删除具有非 ASCII 名称的文件

Question

提问by Rohit Chopra

I have some old migrated files that contain non-printable characters. I would like to find all files with such names and delete them completely from the system.

我有一些包含不可打印字符的旧迁移文件。我想找到所有具有此类名称的文件并将它们从系统中完全删除。

Example:

例子：

ls -l
-rwxrwxr-x 1 cws cws      0 Dec 28  2011 ??"??

ls -lb
-rwxrwxr-x 1 cws cws      0 Dec 28  2011 \a1"61

I would like to find all such files.

我想找到所有这些文件。

Here is an example screenshot of what I'm seeing when I do a lsin such folders:

这是我ls在此类文件夹中执行操作时所看到的示例屏幕截图：

enter image description here

在此处输入图片说明

I want to find these files with the non-printable characters and just delete them.

我想找到这些带有不可打印字符的文件，然后删除它们。

Answer 1

回答by ThisSuitIsBlackNot

Non-ASCII characters

非 ASCII 字符

ASCII character codes range from 0x00to 0x7Fin hex. Therefore, any character with a code greater than 0x7Fis a non-ASCII character. This includes the bulk of the characters in UTF-8 (ASCII codes are essentially a subset of UTF-8). For example, the Japanese character

ASCII 字符代码范围从0x00到0x7F十六进制。因此，任何代码大于的字符0x7F都是非 ASCII 字符。这包括 UTF-8 中的大部分字符（ASCII 代码本质上是 UTF-8 的子集）。例如，日文字符

あ

is encoded in hex in UTF-8 as

在 UTF-8 中以十六进制编码为

E3 81 82

UTF-8 has been the default character encoding on, among others, Red Hat Linux since version 8.0 (2002), SuSE Linux since version 9.1 (2004), and Ubuntu Linux since version 5.04 (2005).

UTF-8自 8.0 (2002) 版以来一直是 Red Hat Linux、SuSE Linux 自 9.1 (2004) 版和 Ubuntu Linux 5.04 (2005) 版以来的默认字符编码。

ASCII control characters

ASCII 控制字符

Out of the ASCII codes, 0x00through 0x1Fand 0x7Frepresent control characters such as ESC(0x1B). These control characters were not originally intended to be printable even though some of them, like the line feed character 0x0A, can be interpreted and displayed.

在 ASCII 码中，0x00通过0x1F和0x7F表示控制字符，如ESC( 0x1B)。这些控制字符最初并不打算用于打印，即使其中一些字符0x0A（如换行符）可以被解释和显示。

On my system, lsdisplays all control characters as ?by default, unless I pass the --show-control-charsoption. I'm guessing that the files you want to delete contain ASCII control characters, as opposed to non-ASCII characters. This is an important distinction: if you delete filenames containing non-ASCII characters, you may blow away legitimate files that just happen to be named in another language.

在我的系统上，默认情况下ls显示所有控制字符?，除非我传递该--show-control-chars选项。我猜您要删除的文件包含 ASCII 控制字符，而不是非 ASCII 字符。这是一个重要的区别：如果您删除包含非 ASCII 字符的文件名，您可能会删除恰好以另一种语言命名的合法文件。

Regular expressions for character codes

字符代码的正则表达式

POSIX

POSIX provides a very handy collection of character classes for dealing with these types of characters (thanks to bashophilfor pointing this out):

POSIX 提供了一个非常方便的字符类集合来处理这些类型的字符（感谢bashophil指出这一点）：

[:cntrl:] Control characters
[:graph:] Graphic printable characters (same as [:print:] minus the space character)
[:print:] Printable characters (same as [:graph:] plus the space character)

PCRE

聚合酶链反应

Perl Compatible Regular Expressionsallow hexadecimal character codes using the syntax

Perl 兼容正则表达式允许使用语法的十六进制字符代码

\x00

For example, a PCRE regex for the Japanese character あwould be

例如，对于日文字符一个PCRE正则表达式あ会

\xE3\x81\x82

In addition to the POSIX character classes listed above, PCRE also provides the [:ascii:]character class, which is a convenient shorthand for [\x00-\x7F].

除了上面列出的 POSIX 字符类，PCRE 还提供了[:ascii:]字符类，它是[\x00-\x7F].

GNU's version of grepsupports PCRE using the -Pflag, but BSD grep(on Mac OS X, for example) does not. Neither GNU nor BSD findsupports PCRE regexes.

GNU 的版本grep支持使用该-P标志的PCRE ，但 BSD grep（例如在 Mac OS X 上）不支持。GNU 和 BSD 都不find支持 PCRE 正则表达式。

Finding the files

查找文件

GNU findsupports POSIX regexes (thanks to iscfrcfor pointing out the pure findsolution to avoid spawning additional processes). The following command will list all filenames (but not directory names) below the current directory that contain non-printable control characters:

GNUfind支持 POSIX 正则表达式（感谢iscfrc指出find避免产生额外进程的纯解决方案）。以下命令将列出当前目录下包含不可打印控制字符的所有文件名（但不是目录名）：

find -type f -regextype posix-basic -regex '^.*/[^/]*[[:cntrl:]][^/]*$'

The regex is a little complicated because the -regexoption has to match the entire file path, not just the filename, and because I'm assuming that we don't want to blow away files with normal names simply because they are inside directories with names containing control characters.

正则表达式有点复杂，因为该-regex选项必须匹配整个文件路径，而不仅仅是文件名，并且因为我假设我们不想仅仅因为它们位于名称包含的目录中而删除具有正常名称的文件控制字符。

To delete the matching files, simply pass the -deleteoption to find, after all other options(this is critical; passing -deleteas the first option will blow away everything in your current directory):

要删除匹配的文件，只需将-delete选项传递给find，在所有其他选项之后（这很关键；-delete作为第一个选项传递将清除当前目录中的所有内容）：

find -type f -regextype posix-basic -regex '^.*/[^/]*[[:cntrl:]][^/]*$' -delete

I highlyrecommend running the command withoutthe -deletefirst, so you can see what will be deleted before it's too late.

我强烈建议在没有第-delete一个命令的情况下运行该命令，这样您就可以在为时已晚之前看到将要删除的内容。

If you also pass the -printoption, you can see what is being deleted as the command runs:

如果您还传递了该-print选项，则可以看到命令运行时正在删除的内容：

find -type f -regextype posix-basic -regex '^.*/[^/]*[[:cntrl:]][^/]*$' -print -delete

To blow away any paths(files ordirectories) that contain control characters, the regex can be simplified and you can drop the -typeoption:

要删除包含控制字符的任何路径（文件或目录），可以简化正则表达式，您可以删除该-type选项：

find -regextype posix-basic -regex '.*[[:cntrl:]].*' -print -delete

With this command, if a directory name contains control characters, even if none of the filenames inside the directory do, they will allbe deleted.

使用此命令，如果目录名称包含控制字符，即使没有该目录内的文件名的做，他们将全部被删除。

Update: Finding both non-ASCII andcontrol characters

更新：同时查找非 ASCII 字符和控制字符

It looks like your files contain both non-ASCII characters andASCII control characters. As it turns out, [:ascii:]is nota POSIX character class, but it isprovided by PCRE. I couldn't find a POSIX regex to do this, so it's Perl to the rescue. We'll still use findto traverse our directory tree, but we'll pass the results to Perl for processing.

看起来您的文件同时包含非 ASCII 字符和ASCII 控制字符。事实证明，[:ascii:]是不是一个POSIX字符类，但它是由PCRE提供。我找不到 POSIX 正则表达式来执行此操作，因此可以使用 Perl。我们仍将使用find遍历我们的目录树，但我们会将结果传递给 Perl 进行处理。

To make sure we can handle filenames containing newlines (which seems likely in this case), we need to use the -print0argument to find(supported on both GNU and BSD versions); this separates records with a null character (0x00) instead of a newline, since the null character is the only character that can't be in a valid filename on Linux. We need to pass the corresponding flag -0to our Perl code so it knows how records are separated. The following command will print every path inside the current directory, recursively:

为了确保我们可以处理包含换行符的文件名（在这种情况下似乎很可能），我们需要使用-print0参数 to find（在 GNU 和 BSD 版本上都支持）；这将使用空字符 ( 0x00) 而不是换行符分隔记录，因为空字符是 Linux 上唯一不能出现在有效文件名中的字符。我们需要将相应的标志传递-0给我们的 Perl 代码，以便它知道如何分隔记录。以下命令将递归打印当前目录中的每个路径：

find . -print0 | perl -n0e 'print $_, "\n"'

Note that this command only spawns a single instance of the Perl interpreter, which is good for performance. The starting path argument (in this case, .for CWD) is optional in GNU findbut is required in BSD findon Mac OS X, so I've included it for the sake of portability.

请注意，此命令仅生成 Perl 解释器的单个实例，这对性能有好处。起始路径参数（在本例中.为 for CWD）在 GNU 中是可选的，find但find在 Mac OS X 上的BSD 中是必需的，因此为了可移植性，我将其包含在内。

Now for our regex. Here is a PCRE regex matching names that contain either non-ASCII or non-printable (i.e. control) characters (or both):

现在是我们的正则表达式。这是一个 PCRE 正则表达式匹配包含非 ASCII 或不可打印（即控制）字符（或两者）的名称：

[[:^ascii:][:cntrl:]]

The following command will print all paths(directories orfiles) in the current directory that match this regex:

以下命令将打印当前目录中与此正则表达式匹配的所有路径（目录或文件）：

find . -print0 | perl -n0e 'chomp; print $_, "\n" if /[[:^ascii:][:cntrl:]]/'

The chompis necessary because it strips off the trailing null character from each path, which would otherwise match our regex. To delete the matching files and directories, we can use the following:

这chomp是必要的，因为它从每个路径中去除了尾随的空字符，否则它将与我们的正则表达式匹配。要删除匹配的文件和目录，我们可以使用以下命令：

find . -print0 | perl -MFile::Path=remove_tree -n0e 'chomp; remove_tree($_, {verbose=>1}) if /[[:^ascii:][:cntrl:]]/'

This will also print out what is being deleted as the command runs (although control characters are interpreted so the output will not quite match the output of ls).

这也将打印出在命令运行时被删除的内容（尽管控制字符被解释，因此输出与的输出不太匹配ls）。

Answer 2

回答by Alexandre Schmidt

By now, you probably have solved your question, but it didn't work well for my case, as I had files that was not being shown by findwhen I used the -regexswitch. So I developed this workaround using ls. Hope it can be useful to someone.

到目前为止，您可能已经解决了您的问题，但它对我的情况并不适用，因为我有一些文件find在我使用-regex开关时没有显示出来。因此，我使用ls. 希望它可以对某人有用。

Basically, what worked for me was this:

基本上，对我有用的是：

ls -1 -R -i | grep -a "[^A-Za-z0-9_.':@ /-]" | while read f; do inode=$(echo "$f" | cut -d ' ' -f 1); find -inum "$inode" -delete; done

Breaking it in parts:

把它分成几部分：

ls -1 -R -i

This will recursively (-R) list (ls) files under current directory, one file per line (-1), prefixing each file by its inode number (-i). Results will be piped to grep.

这将递归 ( -R) 列出ls当前目录下的( ) 文件，每行一个文件 ( -1)，并在每个文件前加上其 inode 编号 ( -i)。结果将通过管道传送到grep.

grep -a "[^A-Za-z0-9_.':@ /-]"

Filter each entry considering each input as text (-a), even when it is eventually binary. grepwill let a line pass if it contains a character different from the specified in the list. Results will be piped to while.

过滤每个条目，将每个输入视为文本 ( -a)，即使它最终是二进制的。grep如果它包含与列表中指定的字符不同的字符，将让行通过。结果将通过管道传送到while.

while read f
do
    inode=$(echo "$f" | cut -d ' ' -f 1)
    find -inum "$inode" -delete
done

This whilewill iterate through all entries, extracting the inode number and passing the inode to find, which will then delete the file.

这while将遍历所有条目，提取 inode 编号并将 inode 传递给find，然后将删除文件。

Answer 3

回答by Dan

It is possible to use PCRE with grep -P, just not with find (unfortunately). You can chain find with grep using exec. With PCRE (perl regex), we can use the ascii class and find any char that is non-ascii.

可以将 PCRE 与 grep -P 一起使用，但不能与 find 一起使用（不幸的是）。您可以使用 exec 与 grep 链接查找。使用 PCRE（perl regex），我们可以使用 ascii 类并找到任何非 ascii 的字符。

find . -type f -exec sh -c "echo \"{}\" | grep -qP '[^[:ascii:]]'" \; -exec rm {} \;

The following exec will not execute unless the first one returns a non-error code. In this case, it means the expression matched the filename. I used sh -c because -exec doesn't like pipes.

除非第一个 exec 返回非错误代码，否则下面的 exec 不会执行。在这种情况下，这意味着表达式与文件名匹配。我使用 sh -c 因为 -exec 不喜欢管道。

Answer 4

回答by kenorb

Based on this answer, try:

基于此答案，请尝试：

LC_ALL=C find . -regex '.*[^ -~].*' -print # -delete

or:

或者：

LC_ALL=C find . -type f -regex '*[^[:alnum:][:punct:]]*' -print # -delete

^{Note: After files are printed right, remove the #character.}

^{注意：文件打印正确后，去掉#字符。}

See also: How do I grep for all non-ASCII characters.

另请参阅：如何对所有非 ASCII 字符进行 grep。

Answer 5

回答by dave12345678

You could print only lines containing a backslash with grep:

您只能使用 grep 打印包含反斜杠的行：

ls -lb | grep \\

bash 查找和删除具有非 ASCII 名称的文件

提问by Rohit Chopra

回答by ThisSuitIsBlackNot

Non-ASCII characters

非 ASCII 字符

ASCII control characters

ASCII 控制字符

Regular expressions for character codes

字符代码的正则表达式

POSIX

POSIX

PCRE

聚合酶链反应

Finding the files

查找文件

Update: Finding both non-ASCII andcontrol characters

更新：同时查找非 ASCII 字符和控制字符

回答by Alexandre Schmidt

回答by Dan

回答by kenorb

回答by dave12345678

相关推荐

最近更新

标签

bash 查找和删除具有非 ASCII 名称的文件

提问by Rohit Chopra

回答by ThisSuitIsBlackNot

Non-ASCII characters

非 ASCII 字符

ASCII control characters

ASCII 控制字符

Regular expressions for character codes

字符代码的正则表达式

POSIX

POSIX

PCRE

聚合酶链反应

Finding the files

查找文件

Update: Finding both non-ASCII andcontrol characters

更新：同时查找非 ASCII 字符和控制字符

回答by Alexandre Schmidt

回答by Dan

回答by kenorb

回答by dave12345678

相关推荐

grep 输出仅打印 bash 脚本中的一行

如何将带引号的参数从变量传递给 bash 脚本

bash 如何查找除给定名称之外的文件？

bash LDAP 中的用户帐户创建日期

相关推荐

最近更新

标签