php 使用 BOM 搜索 UTF-8 文件的优雅方式？

Question

提问by vog

For debugging purposes, I need to recursively search a directory for all files which start with a UTF-8 byte order mark (BOM). My current solution is a simple shell script:

出于调试目的，我需要递归搜索所有以 UTF-8 字节顺序标记 (BOM) 开头的文件的目录。我目前的解决方案是一个简单的 shell 脚本：

find -type f |
while read file
do
    if [ "`head -c 3 -- "$file"`" == $'\xef\xbb\xbf' ]
    then
        echo "found BOM in: $file"
    fi
done

Or, if you prefer short, unreadable one-liners:

或者，如果您更喜欢简短的、不可读的单行代码：

find -type f|while read file;do [ "`head -c3 -- "$file"`" == $'\xef\xbb\xbf' ] && echo "found BOM in: $file";done

It doesn't work with filenames that contain a line break, but such files are not to be expected anyway.

它不适用于包含换行符的文件名，但无论如何都不应期望此类文件。

Is there any shorter or more elegant solution?

有没有更短或更优雅的解决方案？

Are there any interesting text editors or macros for text editors?

有没有什么有趣的文本编辑器或文本编辑器的宏？

Answer 1

回答by Denis

What about this one simple command which not just finds but clears the nasty BOM? :)

这个不仅可以找到而且可以清除讨厌的 BOM 的简单命令怎么样？:)

find . -type f -exec sed '1s/^\xEF\xBB\xBF//' -i {} \;

I love "find" :)

我喜欢“发现”:)

WarningThe above will modifybinary files which contain those three characters.

警告以上将修改包含这三个字符的二进制文件。

If you want just to show BOM files, use this one:

如果您只想显示 BOM 文件，请使用此文件：

grep -rl $'\xEF\xBB\xBF' .

Answer 2

回答by Jan Przybylo

The best and easiest way to do this on Windows:

在 Windows 上执行此操作的最佳和最简单的方法：

Total Commander→ go to project's root dir → find files (Alt+ F7) → file types *.* → Find text "EF BB BF" → check 'Hex' checkbox → search

Total Commander→ 转到项目的根目录 → 查找文件 ( Alt+ F7) → 文件类型 *.* → 查找文本“EF BB BF” → 选中“Hex”复选框 → 搜索

And you get the list :)

你会得到名单:)

Answer 3

回答by Aron Griffis

find . -type f -print0 | xargs -0r awk '
    /^\xEF\xBB\xBF/ {print FILENAME}
    {nextfile}'

Most of the solutions given above test more than the first line of the file, even if some (such as Marcus's solution) then filter the results. This solution only tests the first line of each file so it should be a bit quicker.

上面给出的大多数解决方案都比文件的第一行测试更多，即使有些（例如 Marcus 的解决方案）然后过滤了结果。此解决方案仅测试每个文件的第一行，因此应该会更快一些。

Answer 4

回答by CesarB

If you accept some false positives (in case there are non-text files, or in the unlikely case there is a ZWNBSP in the middle of a file), you can use grep:

如果您接受一些误报（如果有非文本文件，或者在不太可能的情况下文件中间有 ZWNBSP），您可以使用 grep：

fgrep -rl `echo -ne '\xef\xbb\xbf'` .

Answer 5

回答by theory

You can use grepto find them and Perl to strip them out like so:

您可以使用grep查找它们并使用 Perl 将它们剥离，如下所示：

grep -rl $'\xEF\xBB\xBF' . | xargs perl -i -pe 's{\xEF\xBB\xBF}{}'

Answer 6

回答by Marcus Griep

I would use something like:

我会使用类似的东西：

grep -orHbm1 "^`echo -ne '\xef\xbb\xbf'`" . | sed '/:0:/!d;s/:0:.*//'

Which will ensure that the BOM occurs starting at the first byte of the file.

这将确保 BOM 从文件的第一个字节开始。

Answer 7

回答by julien

For a Windows user, see this(good PHP script for finding the BOMin your project).

对于 Windows 用户，请参阅此（用于BOM在您的项目中查找的优秀 PHP 脚本）。

Answer 8

回答by mario

An overkill solution to this is phptags(not the vitool with the same name), which specifically looks for PHP scripts:

对此的一个矫枉过正的解决方案是phptags（不是vi同名的工具），它专门查找 PHP 脚本：

phptags --warn ./

Will output something like:

将输出类似：

./invalid.php: TRAILING whitespace ("?>\n")
./invalid.php: UTF-8 BOM alone ("\xEF\xBB\xBF")

And the --whitespacemode will automatically fix such issues (recursively, but asserts that it only rewrites .php scripts.)

并且该--whitespace模式将自动修复此类问题（递归地，但断言它仅重写 .php 脚本。）

Answer 9

回答by Refineo

I used this to correct only JavaScript files:

我用它来更正 JavaScript 文件：

find . -iname *.js -type f -exec sed 's/^\xEF\xBB\xBF//' -i.bak {} \; -exec rm {}.bak \;

Answer 10

回答by Jonathan Wright

find -type f -print0 | xargs -0 grep -l `printf '^\xef\xbb\xbf'` | sed 's/^/found BOM in: /'

find -print0puts a null \0 between each file name instead of using new lines
xargs -0expects null separated arguments instead of line separated
grep -llists the files which match the regex
The regex ^\xeff\xbb\xbfisn't entirely correct, as it will match non-BOMed UTF-8 files if they have zero width spaces at the start of a line

find -print0在每个文件名之间放置一个空 \0 而不是使用新行
xargs -0期望空分隔参数而不是行分隔
grep -l列出匹配正则表达式的文件
正则表达式^\xeff\xbb\xbf并不完全正确，因为如果它们在行首的宽度为零，它将匹配非 BOMed UTF-8 文件

php 使用 BOM 搜索 UTF-8 文件的优雅方式？

提问by vog

回答by Denis

回答by Jan Przybylo

回答by Aron Griffis

回答by CesarB

回答by theory

回答by Marcus Griep

回答by julien

回答by mario

回答by Refineo

回答by Jonathan Wright

相关推荐

最近更新

标签

php 使用 BOM 搜索 UTF-8 文件的优雅方式？

提问by vog

回答by Denis

回答by Jan Przybylo

回答by Aron Griffis

回答by CesarB

回答by theory

回答by Marcus Griep

回答by julien

回答by mario

回答by Refineo

回答by Jonathan Wright

相关推荐

php 产生真实单词的词干算法

PHP - date() 与 date.timezone / date_default_timezone_set()

如何遍历 PHP 中的 DOM 元素？

PHP：默认 cURL 超时值

相关推荐

最近更新

标签