php 使用 BOM 搜索 UTF-8 文件的优雅方式?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/204765/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Elegant way to search for UTF-8 files with BOM?
提问by vog
For debugging purposes, I need to recursively search a directory for all files which start with a UTF-8 byte order mark (BOM). My current solution is a simple shell script:
出于调试目的,我需要递归搜索所有以 UTF-8 字节顺序标记 (BOM) 开头的文件的目录。我目前的解决方案是一个简单的 shell 脚本:
find -type f |
while read file
do
if [ "`head -c 3 -- "$file"`" == $'\xef\xbb\xbf' ]
then
echo "found BOM in: $file"
fi
done
Or, if you prefer short, unreadable one-liners:
或者,如果您更喜欢简短的、不可读的单行代码:
find -type f|while read file;do [ "`head -c3 -- "$file"`" == $'\xef\xbb\xbf' ] && echo "found BOM in: $file";done
It doesn't work with filenames that contain a line break, but such files are not to be expected anyway.
它不适用于包含换行符的文件名,但无论如何都不应期望此类文件。
Is there any shorter or more elegant solution?
有没有更短或更优雅的解决方案?
Are there any interesting text editors or macros for text editors?
有没有什么有趣的文本编辑器或文本编辑器的宏?
回答by Denis
What about this one simple command which not just finds but clears the nasty BOM? :)
这个不仅可以找到而且可以清除讨厌的 BOM 的简单命令怎么样?:)
find . -type f -exec sed '1s/^\xEF\xBB\xBF//' -i {} \;
I love "find" :)
我喜欢“发现”:)
WarningThe above will modifybinary files which contain those three characters.
警告以上将修改包含这三个字符的二进制文件。
If you want just to show BOM files, use this one:
如果您只想显示 BOM 文件,请使用此文件:
grep -rl $'\xEF\xBB\xBF' .
回答by Jan Przybylo
The best and easiest way to do this on Windows:
在 Windows 上执行此操作的最佳和最简单的方法:
Total Commander→ go to project's root dir → find files (Alt+ F7) → file types *.* → Find text "EF BB BF" → check 'Hex' checkbox → search
Total Commander→ 转到项目的根目录 → 查找文件 ( Alt+ F7) → 文件类型 *.* → 查找文本“EF BB BF” → 选中“Hex”复选框 → 搜索
And you get the list :)
你会得到名单:)
回答by Aron Griffis
find . -type f -print0 | xargs -0r awk '
/^\xEF\xBB\xBF/ {print FILENAME}
{nextfile}'
Most of the solutions given above test more than the first line of the file, even if some (such as Marcus's solution) then filter the results. This solution only tests the first line of each file so it should be a bit quicker.
上面给出的大多数解决方案都比文件的第一行测试更多,即使有些(例如 Marcus 的解决方案)然后过滤了结果。此解决方案仅测试每个文件的第一行,因此应该会更快一些。
回答by CesarB
If you accept some false positives (in case there are non-text files, or in the unlikely case there is a ZWNBSP in the middle of a file), you can use grep:
如果您接受一些误报(如果有非文本文件,或者在不太可能的情况下文件中间有 ZWNBSP),您可以使用 grep:
fgrep -rl `echo -ne '\xef\xbb\xbf'` .
回答by theory
You can use grepto find them and Perl to strip them out like so:
您可以使用grep查找它们并使用 Perl 将它们剥离,如下所示:
grep -rl $'\xEF\xBB\xBF' . | xargs perl -i -pe 's{\xEF\xBB\xBF}{}'
回答by Marcus Griep
I would use something like:
我会使用类似的东西:
grep -orHbm1 "^`echo -ne '\xef\xbb\xbf'`" . | sed '/:0:/!d;s/:0:.*//'
Which will ensure that the BOM occurs starting at the first byte of the file.
这将确保 BOM 从文件的第一个字节开始。
回答by julien
回答by mario
An overkill solution to this is phptags(not the vitool with the same name), which specifically looks for PHP scripts:
对此的一个矫枉过正的解决方案是phptags(不是vi同名的工具),它专门查找 PHP 脚本:
phptags --warn ./
Will output something like:
将输出类似:
./invalid.php: TRAILING whitespace ("?>\n")
./invalid.php: UTF-8 BOM alone ("\xEF\xBB\xBF")
And the --whitespacemode will automatically fix such issues (recursively, but asserts that it only rewrites .php scripts.)
并且该--whitespace模式将自动修复此类问题(递归地,但断言它仅重写 .php 脚本。)
回答by Refineo
I used this to correct only JavaScript files:
我用它来更正 JavaScript 文件:
find . -iname *.js -type f -exec sed 's/^\xEF\xBB\xBF//' -i.bak {} \; -exec rm {}.bak \;
回答by Jonathan Wright
find -type f -print0 | xargs -0 grep -l `printf '^\xef\xbb\xbf'` | sed 's/^/found BOM in: /'
find -print0puts a null \0 between each file name instead of using new linesxargs -0expects null separated arguments instead of line separatedgrep -llists the files which match the regex- The regex
^\xeff\xbb\xbfisn't entirely correct, as it will match non-BOMed UTF-8 files if they have zero width spaces at the start of a line
find -print0在每个文件名之间放置一个空 \0 而不是使用新行xargs -0期望空分隔参数而不是行分隔grep -l列出匹配正则表达式的文件- 正则表达式
^\xeff\xbb\xbf并不完全正确,因为如果它们在行首的宽度为零,它将匹配非 BOMed UTF-8 文件

