如何检测文件在 Bash 中是否有 UTF-8 BOM?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/33977843/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
How to detect if a file has a UTF-8 BOM in Bash?
提问by James Ko
I'm trying to write a script that will automatically remove UTF-8 BOMs from a file. I'm having trouble detecting whether the file has one in the first place or not. Here is my code:
我正在尝试编写一个脚本,该脚本将自动从文件中删除 UTF-8 BOM。我无法首先检测文件是否有文件。这是我的代码:
function has-bom {
# Test if the file starts with 0xEF, 0xBB, and 0xBF
head -c 3 "" | grep -P '\xef\xbb\xbf'
return $?
}
For some reason, head
seems to be ignoring the BOM in front of the file. As an example, running this
出于某种原因,head
似乎忽略了文件前面的 BOM。例如,运行这个
printf '\xef\xbb\xbf' > file
head -c 3 file
won't print anything.
不会打印任何东西。
I tried looking for an option in head --help
that would let me work around this, but no luck. Is there anything I can do to make this work?
我试图寻找一个head --help
可以让我解决这个问题的选项,但没有运气。有什么我可以做的吗?
回答by John1024
First, let's demonstrate that head
is actually working correctly:
首先,让我们证明它head
实际上工作正常:
$ printf '\xef\xbb\xbf' >file
$ head -c 3 file
$ head -c 3 file | hexdump -C
00000000 ef bb bf |...|
00000003
Now, let's create a working function has_bom
. If your grep
supports -P
, then one option is:
现在,让我们创建一个工作函数has_bom
。如果您grep
支持-P
,那么一种选择是:
$ has_bom() { head -c3 "" | LC_ALL=C grep -qP '\xef\xbb\xbf'; }
$ has_bom file && echo yes
yes
Currently, only GNU grep
supports -P
.
目前,只有 GNUgrep
支持-P
.
Another option is to use bash's $'...'
:
另一种选择是使用 bash 的$'...'
:
$ has_bom() { head -c3 "" | grep -q $'\xef\xbb\xbf'; }
$ has_bom file && echo yes
yes
ksh
and zsh
also support $'...'
but this construct is not POSIX and dash
does not support it.
ksh
并且zsh
也支持$'...'
但是这个构造不是 POSIX 并且dash
不支持它。
Notes:
笔记:
The use of an explicit
return $?
is optional. The function will, by default, return with the exit code of the last command run.I have used the POSIX form for defining functions. This is equivalent to the bash form but gives you one less problem to deal with if you ever have to run the function under another shell.
bash does accept the use of the character
-
in a function name but this is a controversial feature. I replaced it with_
which is more widely accepted. (For more on this issue, see this answer.)The
-q
option togrep
makes it quiet, meaning that it still sets a proper exit code but it does not send any characters to stdout.
显式的使用
return $?
是可选的。默认情况下,该函数将返回上次命令运行的退出代码。我使用 POSIX 形式来定义函数。这等效于 bash 形式,但是如果您不得不在另一个 shell 下运行该函数,那么您需要处理的问题就会少一些。
bash 确实接受
-
在函数名称中使用字符,但这是一个有争议的特性。我用_
更广泛接受的代替了它。(有关此问题的更多信息,请参阅此答案。)使其安静的
-q
选项grep
,这意味着它仍然设置了正确的退出代码,但它不会向 stdout 发送任何字符。
回答by apexik
I applied the followings for the first read line:
我为第一行读取应用了以下内容:
read c
if (( "$(printf "%d" "'${c:0:1}")" == 65279 )) ; then c="${c:1}" ; fi
This simply removes the BOM from the variable.
这只是从变量中删除 BOM。