Linux 文件shell脚本的编码
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/1730878/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
encoding of file shell script
提问by rizidoro
How can I check the file encoding in a shell script? I need to know if a file is encoded in utf-8 or iso-8859-1.
如何检查shell脚本中的文件编码?我需要知道文件是用 utf-8 还是 iso-8859-1 编码的。
Thanks
谢谢
采纳答案by ChristopheD
I'd just use
我只是用
file -bi myfile.txt
to determine the character encoding of a particular file.
确定特定文件的字符编码。
A solution with an external dependency but I suspect file
is very common nowadays among all semi-modern distro's.
具有外部依赖性的解决方案,但我怀疑file
现在在所有半现代发行版中非常普遍。
EDIT:
编辑:
As a response to Laurence Gonsalves' comment: b
is the option to be 'brief' (not include the filename) and i
is the shorthand equivalent of --mime
so the most portable way (including Mac OSX) then probably is:
作为对 Laurence Gonsalves 评论的回应:b
是“简短”的选项(不包括文件名),并且i
是--mime
最便携的方式(包括 Mac OSX)的速记等价物,那么可能是:
file --mime myfile.txt
回答by jochil
you can use the file command
file --mime myfile.text
你可以使用文件命令
file --mime myfile.text
回答by Laurence Gonsalves
There's no way to be 100% certain (unless you're dealing with a file format that internally states its encoding).
没有办法 100% 确定(除非您处理的是内部声明其编码的文件格式)。
Most tools that attempt to make this distinction will try and decode the file as utf-8 (as that's the more strict encoding), and if that fails, then fall back to iso-8859-1. You can do this with iconv
"by hand", or you can use file
:
大多数试图做出这种区分的工具都会尝试将文件解码为 utf-8(因为这是更严格的编码),如果失败,则回退到 iso-8859-1。您可以iconv
“手动”执行此操作,也可以使用file
:
$ file utf8.txt
utf8.txt: UTF-8 Unicode text
$ file latin1.txt
latin1.txt: ISO-8859 text
Note that ASCII files are both UTF-8 and ISO-8859-1 compatible.
请注意,ASCII 文件同时兼容 UTF-8 和 ISO-8859-1。
$ file ascii.txt
ascii.txt: ASCII text
Finally: there's no real way to distinguish between ISO-8859-1 and ISO-8859-2, for example, unless you're going to assume it's natural language and use statistical methods. This is probably why file says "ISO-8859".
最后:例如,没有真正的方法可以区分 ISO-8859-1 和 ISO-8859-2,除非您假设它是自然语言并使用统计方法。这可能就是文件显示“ISO-8859”的原因。
回答by broadband
File command is not 100% certain. Simple test:
文件命令不是 100% 确定的。简单测试:
#!/bin/bash
echo "a" > /tmp/foo
for i in {1..1000000}
do
echo "asdas" >> /tmp/foo
done
echo "ü???ü?? " >> /tmp/foo
file -b --mime-encoding /tmp/foo
this outputs:
这输出:
us-ascii
Ascii does not know german umlauts.
Ascii 不知道德语变音。
File is a bunch of bytes (sequence of bytes). Without trusting meta data (BOM only recomended for utf-16 and utf-32, MIME, header of data) you can't really detect encoding. Sequence of bytes can be interpreted as utf-8 or ISO-8859-1/2 or anything you want. Well it depends for certain sequence if iso-8850-1/utf-8 map exist. What you want is to encode the whole file content to desired character encoding. If it fails the desired encoding does not have map for this sequence of bytes.
文件是一堆字节(字节序列)。如果不信任元数据(仅推荐用于 utf-16 和 utf-32、MIME、数据标头的 BOM),您就无法真正检测到编码。字节序列可以解释为 utf-8 或 ISO-8859-1/2 或任何你想要的。如果iso-8850-1/utf-8映射存在,那这取决于特定的序列。您想要的是将整个文件内容编码为所需的字符编码。如果失败,则所需的编码没有此字节序列的映射。
In shell maybe use python, perl or like Laurence Gonsalves says iconv. For text files I use in python this:
在 shell 中可能使用 python、perl 或像 Laurence Gonsalves 所说的 iconv。对于我在 python 中使用的文本文件:
f = codecs.open(path, encoding='utf-8', errors='strict')
def valid_string(str):
try:
str.decode('utf-8')
return True
except UnicodeDecodeError:
return False
How do you that a file is a text file. You don't. You encode line by line with desired character encoding. Ok, you can add a little trust and check if BOM exists (file is utf encoded).
你怎么知道一个文件是一个文本文件。你没有。您使用所需的字符编码逐行编码。好的,您可以添加一点信任并检查 BOM 是否存在(文件是 utf 编码的)。