Linux 文件shell脚本的编码

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/1730878/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-03 17:53:19  来源:igfitidea点击:

encoding of file shell script

linuxbashshellencoding

提问by rizidoro

How can I check the file encoding in a shell script? I need to know if a file is encoded in utf-8 or iso-8859-1.

如何检查shell脚本中的文件编码?我需要知道文件是用 utf-8 还是 iso-8859-1 编码的。

Thanks

谢谢

采纳答案by ChristopheD

I'd just use

我只是用

file -bi myfile.txt

to determine the character encoding of a particular file.

确定特定文件的字符编码。

A solution with an external dependency but I suspect fileis very common nowadays among all semi-modern distro's.

具有外部依赖性的解决方案,但我怀疑file现在在所有半现代发行版中非常普遍。

EDIT:

编辑:

As a response to Laurence Gonsalves' comment: bis the option to be 'brief' (not include the filename) and iis the shorthand equivalent of --mimeso the most portable way (including Mac OSX) then probably is:

作为对 Laurence Gonsalves 评论的回应:b是“简短”的选项(不包括文件名),并且i--mime最便携的方式(包括 Mac OSX)的速记等价物,那么可能是:

file --mime myfile.txt 

回答by jochil

you can use the file command file --mime myfile.text

你可以使用文件命令 file --mime myfile.text

回答by Laurence Gonsalves

There's no way to be 100% certain (unless you're dealing with a file format that internally states its encoding).

没有办法 100% 确定(除非您处理的是内部声明其编码的文件格式)。

Most tools that attempt to make this distinction will try and decode the file as utf-8 (as that's the more strict encoding), and if that fails, then fall back to iso-8859-1. You can do this with iconv"by hand", or you can use file:

大多数试图做出这种区分的工具都会尝试将文件解码为 utf-8(因为这是更严格的编码),如果失败,则回退到 iso-8859-1。您可以iconv“手动”执行此操作,也可以使用file

$ file utf8.txt
utf8.txt: UTF-8 Unicode text
$ file latin1.txt
latin1.txt: ISO-8859 text

Note that ASCII files are both UTF-8 and ISO-8859-1 compatible.

请注意,ASCII 文件同时兼容 UTF-8 和 ISO-8859-1。

$ file ascii.txt
ascii.txt: ASCII text

Finally: there's no real way to distinguish between ISO-8859-1 and ISO-8859-2, for example, unless you're going to assume it's natural language and use statistical methods. This is probably why file says "ISO-8859".

最后:例如,没有真正的方法可以区分 ISO-8859-1 和 ISO-8859-2,除非您假设它是自然语言并使用统计方法。这可能就是文件显示“ISO-8859”的原因。

回答by broadband

File command is not 100% certain. Simple test:

文件命令不是 100% 确定的。简单测试:

#!/bin/bash

echo "a" > /tmp/foo

for i in {1..1000000}
do
  echo "asdas" >> /tmp/foo
done

echo "ü???ü?? " >> /tmp/foo

file -b --mime-encoding /tmp/foo

this outputs:

这输出:

us-ascii

Ascii does not know german umlauts.

Ascii 不知道德语变音。

File is a bunch of bytes (sequence of bytes). Without trusting meta data (BOM only recomended for utf-16 and utf-32, MIME, header of data) you can't really detect encoding. Sequence of bytes can be interpreted as utf-8 or ISO-8859-1/2 or anything you want. Well it depends for certain sequence if iso-8850-1/utf-8 map exist. What you want is to encode the whole file content to desired character encoding. If it fails the desired encoding does not have map for this sequence of bytes.

文件是一堆字节(字节序列)。如果不信任元数据(仅推荐用于 utf-16 和 utf-32、MIME、数据标头的 BOM),您就无法真正检测到编码。字节序列可以解释为 utf-8 或 ISO-8859-1/2 或任何你想要的。如果iso-8850-1/utf-8映射存在,那这取决于特定的序列。您想要的是将整个文件内容编码为所需的字符编码。如果失败,则所需的编码没有此字节序列的映射。

In shell maybe use python, perl or like Laurence Gonsalves says iconv. For text files I use in python this:

在 shell 中可能使用 python、perl 或像 Laurence Gonsalves 所说的 iconv。对于我在 python 中使用的文本文件:

f = codecs.open(path, encoding='utf-8', errors='strict')


def valid_string(str):
  try:
    str.decode('utf-8')
    return True
  except UnicodeDecodeError:
    return False

How do you that a file is a text file. You don't. You encode line by line with desired character encoding. Ok, you can add a little trust and check if BOM exists (file is utf encoded).

你怎么知道一个文件是一个文本文件。你没有。您使用所需的字符编码逐行编码。好的,您可以添加一点信任并检查 BOM 是否存在(文件是 utf 编码的)。