Linux 文件shell脚本的编码

Question

提问by rizidoro

How can I check the file encoding in a shell script? I need to know if a file is encoded in utf-8 or iso-8859-1.

如何检查shell脚本中的文件编码？我需要知道文件是用 utf-8 还是 iso-8859-1 编码的。

Thanks

谢谢

Answer 1

采纳答案by ChristopheD

I'd just use

我只是用

file -bi myfile.txt

to determine the character encoding of a particular file.

确定特定文件的字符编码。

A solution with an external dependency but I suspect fileis very common nowadays among all semi-modern distro's.

具有外部依赖性的解决方案，但我怀疑file现在在所有半现代发行版中非常普遍。

EDIT:

编辑：

As a response to Laurence Gonsalves' comment: bis the option to be 'brief' (not include the filename) and iis the shorthand equivalent of --mimeso the most portable way (including Mac OSX) then probably is:

作为对 Laurence Gonsalves 评论的回应：b是“简短”的选项（不包括文件名），并且i是--mime最便携的方式（包括 Mac OSX）的速记等价物，那么可能是：

file --mime myfile.txt

Answer 2

回答by jochil

you can use the file command file --mime myfile.text

你可以使用文件命令 file --mime myfile.text

Answer 3

回答by Laurence Gonsalves

There's no way to be 100% certain (unless you're dealing with a file format that internally states its encoding).

没有办法 100% 确定（除非您处理的是内部声明其编码的文件格式）。

Most tools that attempt to make this distinction will try and decode the file as utf-8 (as that's the more strict encoding), and if that fails, then fall back to iso-8859-1. You can do this with iconv"by hand", or you can use file:

大多数试图做出这种区分的工具都会尝试将文件解码为 utf-8（因为这是更严格的编码），如果失败，则回退到 iso-8859-1。您可以iconv“手动”执行此操作，也可以使用file：

$ file utf8.txt
utf8.txt: UTF-8 Unicode text
$ file latin1.txt
latin1.txt: ISO-8859 text

Note that ASCII files are both UTF-8 and ISO-8859-1 compatible.

请注意，ASCII 文件同时兼容 UTF-8 和 ISO-8859-1。

$ file ascii.txt
ascii.txt: ASCII text

Finally: there's no real way to distinguish between ISO-8859-1 and ISO-8859-2, for example, unless you're going to assume it's natural language and use statistical methods. This is probably why file says "ISO-8859".

最后：例如，没有真正的方法可以区分 ISO-8859-1 和 ISO-8859-2，除非您假设它是自然语言并使用统计方法。这可能就是文件显示“ISO-8859”的原因。

Answer 4

回答by broadband

File command is not 100% certain. Simple test:

文件命令不是 100% 确定的。简单测试：

#!/bin/bash

echo "a" > /tmp/foo

for i in {1..1000000}
do
  echo "asdas" >> /tmp/foo
done

echo "ü???ü?? " >> /tmp/foo

file -b --mime-encoding /tmp/foo

this outputs:

这输出：

us-ascii

Ascii does not know german umlauts.

Ascii 不知道德语变音。

File is a bunch of bytes (sequence of bytes). Without trusting meta data (BOM only recomended for utf-16 and utf-32, MIME, header of data) you can't really detect encoding. Sequence of bytes can be interpreted as utf-8 or ISO-8859-1/2 or anything you want. Well it depends for certain sequence if iso-8850-1/utf-8 map exist. What you want is to encode the whole file content to desired character encoding. If it fails the desired encoding does not have map for this sequence of bytes.

文件是一堆字节（字节序列）。如果不信任元数据（仅推荐用于 utf-16 和 utf-32、MIME、数据标头的 BOM），您就无法真正检测到编码。字节序列可以解释为 utf-8 或 ISO-8859-1/2 或任何你想要的。如果iso-8850-1/utf-8映射存在，那这取决于特定的序列。您想要的是将整个文件内容编码为所需的字符编码。如果失败，则所需的编码没有此字节序列的映射。

In shell maybe use python, perl or like Laurence Gonsalves says iconv. For text files I use in python this:

在 shell 中可能使用 python、perl 或像 Laurence Gonsalves 所说的 iconv。对于我在 python 中使用的文本文件：

f = codecs.open(path, encoding='utf-8', errors='strict')


def valid_string(str):
  try:
    str.decode('utf-8')
    return True
  except UnicodeDecodeError:
    return False

How do you that a file is a text file. You don't. You encode line by line with desired character encoding. Ok, you can add a little trust and check if BOM exists (file is utf encoded).

你怎么知道一个文件是一个文本文件。你没有。您使用所需的字符编码逐行编码。好的，您可以添加一点信任并检查 BOM 是否存在（文件是 utf 编码的）。

Linux 文件shell脚本的编码

提问by rizidoro

采纳答案by ChristopheD

回答by jochil

回答by Laurence Gonsalves

回答by broadband

相关推荐

最近更新

标签

Linux 文件shell脚本的编码

提问by rizidoro

采纳答案by ChristopheD

回答by jochil

回答by Laurence Gonsalves

回答by broadband

相关推荐

Linux 通过 grep 删除文本文件中的空行

在 C# 中加密和解密字符串？

linux：杀死后台任务

为什么这段代码在 C# 中无效？

相关推荐

最近更新

标签