bash 查找文本文件的编码
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/12866068/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Finding the encoding of text files
提问by Hakim
I have a bunch of text files with different encodings. But I want to convert all of the into utf-8. since there are about 1000 files, I cant do it manually. I know that there are some commands in llinux which change the encodings of files from one encoding into another one. but my question is how to automatically detect the current encoding of a file? Clearly I'm looking for a command (say FindEncoding($File) ) to do this:
我有一堆不同编码的文本文件。但我想把所有的都转换成 utf-8。由于大约有 1000 个文件,我无法手动完成。我知道 llinux 中有一些命令可以将文件的编码从一种编码更改为另一种编码。但我的问题是如何自动检测文件的当前编码?显然我正在寻找一个命令(比如 FindEncoding($File) )来做到这一点:
foreach file
do
$encoding=FindEncoding($File);
uconv -f $encoding -t utf-8 $file;
done
回答by J. Katzwinkel
I usually do sth like this:
我通常这样做:
for f in *.txt; do
encoding=$(file -i "$f" | sed "s/.*charset=\(.*\)$//")
recode $encoding..utf-8 "$f"
done
Note that recode will overwrite the file for changing the character encoding.
If it is not possible to identify the text files by extension, their respective mime type can be determined with file -bi | cut -d ';' -f 1.
请注意,重新编码将覆盖文件以更改字符编码。如果无法通过扩展名识别文本文件,则可以使用file -bi | cut -d ';' -f 1.
It is also probably a good idea to avoid unnecessary re-encodings by checking on UFT-8 first:
通过首先检查 UFT-8 来避免不必要的重新编码也可能是一个好主意:
if [ ! "$encoding" = "utf-8" ]; then
#encode
After this treatment, there might still be some files with an us-asciiencoding. The reason for that is ASCII being a subset of UTF-8 that remains in use unless any characters are introduced that are not expressible by ASCII. In that case, the encoding switches to UTF-8.
经过这种处理,可能仍然有一些带有us-ascii编码的文件。原因是 ASCII 是 UTF-8 的一个子集,除非引入了任何 ASCII 无法表达的字符,否则它仍然在使用。在这种情况下,编码会切换到 UTF-8。

