Linux iconv 将任何编码转换为 UTF-8
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/9824902/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
iconv any encoding to UTF-8
提问by Blainer
I am trying to point iconv to a directory and all files will be converted UTF-8 regardless of the current encoding
我试图将 iconv 指向一个目录,无论当前编码如何,所有文件都将被转换为 UTF-8
I am using this script but you have to specify what encoding you are going FROM. How can I make it autdetect the current encoding?
我正在使用此脚本,但您必须指定要使用的编码。如何让它自动检测当前编码?
dir_iconv.sh
dir_iconv.sh
#!/bin/bash
ICONVBIN='/usr/bin/iconv' # path to iconv binary
if [ $# -lt 3 ]
then
echo "sudo convert/dir_iconv.sh convert/books CURRENT_ENCODING utf8
dir from_charset to_charset"
exit
fi
for f in /*
do
if test -f $f
then
echo -e "\nConverting $f"
/bin/mv $f $f.old
$ICONVBIN -f -t $f.old > $f
else
echo -e "\nSkipping $f - not a regular file";
fi
done
terminal line
终端线
CHARSET="$(file -bi "$i"|awk -F "=" '{print }')"
if [ "$CHARSET" != utf-8 ]; then
iconv -f "$CHARSET" -t utf8 "$i" -o outfile
fi
采纳答案by Michal Kottman
Maybe you are looking for enca
:
也许您正在寻找enca
:
Enca is an Extremely Naive Charset Analyser. It detects character set and encoding of text files and can also convert them to other encodings using either a built-in converter or external libraries and tools like libiconv, librecode, or cstocs.
Currently it supports Belarusian, Bulgarian, Croatian, Czech, Estonian, Hungarian, Latvian, Lithuanian, Polish, Russian, Slovak, Slovene, Ukrainian, Chinese, and some multibyte encodings independently on language.
Enca 是一个非常简单的字符集分析器。它检测文本文件的字符集和编码,还可以使用内置转换器或外部库和工具(如 libiconv、librecode 或 cstocs)将它们转换为其他编码。
目前它支持白俄罗斯语、保加利亚语、克罗地亚语、捷克语、爱沙尼亚语、匈牙利语、拉脱维亚语、立陶宛语、波兰语、俄语、斯洛伐克语、斯洛文尼亚语、乌克兰语、中文和一些独立于语言的多字节编码。
Note that in general, autodetection of current encoding is a difficult process (the same byte sequence can be correct text in multiple encodings). enca
uses heuristics based on the language you tell it to detect (to limit the number of encodings). You can use enconv
to convert text filesto a single encoding.
请注意,通常情况下,当前编码的自动检测是一个困难的过程(相同的字节序列可以是多种编码中的正确文本)。enca
根据您告诉它检测的语言使用启发式方法(以限制编码数量)。您可以使用enconv
到文本文件转换到一个单一的编码。
回答by Julian Hughes
You can get what you need using standard gnu utils file and awk. Example:
您可以使用标准的 gnu utils 文件和 awk 获得所需的内容。例子:
file -bi .xsession-errors
gives me:
"text/plain; charset=us-ascii"
file -bi .xsession-errors
给我:“text/plain;charset=us-ascii”
so file -bi .xsession-errors |awk -F "=" '{print $2}'
gives me
"us-ascii"
所以file -bi .xsession-errors |awk -F "=" '{print $2}'
给了我“us-ascii”
I use it in scripts like so:
我在脚本中使用它,如下所示:
#!/bin/bash
# converting all files in a dir to utf8
for f in *
do
if test -f $f then
echo -e "\nConverting $f"
CHARSET="$(file -bi "$f"|awk -F "=" '{print }')"
if [ "$CHARSET" != utf-8 ]; then
iconv -f "$CHARSET" -t utf8 "$f" -o "$f"
fi
else
echo -e "\nSkipping $f - it's a regular file";
fi
done
回答by Douglas Fernandes
Compiling all them. Go to dir, create dir2utf8.sh
:
编译所有这些。转到目录,创建dir2utf8.sh
:
#!/bin/bash
apt-get -y install recode uchardet > /dev/null
find "" -type f | while read FFN # 'dir' should be changed...
do
encoding=$(uchardet "$FFN")
echo "$FFN: $encoding"
enc=`echo $encoding | sed 's#^x-mac-#mac#'`
set +x
recode $enc..UTF-8 "$FFN"
done
回答by demofly
Here is my solutionto in place all files using recodeand uchardet:
这是我使用recode和uchardet放置所有文件的解决方案:
bash convert-dir-to-utf8.sh /pat/to/my/trash/dir
put it into convert-dir-to-utf8.sh
and run:
放入convert-dir-to-utf8.sh
并运行:
detection_cat ()
{
DET_OUT=$(chardet );
ENC=$(echo $DET_OUT | sed "s|^.*: \(.*\) (confid.*$||");
iconv -f $ENC
}
Note that sed
is a workaround for mac encodings here.
Many uncommon encodings need workarounds like this.
请注意,这里sed
是 mac 编码的解决方法。许多不常见的编码需要这样的解决方法。
回答by demofly
Check out tools available for a data convertation in a linux cli: https://www.debian.org/doc/manuals/debian-reference/ch11.en.html
查看可用于 Linux cli 中数据转换的工具:https: //www.debian.org/doc/manuals/debian-reference/ch11.en.html
Also, there is a quest to find out a full list of encodings which are available in iconv
. Just run iconv --list
and find out that encoding names differs from names returned by uchardet
tool (for example: x-mac-cyrillic in uchardet
vs. mac-cyrillic in iconv
)
此外,还需要找出iconv
. 只需运行iconv --list
并发现编码名称与uchardet
工具返回的名称不同(例如: x-mac-cyrillic in uchardet
vs. mac-cyrillic in iconv
)
回答by Jared Tsai
enca command doesn't work for my Simplified-Chinese text file with GB2312 encoding.
enca 命令不适用于我的 GB2312 编码的简体中文文本文件。
Instead, I use the following function to convert the text file for me. You could of course re-direct the output into a file.
相反,我使用以下函数为我转换文本文件。您当然可以将输出重定向到文件中。
It requires chardetand iconvcommands.
它需要chardet和iconv命令。
#!/bin/bash
find "<YOUR_FOLDER_PATH>" -name '*' -type f -exec grep -Iq . {} \; -print0 |
while IFS= read -r -d $'#!/bin/bash
find "<YOUR_FOLDER_PATH>" -name '*' -type f -exec grep -Iq . {} \; -print0 |
while IFS= read -r -d $'#!/bin/bash
find "<YOUR_FOLDER_PATH>" -name '*' -type f -exec grep -Iq . {} \; -print0 |
while IFS= read -r -d $'##代码##' LINE_FILE; do
CHARSET=$(uchardet $LINE_FILE)
REENCSED=`echo $CHARSET | sed 's#^x-mac-#mac#'`
echo "\"$CHARSET\" \"$LINE_FILE\""
# NOTE: Convert/reconvert to utf8. By Questor
recode $REENCSED..UTF-8 "$LINE_FILE" 2> STDERR_OP 1> STDOUT_OP
STDERR_OP=$(cat STDERR_OP)
rm -f STDERR_OP
if [ -n "$STDERR_OP" ] ; then
# NOTE: Convert/reconvert to utf8. By Questor
bash -c "</dev/tty vim -u NONE +\"set binary | set noeol | set nobomb | set encoding=utf-8 | set fileencoding=utf-8 | wq\" \"$LINE_FILE\""
else
# NOTE: Remove "BOM" if exists as it is unnecessary. By Questor
# [Refs.: https://stackoverflow.com/a/2223926/3223785 ,
# https://stackoverflow.com/a/45240995/3223785 ]
sed -i '1s/^\xEF\xBB\xBF//' "$LINE_FILE"
fi
done
' LINE_FILE; do
CHARSET=$(uchardet $LINE_FILE)
REENCSED=`echo $CHARSET | sed 's#^x-mac-#mac#'`
echo "\"$CHARSET\" \"$LINE_FILE\""
# NOTE: Convert/reconvert to utf8. By Questor
recode $REENCSED..UTF-8 "$LINE_FILE" 2> STDERR_OP 1> STDOUT_OP
STDERR_OP=$(cat STDERR_OP)
rm -f STDERR_OP
if [ -n "$STDERR_OP" ] ; then
# NOTE: Convert/reconvert to utf8. By Questor
iconv -f "$CHARSET" -t utf8 "$LINE_FILE" -o "$LINE_FILE" 2> STDERR_OP 1> STDOUT_OP
STDERR_OP=$(cat STDERR_OP)
rm -f STDERR_OP
fi
# NOTE: Remove "BOM" if exists as it is unnecessary. By Questor
# [Refs.: https://stackoverflow.com/a/2223926/3223785 ,
# https://stackoverflow.com/a/45240995/3223785 ]
sed -i '1s/^\xEF\xBB\xBF//' "$LINE_FILE"
if [ -n "$STDERR_OP" ] ; then
echo "ERROR: \"$STDERR_OP\""
fi
STDOUT_OP=$(cat STDOUT_OP)
rm -f STDOUT_OP
if [ -n "$STDOUT_OP" ] ; then
echo "RESULT: \"$STDOUT_OP\""
fi
done
# [Refs.: https://justrocketscience.com/post/handle-encodings ,
# https://stackoverflow.com/a/9612232/3223785 ,
# https://stackoverflow.com/a/13659891/3223785 ]
' LINE_FILE; do
CHARSET=$(uchardet $LINE_FILE)
echo "Converting ($CHARSET) $LINE_FILE"
# NOTE: Convert/reconvert to utf8. By Questor
iconv -f "$CHARSET" -t utf8 "$LINE_FILE" -o "$LINE_FILE"
# NOTE: Remove "BOM" if exists as it is unnecessary. By Questor
# [Refs.: https://stackoverflow.com/a/2223926/3223785 ,
# https://stackoverflow.com/a/45240995/3223785 ]
sed -i '1s/^\xEF\xBB\xBF//' "$LINE_FILE"
done
# [Refs.: https://justrocketscience.com/post/handle-encodings ,
# https://stackoverflow.com/a/9612232/3223785 ,
# https://stackoverflow.com/a/13659891/3223785 ]
回答by Eduardo Lucio
First answer
第一个回答
##代码##FURTHER QUESTION:I do not know if my approach is the safest. I say this because I noticed that some files are not correctly converted (characters will be lost) or are "truncated". I suspect that this has to do with the "iconv" tool or with the charset information obtained with the "uchardet" tool. I was curious about the solution presented by @demoflybecause it could be safer.
另一个问题:我不知道我的方法是否最安全。我这样说是因为我注意到有些文件没有正确转换(字符会丢失)或被“截断”。我怀疑这与“iconv”工具或使用“uchardet”工具获得的字符集信息有关。我对@demofly 提出的解决方案很好奇,因为它可能更安全。
Another answer
另一个答案
Based on @demofly 's answer:
基于@demofly 的回答:
##代码##Third answer
第三个答案
Hybrid solution with recode and vim:
使用 recode 和 vim 的混合解决方案:
##代码##This was the solution with the highest number of perfect conversions. Additionally, we did not have any truncated files.
这是具有最高数量的完美转换的解决方案。此外,我们没有任何截断的文件。
- WARNING:Make a backup of your files and use a merge tool to check/compare the changes. Problems probably will appear!
- TIP:The command
sed -i '1s/^\xEF\xBB\xBF//' "$LINE_FILE"
can be executed after a preliminary comparison with the merge tool after a conversion without it since it can cause "differences". - NOTE:The search using
find
brings all non-binary files from the given path ("") and its subfolders.
- 警告:备份您的文件并使用合并工具检查/比较更改。很可能会出现问题!
- 提示:该命令
sed -i '1s/^\xEF\xBB\xBF//' "$LINE_FILE"
可以在没有合并工具的转换后与合并工具进行初步比较后执行,因为它可能会导致“差异”。 - 注意:搜索 using
find
会从给定的路径 ("") 及其子文件夹中获取所有非二进制文件。