Linux iconv 将任何编码转换为 UTF-8

Question

提问by Blainer

I am trying to point iconv to a directory and all files will be converted UTF-8 regardless of the current encoding

我试图将 iconv 指向一个目录，无论当前编码如何，所有文件都将被转换为 UTF-8

I am using this script but you have to specify what encoding you are going FROM. How can I make it autdetect the current encoding?

我正在使用此脚本，但您必须指定要使用的编码。如何让它自动检测当前编码？

dir_iconv.sh

#!/bin/bash

ICONVBIN='/usr/bin/iconv' # path to iconv binary

if [ $# -lt 3 ]
then
  echo "sudo convert/dir_iconv.sh convert/books CURRENT_ENCODING utf8
 dir from_charset to_charset"
  exit
fi

for f in /*
do
  if test -f $f
  then
    echo -e "\nConverting $f"
    /bin/mv $f $f.old
    $ICONVBIN -f  -t  $f.old > $f
  else
    echo -e "\nSkipping $f - not a regular file";
  fi
done

terminal line

终端线

CHARSET="$(file -bi "$i"|awk -F "=" '{print }')"

if [ "$CHARSET" != utf-8 ]; then
  iconv -f "$CHARSET" -t utf8 "$i" -o outfile
fi

Answer 1

采纳答案by Michal Kottman

回答by Julian Hughes

You can get what you need using standard gnu utils file and awk. Example:

您可以使用标准的 gnu utils 文件和 awk 获得所需的内容。例子：

file -bi .xsession-errorsgives me: "text/plain; charset=us-ascii"

file -bi .xsession-errors给我：“text/plain；charset=us-ascii”

so file -bi .xsession-errors |awk -F "=" '{print $2}'gives me "us-ascii"

所以file -bi .xsession-errors |awk -F "=" '{print $2}'给了我“us-ascii”

I use it in scripts like so:

我在脚本中使用它，如下所示：

#!/bin/bash
# converting all files in a dir to utf8

for f in *
do
  if test -f $f then
    echo -e "\nConverting $f"
    CHARSET="$(file -bi "$f"|awk -F "=" '{print }')"
    if [ "$CHARSET" != utf-8 ]; then
      iconv -f "$CHARSET" -t utf8 "$f" -o "$f"
    fi
  else
    echo -e "\nSkipping $f - it's a regular file";
  fi
done

Answer 3

回答by Douglas Fernandes

Compiling all them. Go to dir, create dir2utf8.sh:

编译所有这些。转到目录，创建dir2utf8.sh：

#!/bin/bash

apt-get -y install recode uchardet > /dev/null
find "" -type f | while read FFN # 'dir' should be changed...
do
  encoding=$(uchardet "$FFN")
  echo "$FFN: $encoding"
  enc=`echo $encoding | sed 's#^x-mac-#mac#'`
  set +x
  recode $enc..UTF-8 "$FFN"
done

Answer 4

回答by demofly

Here is my solutionto in place all files using recodeand uchardet:

这是我使用recode和uchardet放置所有文件的解决方案：

bash convert-dir-to-utf8.sh /pat/to/my/trash/dir

put it into convert-dir-to-utf8.shand run:

放入convert-dir-to-utf8.sh并运行：

detection_cat () 
{
    DET_OUT=$(chardet );
    ENC=$(echo $DET_OUT | sed "s|^.*: \(.*\) (confid.*$||");
    iconv -f $ENC 
}

Note that sedis a workaround for mac encodings here. Many uncommon encodings need workarounds like this.

请注意，这里sed是 mac 编码的解决方法。许多不常见的编码需要这样的解决方法。

Answer 5

回答by demofly

Check out tools available for a data convertation in a linux cli: https://www.debian.org/doc/manuals/debian-reference/ch11.en.html

查看可用于 Linux cli 中数据转换的工具：https: //www.debian.org/doc/manuals/debian-reference/ch11.en.html

Also, there is a quest to find out a full list of encodings which are available in iconv. Just run iconv --listand find out that encoding names differs from names returned by uchardettool (for example: x-mac-cyrillic in uchardetvs. mac-cyrillic in iconv)

此外，还需要找出iconv. 只需运行iconv --list并发现编码名称与uchardet工具返回的名称不同（例如： x-mac-cyrillic in uchardetvs. mac-cyrillic in iconv）

Answer 6

回答by Jared Tsai

enca command doesn't work for my Simplified-Chinese text file with GB2312 encoding.

enca 命令不适用于我的 GB2312 编码的简体中文文本文件。

Instead, I use the following function to convert the text file for me. You could of course re-direct the output into a file.

相反，我使用以下函数为我转换文本文件。您当然可以将输出重定向到文件中。

It requires chardetand iconvcommands.

它需要chardet和iconv命令。

#!/bin/bash

find "<YOUR_FOLDER_PATH>" -name '*' -type f -exec grep -Iq . {} \; -print0 |
while IFS= read -r -d $'#!/bin/bash

find "<YOUR_FOLDER_PATH>" -name '*' -type f -exec grep -Iq . {} \; -print0 |
while IFS= read -r -d $'#!/bin/bash

find "<YOUR_FOLDER_PATH>" -name '*' -type f -exec grep -Iq . {} \; -print0 |
while IFS= read -r -d $'##代码##' LINE_FILE; do
  CHARSET=$(uchardet $LINE_FILE)
  REENCSED=`echo $CHARSET | sed 's#^x-mac-#mac#'`
  echo "\"$CHARSET\" \"$LINE_FILE\""

  # NOTE: Convert/reconvert to utf8. By Questor
  recode $REENCSED..UTF-8 "$LINE_FILE" 2> STDERR_OP 1> STDOUT_OP

  STDERR_OP=$(cat STDERR_OP)
  rm -f STDERR_OP
  if [ -n "$STDERR_OP" ] ; then

    # NOTE: Convert/reconvert to utf8. By Questor
    bash -c "</dev/tty vim -u NONE +\"set binary | set noeol | set nobomb | set encoding=utf-8 | set fileencoding=utf-8 | wq\" \"$LINE_FILE\""

  else

    # NOTE: Remove "BOM" if exists as it is unnecessary. By Questor
    # [Refs.: https://stackoverflow.com/a/2223926/3223785 ,
    # https://stackoverflow.com/a/45240995/3223785 ]
    sed -i '1s/^\xEF\xBB\xBF//' "$LINE_FILE"

  fi
done
' LINE_FILE; do
  CHARSET=$(uchardet $LINE_FILE)
  REENCSED=`echo $CHARSET | sed 's#^x-mac-#mac#'`
  echo "\"$CHARSET\" \"$LINE_FILE\""

  # NOTE: Convert/reconvert to utf8. By Questor
  recode $REENCSED..UTF-8 "$LINE_FILE" 2> STDERR_OP 1> STDOUT_OP

  STDERR_OP=$(cat STDERR_OP)
  rm -f STDERR_OP
  if [ -n "$STDERR_OP" ] ; then

    # NOTE: Convert/reconvert to utf8. By Questor
    iconv -f "$CHARSET" -t utf8 "$LINE_FILE" -o "$LINE_FILE" 2> STDERR_OP 1> STDOUT_OP

    STDERR_OP=$(cat STDERR_OP)
    rm -f STDERR_OP
  fi

  # NOTE: Remove "BOM" if exists as it is unnecessary. By Questor
  # [Refs.: https://stackoverflow.com/a/2223926/3223785 ,
  # https://stackoverflow.com/a/45240995/3223785 ]
  sed -i '1s/^\xEF\xBB\xBF//' "$LINE_FILE"

  if [ -n "$STDERR_OP" ] ; then
    echo "ERROR: \"$STDERR_OP\""
  fi
  STDOUT_OP=$(cat STDOUT_OP)
  rm -f STDOUT_OP
  if [ -n "$STDOUT_OP" ] ; then
    echo "RESULT: \"$STDOUT_OP\""
  fi
done
# [Refs.: https://justrocketscience.com/post/handle-encodings ,
# https://stackoverflow.com/a/9612232/3223785 ,
# https://stackoverflow.com/a/13659891/3223785 ]
' LINE_FILE; do
  CHARSET=$(uchardet $LINE_FILE)
  echo "Converting ($CHARSET) $LINE_FILE"

  # NOTE: Convert/reconvert to utf8. By Questor
  iconv -f "$CHARSET" -t utf8 "$LINE_FILE" -o "$LINE_FILE"

  # NOTE: Remove "BOM" if exists as it is unnecessary. By Questor
  # [Refs.: https://stackoverflow.com/a/2223926/3223785 ,
  # https://stackoverflow.com/a/45240995/3223785 ]
  sed -i '1s/^\xEF\xBB\xBF//' "$LINE_FILE"

done
# [Refs.: https://justrocketscience.com/post/handle-encodings ,
# https://stackoverflow.com/a/9612232/3223785 ,
# https://stackoverflow.com/a/13659891/3223785 ]

Answer 7

回答by Eduardo Lucio

First answer

第一个回答

##代码##

FURTHER QUESTION:I do not know if my approach is the safest. I say this because I noticed that some files are not correctly converted (characters will be lost) or are "truncated". I suspect that this has to do with the "iconv" tool or with the charset information obtained with the "uchardet" tool. I was curious about the solution presented by @demoflybecause it could be safer.

另一个问题：我不知道我的方法是否最安全。我这样说是因为我注意到有些文件没有正确转换（字符会丢失）或被“截断”。我怀疑这与“iconv”工具或使用“uchardet”工具获得的字符集信息有关。我对@demofly 提出的解决方案很好奇，因为它可能更安全。

Another answer

另一个答案

Based on @demofly 's answer:

基于@demofly 的回答：

##代码##

Third answer

第三个答案

Hybrid solution with recode and vim:

使用 recode 和 vim 的混合解决方案：

##代码##

This was the solution with the highest number of perfect conversions. Additionally, we did not have any truncated files.

这是具有最高数量的完美转换的解决方案。此外，我们没有任何截断的文件。

WARNING:Make a backup of your files and use a merge tool to check/compare the changes. Problems probably will appear!
TIP:The command sed -i '1s/^\xEF\xBB\xBF//' "$LINE_FILE"can be executed after a preliminary comparison with the merge tool after a conversion without it since it can cause "differences".
NOTE:The search using findbrings all non-binary files from the given path ("") and its subfolders.

警告：备份您的文件并使用合并工具检查/比较更改。很可能会出现问题！
提示：该命令sed -i '1s/^\xEF\xBB\xBF//' "$LINE_FILE"可以在没有合并工具的转换后与合并工具进行初步比较后执行，因为它可能会导致“差异”。
注意：搜索 usingfind会从给定的路径 ("") 及其子文件夹中获取所有非二进制文件。

Linux iconv 将任何编码转换为 UTF-8

提问by Blainer

采纳答案by Michal Kottman

回答by Julian Hughes

回答by Douglas Fernandes

回答by demofly

回答by demofly

回答by Jared Tsai

回答by Eduardo Lucio

First answer

第一个回答

Another answer

另一个答案

Third answer

第三个答案

相关推荐

最近更新

标签

Linux iconv 将任何编码转换为 UTF-8

提问by Blainer

采纳答案by Michal Kottman

回答by Julian Hughes

回答by Douglas Fernandes

回答by demofly

回答by demofly

回答by Jared Tsai

回答by Eduardo Lucio

First answer

第一个回答

Another answer

另一个答案

Third answer

第三个答案

相关推荐

Linux 如何修复 Debian Lenny 中的 apt-get 更新以便安装 PostgreSql 9.1

C# ObservableCollection PropertyChanged 事件

Unix 如何阻止 unix/linux `wall` 消息传递？

C# 在不使用 Graphics 对象的情况下测量字符串？

相关推荐

最近更新

标签