bash 如何从文件中删除所有变音符号？

Question

提问by Village

I have a file containing many vowels with diacritics. I need to make these replacements:

我有一个包含许多带有变音符号的元音的文件。我需要进行这些替换：

Replace ā, á, ǎ, and à with a.
Replace ē, é, ě, and è with e.
Replace ī, í, ǐ, and ì with i.
Replace ō, ó, ǒ, and ò with o.
Replace ū, ú, ǔ, and ù with u.
Replace ǖ, ǘ, ǚ, and ǜ with ü.
Replace ā, á, ǎ, and à with A.
Replace ē, é, ě, and è with E.
Replace ī, í, ǐ, and ì with I.
Replace ō, ó, ǒ, and ò with O.
Replace ū, ú, ǔ, and ù with U.
Replace ǖ, ǘ, ǚ, and ǜ with ü.

将 ā、á、ǎ 和 à 替换为 a。
将 ē、é、ě 和 è 替换为 e。
将 ī、í、ǐ 和 ì 替换为 i。
将 ō、ó、ǒ 和 ò 替换为 o。
用 u 替换 ū、ú、ǔ 和ù。
用ü替换ǖ、ǘ、ǚ和ǜ。
将 ā、á、ǎ 和 à 替换为 A。
将 ē、é、ě 和 è 替换为 E。
将 ī、í、ǐ 和 ì 替换为 I。
将 ō、ó、ǒ 和 ò 替换为 O。
将 ū、ú、ǔ 和 ù 替换为 U。
用ü替换ǖ、ǘ、ǚ和ǜ。

I know I can replace them one at a time with this:

我知道我可以一次替换一个：

sed -i 's/ā/a/g' ./file.txt

Is there a more efficient way to replace all of these?

有没有更有效的方法来替换所有这些？

Answer 1

回答by Kent

If you check the man page of the tool iconv:

如果您查看该工具的手册页iconv：

//TRANSLIT
When the string "//TRANSLIT" is appended to --to-code, transliteration is activated. This means that when a character cannot be represented in the target character set, it can be approximated through one or several similarly looking characters.

//TRANSLIT
当字符串“//TRANSLIT”附加到--to-code时，音译被激活。这意味着当一个字符不能在目标字符集中表示时，它可以通过一个或几个看起来相似的字符来近似。

so we could do :

所以我们可以这样做：

kent$  cat test1
    Replace ā, á, ǎ, and à with a.
    Replace ē, é, ě, and è with e.
    Replace ī, í, ǐ, and ì with i.
    Replace ō, ó, ǒ, and ò with o.
    Replace ū, ú, ǔ, and ù with u.
    Replace ǖ, ǘ, ǚ, and ǜ with ü.
    Replace ā, á, ǎ, and à with A.
    Replace ē, é, ě, and è with E.
    Replace ī, í, ǐ, and ì with I.
    Replace ō, ó, ǒ, and ò with O.
    Replace ū, ú, ǔ, and ù with U.
    Replace ǖ, ǘ, ǚ, and ǜ with ü.


kent$  iconv -f utf8 -t ascii//TRANSLIT test1
    Replace a, a, a, and a with a.
    Replace e, e, e, and e with e.
    Replace i, i, i, and i with i.
    Replace o, o, o, and o with o.
    Replace u, u, u, and u with u.
    Replace u, u, u, and u with u.
    Replace A, A, A, and A with A.
    Replace E, E, E, and E with E.
    Replace I, I, I, and I with I.
    Replace O, O, O, and O with O.
    Replace U, U, U, and U with U.
    Replace U, U, U, and U with U.

Answer 2

回答by potong

This might work for you:

这可能对你有用：

sed -i 'y/āáǎàēéěèīíǐìōóǒòūúǔùǖǘǚǜāáǎàēéěèīíǐìōóǒòūúǔùǖǘǚǜ/aaaaeeeeiiiioooouuuuüüüüAAAAEEEEIIIIOOOOUUUUüüüü/' file

Answer 3

回答by Fedir RYKHTIK

I like iconvas it handles all accents variations :

我喜欢iconv它处理所有口音变化：

cat non-ascii.txt | iconv -f utf8 -t ascii//TRANSLIT//IGNORE > ascii.txt

Answer 4

回答by ktf

For this the tr(1)command is for. For example:

为此，tr(1)命令适用。例如：

tr 'āáǎàēéěèīíǐì...' 'aaaaeeeeiii...' <infile >outfile

You may have to check/change your LANGenvironment variable to match the character set being used.

您可能需要检查/更改LANG环境变量以匹配正在使用的字符集。

Answer 5

回答by hungnv

You can use something like this:

你可以使用这样的东西：

  sed -e 's/[àa]/a/g;s/[??]/o/g;s/[í,ì]/i/g;s/[ê,?]/e/g'

just add more characters to [..] for your need.

只需根据您的需要向 [..] 添加更多字符。

Answer 6

回答by Rich Traube

You can use man iso_8859_1(or your char set) or od -bcto identify the the octal representation of the diacritic. Then use gawkto do the replacing.

您可以使用man iso_8859_1（或您的字符集）或od -bc来标识变音符号的八进制表示。然后用gawk做替换。

{ gsub(/4/,"a"; print #!/bin/bash
INPUT=""
declare -a acc
declare -a noa
acc=('$' '?¨' '?a' '??' 'à' 'á' '?' '?' '?' '?' '?' '?' 'è' 'é' 'ê' '?' 'ì' 'í' '?' '?' 'D' '?' 'ò' 'ó' '?' '?' '?' '?' 'ù' 'ú' '?' 'ü' 'Y' '?' 'à' 'á' 'a' '?' '?' '?' '?' '?' 'è' 'é' 'ê' '?' 'ì' 'í' '?' '?' '?' 'ò' 'ó' '?' '?' '?' '?' 'ù' 'ú' '?' 'ü' 'y' '?' 'ā' 'ā' '?' '?' '?' '?' '?' '?' '?' '?' '?' '?' '?' '?' '?' '?' '?' '?' 'ē' 'ē' '?' '?' '?' '?' '?' '?' 'ě' 'ě' '?' '?' '?' '?' '?' '?' '?' '?' '?' '?' '?' '?' '?' '?' 'ī' 'ī' '?' '?' '?' '?' '?' '?' '?' '?' '?' '?' '?' '?' '?' '?' '?' '?' '?' '?' '?' '?' '?' '?' '?' 'ń' '?' '?' '?' 'ň' '?' 'ō' 'ō' '?' '?' '?' '?' '?' '?' '?' '?' '?' '?' '?' '?' '?' '?' '?' '?' '?' '?' '?' '?' '?' '?' '?' '?' '?' '?' '?' '?' 'ū' 'ū' '?' '?' '?' '?' '?' '?' '?' '?' '?' '?' '?' '?' '?' '?' '?' '?' '?' '?' '?' '?' '?' '?' '?' '?' '?' 'ǎ' 'ǎ' 'ǐ' 'ǐ' 'ǒ' 'ǒ' 'ǔ' 'ǔ' 'ǖ' 'ǖ' 'ǘ' 'ǘ' 'ǚ' 'ǚ' 'ǜ' 'ǜ' '?' '?' '?' '?' '?' '?');
noa=('S' 'e' 'e' 'e' 'A' 'A' 'A' 'A' 'A' 'A' 'AE' 'C' 'E' 'E' 'E' 'E' 'I' 'I' 'I' 'I' 'D' 'N' 'O' 'O' 'O' 'O' 'O' 'O' 'U' 'U' 'U' 'U' 'Y' 's' 'a' 'a' 'a' 'a' 'a' 'a' 'ae' 'c' 'e' 'e' 'e' 'e' 'i' 'i' 'i' 'i' 'n' 'o' 'o' 'o' 'o' 'o' 'o' 'u' 'u' 'u' 'u' 'y' 'y' 'A' 'a' 'A' 'a' 'A' 'a' 'C' 'c' 'C' 'c' 'C' 'c' 'C' 'c' 'D' 'd' 'D' 'd' 'E' 'e' 'E' 'e' 'E' 'e' 'E' 'e' 'E' 'e' 'G' 'g' 'G' 'g' 'G' 'g' 'G' 'g' 'H' 'h' 'H' 'h' 'I' 'i' 'I' 'i' 'I' 'i' 'I' 'i' 'I' 'i' 'IJ' 'ij' 'J' 'j' 'K' 'k' 'L' 'l' 'L' 'l' 'L' 'l' 'L' 'l' 'l' 'l' 'N' 'n' 'N' 'n' 'N' 'n' 'n' 'O' 'o' 'O' 'o' 'O' 'o' 'OE' 'oe' 'R' 'r' 'R' 'r' 'R' 'r' 'S' 's' 'S' 's' 'S' 's' 'S' 's' 'T' 't' 'T' 't' 'T' 't' 'U' 'u' 'U' 'u' 'U' 'u' 'U' 'u' 'U' 'u' 'U' 'u' 'W' 'w' 'Y' 'y' 'Y' 'Z' 'z' 'Z' 'z' 'Z' 'z' 's' 'f' 'O' 'o' 'U' 'u' 'A' 'a' 'I' 'i' 'O' 'o' 'U' 'u' 'U' 'u' 'U' 'u' 'U' 'u' 'U' 'u' 'A' 'a' 'AE' 'ae' 'O' 'o');

i=0
length=${#INPUT}
while [[ $i -lt $length ]]; do
    char=${INPUT:$i:1};
    #echo $i:$char
    j=0
    for letter in "${acc[@]}"
    do
        if [[ "$letter" == "$char" ]]; then
            char="${noa[$j]}"
        fi
        ((j++))
    done
    ((i++))
    OUTPUT=$OUTPUT$char
done
echo $OUTPUT
 }

This replaces ?with a.

这替换?为a.

Answer 7

回答by Fred

export LC_ALL=en_US.iso88591

Answer 8

回答by Bruno

This may not work. Just because your locale must be set!

这可能不起作用。只是因为您的语言环境必须设置！

use locale to set LC_ALL, for example:

使用 locale 设置 LC_ALL，例如：

locale -a

Note that the full list of locales is available through:

请注意，可以通过以下方式获得完整的语言环境列表：

echo '{"doNotReplaceKey":"báb?gêjírù","replaceValueKey":"báb?gêjírù","anotherNotReplaceKey":"báb?gêjírù"}' \
    | sed -e ':a;s/replaceValueKey":"\([a-zA-Z0-9 -_]*\)[áaà??]/replaceValueKey":"a/g;ta' \
    | sed -e ':a;s/replaceValueKey":"\([a-zA-Z0-9 -_]*\)[éêè?]/replaceValueKey":"e/g;ta'  \
    | sed -e ':a;s/replaceValueKey":"\([a-zA-Z0-9 -_]*\)[í?ì?]/replaceValueKey":"i/g;ta'  \
    | sed -e ':a;s/replaceValueKey":"\([a-zA-Z0-9 -_]*\)[ó?ò??]/replaceValueKey":"o/g;ta' \
    | sed -e ':a;s/replaceValueKey":"\([a-zA-Z0-9 -_]*\)[ú?ùü]/replaceValueKey":"u/g;ta'

Answer 9

回答by Thiago Mata

If you, like me, need to replace the accents just in some special places of your file text, you can do that using this kind of regex

如果您像我一样需要替换文件文本某些特殊位置的重音符号，您可以使用这种正则表达式

{"doNotReplaceKey":"báb?gêjírù","replaceValueKey":"babogejiru","anotherNotReplaceKey":"báb?gêjírù"}

Output

输出

##代码##

bash 如何从文件中删除所有变音符号？

提问by Village

回答by Kent

回答by potong

回答by Fedir RYKHTIK

回答by ktf

回答by hungnv

回答by Rich Traube

回答by Fred

回答by Bruno

回答by Thiago Mata

相关推荐

最近更新

标签

bash 如何从文件中删除所有变音符号？

提问by Village

回答by Kent

回答by potong

回答by Fedir RYKHTIK

回答by ktf

回答by hungnv

回答by Rich Traube

回答by Fred

回答by Bruno

回答by Thiago Mata

相关推荐

bash if中的like子句

bash 如何在 awk 中调用 split 函数以在“\.”上拆分字符串？

有没有办法编写一个 bash 函数来中止整个执行，无论它如何调用？

bash 如何删除文本文件中的每 X 行？

相关推荐

最近更新

标签