bash 如何从文件中删除所有变音符号?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/10207354/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
How to remove all of the diacritics from a file?
提问by Village
I have a file containing many vowels with diacritics. I need to make these replacements:
我有一个包含许多带有变音符号的元音的文件。我需要进行这些替换:
- Replace ā, á, ǎ, and à with a.
- Replace ē, é, ě, and è with e.
- Replace ī, í, ǐ, and ì with i.
- Replace ō, ó, ǒ, and ò with o.
- Replace ū, ú, ǔ, and ù with u.
- Replace ǖ, ǘ, ǚ, and ǜ with ü.
- Replace ā, á, ǎ, and à with A.
- Replace ē, é, ě, and è with E.
- Replace ī, í, ǐ, and ì with I.
- Replace ō, ó, ǒ, and ò with O.
- Replace ū, ú, ǔ, and ù with U.
- Replace ǖ, ǘ, ǚ, and ǜ with ü.
- 将 ā、á、ǎ 和 à 替换为 a。
- 将 ē、é、ě 和 è 替换为 e。
- 将 ī、í、ǐ 和 ì 替换为 i。
- 将 ō、ó、ǒ 和 ò 替换为 o。
- 用 u 替换 ū、ú、ǔ 和ù。
- 用ü替换ǖ、ǘ、ǚ和ǜ。
- 将 ā、á、ǎ 和 à 替换为 A。
- 将 ē、é、ě 和 è 替换为 E。
- 将 ī、í、ǐ 和 ì 替换为 I。
- 将 ō、ó、ǒ 和 ò 替换为 O。
- 将 ū、ú、ǔ 和 ù 替换为 U。
- 用ü替换ǖ、ǘ、ǚ和ǜ。
I know I can replace them one at a time with this:
我知道我可以一次替换一个:
sed -i 's/ā/a/g' ./file.txt
Is there a more efficient way to replace all of these?
有没有更有效的方法来替换所有这些?
回答by Kent
If you check the man page of the tool iconv
:
如果您查看该工具的手册页iconv
:
//TRANSLIT
When the string "//TRANSLIT" is appended to --to-code, transliteration is activated. This means that when a character cannot be represented in the target character set, it can be approximated through one or several similarly looking characters.
//TRANSLIT
当字符串“//TRANSLIT”附加到--to-code时,音译被激活。这意味着当一个字符不能在目标字符集中表示时,它可以通过一个或几个看起来相似的字符来近似。
so we could do :
所以我们可以这样做:
kent$ cat test1
Replace ā, á, ǎ, and à with a.
Replace ē, é, ě, and è with e.
Replace ī, í, ǐ, and ì with i.
Replace ō, ó, ǒ, and ò with o.
Replace ū, ú, ǔ, and ù with u.
Replace ǖ, ǘ, ǚ, and ǜ with ü.
Replace ā, á, ǎ, and à with A.
Replace ē, é, ě, and è with E.
Replace ī, í, ǐ, and ì with I.
Replace ō, ó, ǒ, and ò with O.
Replace ū, ú, ǔ, and ù with U.
Replace ǖ, ǘ, ǚ, and ǜ with ü.
kent$ iconv -f utf8 -t ascii//TRANSLIT test1
Replace a, a, a, and a with a.
Replace e, e, e, and e with e.
Replace i, i, i, and i with i.
Replace o, o, o, and o with o.
Replace u, u, u, and u with u.
Replace u, u, u, and u with u.
Replace A, A, A, and A with A.
Replace E, E, E, and E with E.
Replace I, I, I, and I with I.
Replace O, O, O, and O with O.
Replace U, U, U, and U with U.
Replace U, U, U, and U with U.
回答by potong
This might work for you:
这可能对你有用:
sed -i 'y/āáǎàēéěèīíǐìōóǒòūúǔùǖǘǚǜāáǎàēéěèīíǐìōóǒòūúǔùǖǘǚǜ/aaaaeeeeiiiioooouuuuüüüüAAAAEEEEIIIIOOOOUUUUüüüü/' file
回答by Fedir RYKHTIK
I like iconv
as it handles all accents variations :
我喜欢iconv
它处理所有口音变化:
cat non-ascii.txt | iconv -f utf8 -t ascii//TRANSLIT//IGNORE > ascii.txt
回答by ktf
For this the tr(1)command is for. For example:
为此,tr(1)命令适用。例如:
tr 'āáǎàēéěèīíǐì...' 'aaaaeeeeiii...' <infile >outfile
You may have to check/change your LANG
environment variable to match the character set being used.
您可能需要检查/更改LANG
环境变量以匹配正在使用的字符集。
回答by hungnv
You can use something like this:
你可以使用这样的东西:
sed -e 's/[àa]/a/g;s/[??]/o/g;s/[í,ì]/i/g;s/[ê,?]/e/g'
just add more characters to [..] for your need.
只需根据您的需要向 [..] 添加更多字符。
回答by Rich Traube
You can use man iso_8859_1
(or your char set) or od -bc
to identify the the octal representation of the diacritic. Then use gawk
to do the replacing.
您可以使用man iso_8859_1
(或您的字符集)或od -bc
来标识变音符号的八进制表示。然后用gawk
做替换。
{ gsub(/4/,"a"; print #!/bin/bash
INPUT=""
declare -a acc
declare -a noa
acc=('$' '?¨' '?a' '??' 'à' 'á' '?' '?' '?' '?' '?' '?' 'è' 'é' 'ê' '?' 'ì' 'í' '?' '?' 'D' '?' 'ò' 'ó' '?' '?' '?' '?' 'ù' 'ú' '?' 'ü' 'Y' '?' 'à' 'á' 'a' '?' '?' '?' '?' '?' 'è' 'é' 'ê' '?' 'ì' 'í' '?' '?' '?' 'ò' 'ó' '?' '?' '?' '?' 'ù' 'ú' '?' 'ü' 'y' '?' 'ā' 'ā' '?' '?' '?' '?' '?' '?' '?' '?' '?' '?' '?' '?' '?' '?' '?' '?' 'ē' 'ē' '?' '?' '?' '?' '?' '?' 'ě' 'ě' '?' '?' '?' '?' '?' '?' '?' '?' '?' '?' '?' '?' '?' '?' 'ī' 'ī' '?' '?' '?' '?' '?' '?' '?' '?' '?' '?' '?' '?' '?' '?' '?' '?' '?' '?' '?' '?' '?' '?' '?' 'ń' '?' '?' '?' 'ň' '?' 'ō' 'ō' '?' '?' '?' '?' '?' '?' '?' '?' '?' '?' '?' '?' '?' '?' '?' '?' '?' '?' '?' '?' '?' '?' '?' '?' '?' '?' '?' '?' 'ū' 'ū' '?' '?' '?' '?' '?' '?' '?' '?' '?' '?' '?' '?' '?' '?' '?' '?' '?' '?' '?' '?' '?' '?' '?' '?' '?' 'ǎ' 'ǎ' 'ǐ' 'ǐ' 'ǒ' 'ǒ' 'ǔ' 'ǔ' 'ǖ' 'ǖ' 'ǘ' 'ǘ' 'ǚ' 'ǚ' 'ǜ' 'ǜ' '?' '?' '?' '?' '?' '?');
noa=('S' 'e' 'e' 'e' 'A' 'A' 'A' 'A' 'A' 'A' 'AE' 'C' 'E' 'E' 'E' 'E' 'I' 'I' 'I' 'I' 'D' 'N' 'O' 'O' 'O' 'O' 'O' 'O' 'U' 'U' 'U' 'U' 'Y' 's' 'a' 'a' 'a' 'a' 'a' 'a' 'ae' 'c' 'e' 'e' 'e' 'e' 'i' 'i' 'i' 'i' 'n' 'o' 'o' 'o' 'o' 'o' 'o' 'u' 'u' 'u' 'u' 'y' 'y' 'A' 'a' 'A' 'a' 'A' 'a' 'C' 'c' 'C' 'c' 'C' 'c' 'C' 'c' 'D' 'd' 'D' 'd' 'E' 'e' 'E' 'e' 'E' 'e' 'E' 'e' 'E' 'e' 'G' 'g' 'G' 'g' 'G' 'g' 'G' 'g' 'H' 'h' 'H' 'h' 'I' 'i' 'I' 'i' 'I' 'i' 'I' 'i' 'I' 'i' 'IJ' 'ij' 'J' 'j' 'K' 'k' 'L' 'l' 'L' 'l' 'L' 'l' 'L' 'l' 'l' 'l' 'N' 'n' 'N' 'n' 'N' 'n' 'n' 'O' 'o' 'O' 'o' 'O' 'o' 'OE' 'oe' 'R' 'r' 'R' 'r' 'R' 'r' 'S' 's' 'S' 's' 'S' 's' 'S' 's' 'T' 't' 'T' 't' 'T' 't' 'U' 'u' 'U' 'u' 'U' 'u' 'U' 'u' 'U' 'u' 'U' 'u' 'W' 'w' 'Y' 'y' 'Y' 'Z' 'z' 'Z' 'z' 'Z' 'z' 's' 'f' 'O' 'o' 'U' 'u' 'A' 'a' 'I' 'i' 'O' 'o' 'U' 'u' 'U' 'u' 'U' 'u' 'U' 'u' 'U' 'u' 'A' 'a' 'AE' 'ae' 'O' 'o');
i=0
length=${#INPUT}
while [[ $i -lt $length ]]; do
char=${INPUT:$i:1};
#echo $i:$char
j=0
for letter in "${acc[@]}"
do
if [[ "$letter" == "$char" ]]; then
char="${noa[$j]}"
fi
((j++))
done
((i++))
OUTPUT=$OUTPUT$char
done
echo $OUTPUT
}
This replaces ?
with a
.
这替换?
为a
.
回答by Fred
export LC_ALL=en_US.iso88591
回答by Bruno
This may not work. Just because your locale must be set!
这可能不起作用。只是因为您的语言环境必须设置!
use locale to set LC_ALL, for example:
使用 locale 设置 LC_ALL,例如:
locale -a
Note that the full list of locales is available through:
请注意,可以通过以下方式获得完整的语言环境列表:
echo '{"doNotReplaceKey":"báb?gêjírù","replaceValueKey":"báb?gêjírù","anotherNotReplaceKey":"báb?gêjírù"}' \
| sed -e ':a;s/replaceValueKey":"\([a-zA-Z0-9 -_]*\)[áaà??]/replaceValueKey":"a/g;ta' \
| sed -e ':a;s/replaceValueKey":"\([a-zA-Z0-9 -_]*\)[éêè?]/replaceValueKey":"e/g;ta' \
| sed -e ':a;s/replaceValueKey":"\([a-zA-Z0-9 -_]*\)[í?ì?]/replaceValueKey":"i/g;ta' \
| sed -e ':a;s/replaceValueKey":"\([a-zA-Z0-9 -_]*\)[ó?ò??]/replaceValueKey":"o/g;ta' \
| sed -e ':a;s/replaceValueKey":"\([a-zA-Z0-9 -_]*\)[ú?ùü]/replaceValueKey":"u/g;ta'
回答by Thiago Mata
If you, like me, need to replace the accents just in some special places of your file text, you can do that using this kind of regex
如果您像我一样需要替换文件文本某些特殊位置的重音符号,您可以使用这种正则表达式
{"doNotReplaceKey":"báb?gêjírù","replaceValueKey":"babogejiru","anotherNotReplaceKey":"báb?gêjírù"}
Output
输出
##代码##