bash 如何在一堆文本文件中用“ ”替换所有“0xa0”字符?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/5428844/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
How to replace all '0xa0' chars with a ' ' in a bunch of text files?
提问by alvas
i've been trying to mass-edit a bunch of text files to utf-8 in python and this error keeps popping out. is there a way to replace them in some python scrips or bash commands? i used the code:
我一直在尝试在 python 中将一堆文本文件批量编辑为 utf-8,但这个错误不断弹出。有没有办法在一些 python 脚本或 bash 命令中替换它们?我使用了代码:
writer = codecs.open(os.path.join(wrd, 'dict.en'), 'wtr', 'utf-8')
for infile in glob.glob(os.path.join(wrd,'*.txt')):
print infile
for line in open(infile):
writer.write(line.encode('utf-8'))
and got these sorts of errors:
并得到了这些类型的错误:
Traceback (most recent call last):
File "dicting.py", line 30, in <module>
writer.write(line2.encode('utf-8'))
UnicodeDecodeError: 'utf8' codec can't decode byte 0xa0 in position 216: unexpected code byte
回答by ncoghlan
OK, first point: your output file is set to automatically encode text written to it as utf-8, so don't include an explicit encode('utf-8')method call when passing arguments to the write()method.
好的,第一点:您的输出文件设置为自动将写入其中的文本编码为utf-8,因此encode('utf-8')在向write()方法传递参数时不要包含显式方法调用。
So the first thing to try is to simply use the following in your inner loop:
所以首先要尝试的是在你的内部循环中简单地使用以下内容:
writer.write(line)
If that doesn't work, then the problem is almost certainly the fact that, as others have noted, you aren't decoding your inputfile properly.
如果这不起作用,那么问题几乎肯定是,正如其他人所指出的那样,您没有正确解码输入文件。
Taking a wild guess and assuming that your input files are encoded in cp1252, you could try as a quick test the following in the inner loop:
大胆猜测并假设您的输入文件以 编码cp1252,您可以尝试在内部循环中快速测试以下内容:
for line in codecs.open(infile, 'r', 'cp1252'):
writer.write(line)
Minor point: 'wtr' is a nonsensical mode string (as write access implies read access). Simplify it to either 'wt' or even just 'w'.
次要问题:'wtr' 是一个无意义的模式字符串(因为写访问意味着读访问)。将其简化为“wt”或什至只是“w”。
回答by geekosaur
Did you omit some code there? You're reading into linebut trying to re-encode line2.
你在那里省略了一些代码吗?您正在阅读line但试图重新编码line2.
In any case, you're going to have to tell Python what encoding the input file is; if you don't know, then you'll have to open it raw and perform substitutions without help of a codec.
在任何情况下,您都必须告诉 Python 输入文件的编码是什么;如果您不知道,那么您将不得不在没有编解码器帮助的情况下打开它并执行替换。
回答by Andreas Jung
Please be serious - a simple replace() operation will do the job:
请认真 - 一个简单的 replace() 操作将完成这项工作:
line = line.replace(chr(0xa0), '')
In addition the codecs.open() constructors support the 'errors' parameter to handle conversion errors. Please read up (yourself).
此外 codecs.open() 构造函数支持 'errors' 参数来处理转换错误。请仔细阅读(你自己)。

