bash 如何在一堆文本文件中用“ ”替换所有“0xa0”字符？

Question

提问by alvas

i've been trying to mass-edit a bunch of text files to utf-8 in python and this error keeps popping out. is there a way to replace them in some python scrips or bash commands? i used the code:

我一直在尝试在 python 中将一堆文本文件批量编辑为 utf-8，但这个错误不断弹出。有没有办法在一些 python 脚本或 bash 命令中替换它们？我使用了代码：

writer = codecs.open(os.path.join(wrd, 'dict.en'), 'wtr', 'utf-8')
for infile in glob.glob(os.path.join(wrd,'*.txt')):
        print infile
        for line in open(infile):
                writer.write(line.encode('utf-8'))

and got these sorts of errors:

并得到了这些类型的错误：

Traceback (most recent call last):
  File "dicting.py", line 30, in <module>
    writer.write(line2.encode('utf-8'))
UnicodeDecodeError: 'utf8' codec can't decode byte 0xa0 in position 216: unexpected code byte

Answer 1

回答by ncoghlan

OK, first point: your output file is set to automatically encode text written to it as utf-8, so don't include an explicit encode('utf-8')method call when passing arguments to the write()method.

好的，第一点：您的输出文件设置为自动将写入其中的文本编码为utf-8，因此encode('utf-8')在向write()方法传递参数时不要包含显式方法调用。

So the first thing to try is to simply use the following in your inner loop:

所以首先要尝试的是在你的内部循环中简单地使用以下内容：

writer.write(line)

If that doesn't work, then the problem is almost certainly the fact that, as others have noted, you aren't decoding your inputfile properly.

如果这不起作用，那么问题几乎肯定是，正如其他人所指出的那样，您没有正确解码输入文件。

Taking a wild guess and assuming that your input files are encoded in cp1252, you could try as a quick test the following in the inner loop:

大胆猜测并假设您的输入文件以编码cp1252，您可以尝试在内部循环中快速测试以下内容：

for line in codecs.open(infile, 'r', 'cp1252'):
    writer.write(line)

Minor point: 'wtr' is a nonsensical mode string (as write access implies read access). Simplify it to either 'wt' or even just 'w'.

次要问题：'wtr' 是一个无意义的模式字符串（因为写访问意味着读访问）。将其简化为“wt”或什至只是“w”。

Answer 2

回答by geekosaur

Did you omit some code there? You're reading into linebut trying to re-encode line2.

你在那里省略了一些代码吗？您正在阅读line但试图重新编码line2.

In any case, you're going to have to tell Python what encoding the input file is; if you don't know, then you'll have to open it raw and perform substitutions without help of a codec.

在任何情况下，您都必须告诉 Python 输入文件的编码是什么；如果您不知道，那么您将不得不在没有编解码器帮助的情况下打开它并执行替换。

Answer 3

回答by Andreas Jung

Please be serious - a simple replace() operation will do the job:

请认真 - 一个简单的 replace() 操作将完成这项工作：

line = line.replace(chr(0xa0), '')

In addition the codecs.open() constructors support the 'errors' parameter to handle conversion errors. Please read up (yourself).

此外 codecs.open() 构造函数支持 'errors' 参数来处理转换错误。请仔细阅读（你自己）。

bash 如何在一堆文本文件中用“ ”替换所有“0xa0”字符？

提问by alvas

回答by ncoghlan

回答by geekosaur

回答by Andreas Jung

相关推荐

最近更新

标签

bash 如何在一堆文本文件中用“ ”替换所有“0xa0”字符？

提问by alvas

回答by ncoghlan

回答by geekosaur

回答by Andreas Jung

相关推荐

bash 如何在bash脚本中转义字符串

bash 如何grep然后使grep的特定输出的if语句失败？

bash 在我的主目录中找不到 .bash_profile

bash start-stop-daemon 在命令行中工作，但在 /etc/init.d 脚本中不起作用

相关推荐

最近更新

标签