Python中的UnicodeDecodeError读取文件时,如何忽略错误并跳转到下一行?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/24616678/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-19 04:52:36  来源:igfitidea点击:

UnicodeDecodeError in Python when reading a file, how to ignore the error and jump to the next line?

pythonpython-3.xfileutf-8

提问by Chicoscience

I have to read a text file into Python. The file encoding is:

我必须将文本文件读入 Python。文件编码为:

file -bi test.csv 
text/plain; charset=us-ascii

This is a third-party file, and I get a new one every day, so I would rather not change it. The file has non ascii characters, such as ?, for example. I need to read the lines using python, and I can afford to ignore a line which has a non-ascii character.

这是第三方文件,我每天都会得到一个新文件,所以我宁愿不更改它。例如,该文件具有非 ascii 字符,例如 ?。我需要使用 python 读取行,并且我可以忽略具有非 ascii 字符的行。

My problem is that when I read the file in Python, I get the UnicodeDecodeError when reaching the line where a non-ascii character exists, and I cannot read the rest of the file.

我的问题是,当我在 Python 中读取文件时,到达存在非 ascii 字符的行时出现 UnicodeDecodeError,并且无法读取文件的其余部分。

Is there a way to avoid this. If I try this:

有没有办法避免这种情况。如果我试试这个:

fileHandle = codecs.open("test.csv", encoding='utf-8');
try:
    for line in companiesFile:
        print(line, end="");
except UnicodeDecodeError:
    pass;

then when the error is reached the for loop ends and I cannot read the remaining of the file. I want to skip the line that causes the mistake and go on. I would rather not do any changes to the input file, if possible.

然后当达到错误时 for 循环结束,我无法读取文件的其余部分。我想跳过导致错误的行并继续。如果可能,我宁愿不对输入文件进行任何更改。

Is there any way to do this? Thank you very much.

有没有办法做到这一点?非常感谢。

采纳答案by Martijn Pieters

Your file doesn't appear to use the UTF-8 encoding. It is important to use the correct codec when opening a file.

您的文件似乎没有使用 UTF-8 编码。打开文件时使用正确的编解码器很重要。

You cantell open()how to treat decoding errors, with the errorskeyword:

可以open()使用errors关键字告诉如何处理解码错误:

errorsis an optional string that specifies how encoding and decoding errors are to be handled–this cannot be used in binary mode. A variety of standard error handlers are available, though any error handling name that has been registered with codecs.register_error()is also valid. The standard names are:

  • 'strict'to raise a ValueErrorexception if there is an encoding error. The default value of Nonehas the same effect.
  • 'ignore'ignores errors. Note that ignoring encoding errors can lead to data loss.
  • 'replace'causes a replacement marker (such as '?') to be inserted where there is malformed data.
  • 'surrogateescape'will represent any incorrect bytes as code points in the Unicode Private Use Area ranging from U+DC80 to U+DCFF. These private code points will then be turned back into the same bytes when the surrogateescapeerror handler is used when writing data. This is useful for processing files in an unknown encoding.
  • 'xmlcharrefreplace'is only supported when writing to a file. Characters not supported by the encoding are replaced with the appropriate XML character reference &#nnn;.
  • 'backslashreplace'(also only supported when writing) replaces unsupported characters with Python's backslashed escape sequences.

errors是一个可选字符串,指定如何处理编码和解码错误——这不能在二进制模式下使用。有多种标准错误处理程序可用,但任何已注册的错误处理程序名称codecs.register_error()也是有效的。标准名称是:

  • 'strict'ValueError如果存在编码错误,则引发异常。的默认值None具有相同的效果。
  • 'ignore'忽略错误。请注意,忽略编码错误会导致数据丢失。
  • 'replace'导致在存在格式错误的数据的地方插入替换标记(例如“?”)。
  • 'surrogateescape'将任何不正确的字节表示为 Unicode 专用区域中的代码点,范围从 U+DC80 到 U+DCFF。surrogateescape当写入数据时使用错误处理程序时,这些私有代码点将被转换回相同的字节。这对于处理未知编码的文件很有用。
  • 'xmlcharrefreplace'仅在写入文件时支持。编码不支持的字符将替换为适当的 XML 字符引用&#nnn;
  • 'backslashreplace'(也仅在写入时支持)用 Python 的反斜杠转义序列替换不受支持的字符。

Opening the file with anything other than 'strict'('ignore', 'replace', etc.) will then let you read the file without exceptions being raised.

使用除'strict'( 'ignore''replace'等)以外的任何内容打开文件将让您读取文件而不会引发异常。

Note that decoding takes place per buffered block of data, not per textual line. If you must detect errors on a line-by-line basis, use the surrogateescapehandler and test each line read for codepoints in the surrogate range:

请注意,解码是按缓冲的数据块进行的,而不是按文本行进行的。如果您必须逐行检测错误,请使用surrogateescape处理程序并测试读取的每一行是否有代理范围内的代码点:

import re

_surrogates = re.compile(r"[\uDC80-\uDCFF]")

def detect_decoding_errors_line(l, _s=_surrogates.finditer):
    """Return decoding errors in a line of text

    Works with text lines decoded with the surrogateescape
    error handler.

    Returns a list of (pos, byte) tuples

    """
    # DC80 - DCFF encode bad bytes 80-FF
    return [(m.start(), bytes([ord(m.group()) - 0xDC00]))
            for m in _s(l)]

E.g.

例如

with open("test.csv", encoding="utf8", errors="surrogateescape") as f:
    for i, line in enumerate(f, 1):
        errors = detect_decoding_errors_line(line)
        if errors:
            print(f"Found errors on line {i}:")
            for (col, b) in errors:
                print(f" {col + 1:2d}: {b[0]:02x}")

Take into account that not all decoding errors can be recovered from gracefully. While UTF-8 is designed to be robust in the face of small errors, other multi-byte encodings such as UTF-16 and UTF-32 can't cope with dropped or extra bytes, which will then affect how accurately line separators can be located. The above approach can then result in the remainder of the file being treated as one long line. If the file is big enough, that can then in turn lead to a MemoryErrorexception if the 'line' is large enough.

考虑到并非所有解码错误都可以正常恢复。虽然 UTF-8 被设计为在面对小错误时具有鲁棒性,但其他多字节编码(如 UTF-16 和 UTF-32)无法处理丢失或额外的字节,这将影响行分隔符的准确度位于。上述方法可能会导致文件的其余部分被视为一个长行。如果文件足够大,那么MemoryError如果“行”足够大,又会导致异常。