Python 如何修复“UnicodeDecodeError:‘charmap’编解码器无法解码位置 29815 中的字节 0x9d:字符映射到 <undefined>”?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/49562499/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-19 19:08:16  来源:igfitidea点击:

How to fix ''UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 29815: character maps to <undefined>''?

pythonunicodefile-iosqlitedecode

提问by user3443027

At the moment, I am trying to get a Python 3 program to do some manipulations with a text file filled with information, through the Spyder IDE/GUI. However, when trying to read the file I get the following error:

目前,我正在尝试让 Python 3 程序通过 Spyder IDE/GUI 对填充了信息的文本文件进行一些操作。但是,在尝试读取文件时,出现以下错误:

  File "<ipython-input-13-d81e1333b8cd>", line 77, in <module>
    parser(f)

  File "<ipython-input-13-d81e1333b8cd>", line 18, in parser
    data = infile.read()

  File "C:\ProgramData\Anaconda3\lib\encodings\cp1252.py", line 23, in decode
    return codecs.charmap_decode(input,self.errors,decoding_table)[0]

UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 29815: character maps to <undefined>

The code of the program is as follows:

程序代码如下:

import os

os.getcwd()

import glob
import re
import sqlite3
import csv

def parser(file):

    # Open a TXT file. Store all articles in a list. Each article is an item
    # of the list. Split articles based on the location of such string as
    # 'Document PRN0000020080617e46h00461'

    articles = []
    with open(file, 'r') as infile:
        data = infile.read()
    start = re.search(r'\n HD\n', data).start()
    for m in re.finditer(r'Document [a-zA-Z0-9]{25}\n', data):
        end = m.end()
        a = data[start:end].strip()
        a = '\n   ' + a
        articles.append(a)
        start = end

    # In each article, find all used Intelligence Indexing field codes. Extract
    # content of each used field code, and write to a CSV file.

    # All field codes (order matters)
    fields = ['HD', 'CR', 'WC', 'PD', 'ET', 'SN', 'SC', 'ED', 'PG', 'LA', 'CY', 'LP',
              'TD', 'CT', 'RF', 'CO', 'IN', 'NS', 'RE', 'IPC', 'IPD', 'PUB', 'AN']

    for a in articles:
        used = [f for f in fields if re.search(r'\n   ' + f + r'\n', a)]
        unused = [[i, f] for i, f in enumerate(fields) if not re.search(r'\n   ' + f + r'\n', a)]
        fields_pos = []
        for f in used:
            f_m = re.search(r'\n   ' + f + r'\n', a)
            f_pos = [f, f_m.start(), f_m.end()]
            fields_pos.append(f_pos)
        obs = []
        n = len(used)
        for i in range(0, n):
            used_f = fields_pos[i][0]
            start = fields_pos[i][2]
            if i < n - 1:
                end = fields_pos[i + 1][1]
            else:
                end = len(a)
            content = a[start:end].strip()
            obs.append(content)
        for f in unused:
            obs.insert(f[0], '')
        obs.insert(0, file.split('/')[-1].split('.')[0])  # insert Company ID, e.g., GVKEY
        # print(obs)
        cur.execute('''INSERT INTO articles
                       (id, hd, cr, wc, pd, et, sn, sc, ed, pg, la, cy, lp, td, ct, rf,
                       co, ina, ns, re, ipc, ipd, pub, an)
                       VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?,
                       ?, ?, ?, ?, ?, ?, ?, ?)''', obs)

# Write to SQLITE
conn = sqlite3.connect('factiva.db')
with conn:
    cur = conn.cursor()
    cur.execute('DROP TABLE IF EXISTS articles')
    # Mirror all field codes except changing 'IN' to 'INC' because it is an invalid name
    cur.execute('''CREATE TABLE articles
                   (nid integer primary key, id text, hd text, cr text, wc text, pd text,
                   et text, sn text, sc text, ed text, pg text, la text, cy text, lp text,
                   td text, ct text, rf text, co text, ina text, ns text, re text, ipc text,
                   ipd text, pub text, an text)''')
    for f in glob.glob('*.txt'):
        print(f)
        parser(f)

# Write to CSV to feed Stata
with open('factiva.csv', 'w', newline='') as csvfile:
    writer = csv.writer(csvfile)
    with conn:
        cur = conn.cursor()
        cur.execute('SELECT * FROM articles WHERE hd IS NOT NULL')
        colname = [desc[0] for desc in cur.description]
        writer.writerow(colname)
        for obs in cur.fetchall():
            writer.writerow(obs)

回答by Giacomo Catenazzi

As you see from https://en.wikipedia.org/wiki/Windows-1252, the code 0x9D is not defined in CP1252.

正如您从https://en.wikipedia.org/wiki/Windows-1252看到的,代码 0x9D 未在 CP1252 中定义。

The "error" is e.g. in your openfunction: you do not specify the encoding, so python (just in windows) will use some system encoding. In general, if you read a file that maybe was not create in the same machine, it is really better to specify the encoding.

“错误”例如在您的open函数中:您没有指定编码,因此 python(仅在 Windows 中)将使用一些系统编码。通常,如果您读取的文件可能不是在同一台机器上创建的,那么指定编码确实更好。

I recommend to put also a coding also on your openfor writing the csv. It is really better to be explicit.

我建议在您open编写 csv 时也添加一个编码。最好是明确的。

I do no know the original file format, but adding to open , encoding='utf-8'is usually a good thing (and it is the default in Linux and MacOs).

我不知道原始文件格式,但添加到 open, encoding='utf-8'通常是一件好事(这是 Linux 和 MacOs 中的默认设置)。

回答by Romano

The above did not work for me, try this instead: , errors='ignore'Worked wonders!

以上对我不起作用,试试这个:创造, errors='ignore'奇迹!

回答by Martin Kinuthia

Add encoding in the open statement For example:

在open语句中添加编码例如:

f=open("filename.txt","r",encoding='utf-8')

回答by AnksG

I fixed this by changing the format of the file from .csv to .xlsx.

我通过将文件格式从 .csv 更改为 .xlsx 来解决此问题。

回答by pkanda

errors='ignore' solved my headache in:

errors='ignore' 解决了我的头痛:

how to find the word "coma"in directories and subdirectories =

如何在目录和子目录中找到“昏迷”一词=

import os
rootdir=('K:\0\000.THU.EEG.nedc_tuh_eeg\000edf.01_tcp_ar\01_tcp_ar\')
for folder, dirs, files in os.walk(rootdir):
    for file in files:
        if file.endswith('.txt'):
            fullpath = os.path.join(folder, file)
            with open(fullpath, 'r', errors='ignore') as f:
                for line in f:
                    if "coma" in line:
                        print(fullpath)
                        break