Python UnicodeDecodeError：“ascii”编解码器无法解码位置 13 中的字节 0xe2：序号不在范围内（128）

Question

提问by user2602812

I'm using NLTK to perform kmeans clustering on my text file in which each line is considered as a document. So for example, my text file is something like this:

我正在使用 NLTK 对我的文本文件执行 kmeans 聚类，其中每一行都被视为一个文档。例如，我的文本文件是这样的：

belong finger death punch <br>
hasty <br>
mike hasty walls jericho <br>
j?germeister rules <br>
rules bands follow performing j?germeister stage <br>
approach

Now the demo code I'm trying to run is this:

现在我试图运行的演示代码是这样的：

import sys

import numpy
from nltk.cluster import KMeansClusterer, GAAClusterer, euclidean_distance
import nltk.corpus
from nltk import decorators
import nltk.stem

stemmer_func = nltk.stem.EnglishStemmer().stem
stopwords = set(nltk.corpus.stopwords.words('english'))

@decorators.memoize
def normalize_word(word):
    return stemmer_func(word.lower())

def get_words(titles):
    words = set()
    for title in job_titles:
        for word in title.split():
            words.add(normalize_word(word))
    return list(words)

@decorators.memoize
def vectorspaced(title):
    title_components = [normalize_word(word) for word in title.split()]
    return numpy.array([
        word in title_components and not word in stopwords
        for word in words], numpy.short)

if __name__ == '__main__':

    filename = 'example.txt'
    if len(sys.argv) == 2:
        filename = sys.argv[1]

    with open(filename) as title_file:

        job_titles = [line.strip() for line in title_file.readlines()]

        words = get_words(job_titles)

        # cluster = KMeansClusterer(5, euclidean_distance)
        cluster = GAAClusterer(5)
        cluster.cluster([vectorspaced(title) for title in job_titles if title])

        # NOTE: This is inefficient, cluster.classify should really just be
        # called when you are classifying previously unseen examples!
        classified_examples = [
                cluster.classify(vectorspaced(title)) for title in job_titles
            ]

        for cluster_id, title in sorted(zip(classified_examples, job_titles)):
            print cluster_id, title

(which can also be found here)

（也可以在这里找到）

The error I receive is this:

我收到的错误是这样的：

Traceback (most recent call last):
File "cluster_example.py", line 40, in
words = get_words(job_titles)
File "cluster_example.py", line 20, in get_words
words.add(normalize_word(word))
File "", line 1, in
File "/usr/local/lib/python2.7/dist-packages/nltk/decorators.py", line 183, in memoize
result = func(*args)
File "cluster_example.py", line 14, in normalize_word
return stemmer_func(word.lower())
File "/usr/local/lib/python2.7/dist-packages/nltk/stem/snowball.py", line 694, in stem
word = (word.replace(u"\u2019", u"\x27")
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 13: ordinal not in range(128)

What is happening here?

这里发生了什么？

Answer 1

采纳答案by icktoofay

The file is being read as a bunch of strs, but it should be unicodes. Python tries to implicitly convert, but fails. Change:

该文件被读取为一堆strs，但它应该是unicodes。Python 尝试隐式转换，但失败了。改变：

job_titles = [line.strip() for line in title_file.readlines()]

to explicitly decode the strs to unicode(here assuming UTF-8):

将strs显式解码为unicode（此处假设为 UTF-8）：

job_titles = [line.decode('utf-8').strip() for line in title_file.readlines()]

It could also be solved by importing the codecsmoduleand using codecs.openrather than the built-in open.

它也可以通过进口解决的codecs模块和使用codecs.open，而不是内置的open。

Answer 2

回答by Siva S

You can try this also:

你也可以试试这个：

import sys
reload(sys)
sys.setdefaultencoding('utf8')

Answer 3

回答by Aminah Nuraini

You can try this before using job_titlesstring:

您可以在使用job_titles字符串之前尝试此操作：

source = unicode(job_titles, 'utf-8')

Answer 4

回答by Georgi Karadzhov

For me there was a problem with the terminal encoding. Adding UTF-8 to .bashrc solved the problem:

对我来说，终端编码有问题。将 UTF-8 添加到 .bashrc 解决了这个问题：

export LC_CTYPE=en_US.UTF-8

Don't forget to reload .bashrc afterwards:

之后不要忘记重新加载 .bashrc：

source ~/.bashrc

Answer 5

回答by uestcfei

This works fine for me.

这对我来说很好用。

f = open(file_path, 'r+', encoding="utf-8")

You can add a third parameter encodingto ensure the encoding type is 'utf-8'

您可以添加第三个参数encoding以确保编码类型为 'utf-8'

Note: this method works fine in Python3, I did not try it in Python2.7.

注意：此方法在Python3中运行良好，我没有在Python2.7中尝试过。

Answer 6

回答by iamigham

For python 3, the default encoding would be "utf-8". Following steps are suggested in the base documentation:https://docs.python.org/2/library/csv.html#csv-examplesin case of any problem

对于 python 3，默认编码是“utf-8”。基础文档中建议了以下步骤：https: //docs.python.org/2/library/csv.html#csv-examples以防万一

Create a function

def utf_8_encoder(unicode_csv_data):
    for line in unicode_csv_data:
        yield line.encode('utf-8')

Then use the function inside the reader, for e.g.

csv_reader = csv.reader(utf_8_encoder(unicode_csv_data))

创建函数

def utf_8_encoder(unicode_csv_data):
    for line in unicode_csv_data:
        yield line.encode('utf-8')

然后使用阅读器内的函数，例如

csv_reader = csv.reader(utf_8_encoder(unicode_csv_data))

Answer 7

回答by John Greene

To find ANY and ALL unicode error related... Using the following command:

要查找与任何和所有 unicode 错误相关的...使用以下命令：

grep -r -P '[^\x00-\x7f]' /etc/apache2 /etc/letsencrypt /etc/nginx

Found mine in

发现我的

/etc/letsencrypt/options-ssl-nginx.conf:        # The following CSP directives don't use default-src as

Using shed, I found the offending sequence. It turned out to be an editor mistake.

使用shed，我发现了有问题的序列。结果证明是编辑器错误。

00008099:     C2  194 302 11000010
00008100:     A0  160 240 10100000
00008101:  d  64  100 144 01100100
00008102:  e  65  101 145 01100101
00008103:  f  66  102 146 01100110
00008104:  a  61  097 141 01100001
00008105:  u  75  117 165 01110101
00008106:  l  6C  108 154 01101100
00008107:  t  74  116 164 01110100
00008108:  -  2D  045 055 00101101
00008109:  s  73  115 163 01110011
00008110:  r  72  114 162 01110010
00008111:  c  63  099 143 01100011
00008112:     C2  194 302 11000010
00008113:     A0  160 240 10100000

Answer 8

回答by Ganesh Kharad

Use open(fn, 'rb').read().decode('utf-8')instead of just open(fn).read()

使用open(fn, 'rb').read().decode('utf-8')而不仅仅是open(fn).read()

Answer 9

回答by io big

python3x or higher

python3x 或更高版本

load file in byte stream:
body = '' for lines in open('website/index.html','rb'): decodedLine = lines.decode('utf-8') body = body+decodedLine.strip() return body
use global setting:
import io import sys sys.stdout = io.TextIOWrapper(sys.stdout.buffer,encoding='utf-8')

在字节流中加载文件：
body = '' for lines in open('website/index.html','rb'): decodedLine = lines.decode('utf-8') body = body+decodedLine.strip() return body
使用全局设置：
import io import sys sys.stdout = io.TextIOWrapper(sys.stdout.buffer,encoding='utf-8')

Answer 10

回答by loretoparisi

When on Ubuntu 18.04 using Python3.6I have solved the problem doing both:

在 Ubuntu 18.04 上使用Python3.6 时，我已经解决了这两个问题：

with open(filename, encoding="utf-8") as lines:

and if you are running the tool as command line:

如果您以命令行方式运行该工具：

export LC_ALL=C.UTF-8

Note that if you are in Python2.7you have do to handle this differently. First you have to set the default encoding:

请注意，如果您使用的是Python2.7，则必须以不同的方式处理此问题。首先，您必须设置默认编码：

import sys
reload(sys)
sys.setdefaultencoding('utf-8')

and then to load the file you must use io.opento set the encoding:

然后加载必须用于io.open设置编码的文件：

import io
with io.open(filename, 'r', encoding='utf-8') as lines:

You still need to export the env

您仍然需要导出 env

export LC_ALL=C.UTF-8

Python UnicodeDecodeError：“ascii”编解码器无法解码位置 13 中的字节 0xe2：序号不在范围内（128）

提问by user2602812

采纳答案by icktoofay

回答by Siva S

回答by Aminah Nuraini

回答by Georgi Karadzhov

回答by uestcfei

回答by iamigham

回答by John Greene

回答by Ganesh Kharad

回答by io big

回答by loretoparisi

相关推荐

最近更新

标签

Python UnicodeDecodeError：“ascii”编解码器无法解码位置 13 中的字节 0xe2：序号不在范围内（128）

提问by user2602812

采纳答案by icktoofay

回答by Siva S

回答by Aminah Nuraini

回答by Georgi Karadzhov

回答by uestcfei

回答by iamigham

回答by John Greene

回答by Ganesh Kharad

回答by io big

回答by loretoparisi

相关推荐

Python Numpy - 模块没有“排列”属性

Python yum---没有名为 yum 的模块

Python 如何在 OpenCV 中的图像上画一条线？

如何使用 sys.path.append 在 python 中导入文件？

相关推荐

最近更新

标签