Python UnicodeDecodeError: 'utf8' 编解码器无法解码位置 34 中的字节 0xc3:数据意外结束

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/24004278/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-19 03:49:04  来源:igfitidea点击:

UnicodeDecodeError: 'utf8' codec can't decode byte 0xc3 in position 34: unexpected end of data

pythonutf-8character-encodingdecoding

提问by user3701032

I'm trying to write a scrapper, but I'm having issues with encoding. When I tried to copy the string I was looking for into my text file, python2.7told me it didn't recognize the encoding, despite no special characters. Don't know if that's useful info.

我正在尝试编写一个剪贴板,但我遇到了编码问题。当我试图将我正在寻找的字符串复制到我的文本文件中时,python2.7它告诉我它无法识别编码,尽管没有特殊字符。不知道这是否有用的信息。

My code looks like this:

我的代码如下所示:

from urllib import FancyURLopener
import os

class MyOpener(FancyURLopener): #spoofs a real browser on Window
   version = 'Mozilla/5.0 (Windows; U; Windows NT 5.1; it; rv:1.8.1.11) Gecko/20071127 Firefox/2.0.0.11'

print "What is the webaddress?"
webaddress = raw_input("8::>")

print "Folder Name?"
foldername = raw_input("8::>")

if not os.path.exists(foldername):
    os.makedirs(foldername)

def urlpuller(start, page):
   while page[start]!= '"':
      start += 1
   close = start
   while page[close]!='"':
      close += 1
   return page[start:close]

myopener = MyOpener()

response = myopener.open(webaddress)
site = response.read()

nexturl = ''
counter = 0

while(nexturl!=webaddress):
   counter += 1
   start = 0

   for i in range(len(site)-35):
       if site[i:i+35].decode('utf-8') == u'<img id="imgSized" class="slideImg"':
         start = i + 40
         break
   else:
      print "Something's broken, chief. Error = 1"

   next = 0

   for i in range(start, 8, -1):
      if site[i:i+8] == u'<a href=':
         next = i
         break
   else:
      print "Something's broken, chief. Error = 2"

   nexturl = urlpuller(next, site)

   myopener.retrieve(urlpuller(start,site),foldername+'/'+foldername+str(counter)+'.jpg')

print("Retrieval of "+foldername+" completed.")

When I try to run it using the site I'm using, it returns the error:

当我尝试使用我正在使用的站点运行它时,它返回错误:

Traceback (most recent call last):
  File "yada/yadayada/Python/scraper.py", line 37, in <module>
    if site[i:i+35].decode('utf-8') == u'<img id="imgSized" class="slideImg"':
  File "/usr/lib/python2.7/encodings/utf_8.py", line 16, in decode
    return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode byte 0xc3 in position 34: unexpected end of data

When pointed at http://google.com, it worked just fine.

当指向http://google.com 时,它工作得很好。

<meta http-equiv="Content-Type" content="text/html; charset=utf-8">

but when I try to decode using utf-8, as you can see, it does not work.

但是当我尝试使用 utf-8 解码时,如您所见,它不起作用。

Any suggestions?

有什么建议?

采纳答案by Martin Konecny

site[i:i+35].decode('utf-8')

You cannot randomly partition the bytes you've received and then ask UTF-8 to decode it. UTF-8 is a multibyte encoding, meaning you can have anywhere from 1 to 6 bytes to represent one character. If you chop that in half, and ask Python to decode it, it will throw you the unexpected end of dataerror.

您不能随机分割您收到的字节,然后要求 UTF-8 对其进行解码。UTF-8 是一种多字节编码,这意味着您可以使用 1 到 6 个字节来表示一个字符。如果你把它切成两半,然后让 Python 解码它,它会给你unexpected end of data错误。

Look into a tool that has this built for you. BeautifulSoupor lxmlare two alternatives.

查看为您构建的工具。BeautifulSouplxml是两种选择。

回答by Daniel

Instead of your for-loop do something like:

而不是你的for循环做这样的事情:

start = site.decode('utf-8').find('<img id="imgSized" class="slideImg"') + 40

回答by ssareen

Open the csv file in sublime and "Save with Encoding" -> UTF-8.

在 sublime 中打开 csv 文件并“使用编码保存”-> UTF-8。