json UnicodeDecodeError: 'utf8' 编解码器无法解码位置 3131 中的字节 0x80:起始字节无效

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/38518023/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-03 18:31:02  来源:igfitidea点击:

UnicodeDecodeError: 'utf8' codec can't decode byte 0x80 in position 3131: invalid start byte

jsonpython-2.7utf-8asciipython-unicode

提问by wannabhappy

I am trying to read twitter data from json file using python 2.7.12.

我正在尝试使用 python 2.7.12 从 json 文件中读取 twitter 数据。

Code I used is such:

我使用的代码是这样的:

    import json
    import sys
    reload(sys)
    sys.setdefaultencoding('utf-8')

    def get_tweets_from_file(file_name):
        tweets = []
        with open(file_name, 'rw') as twitter_file:
            for line in twitter_file:
                if line != '\r\n':
                    line = line.encode('ascii', 'ignore')
                    tweet = json.loads(line)
                    if u'info' not in tweet.keys():
                        tweets.append(tweet)
    return tweets

Result I got:

结果我得到:

    Traceback (most recent call last):
      File "twitter_project.py", line 100, in <module>
        main()                  
      File "twitter_project.py", line 95, in main
        tweets = get_tweets_from_dir(src_dir, dest_dir)
      File "twitter_project.py", line 59, in get_tweets_from_dir
        new_tweets = get_tweets_from_file(file_name)
      File "twitter_project.py", line 71, in get_tweets_from_file
        line = line.encode('ascii', 'ignore')
    UnicodeDecodeError: 'utf8' codec can't decode byte 0x80 in position 3131: invalid start byte

I went through all the answers from similar issues and came up with this code and it worked last time. I have no clue why it isn't working now...I would appreciate any help!

我浏览了类似问题的所有答案,并提出了这段代码,并且上次有效。我不知道为什么它现在不起作用......我将不胜感激!

回答by Sung-Ho_Ahn

In my case(mac os), there was .DS_store file in my data folder which was a hidden and auto generated file and it caused the issue. I was able to fix the problem after removing it.

在我的情况下(mac os),我的数据文件夹中有一个 .DS_store 文件,它是一个隐藏和自动生成的文件,它导致了问题。删除它后我能够解决问题。

回答by Alastair McCormack

It doesn't help that you have sys.setdefaultencoding('utf-8'), which is confusing things further - It's a nasty hack and you need to remove it from your code. See https://stackoverflow.com/a/34378962/1554386for more information

它对您没有帮助sys.setdefaultencoding('utf-8'),这进一步混淆了事情 - 这是一个令人讨厌的黑客,您需要从代码中删除它。有关更多信息,请参阅https://stackoverflow.com/a/34378962/1554386

The error is happening because lineis a string and you're calling encode(). encode()only makes sense if the string is a Unicode, so Python tries to convert it Unicode first using the default encoding, which in your case is UTF-8, but should be ASCII. Either way, 0x80is not valid ASCII or UTF-8 so fails.

发生错误是因为line是一个字符串并且您正在调用encode(). encode()仅当字符串是 Unicode 时才有意义,因此 Python 尝试首先使用默认编码将其转换为 Unicode,在您的情况下是UTF-8,但应该是ASCII. 无论哪种方式,0x80都不是有效的 ASCII 或 UTF-8,因此失败。

0x80is valid in some characters sets. In windows-1252/cp1252it's .

0x80在某些字符集中有效。在windows-1252/cp1252它是

The trick here is to understand the encoding of your data all the way through your code. At the moment, you're leaving too much up to chance. Unicode String types are a handy Python feature that allows you to decode encoded Strings and forget about the encoding until you need to write or transmit the data.

这里的诀窍是在整个代码中了解数据的编码。目前,你把太多的机会留给了机会。Unicode 字符串类型是一个方便的 Python 功能,它允许您解码编码的字符串并忘记编码,直到您需要写入或传输数据。

Use the iomodule to open the file in text mode and decode the file as it goes - no more .decode()! You need to make sure the encoding of your incoming data is consistent. You can either re-encode it externally or change the encoding in your script. Here's I've set the encoding to windows-1252.

使用该io模块以文本模式打开文件并在文件运行时对其进行解码 - 不再赘述.decode()!您需要确保传入数据的编码一致。您可以在外部对其重新编码,也可以更改脚本中的编码。这是我将编码设置为windows-1252.

with io.open(file_name, 'r', encoding='windows-1252') as twitter_file:
    for line in twitter_file:
        # line is now a <type 'unicode'>
        tweet = json.loads(line)

The iomodule also provide Universal Newlines. This means \r\nare detected as newlines, so you don't have to watch for them.

io模块还提供通用换行符。这意味着\r\n被检测为换行符,因此您不必注意它们。

回答by Midhun Mohan

The error occurs when you are trying to read a tweet containing sentence like

当您尝试阅读包含类似句子的推文时会发生错误

"@Mike http:\www.google.com \A8&^)((&() how are&^%()( you ". Which cannot be read as a String instead you are suppose to read it as raw String . but Converting to raw String Still gives error so i better i suggest you to

"@Mike http:\www.google.com \A8&^)(( &() 怎么样&^%()( 你"。不能被读作 String 而你应该把它读作原始 String 。但转换到原始字符串仍然会出错,所以我最好建议你

read a json file something like this:

读取一个像这样的json文件:

import codecs
import json
    with codecs.open('tweetfile','rU','utf-8') as f:
             for line in f:
                data=json.loads(line)
                print data["tweet"]
keys.append(data["id"])
            fulldata.append(data["tweet"])

which will get you the data load from json file .

这将使您从 json 文件加载数据。

You can also write it to a csv using Pandas.

您还可以使用 Pandas 将其写入 csv。

import pandas as pd
output = pd.DataFrame( data={ "tweet":fulldata,"id":keys} )
output.to_csv( "tweets.csv", index=False, quoting=1 )

Then read from csv to avoid the encoding and decoding problem

然后从csv读取,避免编解码问题

hope this will help you solving you problem.

希望这会帮助你解决你的问题。

Midhun

米顿