Python MongoDB InvalidDocument：无法编码对象

Question

提问by Codious-JR

I am using scrapy to scrap blogs and then store the data in mongodb. At first i got the InvalidDocument Exception. So obvious to me is that the data is not in the right encoding. So before persisting the object, in my MongoPipeline i check if the document is in 'utf-8 strict', and only then i try to persist the object to mongodb. BUT Still i get InvalidDocument Exceptions, now that is annoying.

我正在使用scrapy来删除博客，然后将数据存储在mongodb中。起初我得到了 InvalidDocument Exception。对我来说很明显的是数据的编码不正确。所以在持久化对象之前，在我的 MongoPipeline 中，我检查文档是否在 'utf-8 strict' 中，然后我才尝试将对象持久化到 mongodb。但是我仍然收到无效文档异常，现在这很烦人。

This is my code my MongoPipeline Object that persists objects to mongodb

这是我的代码我的 MongoPipeline 对象，它将对象持久化到 mongodb

# -*- coding: utf-8 -*-

# Define your item pipelines here
#

import pymongo
import sys, traceback
from scrapy.exceptions import DropItem
from crawler.items import BlogItem, CommentItem


class MongoPipeline(object):
    collection_name = 'master'

    def __init__(self, mongo_uri, mongo_db):
        self.mongo_uri = mongo_uri
        self.mongo_db = mongo_db

    @classmethod
    def from_crawler(cls, crawler):
        return cls(
            mongo_uri=crawler.settings.get('MONGO_URI'),
            mongo_db=crawler.settings.get('MONGO_DATABASE', 'posts')
        )

    def open_spider(self, spider):
        self.client = pymongo.MongoClient(self.mongo_uri)
        self.db = self.client[self.mongo_db]

    def close_spider(self, spider):
        self.client.close()

    def process_item(self, item, spider):

        if type(item) is BlogItem:
            try:
                if 'url' in item:
                    item['url'] = item['url'].encode('utf-8', 'strict')
                if 'domain' in item:
                    item['domain'] = item['domain'].encode('utf-8', 'strict')
                if 'title' in item:
                    item['title'] = item['title'].encode('utf-8', 'strict')
                if 'date' in item:
                    item['date'] = item['date'].encode('utf-8', 'strict')
                if 'content' in item:
                    item['content'] = item['content'].encode('utf-8', 'strict')
                if 'author' in item:
                    item['author'] = item['author'].encode('utf-8', 'strict')

            except:  # catch *all* exceptions
                e = sys.exc_info()[0]
                spider.logger.critical("ERROR ENCODING %s", e)
                traceback.print_exc(file=sys.stdout)
                raise DropItem("Error encoding BLOG %s" % item['url'])

            if 'comments' in item:
                comments = item['comments']
                item['comments'] = []

                try:
                    for comment in comments:
                        if 'date' in comment:
                            comment['date'] = comment['date'].encode('utf-8', 'strict')
                        if 'author' in comment:
                            comment['author'] = comment['author'].encode('utf-8', 'strict')
                        if 'content' in comment:
                            comment['content'] = comment['content'].encode('utf-8', 'strict')

                        item['comments'].append(comment)

                except:  # catch *all* exceptions
                    e = sys.exc_info()[0]
                    spider.logger.critical("ERROR ENCODING COMMENT %s", e)
                    traceback.print_exc(file=sys.stdout)

        self.db[self.collection_name].insert(dict(item))

        return item

And still i get the following exception:

我仍然收到以下异常：

au coeur de l\u2019explosion de la bulle Internet n\u2019est probablement pas \xe9tranger au succ\xe8s qui a suivi. Mais franchement, c\u2019est un peu court comme argument !Ce que je sais dire, compte tenu de ce qui pr\xe9c\xe8de, c\u2019est quelles sont les conditions pour r\xe9ussir si l\u2019on est vraiment contraint de rester en France. Ce sont des sujets que je d\xe9velopperai dans un autre article.',
     'date': u'2012-06-27T23:21:25+00:00',
     'domain': 'reussir-sa-boite.fr',
     'title': u'Peut-on encore entreprendre en France ?\t\t\t ',
     'url': 'http://www.reussir-sa-boite.fr/peut-on-encore-entreprendre-en-france/'}
    Traceback (most recent call last):
      File "h:\program files\anaconda\lib\site-packages\twisted\internet\defer.py", line 588, in _runCallbacks
        current.result = callback(current.result, *args, **kw)
      File "H:\PDS\BNP\crawler\crawler\pipelines.py", line 76, in process_item
        self.db[self.collection_name].insert(dict(item))
      File "h:\program files\anaconda\lib\site-packages\pymongo\collection.py", line 409, in insert
        gen(), check_keys, self.uuid_subtype, client)
    InvalidDocument: Cannot encode object: {'author': 'Arnaud Lemasson',
     'content': 'Tellement vrai\xe2\x80\xa6 Il faut vraiment \xc3\xaatre motiv\xc3\xa9 aujourd\xe2\x80\x99hui pour monter sa bo\xc3\xaete. On est pr\xc3\xa9lev\xc3\xa9 de partout, je ne pense m\xc3\xaame pas \xc3\xa0 embaucher, cela me co\xc3\xbbterait bien trop cher. Bref, 100% d\xe2\x80\x99accord avec vous. Le probl\xc3\xa8me, je ne vois pas comment cela pourrait changer avec le gouvernement actuel\xe2\x80\xa6 A moins que si, j\xe2\x80\x99ai pu lire il me semble qu\xe2\x80\x99ils avaient en t\xc3\xaate de r\xc3\xa9duire l\xe2\x80\x99IS pour les petites entreprises et de l\xe2\x80\x99augmenter pour les grandes\xe2\x80\xa6 A vtheitroad',
     'date': '2012-06-27T23:21:25+00:00'}
    2015-11-04 15:29:15 [scrapy] INFO: Closing spider (finished)
    2015-11-04 15:29:15 [scrapy] INFO: Dumping Scrapy stats:
    {'downloader/request_bytes': 259,
     'downloader/request_count': 1,
     'downloader/request_method_count/GET': 1,
     'downloader/response_bytes': 252396,
     'downloader/response_count': 1,
     'downloader/response_status_count/200': 1,
     'finish_reason': 'finished',
     'finish_time': datetime.datetime(2015, 11, 4, 14, 29, 15, 701000),
     'log_count/DEBUG': 2,
     'log_count/ERROR': 1,
     'log_count/INFO': 7,
     'response_received_count': 1,
     'scheduler/dequeued': 1,
     'scheduler/dequeued/memory': 1,
     'scheduler/enqueued': 1,
     'scheduler/enqueued/memory': 1,
     'start)
    time': datetime.datetime(2015, 11, 4, 14, 29, 13, 191000)}

Another funny thingfrom the comment of @eLRuLL i did the following:

@eLRuLL 评论中的另一件有趣的事情我做了以下事情：

>>> s = "Tellement vrai\xe2\x80\xa6 Il faut vraiment \xc3\xaatre motiv\xc3\xa9 aujourd\xe2\x80\x99hui pour monter sa bo\xc3\xaete. On est pr\xc3\xa9lev\xc3\xa9 de partout, je ne pense m\xc3\xaame pas \xc3\xa0 embaucher, cela me"
>>> s
'Tellement vrai\xe2\x80\xa6 Il faut vraiment \xc3\xaatre motiv\xc3\xa9 aujourd\xe2\x80\x99hui pour monter sa bo\xc3\xaete. On est pr\xc3\xa9lev\xc3\xa9 de partout, je ne pense m\xc3\xaame pas \xc3\xa0 embaucher, cela me'
>>> se = s.encode("utf8", "strict")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 14: ordinal not in range(128)
>>> se = s.encode("utf-8", "strict")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 14: ordinal not in range(128)
>>> s.decode()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 14: ordinal not in range(128)

Then my question is. If this text cannot be encoded. Then why, is my MongoPipeline try catch not catchingthis EXCEPTION? Because only objects that don't raise any exception should be appended to item['comments'] ?

那么我的问题是。如果此文本无法编码。那为什么，我的 MongoPipeline try catch没有捕捉到这个异常？因为只有不引发任何异常的对象才应该附加到 item['comments'] ？

Answer 1

采纳答案by Codious-JR

Finally I figured it out. The problem was not with encoding. It was with the structure of the documents.

最后我想通了。问题不在于编码。它与文档的结构有关。

Because i went off on the standard MongoPipeline example which does not deal with nested scrapy items.

因为我继续使用标准 MongoPipeline 示例，该示例不处理嵌套的 scrapy 项目。

What i am doing is: BlogItem: "url" ... comments = [CommentItem]

我正在做的是： BlogItem: "url" ... comments = [CommentItem]

So my BlogItem has a list of CommentItems. Now the problem came here, for persisting the object in the database i do:

所以我的 BlogItem 有一个 CommentItems 列表。现在问题来了，为了在数据库中持久化对象，我这样做：

self.db[self.collection_name].insert(dict(item))

So here i am parsing the BlogItem to a dict. But i am not parsing the list of CommentItems. And because the traceback displays the CommentItem kind of like a dict, It did not occur to me that the problematic object is not a dict!

所以在这里我将 BlogItem 解析为 dict。但我没有解析 CommentItems 列表。并且因为回溯显示 CommentItem 有点像字典，我没有想到有问题的对象不是字典！

So finally the the way to fix this problem is to change the line when appending the comment to the comment list as such:

所以最后解决这个问题的方法是在将评论附加到评论列表时更改行，如下所示：

item['comments'].append(dict(comment))

Now MongoDB considers it as a valid document.

现在 MongoDB 将其视为有效文档。

Lastly, for the last part where i ask why i am getting a exception on the python console and not in the script.

最后，对于最后一部分，我问为什么我在 python 控制台上而不是在脚本中收到异常。

The reason is because i was working on the python console, which only supports ascii. And thus the error.

原因是因为我在 python 控制台上工作，它只支持 ascii。因此错误。

Answer 2

回答by eLRuLL

First, when you do "somestring".encode(...), isn't changing "somestring", but it returns a new encoded string, so you should use something like:

首先，当你这样做时"somestring".encode(...)，不会改变"somestring"，但它返回一个新的编码字符串，所以你应该使用类似的东西：

 item['author'] = item['author'].encode('utf-8', 'strict')

and the same for the other fields.

其他领域也一样。

Answer 3

回答by duhaime

I got this error when running a query

运行查询时出现此错误

db.collection.find({'attr': {'$gte': 20}})

and some records in collectionhad a non-numeric value for attr.

并且中的某些记录collection具有非数字值attr。

Python MongoDB InvalidDocument：无法编码对象

提问by Codious-JR

采纳答案by Codious-JR

回答by eLRuLL

回答by duhaime

相关推荐

最近更新

标签

Python MongoDB InvalidDocument：无法编码对象

提问by Codious-JR

采纳答案by Codious-JR

回答by eLRuLL

回答by duhaime

相关推荐

Python 将缺失的日期添加到 Pandas 数据框

Python 中 X = X[:, 1] 的含义

Python：使用参数（变量）执行shell脚本，但shell脚本中未读取参数

Python 使用列名将多个数组保存到一个 csv 文件

相关推荐

最近更新

标签