Python MongoDB InvalidDocument:无法编码对象
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/33524517/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
MongoDB InvalidDocument: Cannot encode object
提问by Codious-JR
I am using scrapy to scrap blogs and then store the data in mongodb. At first i got the InvalidDocument Exception. So obvious to me is that the data is not in the right encoding. So before persisting the object, in my MongoPipeline i check if the document is in 'utf-8 strict', and only then i try to persist the object to mongodb. BUT Still i get InvalidDocument Exceptions, now that is annoying.
我正在使用scrapy来删除博客,然后将数据存储在mongodb中。起初我得到了 InvalidDocument Exception。对我来说很明显的是数据的编码不正确。所以在持久化对象之前,在我的 MongoPipeline 中,我检查文档是否在 'utf-8 strict' 中,然后我才尝试将对象持久化到 mongodb。但是我仍然收到无效文档异常,现在这很烦人。
This is my code my MongoPipeline Object that persists objects to mongodb
这是我的代码我的 MongoPipeline 对象,它将对象持久化到 mongodb
# -*- coding: utf-8 -*-
# Define your item pipelines here
#
import pymongo
import sys, traceback
from scrapy.exceptions import DropItem
from crawler.items import BlogItem, CommentItem
class MongoPipeline(object):
collection_name = 'master'
def __init__(self, mongo_uri, mongo_db):
self.mongo_uri = mongo_uri
self.mongo_db = mongo_db
@classmethod
def from_crawler(cls, crawler):
return cls(
mongo_uri=crawler.settings.get('MONGO_URI'),
mongo_db=crawler.settings.get('MONGO_DATABASE', 'posts')
)
def open_spider(self, spider):
self.client = pymongo.MongoClient(self.mongo_uri)
self.db = self.client[self.mongo_db]
def close_spider(self, spider):
self.client.close()
def process_item(self, item, spider):
if type(item) is BlogItem:
try:
if 'url' in item:
item['url'] = item['url'].encode('utf-8', 'strict')
if 'domain' in item:
item['domain'] = item['domain'].encode('utf-8', 'strict')
if 'title' in item:
item['title'] = item['title'].encode('utf-8', 'strict')
if 'date' in item:
item['date'] = item['date'].encode('utf-8', 'strict')
if 'content' in item:
item['content'] = item['content'].encode('utf-8', 'strict')
if 'author' in item:
item['author'] = item['author'].encode('utf-8', 'strict')
except: # catch *all* exceptions
e = sys.exc_info()[0]
spider.logger.critical("ERROR ENCODING %s", e)
traceback.print_exc(file=sys.stdout)
raise DropItem("Error encoding BLOG %s" % item['url'])
if 'comments' in item:
comments = item['comments']
item['comments'] = []
try:
for comment in comments:
if 'date' in comment:
comment['date'] = comment['date'].encode('utf-8', 'strict')
if 'author' in comment:
comment['author'] = comment['author'].encode('utf-8', 'strict')
if 'content' in comment:
comment['content'] = comment['content'].encode('utf-8', 'strict')
item['comments'].append(comment)
except: # catch *all* exceptions
e = sys.exc_info()[0]
spider.logger.critical("ERROR ENCODING COMMENT %s", e)
traceback.print_exc(file=sys.stdout)
self.db[self.collection_name].insert(dict(item))
return item
And still i get the following exception:
我仍然收到以下异常:
au coeur de l\u2019explosion de la bulle Internet n\u2019est probablement pas \xe9tranger au succ\xe8s qui a suivi. Mais franchement, c\u2019est un peu court comme argument !Ce que je sais dire, compte tenu de ce qui pr\xe9c\xe8de, c\u2019est quelles sont les conditions pour r\xe9ussir si l\u2019on est vraiment contraint de rester en France. Ce sont des sujets que je d\xe9velopperai dans un autre article.',
'date': u'2012-06-27T23:21:25+00:00',
'domain': 'reussir-sa-boite.fr',
'title': u'Peut-on encore entreprendre en France ?\t\t\t ',
'url': 'http://www.reussir-sa-boite.fr/peut-on-encore-entreprendre-en-france/'}
Traceback (most recent call last):
File "h:\program files\anaconda\lib\site-packages\twisted\internet\defer.py", line 588, in _runCallbacks
current.result = callback(current.result, *args, **kw)
File "H:\PDS\BNP\crawler\crawler\pipelines.py", line 76, in process_item
self.db[self.collection_name].insert(dict(item))
File "h:\program files\anaconda\lib\site-packages\pymongo\collection.py", line 409, in insert
gen(), check_keys, self.uuid_subtype, client)
InvalidDocument: Cannot encode object: {'author': 'Arnaud Lemasson',
'content': 'Tellement vrai\xe2\x80\xa6 Il faut vraiment \xc3\xaatre motiv\xc3\xa9 aujourd\xe2\x80\x99hui pour monter sa bo\xc3\xaete. On est pr\xc3\xa9lev\xc3\xa9 de partout, je ne pense m\xc3\xaame pas \xc3\xa0 embaucher, cela me co\xc3\xbbterait bien trop cher. Bref, 100% d\xe2\x80\x99accord avec vous. Le probl\xc3\xa8me, je ne vois pas comment cela pourrait changer avec le gouvernement actuel\xe2\x80\xa6 A moins que si, j\xe2\x80\x99ai pu lire il me semble qu\xe2\x80\x99ils avaient en t\xc3\xaate de r\xc3\xa9duire l\xe2\x80\x99IS pour les petites entreprises et de l\xe2\x80\x99augmenter pour les grandes\xe2\x80\xa6 A vtheitroad',
'date': '2012-06-27T23:21:25+00:00'}
2015-11-04 15:29:15 [scrapy] INFO: Closing spider (finished)
2015-11-04 15:29:15 [scrapy] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 259,
'downloader/request_count': 1,
'downloader/request_method_count/GET': 1,
'downloader/response_bytes': 252396,
'downloader/response_count': 1,
'downloader/response_status_count/200': 1,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2015, 11, 4, 14, 29, 15, 701000),
'log_count/DEBUG': 2,
'log_count/ERROR': 1,
'log_count/INFO': 7,
'response_received_count': 1,
'scheduler/dequeued': 1,
'scheduler/dequeued/memory': 1,
'scheduler/enqueued': 1,
'scheduler/enqueued/memory': 1,
'start)
time': datetime.datetime(2015, 11, 4, 14, 29, 13, 191000)}
Another funny thingfrom the comment of @eLRuLL i did the following:
@eLRuLL 评论中的另一件有趣的事情我做了以下事情:
>>> s = "Tellement vrai\xe2\x80\xa6 Il faut vraiment \xc3\xaatre motiv\xc3\xa9 aujourd\xe2\x80\x99hui pour monter sa bo\xc3\xaete. On est pr\xc3\xa9lev\xc3\xa9 de partout, je ne pense m\xc3\xaame pas \xc3\xa0 embaucher, cela me"
>>> s
'Tellement vrai\xe2\x80\xa6 Il faut vraiment \xc3\xaatre motiv\xc3\xa9 aujourd\xe2\x80\x99hui pour monter sa bo\xc3\xaete. On est pr\xc3\xa9lev\xc3\xa9 de partout, je ne pense m\xc3\xaame pas \xc3\xa0 embaucher, cela me'
>>> se = s.encode("utf8", "strict")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 14: ordinal not in range(128)
>>> se = s.encode("utf-8", "strict")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 14: ordinal not in range(128)
>>> s.decode()
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 14: ordinal not in range(128)
Then my question is. If this text cannot be encoded. Then why, is my MongoPipeline try catch not catchingthis EXCEPTION? Because only objects that don't raise any exception should be appended to item['comments'] ?
那么我的问题是。如果此文本无法编码。那为什么,我的 MongoPipeline try catch没有捕捉到这个异常?因为只有不引发任何异常的对象才应该附加到 item['comments'] ?
采纳答案by Codious-JR
Finally I figured it out. The problem was not with encoding. It was with the structure of the documents.
最后我想通了。问题不在于编码。它与文档的结构有关。
Because i went off on the standard MongoPipeline example which does not deal with nested scrapy items.
因为我继续使用标准 MongoPipeline 示例,该示例不处理嵌套的 scrapy 项目。
What i am doing is: BlogItem: "url" ... comments = [CommentItem]
我正在做的是: BlogItem: "url" ... comments = [CommentItem]
So my BlogItem has a list of CommentItems. Now the problem came here, for persisting the object in the database i do:
所以我的 BlogItem 有一个 CommentItems 列表。现在问题来了,为了在数据库中持久化对象,我这样做:
self.db[self.collection_name].insert(dict(item))
So here i am parsing the BlogItem to a dict. But i am not parsing the list of CommentItems. And because the traceback displays the CommentItem kind of like a dict, It did not occur to me that the problematic object is not a dict!
所以在这里我将 BlogItem 解析为 dict。但我没有解析 CommentItems 列表。并且因为回溯显示 CommentItem 有点像字典,我没有想到有问题的对象不是字典!
So finally the the way to fix this problem is to change the line when appending the comment to the comment list as such:
所以最后解决这个问题的方法是在将评论附加到评论列表时更改行,如下所示:
item['comments'].append(dict(comment))
Now MongoDB considers it as a valid document.
现在 MongoDB 将其视为有效文档。
Lastly, for the last part where i ask why i am getting a exception on the python console and not in the script.
最后,对于最后一部分,我问为什么我在 python 控制台上而不是在脚本中收到异常。
The reason is because i was working on the python console, which only supports ascii. And thus the error.
原因是因为我在 python 控制台上工作,它只支持 ascii。因此错误。
回答by eLRuLL
First, when you do "somestring".encode(...)
, isn't changing "somestring"
, but it returns a new encoded string, so you should use something like:
首先,当你这样做时"somestring".encode(...)
,不会改变"somestring"
,但它返回一个新的编码字符串,所以你应该使用类似的东西:
item['author'] = item['author'].encode('utf-8', 'strict')
and the same for the other fields.
其他领域也一样。
回答by duhaime
I got this error when running a query
运行查询时出现此错误
db.collection.find({'attr': {'$gte': 20}})
and some records in collection
had a non-numeric value for attr
.
并且 中的某些记录collection
具有非数字值attr
。