MongoDB/NoSQL:保留文档更改历史
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/3507624/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
MongoDB/NoSQL: Keeping Document Change History
提问by Phil Sandler
A fairly common requirement in database applications is to track changes to one or more specific entities in a database. I've heard this called row versioning, a log table or a history table (I'm sure there are other names for it). There are a number of ways to approach it in an RDBMS--you can write all changes from all source tables to a single table (more of a log) or have a separate history table for each source table. You also have the option to either manage the logging in application code or via database triggers.
数据库应用程序中一个相当普遍的要求是跟踪对数据库中一个或多个特定实体的更改。我听说过这称为行版本控制、日志表或历史表(我确定还有其他名称)。在 RDBMS 中有多种方法可以处理它——您可以将所有源表的所有更改写入单个表(更多是日志),或者为每个源表创建一个单独的历史表。您还可以选择在应用程序代码中或通过数据库触发器管理日志记录。
I'm trying to think through what a solution to the same problem would look like in a NoSQL/document database (specifically MongoDB), and how it would be solved in a uniform way. Would it be as simple as creating version numbers for documents, and never overwriting them? Creating separate collections for "real" vs. "logged" documents? How would this affect querying and performance?
我正在尝试思考在 NoSQL/文档数据库(特别是 MongoDB)中解决同一问题的方法,以及如何以统一的方式解决它。它会像为文档创建版本号一样简单,并且永远不会覆盖它们吗?为“真实”和“记录”文档创建单独的集合?这将如何影响查询和性能?
Anyway, is this a common scenario with NoSQL databases, and if so, is there a common solution?
无论如何,这是否是 NoSQL 数据库的常见场景,如果是,是否有通用解决方案?
采纳答案by Niels van der Rest
Good question, I was looking into this myself as well.
好问题,我自己也在研究这个。
Create a new version on each change
在每次更改时创建一个新版本
I came across the Versioning moduleof the Mongoid driver for Ruby. I haven't used it myself, but from what I could find, it adds a version number to each document. Older versions are embedded in the document itself. The major drawback is that the entire document is duplicated on each change, which will result in a lot of duplicate content being stored when you're dealing with large documents. This approach is fine though when you're dealing with small-sized documents and/or don't update documents very often.
我遇到了用于 Ruby 的 Mongoid 驱动程序的版本控制模块。我自己没有使用过,但据我所知,它为每个文档添加了一个版本号。旧版本嵌入在文档本身中。主要缺点是每次更改都会复制整个文档,这将导致在处理大型文档时存储大量重复内容。但是当您处理小型文档和/或不经常更新文档时,这种方法很好。
Only store changes in a new version
仅在新版本中存储更改
Another approach would be to store only the changed fields in a new version. Then you can 'flatten' your history to reconstruct any version of the document. This is rather complex though, as you need to track changes in your model and store updates and deletes in a way that your application can reconstruct the up-to-date document. This might be tricky, as you're dealing with structured documents rather than flat SQL tables.
另一种方法是仅将更改的字段存储在新版本中。然后,您可以“展平”您的历史记录以重建文档的任何版本。但是,这相当复杂,因为您需要跟踪模型中的更改并以应用程序可以重建最新文档的方式存储更新和删除。这可能很棘手,因为您正在处理结构化文档而不是平面 SQL 表。
Store changes within the document
在文档中存储更改
Each field can also have an individual history. Reconstructing documents to a given version is much easier this way. In your application you don't have to explicitly track changes, but just create a new version of the property when you change its value. A document could look something like this:
每个字段也可以有单独的历史记录。通过这种方式将文档重建为给定版本要容易得多。在您的应用程序中,您不必明确跟踪更改,而只需在更改其值时创建该属性的新版本。文档可能如下所示:
{
_id: "4c6b9456f61f000000007ba6"
title: [
{ version: 1, value: "Hello world" },
{ version: 6, value: "Foo" }
],
body: [
{ version: 1, value: "Is this thing on?" },
{ version: 2, value: "What should I write?" },
{ version: 6, value: "This is the new body" }
],
tags: [
{ version: 1, value: [ "test", "trivial" ] },
{ version: 6, value: [ "foo", "test" ] }
],
comments: [
{
author: "joe", // Unversioned field
body: [
{ version: 3, value: "Something cool" }
]
},
{
author: "xxx",
body: [
{ version: 4, value: "Spam" },
{ version: 5, deleted: true }
]
},
{
author: "jim",
body: [
{ version: 7, value: "Not bad" },
{ version: 8, value: "Not bad at all" }
]
}
]
}
Marking part of the document as deleted in a version is still somewhat awkward though. You could introduce a state
field for parts that can be deleted/restored from your application:
不过,在版本中将文档的一部分标记为已删除仍然有些尴尬。您可以state
为可以从您的应用程序中删除/恢复的部件引入一个字段:
{
author: "xxx",
body: [
{ version: 4, value: "Spam" }
],
state: [
{ version: 4, deleted: false },
{ version: 5, deleted: true }
]
}
With each of these approaches you can store an up-to-date and flattened version in one collection and the history data in a separate collection. This should improve query times if you're only interested in the latest version of a document. But when you need both the latest version and historical data, you'll need to perform two queries, rather than one. So the choice of using a single collection vs. two separate collections should depend on how often your application needs the historical versions.
使用这些方法中的每一种,您都可以将最新的扁平化版本存储在一个集合中,并将历史数据存储在一个单独的集合中。如果您只对文档的最新版本感兴趣,这应该会缩短查询时间。但是当您同时需要最新版本和历史数据时,您将需要执行两个查询,而不是一个。因此,选择使用单个集合还是两个单独的集合应该取决于您的应用程序需要历史版本的频率。
Most of this answer is just a brain dump of my thoughts, I haven't actually tried any of this yet. Looking back on it, the first option is probably the easiest and best solution, unless the overhead of duplicate data is very significant for your application. The second option is quite complex and probably isn't worth the effort. The third option is basically an optimization of option two and should be easier to implement, but probably isn't worth the implementation effort unless you really can't go with option one.
这个答案的大部分只是我的想法的大脑转储,我还没有真正尝试过任何这些。回想起来,第一个选项可能是最简单和最好的解决方案,除非重复数据的开销对您的应用程序非常重要。第二种选择非常复杂,可能不值得付出努力。第三个选项基本上是选项二的优化,应该更容易实现,但可能不值得实施,除非你真的不能选择选项一。
Looking forward to feedback on this, and other people's solutions to the problem :)
期待对此的反馈,以及其他人对问题的解决方案:)
回答by Amala
We have partially implemented this on our site and we use the 'Store Revisions in a separate document" (and separate database). We wrote a custom function to return the diffs and we store that. Not so hard and can allow for automated recovery.
我们已经在我们的网站上部分实现了这一点,我们使用“在单独的文档中存储修订”(和单独的数据库)。我们编写了一个自定义函数来返回差异并存储它。没有那么难,可以允许自动恢复。
回答by Paul Taylor
Why not a variation on Store changes within the document?
为什么不在文档中更改 Store 更改?
Instead of storing versions against each key pair, the current key pairs in the document always represents the most recent state and a 'log' of changes is stored within a history array. Only those keys which have changed since creation will have an entry in the log.
文档中的当前密钥对始终表示最新状态,并且更改的“日志”存储在历史数组中,而不是针对每个密钥对存储版本。只有自创建以来发生更改的那些键才会在日志中记录。
{
_id: "4c6b9456f61f000000007ba6"
title: "Bar",
body: "Is this thing on?",
tags: [ "test", "trivial" ],
comments: [
{ key: 1, author: "joe", body: "Something cool" },
{ key: 2, author: "xxx", body: "Spam", deleted: true },
{ key: 3, author: "jim", body: "Not bad at all" }
],
history: [
{
who: "joe",
when: 20160101,
what: { title: "Foo", body: "What should I write?" }
},
{
who: "jim",
when: 20160105,
what: { tags: ["test", "test2"], comments: { key: 3, body: "Not baaad at all" }
}
]
}
回答by Paul Kar.
One can have a current NoSQL database and a historical NoSQL database. There will be a an nightly ETL ran everyday. This ETL will record every value with a timestamp, so instead of values it will always be tuples (versioned fields). It will only record a new value if there was a change made on the current value, saving space in the process. For example, this historical NoSQL database json file can look like so:
一个人可以拥有一个当前的 NoSQL 数据库和一个历史的 NoSQL 数据库。每天都会有一个夜间 ETL 运行。此 ETL 将记录每个带有时间戳的值,因此它始终是元组(版本化字段)而不是值。如果对当前值进行了更改,它只会记录一个新值,从而节省过程中的空间。例如,这个历史 NoSQL 数据库 json 文件可能如下所示:
{
_id: "4c6b9456f61f000000007ba6"
title: [
{ date: 20160101, value: "Hello world" },
{ date: 20160202, value: "Foo" }
],
body: [
{ date: 20160101, value: "Is this thing on?" },
{ date: 20160102, value: "What should I write?" },
{ date: 20160202, value: "This is the new body" }
],
tags: [
{ date: 20160101, value: [ "test", "trivial" ] },
{ date: 20160102, value: [ "foo", "test" ] }
],
comments: [
{
author: "joe", // Unversioned field
body: [
{ date: 20160301, value: "Something cool" }
]
},
{
author: "xxx",
body: [
{ date: 20160101, value: "Spam" },
{ date: 20160102, deleted: true }
]
},
{
author: "jim",
body: [
{ date: 20160101, value: "Not bad" },
{ date: 20160102, value: "Not bad at all" }
]
}
]
}
回答by Dash2TheDot
For users of Python (python 3+, and up of course) , there's HistoricalCollectionthat's an extension of pymongo's Collection object.
对于 Python 用户(当然是 Python 3+ 及以上),HistoricalCollection是 pymongo 的 Collection 对象的扩展。
Example from the docs:
文档中的示例:
from historical_collection.historical import HistoricalCollection
from pymongo import MongoClient
class Users(HistoricalCollection):
PK_FIELDS = ['username', ] # <<= This is the only requirement
# ...
users = Users(database=db)
users.patch_one({"username": "darth_later", "email": "[email protected]"})
users.patch_one({"username": "darth_later", "email": "[email protected]", "laser_sword_color": "red"})
list(users.revisions({"username": "darth_later"}))
# [{'_id': ObjectId('5d98c3385d8edadaf0bb845b'),
# 'username': 'darth_later',
# 'email': '[email protected]',
# '_revision_metadata': None},
# {'_id': ObjectId('5d98c3385d8edadaf0bb845b'),
# 'username': 'darth_later',
# 'email': '[email protected]',
# '_revision_metadata': None,
# 'laser_sword_color': 'red'}]
Full disclosure, I am the package author. :)
完全公开,我是包作者。:)