针对树结构优化的 SQL
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/317322/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Optimized SQL for tree structures
提问by Seb Nilsson
How would you get tree-structured data from a database with the best performance? For example, say you have a folder-hierarchy in a database. Where the folder-database-row has ID, Nameand ParentIDcolumns.
您将如何从具有最佳性能的数据库中获取树结构数据?例如,假设您在数据库中有一个文件夹层次结构。其中文件夹数据库行具有ID、Name和ParentID列。
Would you use a special algorithm to get all the data at once, minimizing the amount of database-calls and process it in code?
您会使用一种特殊的算法一次获取所有数据,最大限度地减少数据库调用量并在代码中处理它吗?
Or would you use do many calls to the database and sort of get the structure done from the database directly?
或者您是否会使用对数据库的多次调用并直接从数据库中完成结构?
Maybe there are different answers based on x amount of database-rows, hierarchy-depth or whatever?
也许有基于 x 数量的数据库行、层次结构深度或其他什么的不同答案?
Edit: I use Microsoft SQL Server, but answers out of other perspectives are interesting too.
编辑:我使用 Microsoft SQL Server,但其他观点的答案也很有趣。
回答by Ned Batchelder
It really depends on how you are going to access the tree.
这实际上取决于您将如何访问该树。
One clever technique is to give every node a string id, where the parent's id is a predictable substring of the child. For example, the parent could be '01', and the children would be '0100', '0101', '0102', etc. This way you can select an entire subtree from the database at once with:
一种巧妙的技术是给每个节点一个字符串 id,其中父节点的 id 是子节点的可预测子字符串。例如,父节点可能是“01”,子节点可能是“0100”、“0101”、“0102”等。这样你就可以一次从数据库中选择整个子树:
SELECT * FROM treedata WHERE id LIKE '0101%';
Because the criterion is an initial substring, an index on the ID column would speed the query.
由于条件是初始子字符串,因此 ID 列上的索引会加快查询速度。
回答by Bernard Igiri
Out of all the ways to store a tree in a RDMS the most common are adjacency lists and nested sets. Nested sets are optimized for reads and can retrieve an entire tree in a single query. Adjacency lists are optimized for writes and can added to with in a simple query.
在 RDMS 中存储树的所有方法中,最常见的是邻接表和嵌套集。嵌套集针对读取进行了优化,可以在单个查询中检索整个树。邻接列表针对写入进行了优化,并且可以在一个简单的查询中添加。
With adjacency lists each node a has column that refers to the parent node or the child node (other links are possible). Using that you can build the hierarchy based on parent child relationships. Unfortunately unless you restrict your tree's depth you cannot pull the whole thing in one query and reading it is usually slower than updating it.
对于邻接列表,每个节点都有一个列,该列引用父节点或子节点(其他链接也是可能的)。使用它,您可以基于父子关系构建层次结构。不幸的是,除非您限制树的深度,否则您无法在一个查询中提取整个内容,并且阅读它通常比更新它慢。
With the nested set model the inverse is true, reading is fast and easy but updates get complex because you must maintain the numbering system. The nested set model encodes both parentage and sort order by enumerating all of the nodes using a preorder based numbering system.
对于嵌套集模型,情况正好相反,读取既快速又容易,但更新变得复杂,因为您必须维护编号系统。嵌套集模型通过使用基于预序的编号系统枚举所有节点来对亲子关系和排序顺序进行编码。
I've used the nested set model and while it is complex for read optimizing a large hierarchy it is worth it. Once you do a few exercises in drawing out the tree and numbering the nodes you should get the hang of it.
我使用了嵌套集模型,虽然读取优化大型层次结构很复杂,但值得。一旦你做了一些绘制树和给节点编号的练习,你就应该掌握它的窍门。
My research on this method started at this article: Managing Hierarchical Data in MySQL.
我对这种方法的研究始于这篇文章:Managing Hierarchical Data in MySQL。
回答by Bernard Igiri
In the product I work on we have some tree structures stored in SQL Server and use the technique mentioned above to store a node's hierarchy in the record. i.e.
在我开发的产品中,我们在 SQL Server 中存储了一些树结构,并使用上述技术将节点的层次结构存储在记录中。IE
tblTreeNode
TreeID = 1
TreeNodeID = 100
ParentTreeNodeID = 99
Hierarchy = ".33.59.99.100."
[...] (actual data payload for node)
Maintaining the the hierarchy is the tricky bit of course and makes use of triggers. But generating it on an insert/delete/move is never recursive, because the parent or child's hierarchy has all the information you need.
维护层次结构当然是一个棘手的问题,并且会使用触发器。但是在插入/删除/移动时生成它永远不会递归,因为父级或子级的层次结构具有您需要的所有信息。
you can get all of node's descendants thusly:
您可以这样获得所有节点的后代:
SELECT * FROM tblNode WHERE Hierarchy LIKE '%.100.%'
Here's the insert trigger:
这是插入触发器:
--Setup the top level if there is any
UPDATE T
SET T.TreeNodeHierarchy = '.' + CONVERT(nvarchar(10), T.TreeNodeID) + '.'
FROM tblTreeNode AS T
INNER JOIN inserted i ON T.TreeNodeID = i.TreeNodeID
WHERE (i.ParentTreeNodeID IS NULL) AND (i.TreeNodeHierarchy IS NULL)
WHILE EXISTS (SELECT * FROM tblTreeNode WHERE TreeNodeHierarchy IS NULL)
BEGIN
--Update those items that we have enough information to update - parent has text in Hierarchy
UPDATE CHILD
SET CHILD.TreeNodeHierarchy = PARENT.TreeNodeHierarchy + CONVERT(nvarchar(10),CHILD.TreeNodeID) + '.'
FROM tblTreeNode AS CHILD
INNER JOIN tblTreeNode AS PARENT ON CHILD.ParentTreeNodeID = PARENT.TreeNodeID
WHERE (CHILD.TreeNodeHierarchy IS NULL) AND (PARENT.TreeNodeHierarchy IS NOT NULL)
END
and here's the update trigger:
这是更新触发器:
--Only want to do something if Parent IDs were changed
IF UPDATE(ParentTreeNodeID)
BEGIN
--Update the changed items to reflect their new parents
UPDATE CHILD
SET CHILD.TreeNodeHierarchy = CASE WHEN PARENT.TreeNodeID IS NULL THEN '.' + CONVERT(nvarchar,CHILD.TreeNodeID) + '.' ELSE PARENT.TreeNodeHierarchy + CONVERT(nvarchar, CHILD.TreeNodeID) + '.' END
FROM tblTreeNode AS CHILD
INNER JOIN inserted AS I ON CHILD.TreeNodeID = I.TreeNodeID
LEFT JOIN tblTreeNode AS PARENT ON CHILD.ParentTreeNodeID = PARENT.TreeNodeID
--Now update any sub items of the changed rows if any exist
IF EXISTS (
SELECT *
FROM tblTreeNode
INNER JOIN deleted ON tblTreeNode.ParentTreeNodeID = deleted.TreeNodeID
)
UPDATE CHILD
SET CHILD.TreeNodeHierarchy = NEWPARENT.TreeNodeHierarchy + RIGHT(CHILD.TreeNodeHierarchy, LEN(CHILD.TreeNodeHierarchy) - LEN(OLDPARENT.TreeNodeHierarchy))
FROM tblTreeNode AS CHILD
INNER JOIN deleted AS OLDPARENT ON CHILD.TreeNodeHierarchy LIKE (OLDPARENT.TreeNodeHierarchy + '%')
INNER JOIN tblTreeNode AS NEWPARENT ON OLDPARENT.TreeNodeID = NEWPARENT.TreeNodeID
END
one more bit, a check constraint to prevent a circular reference in tree nodes:
还有一点,检查约束以防止树节点中的循环引用:
ALTER TABLE [dbo].[tblTreeNode] WITH NOCHECK ADD CONSTRAINT [CK_tblTreeNode_TreeNodeHierarchy] CHECK
((charindex(('.' + convert(nvarchar(10),[TreeNodeID]) + '.'),[TreeNodeHierarchy],(charindex(('.' + convert(nvarchar(10),[TreeNodeID]) + '.'),[TreeNodeHierarchy]) + 1)) = 0))
I would also recommend triggers to prevent more than one root node (null parent) per tree, and to keep related nodes from belonging to different TreeIDs (but those are a little more trivial than the above.)
我还建议使用触发器来防止每棵树有一个以上的根节点(空父节点),并防止相关节点属于不同的 TreeID(但这些比上面的要简单一些。)
You'll want to check for your particular case to see if this solution performs acceptably. Hope this helps!
您需要检查您的特定情况,以查看此解决方案的性能是否可接受。希望这可以帮助!
回答by Gene T
Celko wrote about this (2000):
Celko 写道(2000 年):
http://www.dbmsmag.com/9603d06.html
http://www.dbmsmag.com/9603d06.html
and other people asked:
还有人问:
Joining other tables in oracle tree queries
How to calculate the sum of values in a tree using SQL
How to store directory / hierarchy / tree structure in the database?
Performance of recursive stored procedures in MYSQL to get hierarchical data
What is the most efficient/elegant way to parse a flat table into a tree?
finally, you could look at the rails "acts_as_tree" (read-heavy) and "acts_as_nested_set" (write-heavy) plugins. I don't ahve a good link comparing them.
最后,您可以查看 rails 的“acts_as_tree”(重读)和“acts_as_nested_set”(重写)插件。我没有比较它们的好链接。
回答by S.Lott
There are several common kinds of queries against a hierarchy. Most other kinds of queries are variations on these.
有几种常见的针对层次结构的查询。大多数其他类型的查询都是这些的变体。
From a parent, find all children.
a. To a specific depth. For example, given my immediate parent, all children to a depth of 1 will be my siblings.
b. To the bottom of the tree.
From a child, find all parents.
a. To a specific depth. For example, my immediate parent is parents to a depth of 1.
b. To an unlimited depth.
从父母那里,找到所有的孩子。
一种。到特定深度。例如,给定我的直系父母,深度为 1 的所有孩子都将是我的兄弟姐妹。
湾 到树底。
从一个孩子,找到所有的父母。
一种。到特定深度。例如,我的直接父级是深度为 1 的父级。
湾 到无限的深度。
The (a) cases (a specific depth) are easier in SQL. The special case (depth=1) is trivial in SQL. The non-zero depth is harder. A finite, but non-zero depth, can be done via a finite number of joins. The (b) cases, with indefinite depth (to the top, to the bottom), are really hard.
(a) 情况(特定深度)在 SQL 中更容易。特殊情况 (depth=1) 在 SQL 中是微不足道的。非零深度更难。有限但非零的深度可以通过有限数量的连接来完成。(b) 案例,具有无限深度(到顶部,到底部),真的很难。
If you tree is HUGE(millions of nodes) then you're in a world of hurt no matter what you try to do.
如果您的树很大(数百万个节点),那么无论您尝试做什么,您都会受到伤害。
If your tree is under a million nodes, just fetch it all into memory and work on it there. Life is much simpler in an OO world. Simply fetch the rows and build the tree as the rows are returned.
如果您的树低于一百万个节点,只需将其全部提取到内存中并在那里进行处理。在面向对象的世界中,生活要简单得多。只需获取行并在返回行时构建树。
If you have a Hugetree, you have two choices.
如果你有一棵巨大的树,你有两个选择。
Recursive cursors to handle the unlimited fetching. This means the maintenance of the structure is O(1) -- just update a few nodes and you're done. However fetching is O(n*log(n)) because you have to open a cursor for each node with children.
Clever "heap numbering" algorithms can encode the parentage of each node. Once each node is properly numbered, a trivial SQL SELECT can be used for all four types of queries. Changes to the tree structure, however, require renumbering the nodes, making the cost of a change fairly high compared to the cost of retrieval.
递归游标来处理无限获取。这意味着结构的维护是 O(1) —— 只需更新几个节点就完成了。然而,获取是 O(n*log(n)) 因为你必须为每个有子节点的节点打开一个游标。
聪明的“堆编号”算法可以对每个节点的父系进行编码。一旦每个节点都被正确编号,一个简单的 SQL SELECT 就可以用于所有四种类型的查询。但是,对树结构的更改需要对节点重新编号,与检索成本相比,更改成本相当高。
回答by JeeBee
If you have many trees in the database, and you will only ever get the whole tree out, I would store a tree ID (or root node ID) and a parent node ID for each node in the database, get all the nodes for a particular tree ID, and process in memory.
如果数据库中有很多树,并且只能取出整棵树,我会为数据库中的每个节点存储一个树 ID(或根节点 ID)和一个父节点 ID,获取所有节点特定的树 ID,以及内存中的进程。
However if you will be getting subtrees out, you can only get a subtree of a particular parent node ID, so you either need to store all parent nodes of each node to use the above method, or perform multiple SQL queries as you descend into the tree (hope there are no cycles in your tree!), although you can reuse the same Prepared Statement (assuming that nodes are of the same type and are all stored in a single table) to prevent re-compiling the SQL, so it might not be slower, indeed with database optimisations applied to the query it could be preferable. Might want to run some tests to find out.
但是,如果要取出子树,则只能获取特定父节点 ID 的子树,因此您要么需要存储每个节点的所有父节点以使用上述方法,要么在下降时执行多个 SQL 查询树(希望您的树中没有循环!),尽管您可以重用相同的 Prepared Statement(假设节点具有相同类型并且都存储在单个表中)以防止重新编译 SQL,因此它可能不会变慢,实际上将数据库优化应用于查询可能会更可取。可能想运行一些测试来找出答案。
If you are only storing one tree, your question becomes one of querying subtrees only, and the second answer applied.
如果您只存储一棵树,您的问题将成为仅查询子树之一,并应用第二个答案。
回答by Galwegian
I am a fan of the simple method of storing an ID associated with its parentID:
我喜欢存储与其 parentID 关联的 ID 的简单方法:
ID ParentID
1 null
2 null
3 1
4 2
... ...
It is easy to maintain, and very scalable.
它易于维护,并且具有很强的可扩展性。
回答by Thomas Hansen
Google for "Materialized Path" or "Genetic Trees"...
谷歌搜索“物化路径”或“遗传树”...
回答by Dmitry Khalatov
In Oracle there is SELECT ... CONNECT BY statement to retrieve trees.
在 Oracle 中有 SELECT ... CONNECT BY 语句来检索树。
回答by Turnkey
This articleis interesting as it shows some retrieval methods as well as a way to store the lineage as a derived column. The lineage provides a shortcut method to retrieve the hierarchy without too many joins.
这篇文章很有趣,因为它展示了一些检索方法以及一种将谱系存储为派生列的方法。沿袭提供了一种快捷方法来检索层次结构,而无需太多连接。