postgresql 使用物化路径对树进行排序?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/2797720/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Sorting tree with a materialized path?
提问by Ovid
I have a tree structure in a table and it uses materialized paths to allow me to find children quickly. However, I also need to sort the results depth-first, as one would expect with threaded forum replies.
我在表中有一个树结构,它使用物化路径来让我快速找到孩子。但是,我还需要对结果进行深度优先排序,正如人们对线程论坛回复所期望的那样。
id | parent_id | matpath | created
----+-----------+---------+----------------------------
2 | 1 | 1 | 2010-05-08 15:18:37.987544
3 | 1 | 1 | 2010-05-08 17:38:14.125377
4 | 1 | 1 | 2010-05-08 17:38:57.26743
5 | 1 | 1 | 2010-05-08 17:43:28.211708
7 | 1 | 1 | 2010-05-08 18:18:11.849735
6 | 2 | 1.2 | 2010-05-08 17:50:43.288759
9 | 5 | 1.5 | 2010-05-09 14:02:43.818646
8 | 6 | 1.2.6 | 2010-05-09 14:01:17.632695
So the final results should actually be sorted like this:
所以最终的结果实际上应该是这样排序的:
id | parent_id | matpath | created
----+-----------+---------+----------------------------
2 | 1 | 1 | 2010-05-08 15:18:37.987544
6 | 2 | 1.2 | 2010-05-08 17:50:43.288759
8 | 6 | 1.2.6 | 2010-05-09 14:01:17.632695
3 | 1 | 1 | 2010-05-08 17:38:14.125377
4 | 1 | 1 | 2010-05-08 17:38:57.26743
5 | 1 | 1 | 2010-05-08 17:43:28.211708
9 | 5 | 1.5 | 2010-05-09 14:02:43.818646
7 | 1 | 1 | 2010-05-08 18:18:11.849735
How would I work that out? Can I do that in straight SQL (this is PostgreSQL 8.4) or should additional information be added to this table?
我将如何解决这个问题?我可以用直接的 SQL(这是 PostgreSQL 8.4)来做到这一点,还是应该向这个表中添加额外的信息?
Update: trying to explain sort criteria better.
更新:试图更好地解释排序标准。
Imagine that id '1' is the root post to a forum and everything with a 'matpath' beginning with '1' is a child of that post. So ids 2 through 5 are direct replies to 1 and get matpaths of '1'. However, id 6 is a reply 2, not directly to 1, so it gets a matpath of 1.2. This means that for a threaded forum with proper nesting, with all ids shown in the tables, the structure of the forum would look like this, hence the ordering requirement:
想象一下,id '1' 是论坛的根帖子,并且所有以 '1' 开头的 'matpath' 都是该帖子的子级。所以 ids 2 到 5 是对 1 的直接回复,并获得 '1' 的 matpaths。但是,id 6 是回复 2,而不是直接回复 1,因此它的 matpath 为 1.2。这意味着对于具有正确嵌套的线程论坛,所有 ID 都显示在表中,论坛的结构将如下所示,因此订购要求:
* id 1 (root post)
* id 2
* id 6
* id 8
* id 3
* id 4
* id 5
* id 9
* id 7
采纳答案by RedFilter
I typically create an additional columnn for this, called something like SortPath
. It would contain the data that you need to sort by, concatenated together. That column would be of type varchar
, and get sorted as a string. Something like this:
我通常为此创建一个额外的列,称为SortPath
. 它将包含您需要排序的数据,并连接在一起。该列的类型为varchar
,并按字符串排序。像这样的东西:
id | parent_id | matpath | created | sortpath
---+-----------+---------+-----------------------------+--------------------------------------------------------------------------------------
2 | 1 | 1 | 2010-05-08 15:18:37.987544 | 2010-05-08 15:18:37.987544-2
6 | 2 | 1.2 | 2010-05-08 17:50:43.288759 | 2010-05-08 15:18:37.987544-2.2010-05-08 17:50:43.288759-6
8 | 6 | 1.2.6 | 2010-05-09 14:01:17.632695 | 2010-05-08 15:18:37.987544-2.2010-05-08 17:50:43.288759-6.2010-05-09 14:01:17.632695-8
3 | 1 | 1 | 2010-05-08 17:38:14.125377 | 2010-05-08 17:38:14.125377-3
4 | 1 | 1 | 2010-05-08 17:38:57.26743 | 2010-05-08 17:38:57.267430-4
5 | 1 | 1 | 2010-05-08 17:43:28.211708 | 2010-05-08 17:43:28.211708-5
9 | 5 | 1.5 | 2010-05-09 14:02:43.818646 | 2010-05-08 17:43:28.211708-5.2010-05-09 14:02:43.818646-9
7 | 1 | 1 | 2010-05-08 18:18:11.849735 | 2010-05-08 18:18:11.849735-7
A couple of things to note here:
这里有几点需要注意:
sortpath
will be sorted as a string, so it is important all dates have the same length for it to correctly sort. E.g., observe how2010-05-08 17:38:57.26743
has an extra zero added in thesortpath
column.- I have appended the PK of each node to the end of its date. This is so that if you happen to have two rows with the exact same date, they will always get returned in the same order due to the additional data we are appending.
- To me, the data looks asymmetrical the way I have written it, because we are showing the current node's date in
sortpath
, but it is not inmatpath
. I would prefer to see it in both. - You may want to put the date of node ID 1 at the beginning of each
sortcolumn
as well. This is so that if you ever want to query for more than one forum at a time (you probably won't), then it will still sort correctly.
sortpath
将作为字符串排序,因此重要的是所有日期都具有相同的长度才能正确排序。例如,观察列中如何2010-05-08 17:38:57.26743
添加额外的零sortpath
。- 我已将每个节点的 PK 附加到其日期的末尾。这样,如果您碰巧有两行具有完全相同的日期,由于我们附加了额外的数据,它们将始终以相同的顺序返回。
- 对我来说,数据看起来不对称,因为我们在 中显示了当前节点的日期
sortpath
,但它不在matpath
. 我更愿意在两者中看到它。 - 您可能还想将节点 ID 1 的日期放在每个节点的开头
sortcolumn
。这样一来,如果您想一次查询多个论坛(您可能不会),那么它仍然可以正确排序。
回答by Unreason
I believe your materialized path is not right.
我相信你的物化路径是不对的。
What logic do you get to sort things like this
你有什么逻辑可以对这样的事情进行排序
1
1.2
1
1.5
Why is the second 1 not together with the first one?
为什么第二个 1 不和第一个在一起?
If you had
如果你有
1
1.2
2
2.5
This would be trivial.
这将是微不足道的。
EDIT: I have looked at your example and you are not storing materialized path of a row, but you are storing a materialized path of the parent row. Here's how the materialized path of the row actually should look like. Sorting directly on matpath would work if you would not have more than 9 branches if you stored it as:
编辑:我看过你的例子,你没有存储一行的物化路径,而是存储父行的物化路径。下面是行的实体化路径实际上应该是什么样子。如果您将其存储为以下内容,则如果您的分支不超过 9 个,则直接在 matpath 上排序将起作用:
id | parent_id | matpath | created
----+-----------+-----------+----------------------------
2 | 1 | 1.2 | 2010-05-08 15:18:37.987544
6 | 2 | 1.2.6 | 2010-05-08 17:50:43.288759
8 | 6 | 1.2.6.8 | 2010-05-09 14:01:17.632695
3 | 1 | 1.3 | 2010-05-08 17:38:14.125377
4 | 1 | 1.4 | 2010-05-08 17:38:57.26743
5 | 1 | 1.5 | 2010-05-08 17:43:28.211708
9 | 5 | 1.5.9 | 2010-05-09 14:02:43.818646
7 | 1 | 1.7 | 2010-05-08 18:18:11.849735
otherwise (>9) you would have to turn the matpath
into something like
否则 (>9) 你必须把它matpath
变成类似的东西
001.002.006
001.002.006.008
that would support up to 999 branches.
这将支持多达 999 个分支。
Please note
请注意
- even the approach with 4 fixed digits, such as
0001.0002.0006
would give you a field that is shorter then in the accepted answer - you could parse matpath an produce sorting value on the fly with a user function
- you could directly store matpath in this format (it has some other nice properties, too)
- 即使是具有 4 个固定数字的方法,例如
0001.0002.0006
会给您一个比接受的答案中更短的字段 - 您可以使用用户函数动态解析 matpath 生产排序值
- 您可以直接以这种格式存储 matpath (它也有一些其他不错的属性)
回答by DrFriedParts
I'm not sure I understand why the accepted solution makes any sense at all. It works, but it is even less normalized and less efficient (more disk space, more indexes, etc) than @Unreason's solution (to just pad the ID's in the materialized path).
我不确定我是否理解为什么接受的解决方案有任何意义。它可以工作,但与@Unreason 的解决方案(仅在实体化路径中填充 ID)相比,它的规范化程度更低,效率更低(更多磁盘空间、更多索引等)。
The whole scenario that the OP faces seems to stem from the fact that, as @Unreason correctly points out, the implementation of the materialized path (MP) is incorrect. The OP has provided the MP to the parent, not to the current node. In the accepted solution the SortPath
column corrects this by providing the materialized path to the current node (this time using dates -- why? -- instead of the primary key).
OP 面临的整个场景似乎源于这样一个事实,正如@Unreason 正确指出的那样,物化路径 (MP) 的实现是不正确的。OP 已将 MP 提供给父节点,而不是当前节点。在公认的解决方案中,该SortPath
列通过提供到当前节点的具体化路径来纠正这一点(这次使用日期——为什么?——而不是主键)。
For reference consider the following excerpt:
作为参考,请考虑以下摘录:
Materialized Path
In this approach each record stores the whole path to the root. In our previous example, lets assume that KING is a root node. Then, the record with ename = 'SCOTT' is connected to the root via the path SCOTT->JONES->KING. Modern databases allow representing a list of nodes as a single value, but since materialized path has been invented long before then, the convention stuck to plain character string of nodes concatenated with some separator; most often '.' or '/'.
实体化路径
在这种方法中,每条记录都存储到根的整个路径。在我们前面的例子中,让我们假设 KING 是一个根节点。然后,ename = 'SCOTT' 的记录通过路径 SCOTT->JONES->KING 连接到根。现代数据库允许将节点列表表示为单个值,但由于物化路径早在此之前就已被发明,因此约定坚持使用与某些分隔符连接的节点的纯字符串;最经常 '。' 或者 '/'。
回答by chx
While @Unreason's answer about padding works, I would like to offer another solution which I believe is my own invention to this problem.
虽然@Unreason 关于填充的回答有效,但我想提供另一种解决方案,我认为这是我自己对这个问题的发明。
You are looking for function creating a bytestream, an f(x)=b_1b_2..b_i
(sorry no MatJax on SO) where b_i
is a byte. We know two bytestream compares the same as the first differing byte. We want such a function that f(x)<f(y) iff x<y
.
您正在寻找创建字节流的函数,一个f(x)=b_1b_2..b_i
(很抱歉没有 MatJax),其中b_i
是一个字节。我们知道两个字节流与第一个不同的字节比较相同。我们想要这样一个函数,f(x)<f(y) iff x<y
.
Padding to the same length with 0 definitely obtains this goal, easily. You take two numbers, look at the first nonzero byte and there you are.
用 0 填充到相同的长度肯定可以轻松实现这个目标。你拿两个数字,看看第一个非零字节,你就知道了。
Steven Wittens (acko.net) introduced a different trick to Drupal core some eight years ago: put the number of digits in front of the string as another digit. So, the number 97685 becomes the characters 5 9 7 6 8 5
. This also works: look at the length byte first, if they are not the same then the larger will definitely be the larger. Beyond that, you know the two numbers are equal length. He also used base 36 numbers with 0-9a-z being the digits, much like hex just for every letter. This encoding needs two bytes for the first 36 nodes, three for the next 1260...
大约八年前,Steven Wittens (acko.net) 向 Drupal 核心引入了一个不同的技巧:将字符串前面的数字作为另一个数字。因此,数字 97685 变成了字符5 9 7 6 8 5
。这也有效:首先查看长度字节,如果它们不相同,则越大肯定越大。除此之外,您知道这两个数字的长度相等。他还使用基数为 36 的数字,其中 0-9a-z 是数字,就像每个字母的十六进制一样。此编码需要前 36 个节点的两个字节,接下来的 1260 个节点需要三个字节...
Note that neither padding nor this cunning variable length encoding need separators for the materialized path although for readability they are often included.
请注意,填充和这种狡猾的可变长度编码都不需要物化路径的分隔符,尽管为了可读性,它们经常被包含在内。
numconvsupports a base85 encoding but that requires a case sensitive collation. There are 94 ASCII characters if you remove lower case letters you still have base68.
numconv支持 base85 编码,但这需要区分大小写的排序规则。如果您删除小写字母,则有 94 个 ASCII 字符,您仍然拥有 base68。
But if you use a 'binary' field then you can do base256: instead of a cunning encoding just write the number as a series of bytes and then prefix the whole thing with the length of the bytestream as a single byte. This will allow you to encode any tree smaller than 2^2048 or so. For the first 256 nodes, you are using two bytes, for the next 65280 you are looking at three bytes. This is already quite efficient.
但是,如果您使用“二进制”字段,那么您可以使用 base256:而不是巧妙的编码,只需将数字写入一系列字节,然后将字节流的长度作为单个字节作为整个内容的前缀。这将允许您对小于 2^2048 左右的任何树进行编码。对于前 256 个节点,您使用了两个字节,对于接下来的 65280 个节点,您正在查看三个字节。这已经相当有效了。
I nominate the utf8encode(x)
function. Consider that! You need to descend into bitsorting instead of bytesorting but that doesn't change the outcome: find the leftmost zero bit. If the other string has a 1 there, it'll be a longer UTF-8 encoding so definitely that's larger. If they have the first zero at the same place then we have same length bit strings which compare well for us.
我提名这个utf8encode(x)
功能。考虑一下!您需要进行位排序而不是字节排序,但这不会改变结果:找到最左边的零位。如果另一个字符串在那里有一个 1,它将是一个更长的 UTF-8 编码,所以肯定是更大的。如果它们在同一个地方有第一个零,那么我们有相同长度的位串,这对我们来说比较好。
That's nice but what about separators. The UTF-8 algorithm when looking at it as purely an algorithm creating bitstreams can handle 31 bit numbers -- so it'll work for trees containing less than two billion nodes. Your materialized path will be a bitstream of UTF-8 encoded numbers which compare well: Discard the leftmost identical UTF-8 encoded numbers and we are back at the previous paragraph. Q.E.D.
这很好,但是分隔符呢。将 UTF-8 算法视为纯粹的创建比特流的算法时,它可以处理 31 位数字——因此它适用于包含少于 20 亿个节点的树。您的物化路径将是一个 UTF-8 编码数字的比特流,它们比较好:丢弃最左边的相同 UTF-8 编码数字,我们回到上一段。QED
Because we don't need separators or prefix bytes, we can encode the first 128 nodes into a single byte then the next 1920 into two bytes, and up to 65535, three bytes. For four bytes, base256 will win. For really big trees, you can treat UTF-8 as an encoder of 0-2147483647 into a byte stream. So you can use it as base2147483647 encoding :D
因为我们不需要分隔符或前缀字节,我们可以将前 128 个节点编码为一个字节,然后将接下来的 1920 个节点编码为两个字节,最多为 65535 个,三个字节。对于四个字节,base256 将获胜。对于非常大的树,可以将 UTF-8 视为 0-2147483647 的编码器转换为字节流。所以你可以将它用作 base2147483647 编码:D
To summarize: UTF-8 is best for small trees and not much worse than base256 below two billion nodes. Beyond that until 2^2048 or so prefixed-with-length-base256 wins. Beyond that prefixed-with-length-base2147483647 wins and there's nothing beyond that.
总结一下:UTF-8 最适合小树,并且不比 20 亿节点以下的 base256 差多少。除此之外,直到 2^2048 左右前缀长度 base256 获胜。除了以长度为前缀的 base2147483647 获胜之外,再无其他。
回答by Devdas
I can't think of a simple way to do this in straight SQL. Rather than matpath, I will use node_path here. node_path is matpath||'.'||id
我想不出一个简单的方法来用直接的 SQL 来做到这一点。我将在这里使用 node_path 而不是 matpath。node_path 是 matpath||'.'||id
id | parent_id | node_path | created
----+-----------+---------+----------------------------
2 | 1 | 1.2 | 2010-05-08 15:18:37.987544
3 | 1 | 1.3 | 2010-05-08 17:38:14.125377
4 | 1 | 1.4 | 2010-05-08 17:38:57.26743
5 | 1 | 1.5 | 2010-05-08 17:43:28.211708
7 | 1 | 1.7 | 2010-05-08 18:18:11.849735
6 | 2 | 1.2.6 | 2010-05-08 17:50:43.288759
9 | 5 | 1.5.9 | 2010-05-09 14:02:43.818646
8 | 6 | 1.2.6.8 | 2010-05-09 14:01:17.632695
Now you want to order the tree based on node_path, with the sorting field defined by the number of times you have run the sort.
现在您想根据 node_path 对树进行排序,排序字段由您运行排序的次数定义。
A custom recursive function in plpgsql sorting on split_part(node_path, '.', recursion_depth) will work. You will have to check for NULL values from split_part (and ignore those).
plpgsql 中对 split_part(node_path, '.', recursion_depth) 排序的自定义递归函数将起作用。您必须检查 split_part 中的 NULL 值(并忽略这些值)。