postgresql 与经典的标准化表相比,postgres JSON 索引是否足够有效?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/18799704/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Are postgres JSON indexes efficient enough compared with classic normalized tables?
提问by Peteris
Current Postgresql versions have introduced various features for JSON content, but I'm concerned if I really should use them - I mean, there is not yet "best practice" estabilished on what works and what doesn't, or at least I can't find it.
当前的 Postgresql 版本已经为 JSON 内容引入了各种功能,但我担心我是否真的应该使用它们 - 我的意思是,目前还没有关于哪些有效哪些无效的“最佳实践”,或者至少我不能没找到。
I have a specific example - I have a table about objects which, among other things, contains a list of alternate names for that object. All that data will also be included in a JSON column for retrieval purposes. For example (skipping all the other irrelevant fields).
我有一个具体的例子 - 我有一个关于对象的表格,其中包含该对象的备用名称列表。所有这些数据也将包含在 JSON 列中以供检索。例如(跳过所有其他不相关的字段)。
create table stuff (id serial primary key, data json);
insert into stuff(data) values('{"AltNames":["Name1","Name2","Name3"]}')
I will need some queries in the form "list all objects where one of altnames is 'foobar'." The expected table size is on the order of a few million records. Postgres JSON queries can be used for that, and it can also be indexed (Index for finding an element in a JSON array, for example). However, SHOULD it be done that way or is it a perverse workaround that's not recommended?
我将需要以“列出替代名称之一为‘foobar’的所有对象”形式的一些查询。预期的表大小约为几百万条记录。可以使用 Postgres JSON 查询,也可以对其进行索引(例如,用于在 JSON 数组中查找元素的索引)。但是,是否应该这样做,还是不推荐的反常解决方法?
The classic alternative, of course, is to add an additional table for that one-to-many relation, containing the name and a foreign key to the main table; the performance of that is well understood. However, that has it's own disadvantages as then it means either data duplication between that table and JSON (with possible integrity risk); or creating that JSON return data dynamically at every request, which has it's own performance penalty.
当然,经典的替代方法是为该一对多关系添加一个附加表,其中包含主表的名称和外键;其性能很好理解。但是,这有其自身的缺点,因为这意味着该表和 JSON 之间的数据重复(可能存在完整性风险);或者在每个请求时动态创建该 JSON 返回数据,这有其自身的性能损失。
采纳答案by Denis de Bernardy
I will need some queries in the form "list all objects where one of altnames is 'foobar'." The expected table size is on the order of a few million records. Postgres JSON queries can be used for that, and it can also be indexed (Index For Finding Element in JSON array, for example). However, SHOULD it be done that way or is it a perverse workaround that's not recommended?
我将需要以“列出替代名称之一为‘foobar’的所有对象”形式的一些查询。预期的表大小约为几百万条记录。可以使用 Postgres JSON 查询,也可以对其进行索引(例如,用于在 JSON 数组中查找元素的索引)。但是,是否应该这样做,还是不推荐的反常解决方法?
It canbe done that way but that doesn't mean that you should. In some sense, the best practice is well documented already (see e.g. using hstore vs using XML vs using EAV vs using a separate table) with a new datatype which, for all intents and practical purposes (besides validation and syntax), is no different from prior unstructured or semi-structured options.
它可以做到这样,但这并不意味着你应该。从某种意义上说,最佳实践已经得到了很好的记录(参见例如使用 hstore vs 使用 XML vs 使用 EAV vs 使用单独的表)具有新的数据类型,对于所有意图和实际目的(除了验证和语法),没有什么不同来自先前的非结构化或半结构化期权。
Put another way, it's the same old pig with new makeup.
换句话说,它是一头换了新妆的老猪。
JSON offers the ability to use inverted search tree indexes, in the same way as hstore, array types and tsvectors do. They work fine, but keep in mind that they're primarily designed for extracting points in a neighborhood (think geometry types) ordered by distance, rather than for extracting a list of values in lexicographical order.
JSON 提供了使用倒排搜索树索引的能力,就像 hstore、数组类型和 tsvectors 一样。它们工作正常,但请记住,它们主要用于提取按距离排序的邻域(想想几何类型)中的点,而不是按字典顺序提取值列表。
To illustrate, take the two plans that Roman's answer outlines:
为了说明,以 Roman 的回答概述的两个计划为例:
- The one that does an index scanplows through disk pages directly, retrieving the rows in the order indicated by the index.
- The one that does a bitmap index scanstarts by identifying every disk page that might contain a row, and reads them as they appear on disk, as if it was (and in fact, precisely like) doing a sequence scan that skips useless areas.
- 执行索引扫描的程序直接遍历磁盘页面,按照索引指示的顺序检索行。
- 进行位图索引扫描的程序首先识别可能包含一行的每个磁盘页面,并在它们出现在磁盘上时读取它们,就好像它正在(实际上,完全像)执行跳过无用区域的序列扫描。
Getting back to your question: Cluttered and oversized inverted tree indexeswill indeed improve the performance of your app if you use Postgres tables as giant JSON stores. But they're not a silver bullet either, and they won't get you as far as proper relational design when dealing with bottlenecks.
回到您的问题:如果您将 Postgres 表用作巨大的 JSON 存储,那么杂乱且过大的倒排树索引确实会提高您的应用程序的性能。但它们也不是灵丹妙药,在处理瓶颈时,它们不会让您达到适当的关系设计。
The bottom line, in the end, is no different from what you'd get when deciding to use hstore or an EAV:
最后,与您决定使用 hstore 或 EAV 时得到的结果没有什么不同:
- If it needs an index (i.e. it frequently appears in a where clause or, even more importantly, in a join clause), you likely want the data in a separate field.
- If it's primarily cosmetic, JSON/hstore/EAV/XML/whatever-makes-you-sleep-at-night works fine.
- 如果它需要索引(即它经常出现在 where 子句中,或者更重要的是,出现在 join 子句中),您可能希望数据位于单独的字段中。
- 如果它主要是装饰性的,JSON/hstore/EAV/XML/whatever-makes-you-sleep-at-night 工作正常。
回答by Roman Pekar
I'd say it worth a try. I've created some test (100000 records, ~10 elements in JSON array) and checked how it's working:
我会说值得一试。我创建了一些测试(100000 条记录,JSON 数组中的约 10 个元素)并检查了它是如何工作的:
create table test1 (id serial primary key, data json);
create table test1_altnames (id int, name text);
create or replace function array_from_json(_j json)
returns text[] as
$func$
select array_agg(x.elem::text)
from json_array_elements(_j) as x(elem)
$func$
language sql immutable;
with cte as (
select
(random() * 100000)::int as grp, (random() * 1000000)::int as name
from generate_series(1, 1000000)
), cte2 as (
select
array_agg(Name) as "AltNames"
from cte
group by grp
)
insert into test1 (data)
select row_to_json(t)
from cte2 as t
insert into test1_altnames (id, name)
select id, json_array_elements(data->'AltNames')::text
from test1
create index ix_test1 on test1 using gin(array_from_json(data->'AltNames'));
create index ix_test1_altnames on test1_altnames (name);
Query JSON (30mson my machine):
查询 JSON(在我的机器上为 30毫秒):
select * from test1 where '{489147}' <@ array_from_json(data->'AltNames');
"Bitmap Heap Scan on test1 (cost=224.13..1551.41 rows=500 width=36)"
" Recheck Cond: ('{489147}'::text[] <@ array_from_json((data -> 'AltNames'::text)))"
" -> Bitmap Index Scan on ix_test1 (cost=0.00..224.00 rows=500 width=0)"
" Index Cond: ('{489147}'::text[] <@ array_from_json((data -> 'AltNames'::text)))"
Query table with names (15mson my machine):
带有名称的查询表(在我的机器上为 15毫秒):
select * from test1 as t where t.id in (select tt.id from test1_altnames as tt where tt.name = '489147');
"Nested Loop (cost=12.76..20.80 rows=2 width=36)"
" -> HashAggregate (cost=12.46..12.47 rows=1 width=4)"
" -> Index Scan using ix_test1_altnames on test1_altnames tt (cost=0.42..12.46 rows=2 width=4)"
" Index Cond: (name = '489147'::text)"
" -> Index Scan using test1_pkey on test1 t (cost=0.29..8.31 rows=1 width=36)"
" Index Cond: (id = tt.id)"
Also I have to note, there're some cost to insert/delete rows into table with names (test1_altnames
), so it's a bit more complicated than just selecting rows. Personally I like solution with JSON.
另外我必须注意,将行插入/删除到带有名称 ( test1_altnames
) 的表中需要一些成本,因此它比仅选择行要复杂一些。我个人喜欢 JSON 的解决方案。