postgresql 如何确保物化视图始终是最新的?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/29437650/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
How can I ensure that a materialized view is always up to date?
提问by John Bachir
I'll need to invoke REFRESH MATERIALIZED VIEW
on each change to the tables involved, right? I'm surprised to not find much discussion of this on the web.
我需要对REFRESH MATERIALIZED VIEW
所涉及的表格进行每次更改,对吗?我很惊讶在网上没有找到太多关于这个的讨论。
How should I go about doing this?
我该怎么做呢?
I think the top half of the answer here is what I'm looking for: https://stackoverflow.com/a/23963969/168143
我认为这里答案的上半部分是我正在寻找的:https: //stackoverflow.com/a/23963969/168143
Are there any dangers to this? If updating the view fails, will the transaction on the invoking update, insert, etc. be rolled back? (this is what I want... I think)
这有什么危险吗?如果更新视图失败,调用更新、插入等的事务会回滚吗?(这就是我想要的……我想)
回答by MatheusOl
I'll need to invoke
REFRESH MATERIALIZED VIEW
on each change to the tables involved, right?
我需要对
REFRESH MATERIALIZED VIEW
所涉及的表格进行每次更改,对吗?
Yes, PostgreSQL by itself will never call it automatically, you need to do it some way.
是的,PostgreSQL 本身永远不会自动调用它,您需要以某种方式调用它。
How should I go about doing this?
我该怎么做呢?
Many ways to achieve this. Before giving some examples, keep in mind that REFRESH MATERIALIZED VIEW
commanddoes block the view in AccessExclusive mode, so while it is working, you can't even do SELECT
on the table.
实现这一目标的方法很多。在举一些例子之前,请记住该REFRESH MATERIALIZED VIEW
命令在 AccessExclusive 模式下确实会阻止视图,因此在它工作时,您甚至不能SELECT
在桌子上做。
Although, if you are in version 9.4 or newer, you can give it the CONCURRENTLY
option:
但是,如果您使用的是 9.4 或更高版本,则可以为其提供以下CONCURRENTLY
选项:
REFRESH MATERIALIZED VIEW CONCURRENTLY my_mv;
This will acquire an ExclusiveLock, and will not block SELECT
queries, but may have a bigger overhead (depends on the amount of data changed, if few rows have changed, then it might be faster). Although you still can't run two REFRESH
commands concurrently.
这将获得一个 ExclusiveLock,并且不会阻塞SELECT
查询,但可能会有更大的开销(取决于更改的数据量,如果更改的行很少,那么它可能会更快)。尽管您仍然无法同时运行两个REFRESH
命令。
Refresh manually
手动刷新
It is an option to consider. Specially in cases of data loading or batch updates (e.g. a system that only loads tons of information/data after long periods of time) it is common to have operations at end to modify or process the data, so you can simple include a REFRESH
operation in the end of it.
这是一个可以考虑的选项。特别是在数据加载或批量更新的情况下(例如,一个系统在很长一段时间后才加载大量信息/数据),通常会在末尾进行操作来修改或处理数据,因此您可以简单地将REFRESH
操作包含在它的结束。
Scheduling the REFRESH operation
调度 REFRESH 操作
The first and widely used option is to use some scheduling system to invoke the refresh, for instance, you could configure the like in a cron job:
第一个也是广泛使用的选项是使用一些调度系统来调用刷新,例如,您可以在 cron 作业中配置类似的内容:
*/30 * * * * psql -d your_database -c "REFRESH MATERIALIZED VIEW CONCURRENTLY my_mv"
And then your materialized view will be refreshed at each 30 minutes.
然后您的物化视图将每 30 分钟刷新一次。
Considerations
注意事项
This option is really good, specially with CONCURRENTLY
option, but only if you can accept the data not being 100% up to date all the time. Keep in mind, that even with or without CONCURRENTLY
, the REFRESH
command does need to run the entire query, so you have to take the time needed to run the inner query before considering the time to schedule the REFRESH
.
这个选项真的很好,特别是有CONCURRENTLY
选项,但前提是你可以接受数据不是 100% 一直是最新的。请记住,即使有或没有CONCURRENTLY
,该REFRESH
命令确实需要运行整个查询,因此在考虑安排REFRESH
.
Refreshing with a trigger
用触发器刷新
Another option is to call the REFRESH MATERIALIZED VIEW
in a trigger function, like this:
另一种选择是REFRESH MATERIALIZED VIEW
在触发器函数中调用 ,如下所示:
CREATE OR REPLACE FUNCTION tg_refresh_my_mv()
RETURNS trigger LANGUAGE plpgsql AS $$
BEGIN
REFRESH MATERIALIZED VIEW CONCURRENTLY my_mv;
RETURN NULL;
END;
$$;
Then, in any table that involves changes on the view, you do:
然后,在涉及视图更改的任何表中,您执行以下操作:
CREATE TRIGGER tg_refresh_my_mv AFTER INSERT OR UPDATE OR DELETE
ON table_name
FOR EACH STATEMENT EXECUTE PROCEDURE tg_refresh_my_mv();
Considerations
注意事项
It has some critical pitfalls for performance and concurrency:
它在性能和并发性方面存在一些关键缺陷:
- Any INSERT/UPDATE/DELETE operation will have to execute the query (which is possible slow if you are considering MV);
- Even with
CONCURRENTLY
, oneREFRESH
still blocks another one, so any INSERT/UPDATE/DELETE on the involved tables will be serialized.
- 任何 INSERT/UPDATE/DELETE 操作都必须执行查询(如果您正在考虑 MV,这可能会很慢);
- 即使使用
CONCURRENTLY
,一个REFRESH
仍然会阻塞另一个,因此涉及的表上的任何 INSERT/UPDATE/DELETE 都将被序列化。
The only situation I can think that as a good idea is if the changes are really rare.
我能想到的唯一一种好主意是,如果这些变化真的很少见。
Refresh using LISTEN/NOTIFY
使用 LISTEN/NOTIFY 刷新
The problem with the previous option is that it is synchronous and impose a big overhead at each operation. To ameliorate that, you can use a trigger like before, but that only calls a NOTIFY
operation:
前一个选项的问题在于它是同步的,并且在每个操作中都会产生很大的开销。为了改善这种情况,您可以像以前一样使用触发器,但它只会调用一个NOTIFY
操作:
CREATE OR REPLACE FUNCTION tg_refresh_my_mv()
RETURNS trigger LANGUAGE plpgsql AS $$
BEGIN
NOTIFY refresh_mv, 'my_mv';
RETURN NULL;
END;
$$;
So then you can build an application that keep connected and uses LISTEN
operationto identify the need to call REFRESH
. One nice project that you can use to test this is pgsidekick, with this project you can use shell script to do LISTEN
, so you can schedule the REFRESH
as:
因此,您可以构建一个保持连接的应用程序并使用LISTEN
操作来识别调用REFRESH
. 一个可以用来测试的不错的项目是pgsidekick,在这个项目中你可以使用 shell 脚本来做LISTEN
,所以你可以安排REFRESH
如下:
pglisten --listen=refresh_mv --print0 | xargs -0 -n1 -I? psql -d your_database -c "REFRESH MATERIALIZED VIEW CONCURRENTLY ?;"
Or use pglater
(also inside pgsidekick
) to make sure you don't call REFRESH
very often. For example, you can use the following trigger to make it REFRESH
, but within 1 minute (60 seconds):
或者使用pglater
(也在 inside pgsidekick
)来确保你不REFRESH
经常打电话。例如,您可以使用以下触发器来实现REFRESH
,但在 1 分钟(60 秒)内:
CREATE OR REPLACE FUNCTION tg_refresh_my_mv()
RETURNS trigger LANGUAGE plpgsql AS $$
BEGIN
NOTIFY refresh_mv, '60 REFRESH MATERIALIZED VIEW CONCURRENLTY my_mv';
RETURN NULL;
END;
$$;
So it will not call REFRESH
in less the 60 seconds apart, and also if you NOTIFY
many times in less than 60 seconds, the REFRESH
will be triggered only once.
所以它不会REFRESH
在少于 60 秒的时间内调用,而且如果你NOTIFY
在少于 60 秒内多次调用,REFRESH
只会触发一次。
Considerations
注意事项
As the cron option, this one also is good only if you can bare with a little stale data, but this has the advantage that the REFRESH
is called only when really needed, so you have less overhead, and also the data is updated more closer to when needed.
作为 cron 选项,只有当您可以保留一些陈旧数据时,这个选项也很好,但是这样做的优点REFRESH
是仅在真正需要时才调用它,因此您的开销较小,并且数据更新得更接近需要的时候。
OBS: I haven't really tried the codes and examples yet, so if someone finds a mistake, typo or tries it and works (or not), please let me know.
OBS:我还没有真正尝试过代码和示例,所以如果有人发现错误、打字错误或尝试并成功(或失败),请告诉我。
回答by Park JongBum
Let me point out three things on the previous answer by MatheusOl - the pglater technology.
让我在 MatheusOl 之前的回答中指出三件事 - pglater 技术。
As the last element of long_options array it should include "{0, 0, 0, 0}" element as pointed at https://linux.die.net/man/3/getopt_longby the phrase "The last element of the array has to be filled with zeros." So, it should read -
static struct option long_options[] = { //...... {"help", no_argument, NULL, '?'}, {0, 0, 0, 0} };
On the malloc/free thing -- one free(for char listen = malloc(...);) is missing. Anyhow, malloc caused pglater process to crash on CentOS (but not on Ubuntu - I don't know why). So, I recommend using char array and assign the array name to the char pointer(to both charand char**). You many need to force type conversion while you do that(pointer assignment).
char block4[100]; ... password_prompt = block4; ... char block1[500]; const char **keywords = (const char **)&block1; ... char block3[300]; char *listen = block3; sprintf(listen, "listen %s", id); PQfreemem(id); res = PQexec(db, listen);
Use below table to calculate timeout where md is mature_duration which is the time difference between the latest refresh(lr) time point and current time.
when md >= callback_delay(cd) ==> timeout: 0
when md + PING_INTERVAL >= cd ==> timeout: cd-md[=cd-(now-lr)]
when md + PING_INTERVAL < cd ==> timeout: PI
作为 long_options 数组的最后一个元素,它应该包含“{0, 0, 0, 0}”元素,如https://linux.die.net/man/3/getopt_long 中的短语“数组的最后一个元素必须用零填充。” 所以,它应该读-
static struct option long_options[] = { //...... {"help", no_argument, NULL, '?'}, {0, 0, 0, 0} };
关于 malloc/free 的事情——一个 free(for char listen = malloc(...);) 丢失了。无论如何, malloc 导致 pglater 进程在 CentOS 上崩溃(但不是在 Ubuntu 上 - 我不知道为什么)。因此,我建议使用 char 数组并将数组名称分配给 char 指针(同时指向 char和 char**)。您在执行此操作时需要强制进行类型转换(指针分配)。
char block4[100]; ... password_prompt = block4; ... char block1[500]; const char **keywords = (const char **)&block1; ... char block3[300]; char *listen = block3; sprintf(listen, "listen %s", id); PQfreemem(id); res = PQexec(db, listen);
使用下表计算超时,其中 md 是成熟的持续时间,即最新的 refresh(lr) 时间点与当前时间之间的时间差。
当 md >= callback_delay(cd) ==> 超时:0
当 md + PING_INTERVAL >= cd ==> 超时:cd-md[=cd-(now-lr)]
当 md + PING_INTERVAL < cd ==> 超时:PI
To implement this algorithm(3rd point), you should init 'lr' as follows -
要实现此算法(第三点),您应该按如下方式初始化 'lr' -
res = PQexec(db, command); latest_refresh = time(0); if (PQresultStatus(res) == PGRES_COMMAND_OK) {
res = PQexec(db, command); latest_refresh = time(0); if (PQresultStatus(res) == PGRES_COMMAND_OK) {