SQL 使用 Tablefunc 在多列上透视

Question

提问by ideamotor

Has anyone used tablefuncto pivot on multiple variables as opposed to only using row name? The documentation notes:

有没有人曾经使用tablefunc多个变量而不是仅使用行名称？文档说明：

The "extra" columns are expected to be the same for all rows with the same row_name value.

对于具有相同 row_name 值的所有行，“额外”列应该是相同的。

I'm not sure how to do this without combining the columns that I want to pivot on (which I highly doubt will give me the speed I need). One possible way to do this would be to make the entity numeric and add it to the localt as milliseconds, but this seems like a shaky way to proceed.

我不知道如何在不组合我想要旋转的列的情况下执行此操作（我非常怀疑这会给我所需的速度）。一种可能的方法是将实体设为数字并将其以毫秒为单位添加到本地，但这似乎是一种不稳定的方法。

I've edited the data used in a response to this question: PostgreSQL Crosstab Query.

我已经编辑了在回答这个问题时使用的数据：PostgreSQL Crosstab Query。

 CREATE TEMP TABLE t4 (
  timeof   timestamp
 ,entity    character
 ,status    integer
 ,ct        integer);

 INSERT INTO t4 VALUES 
  ('2012-01-01', 'a', 1, 1)
 ,('2012-01-01', 'a', 0, 2)
 ,('2012-01-02', 'b', 1, 3)
 ,('2012-01-02', 'c', 0, 4);

 SELECT * FROM crosstab(
     'SELECT timeof, entity, status, ct
      FROM   t4
      ORDER  BY 1,2,3'
     ,$$VALUES (1::text), (0::text)$$)
 AS ct ("Section" timestamp, "Attribute" character, "1" int, "0" int);

Returns:

返回：

 Section                   | Attribute | 1 | 0
---------------------------+-----------+---+---
 2012-01-01 00:00:00       |     a     | 1 | 2
 2012-01-02 00:00:00       |     b     | 3 | 4

So as the documentation states, the extracolumn aka 'Attribute' is assumed to be the same for each row nameaka 'Section'. Thus, it reports bfor the second row even though 'entity' also has a 'c'value for that 'timeof' value.

因此，正如文档所述，假设每行名称（又名“部分”）的额外列（又名“属性”）是相同的。因此，它报告第二行的b，即使 'entity' 也具有该 'timeof' 值的'c' 值。

Desired Output:

期望输出：

Section                   | Attribute | 1 | 0
--------------------------+-----------+---+---
2012-01-01 00:00:00       |     a     | 1 | 2
2012-01-02 00:00:00       |     b     | 3 |  
2012-01-02 00:00:00       |     c     |   | 4

Any thoughts or references?

任何想法或参考？

A little more background: I potentially need to do this for billionsof rows and I'm testing out storing this data in long and wide formats and seeing if I can use tablefuncto go from long to wide format more efficiently than with regular aggregate functions.
I'll have about 100 measurements made every minute for around 300 entities. Often, we will need to compare the different measurements made for a given second for a given entity, so we will need to go to wide format very often. Also, the measurements made on a particular entity are highly variable.

更多背景知识：我可能需要对数十亿行执行此操作，并且我正在测试以长格式和宽格式存储这些数据，并查看是否可以使用tablefunc比常规聚合函数更有效地从长格式转换为宽格式。
我将每分钟对大约 300 个实体进行大约 100 次测量。通常，我们需要比较给定实体在给定秒内进行的不同测量，因此我们需要经常使用宽格式。此外，对特定实体进行的测量是高度可变的。

EDIT: I found a resource on this: http://www.postgresonline.com/journal/categories/24-tablefunc.

编辑：我找到了一个资源：http: //www.postgresonline.com/journal/categories/24-tablefunc。

Answer 1

采纳答案by Erwin Brandstetter

The problem with your query is that band cshare the same timestamp 2012-01-02 00:00:00, and you have the timestampcolumn timeoffirst in your query, so - even though you added bold emphasis - band care just extra columns that fall in the same group 2012-01-02 00:00:00. Only the first (b) is returned since (quoting the manual):

您的查询的问题在于b和c共享相同的时间戳2012-01-02 00:00:00，并且您在查询中首先拥有该timestamp列timeof，因此 - 即使您添加了粗体强调 -b并且c只是属于同一组的额外列2012-01-02 00:00:00。b自（引用手册）以来，仅返回第一个 ( )：

The row_namecolumn must be first. The categoryand valuecolumns must be the last two columns, in that order. Any columns between row_nameand categoryare treated as "extra". The "extra" columns are expected to be the samefor all rows with the same row_namevalue.

该row_name列必须是第一个。在category与value列必须是最后两列的顺序。row_name和之间的任何列category都被视为“额外”。对于具有相同值的所有行，“额外”列应该是相同的row_name。

Bold emphasis mine.
Just revert the order of the first two columns to make entitythe row name and it works as desired:

大胆强调我的。
只需恢复前两列的顺序即可制作entity行名称，它可以按需要工作：

SELECT * FROM crosstab(
      'SELECT entity, timeof, status, ct
       FROM   t4
       ORDER  BY 1'
      ,'VALUES (1), (0)')
 AS ct (
    "Attribute" character
   ,"Section" timestamp
   ,"status_1" int
   ,"status_0" int);

entitymust be unique, of course.

entity当然，必须是唯一的。

Reiterate

重申

row_namefirst
(optional) extracolumns next
category(as defined by the second parameter) and valuelast.

row_name第一的
（可选）extra列下一个
category（由第二个参数定义）和valuelast。

Extra columns are filled from the firstrow from each row_namepartition. Values from other rows are ignored, there is only one column per row_nameto fill. Typically those would be the same for every row of one row_name, but that's up to you.

从每个分区的第一行填充额外的列row_name。其他行的值将被忽略，每列仅row_name填充一列。通常，对于 one 的每一行，这些都是相同的row_name，但这取决于您。

For the different setup in your answer:

对于答案中的不同设置：

SELECT localt, entity
     , msrmnt01, msrmnt02, msrmnt03, msrmnt04, msrmnt05  -- , more?
FROM   crosstab(
        'SELECT dense_rank() OVER (ORDER BY localt, entity)::int AS row_name
              , localt, entity -- additional columns
              , msrmnt, val
         FROM   test
         -- WHERE  ???   -- instead of LIMIT at the end
         ORDER  BY localt, entity, msrmnt
         -- LIMIT ???'   -- instead of LIMIT at the end
     , $$SELECT generate_series(1,5)$$)  -- more?
     AS ct (row_name int, localt timestamp, entity int
          , msrmnt01 float8, msrmnt02 float8, msrmnt03 float8, msrmnt04 float8, msrmnt05 float8 -- , more?
            )
LIMIT 1000  -- ??!!

No wonder the queries in your test perform terribly. Your test setup has 14M rows and you process allof them before throwing most of it away with LIMIT 1000. For a reduced result set add WHERE conditions or a LIMIT to the source query!

难怪您的测试中的查询执行得非常糟糕。测试设置有14M行，你处理所有他们投掷大部分逃脱之前LIMIT 1000。对于简化的结果集，向源查询添加 WHERE 条件或 LIMIT！

Plus, the array you work with is needlessly expensive on top of it. I generate a surrogate row name with dense_rank() instead.

另外，您使用的阵列在它之上是不必要的昂贵。我用dense_rank() 生成了一个代理行名称。

db<>fiddlehere - with a simpler test setup and fewer rows.

db<>fiddlehere - 具有更简单的测试设置和更少的行。

Answer 2

回答by ideamotor

In my original question I should have used this for my sample data:

在我最初的问题中，我应该将它用于我的示例数据：

CREATE TEMP TABLE t4 (
 timeof    date
,entity    integer
,status    integer
,ct        integer);
INSERT INTO t4 VALUES 
 ('2012-01-01', 1, 1, 1)
,('2012-01-01', 1, 0, 2)
,('2012-01-01', 3, 0, 3)
,('2012-01-02', 2, 1, 4)
,('2012-01-02', 3, 1, 5)
,('2012-01-02', 3, 0, 6);

With this I have to pivot on both timeof and entity. Since tablefunconly uses one column for the pivoting, you need to find a way to stuff both dimensions in that column. (http://www.postgresonline.com/journal/categories/24-tablefunc). I went with the array, just like the example on in that link.

有了这个，我必须以时间和实体为中心。由于tablefunc只使用一列进行旋转，您需要找到一种方法来填充该列中的两个维度。（http://www.postgresonline.com/journal/categories/24-tablefunc）。我使用了数组，就像该链接中的示例一样。

SELECT (timestamp 'epoch' + row_name[1] * INTERVAL '1 second')::date 
           as localt, 
           row_name[2] As entity, status1, status0
FROM crosstab('SELECT ARRAY[extract(epoch from timeof), entity] as row_name,
                    status, ct
               FROM t4 
               ORDER BY timeof, entity, status'
     ,$$VALUES (1::text), (0::text)$$) 
          as ct (row_name integer[], status1 int, status0 int)

FWIW, I tried using a character array and so far it looks like this is faster for my setup; 9.2.3 Postgresql.

FWIW，我尝试使用字符数组，到目前为止，我的设置看起来更快；9.2.3 PostgreSQL。

This is the result and desired output.

这是结果和所需的输出。

localt           | entity | status1 | status0
--------------------------+---------+--------
2012-01-01       |   1    |    1    |   2
2012-01-01       |   3    |         |   3
2012-01-02       |   2    |    4    |  
2012-01-02       |   3    |    5    |   6

I'm curious how this performs on a much larger data-set and will report back at a later date.

我很好奇这在更大的数据集上的表现如何，并将在以后报告。

Answer 3

回答by ideamotor

Ok, so I ran this on a table closer to my use case. Either I'm doing it wrong or crosstab is not suitable for my use.

好的，所以我在靠近我的用例的桌子上运行它。要么我做错了，要么交叉表不适合我使用。

First I made some similar data:

首先我做了一些类似的数据：

CREATE TABLE public.test (
    id serial primary key,
    msrmnt integer,
    entity integer,
    localt timestamp,
    val    double precision
);
CREATE INDEX ix_test_msrmnt
   ON public.test (msrmnt);
 CREATE INDEX ix_public_test_201201_entity
   ON public.test (entity);
CREATE INDEX ix_public_test_201201_localt
  ON public.test (localt);
insert into public.test (msrmnt, entity, localt, val)
select *
from(
SELECT msrmnt, entity, localt, random() as val 
FROM generate_series('2012-01-01'::timestamp, '2012-01-01 23:59:00'::timestamp, interval '1 minutes') as localt
join 
(select *
FROM generate_series(1, 50, 1) as msrmnt) as msrmnt
on 1=1
join 
(select *
FROM generate_series(1, 200, 1) as entity) as entity
on 1=1) as data;

Then I ran the crosstab code a couple times:

然后我运行了几次交叉表代码：

explain analyze
SELECT (timestamp 'epoch' + row_name[1] * INTERVAL '1 second')::date As localt, row_name[2] as entity
    ,msrmnt01,msrmnt02,msrmnt03,msrmnt04,msrmnt05,msrmnt06,msrmnt07,msrmnt08,msrmnt09,msrmnt10
    ,msrmnt11,msrmnt12,msrmnt13,msrmnt14,msrmnt15,msrmnt16,msrmnt17,msrmnt18,msrmnt19,msrmnt20
    ,msrmnt21,msrmnt22,msrmnt23,msrmnt24,msrmnt25,msrmnt26,msrmnt27,msrmnt28,msrmnt29,msrmnt30
    ,msrmnt31,msrmnt32,msrmnt33,msrmnt34,msrmnt35,msrmnt36,msrmnt37,msrmnt38,msrmnt39,msrmnt40
    ,msrmnt41,msrmnt42,msrmnt43,msrmnt44,msrmnt45,msrmnt46,msrmnt47,msrmnt48,msrmnt49,msrmnt50
    FROM crosstab('SELECT ARRAY[extract(epoch from localt), entity] as row_name, msrmnt, val
               FROM public.test
               ORDER BY localt, entity, msrmnt',$$VALUES  ( 1::text),( 2::text),( 3::text),( 4::text),( 5::text),( 6::text),( 7::text),( 8::text),( 9::text),(10::text)
                                                         ,(11::text),(12::text),(13::text),(14::text),(15::text),(16::text),(17::text),(18::text),(19::text),(20::text)
                                                         ,(21::text),(22::text),(23::text),(24::text),(25::text),(26::text),(27::text),(28::text),(29::text),(30::text)
                                                         ,(31::text),(32::text),(33::text),(34::text),(35::text),(36::text),(37::text),(38::text),(39::text),(40::text)
                                                         ,(41::text),(42::text),(43::text),(44::text),(45::text),(46::text),(47::text),(48::text),(49::text),(50::text)$$)
        as ct (row_name integer[],msrmnt01 double precision, msrmnt02 double precision,msrmnt03 double precision, msrmnt04 double precision,msrmnt05 double precision, 
                    msrmnt06 double precision,msrmnt07 double precision, msrmnt08 double precision,msrmnt09 double precision, msrmnt10 double precision
                 ,msrmnt11 double precision, msrmnt12 double precision,msrmnt13 double precision, msrmnt14 double precision,msrmnt15 double precision, 
                    msrmnt16 double precision,msrmnt17 double precision, msrmnt18 double precision,msrmnt19 double precision, msrmnt20 double precision
                 ,msrmnt21 double precision, msrmnt22 double precision,msrmnt23 double precision, msrmnt24 double precision,msrmnt25 double precision, 
                    msrmnt26 double precision,msrmnt27 double precision, msrmnt28 double precision,msrmnt29 double precision, msrmnt30 double precision
                 ,msrmnt31 double precision, msrmnt32 double precision,msrmnt33 double precision, msrmnt34 double precision,msrmnt35 double precision, 
                    msrmnt36 double precision,msrmnt37 double precision, msrmnt38 double precision,msrmnt39 double precision, msrmnt40 double precision
                 ,msrmnt41 double precision, msrmnt42 double precision,msrmnt43 double precision, msrmnt44 double precision,msrmnt45 double precision, 
                    msrmnt46 double precision,msrmnt47 double precision, msrmnt48 double precision,msrmnt49 double precision, msrmnt50 double precision)
limit 1000

Obtaining this on the third try:

在第三次尝试时获得这个：

QUERY PLAN
Limit  (cost=0.00..20.00 rows=1000 width=432) (actual time=110236.673..110237.667 rows=1000 loops=1)
  ->  Function Scan on crosstab ct  (cost=0.00..20.00 rows=1000 width=432) (actual time=110236.672..110237.598 rows=1000 loops=1)
Total runtime: 110699.598 ms

Then I ran the standard solution a couple times:

然后我运行了几次标准解决方案：

explain analyze
select localt, entity, 
 max(case when msrmnt =  1 then val else null end) as msrmnt01
,max(case when msrmnt =  2 then val else null end) as msrmnt02
,max(case when msrmnt =  3 then val else null end) as msrmnt03
,max(case when msrmnt =  4 then val else null end) as msrmnt04
,max(case when msrmnt =  5 then val else null end) as msrmnt05
,max(case when msrmnt =  6 then val else null end) as msrmnt06
,max(case when msrmnt =  7 then val else null end) as msrmnt07
,max(case when msrmnt =  8 then val else null end) as msrmnt08
,max(case when msrmnt =  9 then val else null end) as msrmnt09
,max(case when msrmnt = 10 then val else null end) as msrmnt10
,max(case when msrmnt = 11 then val else null end) as msrmnt11
,max(case when msrmnt = 12 then val else null end) as msrmnt12
,max(case when msrmnt = 13 then val else null end) as msrmnt13
,max(case when msrmnt = 14 then val else null end) as msrmnt14
,max(case when msrmnt = 15 then val else null end) as msrmnt15
,max(case when msrmnt = 16 then val else null end) as msrmnt16
,max(case when msrmnt = 17 then val else null end) as msrmnt17
,max(case when msrmnt = 18 then val else null end) as msrmnt18
,max(case when msrmnt = 19 then val else null end) as msrmnt19
,max(case when msrmnt = 20 then val else null end) as msrmnt20
,max(case when msrmnt = 21 then val else null end) as msrmnt21
,max(case when msrmnt = 22 then val else null end) as msrmnt22
,max(case when msrmnt = 23 then val else null end) as msrmnt23
,max(case when msrmnt = 24 then val else null end) as msrmnt24
,max(case when msrmnt = 25 then val else null end) as msrmnt25
,max(case when msrmnt = 26 then val else null end) as msrmnt26
,max(case when msrmnt = 27 then val else null end) as msrmnt27
,max(case when msrmnt = 28 then val else null end) as msrmnt28
,max(case when msrmnt = 29 then val else null end) as msrmnt29
,max(case when msrmnt = 30 then val else null end) as msrmnt30
,max(case when msrmnt = 31 then val else null end) as msrmnt31
,max(case when msrmnt = 32 then val else null end) as msrmnt32
,max(case when msrmnt = 33 then val else null end) as msrmnt33
,max(case when msrmnt = 34 then val else null end) as msrmnt34
,max(case when msrmnt = 35 then val else null end) as msrmnt35
,max(case when msrmnt = 36 then val else null end) as msrmnt36
,max(case when msrmnt = 37 then val else null end) as msrmnt37
,max(case when msrmnt = 38 then val else null end) as msrmnt38
,max(case when msrmnt = 39 then val else null end) as msrmnt39
,max(case when msrmnt = 40 then val else null end) as msrmnt40
,max(case when msrmnt = 41 then val else null end) as msrmnt41
,max(case when msrmnt = 42 then val else null end) as msrmnt42
,max(case when msrmnt = 43 then val else null end) as msrmnt43
,max(case when msrmnt = 44 then val else null end) as msrmnt44
,max(case when msrmnt = 45 then val else null end) as msrmnt45
,max(case when msrmnt = 46 then val else null end) as msrmnt46
,max(case when msrmnt = 47 then val else null end) as msrmnt47
,max(case when msrmnt = 48 then val else null end) as msrmnt48
,max(case when msrmnt = 49 then val else null end) as msrmnt49
,max(case when msrmnt = 50 then val else null end) as msrmnt50
from sample
group by localt, entity
limit 1000

Obtaining this on the third try:

在第三次尝试时获得这个：

QUERY PLAN
Limit  (cost=2257339.69..2270224.77 rows=1000 width=24) (actual time=19795.984..20090.626 rows=1000 loops=1)
  ->  GroupAggregate  (cost=2257339.69..5968242.35 rows=288000 width=24) (actual time=19795.983..20090.496 rows=1000 loops=1)
        ->  Sort  (cost=2257339.69..2293339.91 rows=14400088 width=24) (actual time=19795.626..19808.820 rows=50001 loops=1)
              Sort Key: localt
              Sort Method: external merge  Disk: 478568kB
              ->  Seq Scan on sample  (cost=0.00..249883.88 rows=14400088 width=24) (actual time=0.013..2245.247 rows=14400000 loops=1)
Total runtime: 20197.565 ms

So, for my case, it appears so far that crosstab is not a solution. And this is just one day when I'll have multiple years. In fact, I will probably have to go with wide format (not normalized) tables, despite the fact that which measurements are made for entities is variable and new ones are introduced, but I won't go into that here.

因此，就我而言，到目前为止，交叉表似乎不是解决方案。这只是我将拥有多年的一天。事实上，我可能不得不使用宽格式（非规范化）表，尽管为实体进行的测量是可变的并且引入了新的测量，但我不会在这里讨论。

Here's some of my settings using Postgres 9.2.3:

这是我使用 Postgres 9.2.3 的一些设置：

name                    setting
max_connections             100
shared_buffers          2097152
effective_cache_size    6291456
maintenance_work_mem    1048576
work_mem                 262144

SQL 使用 Tablefunc 在多列上透视

提问by ideamotor

采纳答案by Erwin Brandstetter

Reiterate

重申

For the different setup in your answer:

对于答案中的不同设置：

回答by ideamotor

回答by ideamotor

相关推荐

最近更新

标签

SQL 使用 Tablefunc 在多列上透视

提问by ideamotor

采纳答案by Erwin Brandstetter

Reiterate

重申

For the different setup in your answer:

对于答案中的不同设置：

回答by ideamotor

回答by ideamotor

相关推荐

SQL DB2：使用选择插入，每次插入时将每个新行的列增加一？

SQL 使用带条件的内部连接的 Doctrine 查询构建器

SQL 插入语句中表名的别名

SQL 插入多行时，语句“select 1 from dual”有什么作用？

相关推荐

最近更新

标签