关于在 Postgresql 中存储 Lat / Lng 坐标(列类型)

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/21368385/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-10-21 01:17:08  来源:igfitidea点击:

Regarding storing Lat / Lng coordinates in Postgresql (Column type)

performancepostgresqltypescoordinates

提问by alexisdevarennes

I am relatively new to postgresql which is why I turn to those more experienced with it than I.

我对 postgresql 比较陌生,这就是为什么我求助于那些比我更有经验的人。

I am storing coordinates in a postgresql database.

我将坐标存储在 postgresql 数据库中。

They look like this: 35.21076593772987,11.22855348629825 35.210780222605616,11.22826420209139 35.210777635062875,11.228241328291957 35.210766843596794,11.228219799676775 35.210765045075604,11.228213072050166 35.21076234732945,11.228200962345223 35.21076324691649,11.228186161764323 35.21077314123606,11.228083902231146 35.210863083636866,11.227228492401766

它们看起来像这样: 35.21076593772987,11.22855348629825 35.210780222605616,11.22826420209139 35.210777635062875,11.228241328291957 35.210766843596794,11.228219799676775 35.210765045075604,11.228213072050166 35.21076234732945,11.228200962345223 35.21076324691649,11.228186161764323 35.21077314123606,11.228083902231146 35.210863083636866,11.227228492401766

They can range in length from around 800characters up to 7000.

它们的长度范围可以从大约800 个字符到7000字符。

They always include:

它们总是包括:

  • Numbers (0-9)
  • Spaces ( )
  • Punctuation marks and commas (. ,)
  • 数字 ( 0-9)
  • 空格 ( )
  • 标点符号和逗号 ( . ,)

But they can also include:

但它们也可以包括:

  • Vertical bars ( |)
  • 竖线 ( |)

Right now I am storing them as TEXT, but to my understanding TEXTis stored externally which has an effect on performance. Would you recommend switching to another column type? If so, which one?

现在我将它们存储为TEXT,但据我了解TEXT存储在外部,这对性能有影响。您会建议切换到另一种列类型吗?如果有,是哪一个?

Thank you very much.

非常感谢你。

回答by Craig Ringer

Why not use PostGIS for this?

为什么不为此使用 PostGIS?

You're overlooking what's possibly the ideal storage for this kind of data - PostGIS's data types, particularly the geographytype.

您忽略了这种数据的理想存储可能是什么 - PostGIS 的数据类型,尤其是geography类型。

SELECT ST_GeogFromText('POINT(35.21076593772987 11.22855348629825)');

By using geographyyou're storing your data in a representative type that supports all sorts of powerful operations and indexes on the type. Of course, that's only one point; I strongly suspect your data is actually a lineor a shapein which case you should use the appropriate PostGIS geography constructorand input format.

通过使用,geography您将数据存储在支持该类型的各种强大操作和索引的代表性类型中。当然,那只是一个point;我强烈怀疑您的数据实际上是一条线或一个形状,在这种情况下,您应该使用适当的 PostGIS 地理构造函数和输入格式。

The big advantage to using geographyis that it's a type designed specifically for asking real world questions about things like distance, "within", etc; you can use things like ST_Distance_Spheroidto get real earth-distance between points.

使用的最大优点geography是它是一种专为询问有关距离、“内部”等现实世界问题而设计的类型;您可以使用诸如ST_Distance_Spheroid获取点之间的真实地球距离之类的东西。

Avoiding PostGIS?

避免 PostGIS?

If you want to avoid PostGIS, and just store it with native types, I'd recommend an array of point:

如果您想避免使用 PostGIS,而只是将其存储为本机类型,我建议您使用以下数组point

postgres=> SELECT ARRAY[
     point('35.21076593772987','11.22855348629825'), 
     point('35.210780222605616','11.22826420209139'), 
     point('35.210777635062875','11.228241328291957') 
];
                                                       array                                                        
--------------------------------------------------------------------------------------------------------------------
 {"(35.2107659377299,11.2285534862982)","(35.2107802226056,11.2282642020914)","(35.2107776350629,11.228241328292)"}
(1 row)

... unless your points actually represent a lineor shapein which case, use the appropriate type - pathor polygonrespectively.

...除非您的点实际上代表一条线形状,在这种情况下,请使用适当的类型 -pathpolygon分别使用。

This remains a useful compact representation - much more so than textin fact - that is still easily worked with within the DB.

这仍然是一个有用的紧凑表示 - 比text实际上更重要 - 在数据库中仍然很容易使用。

Compare storage:

比较存储:

CREATE TABLE points_text AS SELECT '35.21076593772987,11.22855348629825 35.210780222605616,11.22826420209139 35.210777635062875,11.228241328291957 35.210766843596794,11.228219799676775 35.210765045075604,11.228213072050166 35.21076234732945,11.228200962345223 35.21076324691649,11.228186161764323 35.21077314123606,11.228083902231146 35.210863083636866,11.227228492401766'::text AS p

postgres=> SELECT pg_column_size(points_text.p) FROM points_text;
 pg_column_size 
----------------
            339
(1 row)

CREATE TABLE points_array AS
SELECT array_agg(point(px)) AS p from points_text, LATERAL regexp_split_to_table(p, ' ') split(px);

postgres=> SELECT pg_column_size(p) FROM points_array;
 pg_column_size 
----------------
            168
(1 row)

pathis even more compact, and probably a truer way to model what your data really is:

path更紧凑,并且可能是一种更真实的方式来模拟您的数据的真实情况

postgres=> SELECT pg_column_size(path('35.21076593772987,11.22855348629825 35.210780222605616,11.22826420209139 35.210777635062875,11.228241328291957 35.210766843596794,11.228219799676775 35.210765045075604,11.228213072050166 35.21076234732945,11.228200962345223 35.21076324691649,11.228186161764323 35.21077314123606,11.228083902231146 35.210863083636866,11.227228492401766'));
 pg_column_size 
----------------
             96
(1 row)

unless it's a closed shape, in which case use polygon.

除非它是一个封闭的形状,在这种情况下使用polygon.

Don't...

别...

Either way, please don't just model this as text. It'll make you cry later, when you're trying to solve problems like "how do I determine if this point falls within x distance of the path in this column". PostGIS makes this sort of thing easy, but only if you store your data sensibly in the first place.

无论哪种方式,请不要只是将其建模为文本。稍后,当您尝试解决诸如“我如何确定此点是否位于此列中路径的 x 距离之内”之类的问题时,它会让您哭泣。PostGIS 使这种事情变得容易,但前提是您首先明智地存储数据。

See this closely related question, which discusses the good reasons notto just shove stuff in textfields.

看到这个密切相关的问题,它讨论了只是在text字段中塞东西的充分理由。

Also don't worry too much about in-line vs out-of-line storage. There isn't tons you can do about it, and it's something you should be dealing with only once you get the semantics of your data model right.

也不要太担心在线与离线存储。对此您无能为力,而且只有在您获得正确的数据模型语义后才应该处理它。

回答by harmic

All of the character types(TEXT, VARCHAR, CHAR) behave similarly from a performance point of view. They are normally stored in-line in the table row, unless they are very large, in which case they may be stored in a separate file (called a TOAST file).

从性能的角度来看,所有字符类型(TEXT、VARCHAR、CHAR)的行为都相似。它们通常在线存储在表行中,除非它们非常大,在这种情况下它们可能存储在单独的文件中(称为 TOAST 文件)。

The reasons for this are:

原因如下:

  1. Table rows have to be able to fit inside the database page size (8kb by default)

  2. Having a very large field in a row stored inline would make it slower to access other fields in the table. Imagine a table which contains two columns - a filename and the file content - and you wanted to locate a particular file. If you had the file content stored inline, then you would have to scan every file to find the one you wanted. (Ignoring the effect of indexes that might exist for this example).

  1. 表行必须能够适应数据库页面大小(默认为 8kb)

  2. 在行中存储一个非常大的字段会使访问表中的其他字段变慢。想象一个包含两列(文件名和文件内容)的表,您想定位一个特定的文件。如果您将文件内容内联存储,那么您必须扫描每个文件以找到您想要的文件。(忽略本示例中可能存在的索引的影响)。

Details of TOAST storage can be found here. Note that out of line storage is not the only strategy - the data may be compressed and/or stored out of line.

可以在此处找到 TOAST 存储的详细信息。请注意,离线存储并不是唯一的策略——数据可能会被压缩和/或离线存储。

TOAST-ing kicks in when a row exceeds a threshold (2kb by default), so it is likely that your rows will be affected by this since you state they can be up to 7000 chars (although it might be that most of them are only compressed, not stored out of line).

当一行超过阈值(默认为 2kb)时,TOAST 会启动,因此您的行很可能会受到此影响,因为您声明它们最多可以有 7000 个字符(尽管它们中的大多数可能只是压缩,而不是存储在行外)。

You can affect how tables are subjected to this treatment using the command ALTER TABLE ... SET STORAGE.

您可以使用命令ALTER TABLE ... SET STORAGE来影响表如何进行这种处理。

This storage strategy applies to all of the data types which you might use to store the type of data you are describing. It would take a better knowledge of your application to make reliable suggestions for other strategies, but here are some ideas:

此存储策略适用于您可能用于存储您所描述的数据类型的所有数据类型。需要对您的应用程序有更好的了解才能为其他策略提供可靠的建议,但这里有一些想法:

  • It might be better to re-factor the data - instead of storing all of the co-ordinates into a large string and processing it in your application, store them as individual rows in a referenced table. Since in any case your application is splitting and parsing the data into co-ordinate pairs for use, letting the database do this for you makes a kind of sense.

    This would particularly be a good idea if subsets of the data in each co-ordinate set need to be selected or updated instead of always consumed or updated in a single operation, or if doing so allowed you to index the data more effectively.

  • Since we are talking about co-ordinate data, you could consider using PostGIS, an extension for PostgreSQL which specifically caters for this kind of data. It also includes operators allowing you to filter rows which are, for example, inside or outside bounding boxes.

  • 重新分解数据可能会更好 - 与其将所有坐标存储到一个大字符串中并在您的应用程序中处理它,不如将它们作为单独的行存储在引用表中。由于在任何情况下您的应用程序都将数据拆分并解析为坐标对以供使用,因此让数据库为您执行此操作是有意义的。

    如果需要选择或更新每个坐标集中的数据子集而不是总是在单个操作中使用或更新,或者如果这样做可以让您更有效地索引数据,这将是一个特别好的主意。

  • 由于我们谈论的是坐标数据,您可以考虑使用PostGIS,它是 PostgreSQL 的一个扩展,专门针对此类数据。它还包括允许您过滤位于边界框内部或外部的行的运算符。

回答by Floris

Don't focus on the fact that these numbers are coordinates. Instead, notice that they are strings of numbers in a very limited range, and all of roughly the same magnitude. You are most likely interested in how these numbers change (looks like a trajectory of an object off the coast of Tunisia if I just punch these coordinates into a map).

不要关注这些数字是坐标这一事实。相反,请注意它们是非常有限范围内的数字串,并且所有数字都大致相同。您最有可能对这些数字如何变化感兴趣(如果我只是将这些坐标打入地图,它看起来就像突尼斯海岸附近物体的轨迹)。

I would recommend that you convert the numbers to double precision (53 bits of precision ~ 9 parts in 10^15 - close to the LSD of your numbers), and subtract each value from the first value in the series. This will result in much smaller numbers being stored, and greater relative accuracy. You could get away with storing the differences as long integers, probably (multiplying appropriately) but it will be faster to keep them as doubles.

我建议您将数字转换为双精度(53 位精度 ~ 10^15 中的 9 个部分 - 接近数字的 LSD),并从系列中的第一个值中减去每个值。这将导致存储的数字小得多,并且相对准确度更高。您可以将差异存储为长整数,可能(适当相乘),但将它们保留为双精度会更快。

And if you just take each 'trajectory' (I am just calling a collection of GPS points a trajectory, I have no idea if that is what they represent in your case) and give it a unique ID, then you can have a table with columns:

如果您只是采用每个“轨迹”(我只是将 GPS 点的集合称为轨迹,我不知道这是否是它们在您的情况下代表的内容)并为其指定一个唯一 ID,那么您就可以拥有一张表列:

unique ID  |  trajectory ID  |     latitude      |      longitude
   1              1            11.2285534862982     35.2107802226056
   2              1            11.2282642020913     35.2107776350628
   3              1            11.2282413282919     35.2107668435967
   4              1            11.2282197996767     35.2107650450756
   5              1            11.2282130720501     35.2107623473294
   6              1            11.2282009623452     35.2107632469164
   7              1            11.2281861617643     35.2107731412360
   8              1            11.2280839022311     35.2108630836368

Conversion from text to string is MUCH slower than you think - it requires many operations. If you end up using the data as numbers, I highly recommend storing them as numbers...

从文本到字符串的转换比你想象的要慢得多——它需要很多操作。如果您最终将数据用作数字,我强烈建议将它们存储为数字...