SQL 用于超快速查询的数据库

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/2229420/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-01 05:19:46  来源:igfitidea点击:

Database for super-fast querying

sqlsearchnosql

提问by Anton Gogolev

We have a 300 Gb+ data array we'd like to query as fast as possible. Traditional SQL databases (specifically, SQL Server) cannot handle this volume as effectively as we need (like, perform a selectwith 10-20 conditions in whereclause in less than 10 sec), so I'm investigating other solutions for this problem.

我们有一个 300 Gb+ 的数据阵列,我们希望尽快查询。传统的 SQL 数据库(特别是 SQL Server)不能像我们需要的那样有效地处理这个数量(比如,在不到 10 秒的时间内执行一个select有 10-20 个条件的where子句),所以我正在研究这个问题的其他解决方案。

I've been reading about NoSQLand this whole thing looks promising, but I'd prefer to hear from those who have used it in real life.

我一直在阅读有关NoSQL 的文章,整个事情看起来很有希望,但我更愿意听取在现实生活中使用过它的人的意见。

What can you suggest here?

你能在这里提出什么建议?

EDITto clarify what we're after.

编辑以澄清我们所追求的。

We're a company developing an app whereby users can search for tours and perform bookings of said tours, paying for them with their plastic cards. This whole thing can surely be Russia-specific, so bear with me.

我们是一家开发应用程序的公司,用户可以通过该应用程序搜索旅游并预订所述旅游,并用他们的塑料卡付款。这整件事肯定是俄罗斯特有的,所以请耐心等待。

When a user logs on to the site, she is presented with a form similar to this:

当用户登录到该站点时,她会看到一个类似于以下的表单:

alt text http://queenbee.alponline.ru/searchform.png

替代文字 http://queenbee.alponline.ru/searchform.png

Here, user selects where she leaves from and where she goes to, dates, duration and all that.

在这里,用户可以选择她从哪里出发、去哪里、日期、持续时间等等。

After hitting "Search" a request goes to our DB server, which, with cannot handle such load: queries include various kinds of parameters. Sharding doesn't work well either.

在点击“搜索”后,一个请求会发送到我们的数据库服务器,它无法处理这样的负载:查询包括各种参数。分片也不能很好地工作。

So what I'm after is a some kind of a pseudo-database, which can do lightning fast queries.

所以我所追求的是某种伪数据库,它可以进行闪电般的快速查询。

回答by ConcernedOfTunbridgeWells

If you want to do ad-hoc queries for reporting or analysis you're probably better off using something that will play nicely with off-the-shelf reporting tools. Otherwise you are likely to find yourself getting dragged off all the time to write little report programs to query the data. This is a strike against NoSQL type databases, but it may or may not be an issue depending on your circumstances.

如果您想对报告或分析进行临时查询,您可能最好使用可以与现成报告工具很好地配合使用的工具。否则,您可能会发现自己总是被拖着去编写小报表程序来查询数据。这是对 NoSQL 类型数据库的一次打击,但它可能是也可能不是问题,具体取决于您的情况。

300GB should not be beyond the capabilities of modern RDBMS platforms, even MS SQL Server. Some other options for large database queries of this type are:

300GB 不应超出现代 RDBMS 平台的能力,即使是 MS SQL Server。这种类型的大型数据库查询的一些其他选项是:

  • See if you can use a SSAS cube and aggregations to mitigate your query performance issues. Usage-based optimiisation might get you adequate performance without having to get another database system. SSAS can also be used in shared-nothing configurations, allowing you to stripe your queries across a cluster of relatively cheap servers with direct-attach disks. Look at ProClarity for a front-end if you do go this way.

  • Sybase IQ is a RDBMS platform that uses an underlying data structure optimised for reporting queries. It has the advantage that it plays nicely with a reasonable variety of conventional reporting tools. Several other systems of this type exist, such as Red Brick, Teradata or Greenplum (which uses a modified version of PostgreSQL). The principal strike against these systems is that they are not exactly mass market items and can be quite expensive.

  • Microsoft has a shared-nothing version of SQL Server in the pipeline, which you might be able to use. However they've tied it to third party hardware manufacturers so you can only get it with dedicated (and therefore expensive) hardware.

  • Look for opportunities to build data marts with aggregated data to reduce the volumes for some of the queries.

  • Look at tuning your hardware. Direct attach SAS arrays and RAID controllers can put through streaming I/O of the sort used in table scans pretty quickly. If you partition your tables over a large number of mirrored pairs you can get very fast streaming performance - easily capable of saturating the SAS channels.

    Practically, you're looking at getting 10-20GB/sec from your I/O subsystem if you want the performance targets you describe, and it is certianly possible to do this without resorting to really exotic hardware.

  • 查看是否可以使用 SSAS 多维数据集和聚合来缓解查询性能问题。基于使用情况的优化可能会让您获得足够的性能,而无需获得另一个数据库系统。SSAS 也可用于无共享配置,允许您在具有直接附加磁盘的相对便宜的服务器集群中对查询进行条带化。如果您这样做,请查看前端的 ProClarity。

  • Sybase IQ 是一个 RDBMS 平台,它使用为报告查询优化的底层数据结构。它的优点是它可以很好地与各种合理的传统报告工具配合使用。存在其他几种此类系统,例如 Red Brick、Teradata 或 Greenplum(使用 PostgreSQL 的修改版本)。对这些系统的主要打击是它们不完全是大众市场产品,而且可能非常昂贵。

  • Microsoft 在管道中有一个无共享版本的 SQL Server,您可以使用它。但是,他们已将其绑定到第三方硬件制造商,因此您只能使用专用(因此价格昂贵)硬件来获得它。

  • 寻找机会使用聚合数据构建数据集市,以减少某些查询的数量。

  • 看看调整你的硬件。直连 SAS 阵列和 RAID 控制器可以非常快速地完成表扫描中使用的那种流 I/O。如果您将表划分为大量镜像对,您可以获得非常快的流传输性能 - 很容易使 SAS 通道饱和。

    实际上,如果您想要您描述的性能目标,您正在考虑从您的 I/O 子系统获得 10-20GB/秒的速度,并且当然可以在不求助于真正奇特的硬件的情况下做到这一点。

回答by Andrew

I'm not sure I would agree that the traditional SQL databases can not handle these volumes, I can query through much larger datasets within those timeframes, but it has been designed specifically to handle that kind of work and placed on suitable hardware, specifically an IO subsystem that is designed to handle large data requests.

我不确定我是否同意传统 SQL 数据库无法处理这些数据量,我可以在这些时间范围内查询更大的数据集,但它是专门为处理此类工作而设计的,并放置在合适的硬件上,特别是旨在处理大数据请求的 IO 子系统。

回答by HLGEM

A properly set up SQL server should be able to handle data in the terrabytes without having performance problems. I have several friends who manage SQl Server databases that size with no perfomance issues.

正确设置的 SQL 服务器应该能够处理 TB 级的数据而不会出现性能问题。我有几个朋友管理 SQl Server 数据库,但没有性能问题。

Your problem may be one or more of these:

您的问题可能是以下一项或多项:

  • Inadequate server specs
  • Lack of good partitioning
  • Poor indexing
  • Poor database design
  • Poor query design including using tools like LINQ which may write poorly performing code for a database that size.
  • 服务器规格不足
  • 缺乏良好的分区
  • 索引不佳
  • 糟糕的数据库设计
  • 糟糕的查询设计,包括使用 LINQ 之类的工具,这些工具可能会为这种大小的数据库编写性能不佳的代码。

It assuredly is NOT the ability of SQL Server to handle these loads. If you have a databse that size you need to hire a professional dba with experience in optimizing large systems.

这肯定不是 SQL Server 处理这些负载的能力。如果您有这么大的数据库,则需要聘请具有优化大型系统经验的专业 dba。

回答by MarkR

I expect a "conventional" database can do what you want, provided you structure your data appropriately for the queries you're doing.

我希望“传统”数据库可以做你想做的事,只要你为你正在做的查询适当地构建你的数据。

You may find that in order to generate reports respectably, you need to summarise your data as it is generated (or loaded, transformed etc) and report off the summary data.

您可能会发现,为了体面地生成报告,您需要在生成(或加载、转换等)数据时对其进行汇总,并报告汇总数据。

The speed of a SELECT is not related (directly, in most cases) to the number of conditions in the WHERE clause (usually), but it is to do with the explain plan and the number of rows examined. There are tools which will analyse this for you.

SELECT 的速度与 WHERE 子句中的条件数(通常)无关(在大多数情况下直接相关),但与解释计划和检查的行数有关。有一些工具可以为您分析这一点。

Ultimately, at 300G (which is not THAT big) you will probably need to keep some of your data on disc (=slow) at least some of the time so you want to start reducing the number of IO operations required. Reducing IO operations may mean making covering indexes, summary tables and copies of data with differing clustered indexes. This makes your 300G bigger, but who cares.

最终,在 300G(不是那么大)时,您可能需要至少在某些时候将一些数据保留在磁盘上(=慢速),因此您希望开始减少所需的 IO 操作数量。减少 IO 操作可能意味着使用不同的聚集索引来制作覆盖索引、汇总表和数据副本。这会让你的 300G 更大,但谁在乎呢。

IO ops are king :)

IO 操作是王道 :)

Clearly doing these things is very expensive in terms of developer time, so you should start by throwing lots of hardware at the problem, and only try to fix it with software once that becomes insufficient. Lots of RAM is a start (but it won't be able to store > 10-20% of your data set at a time at current cost-effective levels) Even SSDs are not that expensive these days.

显然,就开发人员的时间而言,做这些事情是非常昂贵的,因此您应该首先在问题上投入大量硬件,只有在软件不足时才尝试使用软件来修复它。大量 RAM 是一个开始(但在当前具有成本效益的水平下,它无法一次存储超过 10-20% 的数据集)如今,即使 SSD 也不那么昂贵。

回答by Peter M

From what little I understand, traditional RDBMS are row based which optimizes for insertion speed. But retrieval speed optimization is best achieved with a column based storage system.

据我所知,传统的 RDBMS 是基于行的,可优化插入速度。但是检索速度优化最好使用基于列的存储系统来实现。

See Column oriented DBMSfor a more thorough explanation than I could give

请参阅面向列的 DBMS以获得比我能给出的更全面的解释

回答by David Schmitt

That really depends on what clauses you have in your WHERE and what kind of projection you need on your data.

这实际上取决于您在 WHERE 中有哪些子句以及您需要对数据进行什么样的投影。

It might be good enough to create the appropriate index on your table.

在您的表上创建适当的索引可能就足够了。

Also, even having an optimal data structure is of no use, if you have to read 100GB per query as that will take its time too.

此外,即使拥有最佳数据结构也没有用,如果您必须每次查询读取 100GB,因为这也需要时间。

回答by Quassnoi

NoSQL, as you may have read, is not a relational database.

NoSQL,您可能已经读过,它不是关系数据库。

It is a database which stores key-value pairs which you can traverse using a proprietary API.

它是一个存储键值对的数据库,您可以使用专有的API.

This implies you will need to define the physical layout of the data yourself, as well as do any code optimizations.

这意味着您需要自己定义数据的物理布局,并进行任何代码优化。

I'm quite outdated on this, but several years ago I've participated in a BerkeleyDBproject dealing with slightly less but still high volumes of data (about 100Gb).

我在这方面已经过时了,但几年前我参与了一个BerkeleyDB处理略少但仍然大量数据(约100Gb)的项目。

It was perfectly OK for our needs.

完全可以满足我们的需求。

Please also note, though it may seem obvious to you, that the queries can be optimized. Could you please post the query you use here?

另请注意,虽然对您来说似乎很明显,但可以优化查询。你能在这里发布你使用的查询吗?

回答by Kokizzu

Try Clickhouse, it has benchmark resultthat is faster on most cases even from MemSQL, but you cannot update the record, only insert/delete

试试Clickhouse,它的基准测试结果在大多数情况下甚至从 MemSQL 中都更快,但您无法更新记录,只能插入/删除