与 MongoDB 或 Cassandra 相比,Greenplum 或 Vertica 等数据库的优势
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/8987727/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Advantages of databases like Greenplum or Vertica compared to MongoDB or Cassandra
提问by H6.
I am currently working in a few projects with MongoDBand Apache Cassandrarespectively. I am also using Solr a lot and I am handling "lots" of data with them (approx. 1-2TB). I've heard of Greenplumand Verticathe first time in the last week and I am not really sure, where to put them in my brain. They seem to me like Dataware House (DWH) solutions and I haven't really worked DWH. And they seem to cost lots of money (e.g. $60k for 1TB storage in Greenplum). I am currently not handling Petabyte of data and won't do so I think, but products like cassandra seem also to be able to handle this
我目前正在分别使用MongoDB和Apache Cassandra参与几个项目。我也经常使用 Solr,我正在用它们处理“大量”数据(大约 1-2TB)。上周我第一次听说了Greenplum和Vertica,但我不确定应该把它们放在我脑子里的什么位置。在我看来,他们喜欢 Dataware House (DWH) 解决方案,而我并没有真正使用过 DWH。而且它们似乎要花费很多钱(例如,Greenplum 中的 1TB 存储需要 6 万美元)。我目前没有处理 PB 的数据,我认为不会这样做,但是像 cassandra 这样的产品似乎也能够处理这个
Cassandra is the acknowledged NoSQL leader when it comes to comfortably scaling to terabytes or petabytes of data.
在轻松扩展到 TB 或 PB 级数据方面,Cassandra 是公认的 NoSQL 领导者。
So my question: Why should people use Greenplum & Co? Is there a huge advantage in comparison to these other products?
所以我的问题是:人们为什么要使用 Greenplum & Co?与这些其他产品相比,是否有巨大的优势?
Thanks.
谢谢。
回答by serbaut
Cassandra, Greenplum and Vertica all handle huge amounts of data but in very different ways.
Cassandra、Greenplum 和 Vertica 都以截然不同的方式处理大量数据。
Some made up usecases where each database has its strengths:
一些用例构成了每个数据库都有其优势的用例:
Use cassandra for:
将 cassandra 用于:
tweets.insert(key:user, data:blob);
tweets.get(key:user)
Use greenplum for:
将greenplum用于:
begin;
update account set balance = balance - 10 where account_id = 1;
update account set balance = balance + 10 where account_id = 2;
commit;
Use Vertica for:
将 Vertica 用于:
select sum(balance)
over (partition by region order by account rows unbounded preceding)
from transactions;
回答by Arun
I work in the telecom industry. We deal with large data-sets and complex EDW(enterprise data warehouse) models.We started with Teradata and it was good for few years. Then the data increased exponentially, and as you know expansion in Teradata is expensive. So, we evaluated EMCs namely green plum, oracle exadata, hp Vertica and IBM netteza.
我在电信行业工作。我们处理大型数据集和复杂的 EDW(企业数据仓库)模型。我们从 Teradata 开始,几年来效果很好。然后数据呈指数增长,正如您所知,Teradata 的扩展成本很高。因此,我们评估了 EMC,即绿梅、oracle exadata、hp Vertica 和 IBM netteza。
In speed, generation of 20 reports went like this: 1. Vertica, 2. Netteza, 3. green plum, 4. oracle
在速度上,生成 20 个报告是这样的:1. Vertica,2. Netteza,3. 青梅,4. oracle
In compression ratio: Vertica had a natural advantage. Among others IBM is good too. The worst as per the benchmarks is emc and oracle. As always expected as its both want to sell ton of storage and hardware.
在压缩比方面:Vertica 具有天然优势。IBM 也不错。根据基准测试,最差的是 emc 和 oracle。正如一直预期的那样,它都想出售大量的存储和硬件。
Scalability: All do scale well.
可扩展性:都可以很好地扩展。
Loading time: emc is the best here, others (teradata , Vertica, oracle , IBM) are good too.
加载时间:这里emc最好,其他(teradata、Vertica、oracle、IBM)也不错。
Concurrent user query :Vertica, emc, green plum, then only IBM. Oracle exadata is slow in any type of query case comparatively but much better than its old school 10g.
并发用户查询:Vertica,emc,青梅,当时只有IBM。Oracle exadata 在任何类型的查询情况下都比较慢,但比它的老派 10g 好得多。
Price: Teradata > Oracle > IBM > HP > EMC
价格:Teradata > Oracle > IBM > HP > EMC
Note: Need to compare apple to apple, same no of cores ,ram,data volume, and reports
注意:需要比较苹果与苹果,相同的核心数,内存,数据量和报告
We chose Vertica for hardware independent pricing model, lower pricing and good performance. Now all 40+ users are happy to generate reports without waiting and it all fit in the low cost hp dl380 servers. it is great for olap /edw use case.
我们选择 Vertica 是因为硬件独立定价模型,较低的价格和良好的性能。现在,所有 40 多个用户都乐于无需等待即可生成报告,而且它们都适合低成本的 hp dl380 服务器。它非常适合 olap /edw 用例。
All this analysis is only for edw/analytics/olap case. I am still an oracle fan boy for all oltp, rich plsql, connectivity etc on any hardware or system. Exadata gives a decent mixed workload, but unreasonable in Price/performance ratio and still need to migrate 10g code to exadata best practice (sort of MMP like, bulk processing etc, and its time consuming than what they claim.
所有这些分析仅适用于 edw/analytics/olap 案例。对于任何硬件或系统上的所有 oltp、丰富的 plsql、连接性等,我仍然是一个 oracle 粉丝。Exadata 提供了不错的混合工作负载,但性价比不合理,仍然需要将 10g 代码迁移到 Exadata 最佳实践(类似于 MMP,批量处理等,并且比他们声称的更耗时。
回答by kimbo305
We've been working in Hadoop for 4 years, and Vertica for 2. We had massive loading and indexing problems with our tables in MySQL. We were running on fumes with our home-grown sharding solution. We could have invested heavily in developing a more sophisticated sharding solution, which would have been quite painful, imo. We could have thought harder about what data we absolutely needed to keep in a SQL database.
我们已经在 Hadoop 中工作了 4 年,Vertica 已经工作了 2 年。我们在 MySQL 中的表存在大量加载和索引问题。我们一直在使用我们自己开发的分片解决方案。我们本可以投入巨资开发更复杂的分片解决方案,这会非常痛苦,imo。我们本可以更仔细地考虑我们绝对需要将哪些数据保留在 SQL 数据库中。
But at the end of the day, switching from MySQL to Vertica was what we chose. Vertica performance patterns are quite different from MySQL's, which comes with its own headaches. But it can load a lot of data very quickly, and it is good at heavy duty queries that would make MySQL's head spin.
但最终,我们选择了从 MySQL 切换到 Vertica。Vertica 的性能模式与 MySQL 的完全不同,后者有其自身的问题。但是它可以非常快速地加载大量数据,并且擅长处理会使 MySQL 头晕目眩的重型查询。
The way I see it, Vertica is a solution when you are already invested in SQL and need a heavier duty SQL database. I'm not an expert, so I couldn't tell you what a transition to Oracle or DB2 would have been like compared to Vertica, neither in terms of integration effort or monetary cost.
在我看来,Vertica 是一种解决方案,当您已经投资于 SQL 并且需要更重的 SQL 数据库时。我不是专家,所以我无法告诉您与 Vertica 相比,向 Oracle 或 DB2 的过渡会是什么样子,无论是在集成工作还是金钱成本方面。
Vertica offers a lot of features we've barely looked into. Those might be very attractive to others with use cases different to ours.
Vertica 提供了许多我们几乎没有研究过的功能。这些可能对其他用例与我们不同的人非常有吸引力。
回答by geoffrobinson
I'm a Vertica DBA and prior to that was a developer with Vertica. Michael Stonebreaker (the guy behind Ingres, Vertica, and other databases) has some critiques of NoSQL that are worth listening to.
我是 Vertica DBA,在此之前是 Vertica 的开发人员。Michael Stonebreaker(Ingres、Vertica 和其他数据库背后的人)对 NoSQL 有一些值得倾听的批评。
Basically, here are the advantages of Vertica as I see them:
基本上,以下是我所看到的 Vertica 的优势:
- it's rather fast on large amounts of data
- it's performance is similar (so I can gather) to other data warehousing solutions but it's advantage is clustering and commodity hardware. So you can scale by adding more commodity hardware. It looks cheap in terms of overall cost per TB. (Going from memory not an exact quote.)
- Again, it's for data warehousing.
- You get to use traditional SQL and tables. It's under the hood that's different.
- 它在大量数据上相当快
- 它的性能与其他数据仓库解决方案相似(所以我可以收集),但它的优势是集群和商品硬件。因此,您可以通过添加更多商品硬件来扩展。就每 TB 的总体成本而言,它看起来很便宜。(根据记忆不是一个确切的报价。)
- 同样,它用于数据仓库。
- 您可以使用传统的 SQL 和表。这是不同的引擎盖下。
I can't speak to the other products, but I'm sure a lot of them are fine too.
我不能谈论其他产品,但我相信其中很多也很好。
Edit: Here's a talk from Stonebreaker: http://www.slideshare.net/Dataversity/newsql-vs-nosql-for-new-oltp-michael-stonebraker-voltdb
编辑:这是 Stonebreaker 的演讲:http://www.slideshare.net/Dataversity/newsql-vs-nosql-for-new-oltp-michael-stonebraker-voltdb
回答by Steve Wright
Pivotal, formerly Greenplum, is the well-funded spinoff from EMC, VMware and GE. Pivotal's market are enterprises (and Homeland Cybersecurity agencies) with multi-Petabyte size databases needing complex analytics and high speed ETL. Greenplum's origin is a PostgreSQL DB redesigned for Map Reduced MPP, with later additions for columnar-support and HDFS. It marries the best of SQL + NoSQL making NewSQL.
Pivotal,前身为 Greenplum,是从 EMC、VMware 和 GE 分拆出来的资金充足的公司。Pivotal 的市场是拥有需要复杂分析和高速 ETL 的多 PB 大小数据库的企业(和国土网络安全机构)。Greenplum 的起源是为 Map Reduced MPP 重新设计的 PostgreSQL DB,后来添加了列支持和 HDFS。它结合了 SQL + NoSQL 的优点,从而形成了 NewSQL。
Features:
特征:
- In 2015H1 most of their code, including Greenplum DB & HAWQ, will go Open Source. Some advanced management & performance features at the top of the stack will remain proprietary.
- MPP (Massively Parallel Processing) share-nothing RDBMS database designed for multi-terrabyte to multi-petabyte environments.
- Full SQL Compliance - supporting all versions of SQL: ‘92, ‘99, 2003 OLAP, etc. 100% compatible with PostgreSQL 8.2. ?Only SQL over HADOOP capable of handling all 99 queries used by the TPC-DS benchmark standard without rewriting. The competition cannot do many of them and are significantly slower. SIGMON whitepaper.
- ACID compliance.
- Supports data stored in HDFS, Hive, HBase, Avro, ProtoBuf, Delimited Text and Sequence Files.
- Solr/Lucene integration for multi-lingual full-text search embedded in the SQL.
- Incorporates Open Source Software: Spring, Cloud Foundry, Redis.io, RabbitMQ, Grails, Groovy, Open Chorus, Pig, ZooKeeper, Mahout, MADlib, MapR. Some of these are used at EBSCO.
- Native connectivity to HBase, which is a popular column-store-like technology for Hadoop.
- VMware's participation in $150m investment in MongoDB will likely lead to integration of petabyte-scale XML files.
- Table-by-table specification of distribution keys allow you to design your table schemas to take advantage of node-local joins and group bys, but will perform will even without this.
- Row and/or Column-oriented data storage. It is the only database where a table can be polymorphic with both columnar and row-based partitions as defined by the DBA.
- A column-store table can have a different compression algorithm per column because different datatypes have different compression characteristics to optimize their storage.
- Advanced Map-Reduce-like CBO Query Optimizer – queries can be run on hundreds of thousands of nodes.
- It is the only database with a dynamic distributed pipeline execution model for query processing. While older databases rely on materialized execution Greenplum doesn't have to write data to disk with every intermediate query step. It streams data to the next stage of a query plan in memory, and never has to materialize the data to disk, so it's much faster than what anybody has demonstrated on Hadoop.
- Complex queries on large data sets are solved in seconds or even sub-seconds.
- Data management – provides table statistics, table security.
- Deep analytics – including data mining or machine learning algorithms using MADlib. Deep Semantic Textual Analytics using GPText.
- Graphical Analysis - billion edge distributed in-memory graph database and algorithms using GraphLab.
- Integration of SQL, Solr indexes, GPText, MADlib and GraphLab in a single query for massive syntactical parsing and graph/matrix affinity analysis for deep search analytics.
- Fully ODBC/JDBC compliant.
- Distributed ETL rate of 16 TB/hr!! Integration with Talend available.
- Cloud support: Pivotal plans to package its Cloud Foundry software so that it can be used to host Pivotal atop other clouds as well, including Amazon Web Services' EC2. Pivotal data management will be available for use in a variety of cloud settings and will not be dependent on a proprietary VMware system. Will target OpenStack, vSphere, vCloud Director, or private brands. IBM announced it has standardized on Cloud Foundry for its PaaS. Confluence page.
- Two hardware "appliance" offerings: Isilon NAS & Greenplum DCA.
- 在 2015H1,他们的大部分代码,包括 Greenplum DB 和 HAWQ,都将开源。堆栈顶部的一些高级管理和性能功能将保持专有。
- MPP(大规模并行处理)无共享 RDBMS 数据库,专为多 TB 到多 PB 环境而设计。
- 完整的 SQL 合规性 - 支持所有版本的 SQL:'92、'99、2003 OLAP 等。与 PostgreSQL 8.2 100% 兼容。? 只有基于 HADOOP 的 SQL 能够处理 TPC-DS 基准标准使用的所有 99 个查询,而无需重写。竞争无法完成其中的许多任务,并且速度明显变慢。SIGMON 白皮书。
- 酸合规性。
- 支持存储在 HDFS、Hive、HBase、Avro、ProtoBuf、Delimited Text 和 Sequence Files 中的数据。
- Solr/Lucene 集成用于嵌入在 SQL 中的多语言全文搜索。
- 包含开源软件:Spring、Cloud Foundry、Redis.io、RabbitMQ、Grails、Groovy、Open Chorus、Pig、ZooKeeper、Mahout、MADlib、MapR。其中一些用于 EBSCO。
- 与 HBase 的本机连接,这是 Hadoop 的一种流行的列存储类技术。
- VMware 对 MongoDB 的 1.5 亿美元投资可能会导致 PB 级 XML 文件的集成。
- 分布键的逐表规范允许您设计表模式以利用节点本地连接和分组依据,但即使没有此功能也能执行。
- 面向行和/或面向列的数据存储。它是唯一一个表可以是多态的数据库,其中包含 DBA 定义的基于列和基于行的分区。
- 列存储表的每列可以有不同的压缩算法,因为不同的数据类型具有不同的压缩特性来优化其存储。
- 高级 Map-Reduce 式 CBO 查询优化器——查询可以在数十万个节点上运行。
- 它是唯一具有用于查询处理的动态分布式管道执行模型的数据库。虽然较旧的数据库依赖于物化执行,但 Greenplum 不必在每个中间查询步骤中将数据写入磁盘。它将数据流式传输到内存中查询计划的下一阶段,而不必将数据具体化到磁盘,因此它比任何人在 Hadoop 上演示的速度都要快得多。
- 大型数据集上的复杂查询可在几秒甚至亚秒内解决。
- 数据管理——提供表统计、表安全。
- 深度分析——包括使用 MADlib 的数据挖掘或机器学习算法。使用 GPText 的深度语义文本分析。
- 图形分析 - 使用 GraphLab 的十亿边分布式内存图形数据库和算法。
- 将 SQL、Solr 索引、GPText、MADlib 和 GraphLab 集成到单个查询中,以进行大规模句法解析和图/矩阵亲和性分析,以进行深度搜索分析。
- 完全符合 ODBC/JDBC。
- 分布式 ETL 速率为 16 TB/小时!!可与 Talend 集成。
- 云支持:Pivotal 计划打包其 Cloud Foundry 软件,以便它也可用于在其他云上托管 Pivotal,包括 Amazon Web Services 的 EC2。Pivotal 数据管理将可用于各种云设置,并且不依赖于专有的 VMware 系统。将针对 OpenStack、vSphere、vCloud Director 或私有品牌。IBM 宣布已为其 PaaS 标准化 Cloud Foundry。汇流页面。
- 两种硬件“设备”产品:Isilon NAS 和 Greenplum DCA。
回答by SusanIB
There is a lot of confusion about when to use a row database like MySQL or Oracle or a columnar DB like Infobright or Vertica or a NoSQL variant or Hadoop. We wrote a white paper to try to help sort out which technologies are best suited for which use cases - you can download Emerging Database Landscape(scroll half way down) or watch an on-demand webinar on the same topic.
关于何时使用行数据库(如 MySQL 或 Oracle)或列式数据库(如 Infobright 或 Vertica 或 NoSQL 变体或 Hadoop)存在很多混淆。我们编写了一份白皮书,试图帮助找出哪些技术最适合哪些用例 - 您可以下载Emerging Database Landscape(向下滚动)或观看关于同一主题的点播网络研讨会。
Hope either is useful for you
希望对你有用