postgresql Hadoop(+HBase/HDFS)与 Mysql(或 Postgres)——要处理和查询的独立结构化数据的负载

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/4884967/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-20 00:31:55  来源:igfitidea点击:

Hadoop (+HBase/HDFS) vs Mysql (or Postgres) - Loads of independent, structured data to be processed and queried

mysqldatabase-designpostgresqlhadoopdistributed

提问by MalteseUnderdog

Hi there at SO,

你好,在SO

I would like some ideas/comments on the following from you honorable and venerable bunch.

我想从你们这些可敬可敬的人那里得到一些想法/评论。

I have a 100M records which I need to process. I have 5 nodes (in a rocks cluster) to do this. The data is very structured and falls nicely in the relational data model. I want to do things in parallel since my processing takes some time.

我有 1 亿条记录需要处理。我有 5 个节点(在岩石集群中)来执行此操作。数据非常结构化,非常适合关系数据模型。我想并行做一些事情,因为我的处理需要一些时间。

As I see it I have two main options:

在我看来,我有两个主要选择:

Install mysql on each node and put 20M records on each. Use the head node to delegate queries to the nodes and aggregate the results. Query Capabilities++, but I might risk some headaches when I come to choose partitioning strategies etc. (Q. Is this what they call mysql/postgres cluster?). The really bad part is that the processing of the records is left up to me now to take care of (how to distribute across machines etc)...

每个节点安装mysql,每个节点放20M记录。使用头节点将查询委托给节点并聚合结果。 Query Capabilities++,但是当我选择分区策略等时,我可能会冒一些头痛的风险(问。这就是他们所说的 mysql/postgres 集群吗?)。真正糟糕的部分是记录的处理现在由我来处理(如何跨机器分发等)......

Alternatively install Hadoop, Hive and HBase (note that this might not be the most efficient way to store my data, since HBase is column oriented) and just define the nodes. We write everything in the MapReduce paradigm and, bang, we live happily ever after. The problem here is that we loose the "real time" query capabilities (I know you can use Hive, but that is not suggested for real time queries - which I need) - since I also have some normal sql queries to execute at times "select * from wine where colour = 'brown'".

或者安装 Hadoop、Hive 和 HBase(请注意,这可能不是存储我的数据的最有效方式,因为 HBase 是面向列的)并只定义节点。我们在 MapReduce 范式中编写所有内容,然后我们从此过上了幸福的生活。这里的问题是我们失去了“实时”查询功能(我知道你可以使用 Hive,但不建议实时查询 - 我需要) - 因为我有时也有一些普通的 sql 查询要执行“ select * from wine where color = 'brown'"。

Note that in theory - if I had 100M machines I could do the whole thing instantly since for each record the processing is independent of the other. Also - my data is read-only. I do not envisage any updates happening. I do not need/want 100M records on one node. I do not want there to be redundant data (since there is lots of it) so keeping it in BOTH mysql/postgres and Hadoop/HBase/HDFS. is not a real option.

请注意,理论上 - 如果我有 100M 机器,我可以立即完成整个事情,因为对于每条记录,处理都是相互独立的。另外 - 我的数据是只读的。我不认为会发生任何更新。我不需要/想要一个节点上的 100M 记录。我不希望有冗余数据(因为有很多)所以将它保存在 mysql/postgres 和 Hadoop/HBase/HDFS 中。不是一个真正的选择。

Many Thanks

非常感谢

回答by bajafresh4life

Can you prove that MySQL is the bottleneck? 100M records is not that many, and it looks like that you're not performing complex queries. Without knowing exactly what kind of processing, here is what I would do, in this order:

你能证明MySQL是瓶颈吗?100M 记录并不多,而且看起来您没有执行复杂的查询。在不知道具体是哪种处理的情况下,我将按以下顺序执行以下操作:

  1. Keep the 100M in MySQL. Take a look at Cloudera's Sqoop utility to import records from the database and process them in Hadoop.
  2. If MySQL is the bottleneck in (1), consider setting up slave replication, which will let you parallelize reads, without the complexity of a sharded database. Since you've already stated that you don't need to write back to the database, this should be a viable solution. You can replicate your data to as many servers as needed.
  3. If you are running complex select queries from the database, and (2) is still not viable, then consider using Sqoop to import your records and do whatever query transformations you require in Hadoop.
  1. 在 MySQL 中保持 100M。看看 Cloudera 的 Sqoop 实用程序,从数据库中导入记录并在 Hadoop 中处理它们。
  2. 如果 MySQL 是 (1) 中的瓶颈,请考虑设置从属复制,它可以让您并行读取,而没有分片数据库的复杂性。由于您已经声明不需要写回数据库,因此这应该是一个可行的解决方案。您可以根据需要将数据复制到任意数量的服务器。
  3. 如果您正在从数据库运行复杂的选择查询,并且 (2) 仍然不可行,那么请考虑使用 Sqoop 导入您的记录并在 Hadoop 中执行您需要的任何查询转换。

In your situation, I would resist the temptation to jump off of MySQL, unless it is absolutely necessary.

在您的情况下,除非绝对必要,否则我会抵制跳出 MySQL 的诱惑。

回答by David Gruzman

There are a few questions to ask, before suggesting.
Can you formulate your queries to access by primary key only? In other words - if you can avoid all joins and table scans. If so - HBase is an option, if you need very high rate of read/write accesses.
I do noth thing that Hive is good option taking into consideration low data volume. If you expect them to grow significantly - you can consider it. In any case Hive is good for the analytical workloads - not for the OLTP type of processing.
If you do need relational model with joins and scans - I think good solution might be one Master Node and 4 slaves, with replication between them. You will direct all writes to the master, and balance reads among whole cluster. It is especially good if you have much more reads then writes.
In this schema you will have all 100M records (not that match) on each node. Within each node you can employ partitioning if appropriate.

在提出建议之前,有几个问题要问。
您可以制定您的查询以仅通过主键访问吗?换句话说 - 如果您可以避免所有连接和表扫描。如果是这样 - HBase 是一个选项,如果您需要非常高的读/写访问率。
考虑到低数据量,我不认为 Hive 是不错的选择。如果您预计它们会显着增长 - 您可以考虑。在任何情况下,Hive 都适用于分析工作负载 - 不适用于 OLTP 类型的处理。
如果您确实需要具有连接和扫描的关系模型 - 我认为好的解决方案可能是一个主节点和 4 个从节点,并在它们之间进行复制。您将所有写入定向到主节点,并在整个集群之间平衡读取。如果您有更多的读取然后写入,这尤其好。
在此模式中,您将在每个节点上拥有所有 100M 记录(不匹配)。如果合适,您可以在每个节点内使用分区。

回答by shadanan

You may also want to consider using Cassandra. I recently discovered this article on HBase vs. Cassandrawhich I was reminded of when I read your post.

您可能还想考虑使用Cassandra。我最近发现了这篇关于HBase 与 Cassandra 的文章,当我阅读您的文章时,我想起了这篇文章。

The gist of it is that Cassandra is a highly scallable NoSQL solution with fast querying, which sort of sounds like the solution you're looking for.

它的要点是 Cassandra 是一个具有快速查询的高度可扩展的 NoSQL 解决方案,这听起来像是您正在寻找的解决方案。

So, it all depends on whether you need to maintain your relational model or not.

因此,这完全取决于您是否需要维护关系模型。

回答by Faheem Mitha

HI,

你好,

I had a situation where I had many tables which I created in parallel using sqlalchemy and the python multiprocessing library. I had multiple files, one per table, and loaded them using parallel COPY processes. If each process corresponds to a separate table, that works well. With one table, using COPY would be difficult. You could use tables partitioning in PostgreSQL, I guess. If you are interested I can give more details.

我曾遇到过使用 sqlalchemy 和 python 多处理库并行创建的许多表的情况。我有多个文件,每个表一个,并使用并行 COPY 进程加载它们。如果每个进程对应一个单独的表,那效果很好。对于一张表,使用 COPY 会很困难。我猜你可以在 PostgreSQL 中使用表分区。如果您有兴趣,我可以提供更多详细信息。

Regards.

问候。