从 MySQL 切换到 Cassandra - 优点/缺点?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/2332113/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Switching from MySQL to Cassandra - Pros/Cons?
提问by viksit
For a bit of background - this question deals with a project running on a single small EC2 instance, and is about to migrate to a medium one. The main components are Django, MySQL and a large number of custom analysis tools written in python and java, which do the heavy lifting. The same machine is running Apache as well.
对于一些背景知识 - 这个问题涉及在单个小型 EC2 实例上运行的项目,并且即将迁移到中型实例。主要组件是 Django、MySQL 和大量用 python 和 java 编写的自定义分析工具,它们完成了繁重的工作。同一台机器也在运行 Apache。
The data model looks like the following - a large amount of real time data comes in streamed from various networked sensors, and ideally, I'd like to establish a long-poll approach rather than the current poll every 15 minutes approach (a limitation of computing stats and writing into the database itself). Once the data comes in, I store the raw version in MySQL, let the analysis tools loose on this data, and store statistics in another few tables. All of this is rendered using Django.
数据模型如下所示 - 大量实时数据来自各种联网传感器,理想情况下,我想建立一个长轮询方法,而不是每 15 分钟的当前轮询方法(限制计算统计数据并写入数据库本身)。一旦数据进来,我将原始版本存储在 MySQL 中,让分析工具对这些数据进行松散,并将统计信息存储在另外几个表中。所有这些都是使用 Django 呈现的。
Relational features I would need -
我需要的关系特征 -
- Order by [SliceRange in Cassandra's API seems to satisy this]
- Group by
- Manytomany relations between multiple tables [Cassandra SuperColumns seem to do well for one to many]
- Sphinx on this gives me a nice full text engine, so thats a necessity too. [On Cassandra, the Lucandra project seems to satisfy this need]
- 按[Cassandra API 中的 SliceRange 排序似乎可以满足这一点]
- 通过...分组
- 多个表之间的多条关系【Cassandra SuperColumns 似乎对一对多的处理效果很好】
- Sphinx 在这方面给了我一个很好的全文引擎,所以这也是必要的。【在Cassandra上,Lucandra项目似乎满足了这个需求】
My major problem is that data reads are extremely slow (and writes aren't that hot either). I don't want to throw a lot of money and hardware on it right now, and I'd prefer something that can scale easily with time. Vertically scaling MySQL is not trivial in that sense (or cheap).
我的主要问题是数据读取速度非常慢(写入也不是那么热)。我现在不想在上面投入大量资金和硬件,我更喜欢可以随时间轻松扩展的东西。从这个意义上说,垂直扩展 MySQL 并非微不足道(或便宜)。
So essentially, after having read a lot about NOSQL and experimented with things like MongoDB, Cassandra and Voldemort, my questions are,
所以本质上,在阅读了很多关于 NOSQL 并尝试了 MongoDB、Cassandra 和 Voldemort 之类的东西之后,我的问题是,
On a medium EC2 instance, would I gain any benefits in reads/writes by shifting to something like Cassandra? This article(pdf) definitely seems to suggest that. Currently, I'd say a few hundred writes per minute would be the norm. For reads - since the data changes every 5 minutes or so, cache invalidation has to happen pretty quickly. At some point, it should be able to handle a large number of concurrent users as well. The app performance currently gets killed on MySQL doing some joins on large tables even if indexes are created - something to the order of 32k rows takes more than a minute to render. (This may be an artifact of EC2 virtualized I/O as well). Size of tables is around 4-5 million rows, and there are about 5 such tables.
Everyone talks about using Cassandra on multiple nodes, given the CAP theorem and eventual consistency. But, for a project that is just beginning to grow, does it make sense to deploy a one node cassandra server? Are there any caveats? For instance, can it replace MySQL as a backend for Django? [Is this recommended?]
If I do shift, I'm guessing I'll have to rewrite parts of the app to do a lot more "administrivia" since I'd have to do multiple lookups to fetch rows.
Would it make any sense to just use MySQL as a key value storerather than a relational engine, and go with that? That way I could utilize a large number of stable APIs available, as well as a stable engine (and go relational as needed). (Brett Taylor's post from Friendfeed on this - http://bret.appspot.com/entry/how-friendfeed-uses-mysql)
在中型 EC2 实例上,通过转移到 Cassandra 之类的东西,我会在读/写方面获得任何好处吗?这篇文章(pdf) 似乎确实表明了这一点。目前,我会说每分钟几百次写入将是常态。对于读取 - 由于数据每 5 分钟左右更改一次,缓存失效必须很快发生。在某些时候,它也应该能够处理大量并发用户。即使创建了索引,MySQL 在大型表上执行某些连接时,应用程序的性能目前也会受到影响 - 大约 32k 行的内容需要一分钟多的时间来呈现。(这也可能是 EC2 虚拟化 I/O 的产物)。表的大小约为 4-5 百万行,大约有 5 个这样的表。
考虑到 CAP 定理和最终一致性,每个人都在谈论在多个节点上使用 Cassandra。但是,对于刚刚开始发展的项目,部署单节点 cassandra 服务器是否有意义?有什么注意事项吗?例如,它可以取代 MySQL 作为 Django 的后端吗?[这是推荐的吗?]
如果我确实转移了,我猜我将不得不重写应用程序的某些部分来做更多的“管理”,因为我必须进行多次查找才能获取行。
仅将 MySQL 用作键值存储而不是关系引擎并使用它是否有意义?这样我就可以利用大量可用的稳定 API,以及一个稳定的引擎(并根据需要使用关系)。(Brett Taylor 在 Friendfeed 上的帖子 -http://bret.appspot.com/entry/how-friendfeed-uses-mysql)
Any insights from people who've done a shift would be greatly appreciated!
任何已经完成转变的人的见解将不胜感激!
Thanks.
谢谢。
采纳答案by jbellis
Cassandra and the other distributed databases available today do not provide the kind of ad-hoc query support you are used to from sql. This is because you can't distribute queries with joins performantly, so the emphasis is on denormalization instead.
Cassandra 和当今可用的其他分布式数据库不提供您习惯于从 sql 中使用的那种即席查询支持。这是因为您无法高效地分发带有连接的查询,因此重点是非规范化。
However, Cassandra 0.6 (beta officially out tomorrow, but you can build from the 0.6 branch yourself if you're impatient) supports Hadoop map/reduce for analytics, which actually sounds like a good fit for you.
但是,Cassandra 0.6(明天正式发布测试版,但如果您不耐烦,您可以自己从 0.6 分支构建)支持用于分析的 Hadoop map/reduce,这实际上听起来很适合您。
Cassandra provides excellent support for adding new nodes painlessly, even to an initial group of one.
Cassandra 为轻松添加新节点提供了出色的支持,甚至可以添加到最初的一组节点中。
That said, at a few hundred writes/minute you're going to be fine on mysql for a long, long time. Cassandra is much better at being a key/value store (even better, key/columnfamily) but MySQL is much better at being a relational database. :)
也就是说,在几百次写入/分钟的情况下,您将在 mysql 上运行很长时间。Cassandra 在作为键/值存储(甚至更好,键/列族)方面要好得多,但 MySQL 在关系数据库方面要好得多。:)
There is no django support for Cassandra (or other nosql database) yet. They are talking about doing something for the next version after 1.2, but based on talking to django devs at pycon, nobody is really sure what that will look like yet.
目前还没有对 Cassandra(或其他 nosql 数据库)的 django 支持。他们正在谈论为 1.2 之后的下一个版本做一些事情,但基于与 pycon 的 django 开发人员的交谈,没有人真正确定那会是什么样子。
回答by codemonkey
If you're a relational database developer (as I am), I'd suggest/point out:
如果您是关系数据库开发人员(就像我一样),我建议/指出:
- Get some experience working with Cassandra before you commit to its use on a production system... especially if that production system has a hard deadline for completion. Maybe use it as the backend for something unimportant first.
- It's proving more challenging than I'd anticipated to do simple things that I take for granted about data manipulation using SQL engines. In particular, indexing data and sorting result sets is non-trivial.
- Data modelling has proven challenging as well. As a relational database developer you come to the table with a lot of baggage... you need to be willing to learn how to model data very differently.
- 在您承诺在生产系统上使用 Cassandra 之前,先获得一些使用 Cassandra 的经验……特别是如果该生产系统有一个严格的完成期限。也许先将它用作一些不重要的东西的后端。
- 事实证明,使用 SQL 引擎做一些我认为理所当然的简单事情比我预期的更具挑战性。特别是,索引数据和排序结果集是非常重要的。
- 数据建模也被证明具有挑战性。作为一名关系数据库开发人员,您会带着很多包袱来到这里……您需要愿意学习如何以非常不同的方式对数据建模。
These things said, I strongly recommend building somethingin Cassandra. If you're like me, then doing so will challenge your understanding of data storage and make you rethink a relational-database-fits-all-situations outlook that I didn't even realize I held.
综上所述,我强烈建议在 Cassandra 中构建一些东西。如果您像我一样,那么这样做将挑战您对数据存储的理解,并使您重新思考一种我什至没有意识到我持有的关系数据库适合所有情况的观点。
Some good resources I've found include:
我发现的一些好的资源包括:
回答by logan
The Django-cassandra is an early beta mode. Also Django didn't made for no-sql databases. The key in Django ORM is based on SQL (Django recommends to use PostgreSQL). If you need to use ONLY no-sql (you can mix sql and no-sql in same app) you need to risky use no-sql ORM (it significantly slower than traditional SQL orm or direct use of No-SQL storage). Or you'll need to completely full rewrite django ORM. But in this case i can't presume, why you need Django. Maybe you can use something else, like Tornado?
Django-cassandra 是一个早期的 beta 模式。Django 也不是为 no-sql 数据库制作的。Django ORM 中的关键是基于 SQL(Django 推荐使用 PostgreSQL)。如果您只需要使用 no-sql(您可以在同一个应用程序中混合使用 sql 和 no-sql),则需要冒险使用 no-sql ORM(它比传统的 SQL orm 或直接使用 No-SQL 存储明显慢得多)。或者你需要完全重写 django ORM。但在这种情况下,我不能假设,为什么你需要 Django。也许您可以使用其他东西,例如 Tornado?