database Cassandra 中分区键、复合键和集群键的区别?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/24949676/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-08 09:17:18  来源:igfitidea点击:

Difference between partition key, composite key and clustering key in Cassandra?

databasecassandracql

提问by brain storm

I have been reading articles around the net to understand the differences between the following keytypes. But it just seems hard for me to grasp. Examples will definitely help make understanding better.

我一直在网上阅读文章以了解以下key类型之间的差异。但这对我来说似乎很难掌握。例子肯定会有助于更好地理解。

primary key,
partition key, 
composite key 
clustering key

回答by Carlo Bertuccini

There is a lot of confusion around this, I will try to make it as simple as possible.

关于这个有很多困惑,我会尽量让它变得简单。

The primary key is a general concept to indicate one or more columns used to retrieve data from a Table.

主键是一个通用概念,用于指示用于从表中检索数据的一个或多个列。

The primary key may be SIMPLEand even declared inline:

主键可能很简单,甚至可以声明为内联:

 create table stackoverflow_simple (
      key text PRIMARY KEY,
      data text      
  );

That means that it is made by a single column.

这意味着它是由单个列制成的。

But the primary key can also be COMPOSITE(aka COMPOUND), generated from more columns.

但主键也可以是COMPOSITE(又名COMPOUND),从更多列生成。

 create table stackoverflow_composite (
      key_part_one text,
      key_part_two int,
      data text,
      PRIMARY KEY(key_part_one, key_part_two)      
  );

In a situation of COMPOSITEprimary key, the "first part" of the key is called PARTITION KEY(in this example key_part_oneis the partition key) and the second part of the key is the CLUSTERING KEY(in this example key_part_two)

COMPOSITE主键的情况下,键的“第一部分”称为PARTITION KEY(在本例中key_part_one是分区键),键的第二部分是CLUSTERING KEY(在本例中是key_part_two

Please note that the both partition and clustering key can be made by more columns, here's how:

请注意,分区和集群键都可以由更多列组成,方法如下:

 create table stackoverflow_multiple (
      k_part_one text,
      k_part_two int,
      k_clust_one text,
      k_clust_two int,
      k_clust_three uuid,
      data text,
      PRIMARY KEY((k_part_one, k_part_two), k_clust_one, k_clust_two, k_clust_three)      
  );

Behind these names ...

这些名字的背后……

  • The Partition Keyis responsible for data distribution across your nodes.
  • The Clustering Keyis responsible for data sorting within the partition.
  • The Primary Keyis equivalent to the Partition Keyin a single-field-key table (i.e. Simple).
  • The Composite/Compound Keyis just any multiple-column key
  • 分区键是负责在您的节点的数据分发。
  • 集群主要负责数据的分区中的排序。
  • 主键相当于分区键在单场键表(即简单)。
  • 复合/复合键就是任何多列键

Further usage information: DATASTAX DOCUMENTATION

更多使用信息:DATASTAX 文档



小用法和内容示例


SIMPLE简单的钥匙:

insert into stackoverflow_simple (key, data) VALUES ('han', 'solo');
select * from stackoverflow_simple where key='han';

table content

表格内容

key | data
----+------
han | solo

COMPOSITE/COMPOUND KEYcan retrieve "wide rows" (i.e. you can query by just the partition key, even if you have clustering keys defined)

COMPOSITE/COMPOUND KEY可以检索“宽行”(即您可以仅通过分区键进行查询,即使您定义了集群键)

insert into stackoverflow_composite (key_part_one, key_part_two, data) VALUES ('ronaldo', 9, 'football player');
insert into stackoverflow_composite (key_part_one, key_part_two, data) VALUES ('ronaldo', 10, 'ex-football player');
select * from stackoverflow_composite where key_part_one = 'ronaldo';

table content

表格内容

 key_part_one | key_part_two | data
--------------+--------------+--------------------
      ronaldo |            9 |    football player
      ronaldo |           10 | ex-football player

But you can query with all key (both partition and clustering) ...

但是您可以使用所有键(分区和聚类)进行查询...

select * from stackoverflow_composite 
   where key_part_one = 'ronaldo' and key_part_two  = 10;

query output

查询输出

 key_part_one | key_part_two | data
--------------+--------------+--------------------
      ronaldo |           10 | ex-football player

Important note: the partition key is the minimum-specifier needed to perform a query using a where clause. If you have a composite partition key, like the following

重要说明:分区键是使用where clause. 如果你有一个复合分区键,像下面这样

eg: PRIMARY KEY((col1, col2), col10, col4))

例如: PRIMARY KEY((col1, col2), col10, col4))

You can perform query only by passing at least both col1 and col2, these are the 2 columns that define the partition key. The "general" rule to make query is you have to pass at least all partition key columns, then you can add optionally each clustering key in the order they're set.

您只能通过至少传递 col1 和 col2 来执行查询,这些是定义分区键的 2 列。进行查询的“一般”规则是您必须至少传递所有分区键列,然后您可以按照设置的顺序选择添加每个集群键。

so the valid queries are (excluding secondary indexes)

所以有效的查询是(不包括二级索引

  • col1 and col2
  • col1 and col2 and col10
  • col1 and col2 and col10 and col 4
  • col1 和 col2
  • col1 和 col2 和 col10
  • col1 和 col2 和 col10 和 col 4

Invalid:

无效的:

  • col1 and col2 and col4
  • anything that does not contain both col1 and col2
  • col1 和 col2 和 col4
  • 任何不包含 col1 和 col2 的东西

Hope this helps.

希望这可以帮助。

回答by OrangeDog

Adding a summary answer as the accepted one is quite long. The terms "row" and "column" are used in the context of CQL, not how Cassandra is actually implemented.

添加一个摘要答案作为接受的答案很长。术语“行”和“列”是在 CQL 的上下文中使用的,而不是 Cassandra 的实际实现方式。

  • A primary keyuniquely identifies a row.
  • A composite keyis a key formed from multiple columns.
  • A partition keyis the primary lookup to find a set of rows, i.e. a partition.
  • A clustering keyis the part of the primary key that isn't the partition key (and defines the ordering within a partition).
  • 主键唯一地标识的行。
  • 复合键是从多个列形成的键。
  • 分区键是主查找找到一组行的,即,一个分区。
  • 聚集键是主键不是分区键(并且限定分区中的顺序)的一部分。

Examples:

例子:

  • PRIMARY KEY (a): The partition key is a.
  • PRIMARY KEY (a, b): The partition key is a, the clustering key is b.
  • PRIMARY KEY ((a, b)): The composite partition key is (a, b).
  • PRIMARY KEY (a, b, c): The partition key is a, the composite clustering key is (b, c).
  • PRIMARY KEY ((a, b), c): The composite partition key is (a, b), the clustering key is c.
  • PRIMARY KEY ((a, b), c, d): The composite partition key is (a, b), the composite clustering key is (c, d).
  • PRIMARY KEY (a): 分区键是a
  • PRIMARY KEY (a, b): 分区键是a,聚类键是b
  • PRIMARY KEY ((a, b)):复合分区键是(a, b)
  • PRIMARY KEY (a, b, c): 分区键是a,复合聚类键是(b, c)
  • PRIMARY KEY ((a, b), c): 复合分区键是(a, b),聚簇键是c
  • PRIMARY KEY ((a, b), c, d):复合分区键为(a, b),复合聚类键为(c, d)

回答by Big Data Guy

In cassandra , the difference between primary key,partition key,composite key, clustering key always makes some confusion.. So I am going to explain below and co relate to each others. We use CQL (Cassandra Query Language) for Cassandra database access. Note:- Answer is as per updated version of Cassandra. Primary Key :-

在 cassandra 中,主键、分区键、复合键、集群键之间的区别总是让人有些困惑..所以我将在下面解释并相互关联。我们使用 CQL(Cassandra 查询语言)进行 Cassandra 数据库访问。注意:- 答案是根据 Cassandra 的更新版本。 首要的关键 :-

In cassandra there are 2 different way to use primary Key .

在 cassandra 中有两种不同的方式来使用主键。

CREATE TABLE Cass (
    id int PRIMARY KEY,
    name text 
);


Create Table Cass (
   id int,
   name text,
   PRIMARY KEY(id) 
);


In CQL, the order in which columns are defined for the PRIMARY KEY matters. The first column of the key is called the partition key having property that all the rows sharing the same partition key (even across table in fact) are stored on the same physical node. Also, insertion/update/deletion on rows sharing the same partition key for a given table are performed atomically and in isolation. Note that it is possible to have a composite partition key, i.e. a partition key formed of multiple columns, using an extra set of parentheses to define which columns forms the partition key.

在 CQL 中,为 PRIMARY KEY 定义列的顺序很重要。键的第一列称为分区键,具有共享相同分区键(甚至实际上跨表)的所有行存储在同一物理节点上的属性。此外,对给定表共享相同分区键的行的插入/更新/删除是原子地独立执行的。请注意,可能有一个复合分区键,即由多列组成的分区键,使用一组额外的括号来定义哪些列形成分区键。

Partitioning and ClusteringThe PRIMARY KEY definition is made up of two parts: the Partition Key and the Clustering Columns. The first part maps to the storage engine row key, while the second is used to group columns in a row.

分区和集群PRIMARY KEY 定义由两部分组成:分区键和集群列。第一部分映射到存储引擎行键,而第二部分用于对行中的列进行分组。

CREATE TABLE device_check (
  device_id   int,
  checked_at  timestamp,
  is_power    boolean,
  is_locked   boolean,
  PRIMARY KEY (device_id, checked_at)
);

Here device_id is partition key and checked_at is cluster_key.

这里 device_id 是分区键,checked_at 是 cluster_key。

We can have multiple cluster key as well as partition key too which depends on declaration.

我们也可以有多个集群键以及分区键,这取决于声明。

回答by Chandan Hegde

Primary Key: Is composed of partition key(s) [and optional clustering keys(or columns)]
Partition Key: The hash value of Partition key is used to determine the specific node in a cluster to store the data
Clustering Key: Is used to sort the data in each of the partitions(or responsible node and it's replicas)

主键:由分区键[和可选的集群键(或列)]组成
分区键分区键的哈希值用于确定集群中存储数据的特定节点
集群键:用于对每个分区(或负责节点及其副本)中的数据进行排序

Compound Primary Key: As said above, the clustering keys are optional in a Primary Key. If they aren't mentioned, it's a simple primary key. If clustering keys are mentioned, it's a Compound primary key.

复合主键:如上所述,主键中的集群键是可选的。如果没有提到它们,它就是一个简单的主键。如果提到集群键,则它是复合主键。

Composite Partition Key: Using just one column as a partition key, might result in wide row issues(depends on use case/data modeling). Hence the partition key is sometimes specified as a combination of more than one column.

复合分区键:仅使用一列作为分区键,可能会导致行宽问题(取决于用例/数据建模)。因此,有时将分区键指定为多个列的组合。

Regarding confusion of which one is mandatory, which one can be skipped etc. in a query, trying to imagine Cassandra as a giant HashMaphelps. So in a HashMap, you can't retrieve the values without the Key.
Here, the Partition keysplay the role of that key. So each query needs to have them specified. Without which Cassandra won't know which node to search for.
The clustering keys(columns, which are optional) help in further narrowing your query search after Cassandra finds out the specific node(and it's replicas) responsible for that specific Partition key.

关于在查询中哪些是强制性的,哪些可以跳过等的混淆,尝试将Cassandra 想象为一个巨大的 HashMap会有所帮助。因此,在 HashMap 中,如果没有 Key,您将无法检索值。
在这里,分区键扮演那个键的角色。所以每个查询都需要指定它们。没有它,Cassandra 将不知道要搜索哪个节点。
集群键(列,哪些是可选的),在进一步缩小您的查询搜索卡桑德拉找出特定节点之后(和它的复制品)的帮助下负责特定的分区键

回答by Sun

In brief sense:

简而言之:

Partition Keyis nothing but identificationfor a row, that identification most of the times is the single column (called Primary Key) sometimes a combination of multiple columns (called Composite Partition Key).

分区键是什么,但鉴定为一排,即识别大部分的时间是在单个列(称为主键)有时多个列的组合(称为复合分区键)。

Cluster keyis nothing but Indexing& Sorting. Cluster keys depend on few things:

集群键不过是Indexing& Sorting。集群键取决于几件事:

  1. What columns you use in where clause except primary key columns.

  2. If you have very large records then on what concern I can divide the date for easy management. Example, I have data of 1million a county population records. So for easy management, I cluster data based on state and after pincode and so on.

  1. 除了主键列之外,您在 where 子句中使用了哪些列。

  2. 如果您有非常大的记录,那么我可以将日期分开以便于管理。例如,我有一个县 100 万人口记录的数据。所以为了便于管理,我根据状态和密码等对数据进行聚类。

回答by kboom

Worth to note, you will probably use those lots more than in similar concepts in relational world (composite keys).

值得注意的是,与关系世界中的类似概念(复合键)相比,您可能会更多地使用这些。

Example - suppose you have to find last N users who recently joined user group X. How would you do this efficiently given reads are predominant in this case? Like that (from offical Cassandra guide):

示例 - 假设您必须找到最近加入用户组 X 的最后 N 个用户。在这种情况下,如果读取占主导地位,您将如何有效地做到这一点?像那样(来自官方Cassandra 指南):

CREATE TABLE group_join_dates (
    groupname text,
    joined timeuuid,
    join_date text,
    username text,
    email text,
    age int,
    PRIMARY KEY ((groupname, join_date), joined)
) WITH CLUSTERING ORDER BY (joined DESC)

Here, partitioning keyis compound itself and the clustering keyis a joined date. The reason why a clustering keyis a join date is that results are already sorted(and stored, which makes lookups fast). But why do we use a compound key for partitioning key? Because we always want to read as few partitions as possible. How putting join_datein there helps? Now users from the same group and the same join date will reside in a single partition! This means we will always read as few partitions as possible (first start with the newest, then move to older and so on, rather than jumping between them).

在这里,分区键本身是复合,而聚类键是连接日期。之所以聚集键是注册日期是结果已经分类(和存储,这使得查找快)。但是为什么我们使用复合键来分区键呢?因为我们总是希望读取尽可能少的分区。把join_date放在那里有什么帮助?现在来自同一组和相同加入日期的用户将驻留在单个分区中!这意味着我们将始终读取尽可能少的分区(首先从最新的开始,然后移动到旧的等等,而不是在它们之间跳转)。

In fact, in extreme cases you would also need to use the hash of a join_daterather than a join_datealone - so that if you query for last 3 days often those share the same hash and therefore are available from same partition!

事实上,在极端情况下,您还需要使用join_date的哈希值,而不是单独使用join_date- 因此,如果您查询过去 3 天的哈希值,通常它们共享相同的哈希值,因此可以从同一分区获得!

回答by Sumon Saikan

The primary key in Cassandra usually consists of two parts - Partition key and Clustering columns.

Cassandra 中的主键通常由两部分组成 - 分区键和集群列。

primary_key((partition_key), clustering_col )

primary_key((partition_key), clustering_col )

Partition key - The first part of the primary key. The main aim of a partition key is to identify the node which stores the particular row.

分区键 - 主键的第一部分。分区键的主要目的是识别存储特定行的节点。

CREATE TABLE phone_book ( phone_num int, name text, age int, city text, PRIMARY KEY ((phone_num, name), age);

CREATE TABLE phone_book ( phone_num int, name text, age int, city text, PRIMARY KEY ((phone_num, name), age);

Here, (phone_num, name) is the partition key. While inserting the data, the hash value of the partition key is generated and this value decides which node the row should go into.

这里,(phone_num, name) 是分区键。在插入数据时,会生成分区键的哈希值,该值决定该行应该进入哪个节点。

Consider a 4 node cluster, each node has a range of hash values it can store. (Write) INSERT INTO phone_book VALUES (7826573732, ‘Joey', 25, ‘New York');

考虑一个 4 节点集群,每个节点都有一个可以存储的哈希值范围。(写) INSERT INTO phone_book VALUES (7826573732, 'Joey', 25, 'New York');

Now, the hash value of the partition key is calculated by Cassandra partitioner. say, hash value(7826573732, ‘Joey') → 12 , now, this row will be inserted in Node C.

现在,分区键的哈希值由 Cassandra 分区器计算。比如说,哈希值(7826573732, 'Joey') → 12 ,现在,这一行将被插入到节点 C 中。

(Read) SELECT * FROM phone_book WHERE phone_num=7826573732 and name='Joey';

(阅读) SELECT * FROM phone_book WHERE phone_num=7826573732 and name='Joey';

Now, again the hash value of the partition key (7826573732,'Joey') is calculated, which is 12 in our case which resides in Node C, from which the read is done.

现在,再次计算分区键 (7826573732,'Joey') 的哈希值,在我们的例子中是 12,它位于节点 C,从中完成读取。

  1. Clustering columns - Second part of the primary key. The main purpose of having clustering columns is to store the data in a sorted order. By default, the order is ascending.
  1. 聚类列 - 主键的第二部分。具有聚类列的主要目的是按排序顺序存储数据。默认情况下,顺序是升序。

There can be more than one partition key and clustering columns in a primary key depending on the query you are solving.

根据您正在解决的查询,一个主键中可以有多个分区键和集群列。

primary_key((pk1, pk2), col 1,col2)

primary_key((pk1, pk2), col 1,col2)

回答by Khurana

In database design, a compound key is a set of superkeys that is not minimal.

在数据库设计中,复合键是一组非最小的超键。

A composite key is a set that contains a compound key and at least one attribute that is not a superkey

复合键是包含复合键和至少一个不是超键的属性的集合

Given table: EMPLOYEES {employee_id, firstname, surname}

给定表:EMPLOYEES {employee_id, firstname, surname}

Possible superkeys are:

可能的超级键是:

{employee_id}
{employee_id, firstname}
{employee_id, firstname, surname}

{employee_id} is the only minimal superkey, which also makes it the only candidate key--given that {firstname} and {surname} do not guarantee uniqueness. Since a primary key is defined as a chosen candidate key, and only one candidate key exists in this example, {employee_id} is the minimal superkey, the only candidate key, and the only possible primary key.

{employee_id} 是唯一的最小超键,这也使它成为唯一的候选键——鉴于 {firstname} 和 {surname} 不保证唯一性。由于主键定义为选择的候选键,本例中只存在一个候选键,{employee_id}是最小超键,唯一的候选键,唯一可能的主键。

The exhaustive list of compound keys is:

复合键的详尽列表是:

{employee_id, firstname}
{employee_id, surname}
{employee_id, firstname, surname}

The only composite key is {employee_id, firstname, surname} since that key contains a compound key ({employee_id,firstname}) and an attribute that is not a superkey ({surname}).

唯一的复合键是 {employee_id, firstname, surname},因为该键包含复合键 ({employee_id,firstname}) 和一个非超级键 ({surname}) 的属性。