java 如何使用datastax java驱动程序有效地使用批量写入cassandra?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/26265224/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-11-02 09:36:16  来源:igfitidea点击:

How to efficiently use Batch writes to cassandra using datastax java driver?

javacassandradatastax-java-driver

提问by john

I need to write in Batches to Cassandra using Datastax Java driver and this is my first time I am trying to use batch with datastax java driver so I am having some confusion -

我需要使用 Datastax Java 驱动程序批量写入 Cassandra,这是我第一次尝试将批处理与 datastax Java 驱动程序一起使用,所以我有些困惑 -

Below is my code in which I am trying to make a Statement object and adding it to Batch and setting the ConsistencyLevel as QUORUM as well.

下面是我的代码,我试图在其中创建一个 Statement 对象并将其添加到 Batch 并将 ConsistencyLevel 设置为 QUORUM。

Session session = null;
Cluster cluster = null;

// we build cluster and session object here and we use  DowngradingConsistencyRetryPolicy as well
// cluster = builder.withSocketOptions(socketOpts).withRetryPolicy(DowngradingConsistencyRetryPolicy.INSTANCE)

public void insertMetadata(List<AddressMetadata> listAddress) {
    // what is the purpose of unloggedBatch here?
    Batch batch = QueryBuilder.unloggedBatch();

    try {
        for (AddressMetadata data : listAddress) {
            Statement insert = insertInto("test_table").values(
                    new String[] { "address", "name", "last_modified_date", "client_id" },
                    new Object[] { data.getAddress(), data.getName(), data.getLastModifiedDate(), 1 });
            // is this the right way to set consistency level for Batch?
            insert.setConsistencyLevel(ConsistencyLevel.QUORUM);
            batch.add(insert);
        }

        // now execute the batch
        session.execute(batch);
    } catch (NoHostAvailableException e) {
        // log an exception
    } catch (QueryExecutionException e) {
        // log an exception
    } catch (QueryValidationException e) {
        // log an exception
    } catch (IllegalStateException e) {
        // log an exception
    } catch (Exception e) {
        // log an exception
    }
}

And below is my AddressMetadataclass -

下面是我的AddressMetadata课——

public class AddressMetadata {

    private String name;
    private String address;
    private Date lastModifiedDate;

    public String getName() {
        return name;
    }

    public void setName(String name) {
        this.name = name;
    }

    public String getAddress() {
        return address;
    }

    public void setAddress(String address) {
        this.address = address;
    }

    public Date getLastModifiedDate() {
        return lastModifiedDate;
    }

    public void setLastModifiedDate(Date lastModifiedDate) {
        this.lastModifiedDate = lastModifiedDate;
    }
}

Now my question is - Does the way I am using Batch to insert into cassandra with Datastax Java Driver is correct? And what about retry policies, meaning if batch statement execution failed, then what will happen, will it retry again?

现在我的问题是 - 我使用 Batch 插入带有 Datastax Java 驱动程序的 cassandra 的方式是否正确?那么重试策略呢,这意味着如果批处理语句执行失败,那么会发生什么,它会重试吗?

And is there any better way of using batch writes to cassandra using java driver?

有没有更好的方法可以使用 java 驱动程序对 cassandra 进行批量写入?

回答by phact

First a bit of a rant:

先吐槽一下:

The batch keyword in Cassandra is nota performance optimization for batching together large buckets of data for bulk loads.

Cassandra 中的 batch 关键字不是用于将大数据桶批量加载在一起的性能优化。

Batches are used to group together atomic operations, actions that you expect to occur together. Batches guarantee that if a single part of your batch is successful, the entire batch is successful.

批处理用于将原子操作、您希望一起发生的操作组合在一起。批次保证如果批次的单个部分成功,则整个批次都成功。

Using batches will probably not make your mass ingestion run faster

使用批处理可能不会使您的批量摄取运行得更快

Now for your questions:

现在回答您的问题:

what is the purpose of unloggedBatch here?

unloggedBatch 在这里的目的是什么?

Cassandra uses a mechanism called batch logging in order to ensure a batch's atomicity. By specifying unlogged batch, you are turning off this functionality so the batch is no longer atomic and may fail with partial completion. Naturally, there is a performance penalty for logging your batches and ensuring their atomicity, using unlogged batches will removes this penalty.

Cassandra 使用一种称为批处理日志记录的机制来确保批处理的原子性。通过指定未记录的批处理,您将关闭此功能,因此批处理不再是原子的,并且可能会因部分完成而失败。自然地,记录批次并确保它们的原子性会产生性能损失,使用未记录的批次将消除这种损失。

There are some cases in which you may want to use unlogged batches to ensure that requests (inserts) that belong to the same partition, are sent together. If you batch operations together and they need to be performed in different partitions / nodes, you are essentially creating more work for your coordinator. See specific examples of this in Ryan's blog:

在某些情况下,您可能希望使用未记录的批处理来确保属于同一分区的请求(插入)一起发送。如果您一起批处理操作并且它们需要在不同的分区/节点中执行,那么您实际上是在为您的协调器创建更多工作。在 Ryan 的博客中查看具体示例:

Read this post

阅读这篇文章

Now my question is - Does the way I am using Batch to insert into cassandra with Datastax Java Driver is correct?

现在我的问题是 - 我使用 Batch 插入带有 Datastax Java 驱动程序的 cassandra 的方式是否正确?

I don't see anything wrong with your code here, just depends on what you're trying to achieve. Dig into that blog post I shared for more insight.

我在这里看不到您的代码有什么问题,这取决于您要实现的目标。深入研究我分享的那篇博文以获得更多见解。

And what about retry policies, meaning if batch statement execution failed, then what will happen, will it retry again?

那么重试策略呢,这意味着如果批处理语句执行失败,那么会发生什么,它会重试吗?

A batch on its own will not retry on its own if it fails. The driver does have retry policies but you have to apply those separately.

如果失败,一个批次将不会自行重试。驱动程序确实有重试策略,但您必须单独应用这些策略。

The default policy in the java driver only retries in these scenarios:

java 驱动程序中的默认策略仅在以下情况下重试:

  • On a read timeout, if enough replica replied but data was not retrieved.
  • On a write timeout, if we timeout while writing the distributed log used by batch statements.
  • 在读取超时时,如果有足够的副本回复但未检索到数据。
  • 在写入超时时,如果我们在写入批处理语句使用的分布式日志时超时。

Read more about the default policyand consider less conservative policiesbased on your use case.

阅读有关默认策略的更多信息,并根据您的用例考虑不太保守的策略

回答by Chandra

We debated for a while between using async and batches. We tried out both to compare. We got better throughput using "unlogged batches" compared to individual "async" requests. We dont know why, but based on Ryan's blog, I am guessing it has got to do with the write size. We probably are doing too many smaller writes and so batching them probably gave us better performance as it does reduce network traffic.

我们在使用异步和批处理之间争论了一段时间。我们尝试了两者进行比较。与单个“异步”请求相比,我们使用“未记录的批次”获得了更好的吞吐量。我们不知道为什么,但根据Ryan 的博客,我猜这与写入大小有关。我们可能做了太多较小的写入,因此批处理它们可能会给我们带来更好的性能,因为它确实减少了网络流量。

I have to mention that we are not even doing "unlogged batches" in the recommended way. The recommended way is to do a batch with a single-partition key. Basically, batch all the records which belong to the same partition key. But, we were just batching some records which probably belong to different partitions.

我不得不提一下,我们甚至没有以推荐的方式进行“未记录的批次”。推荐的方法是使用单分区键进行批处理。基本上,批处理属于同一分区键的所有记录。但是,我们只是批处理了一些可能属于不同分区的记录。

Someone did some benchmarking to compare async and "unlogged batches" and we found that quite useful. Here is the link.

有人做了一些基准测试来比较异步和“未记录的批次”,我们发现这非常有用。这是链接