scala Akka Stream Kafka 与 Kafka Streams

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/45628221/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-10-22 09:24:04  来源:igfitidea点击:

Akka Stream Kafka vs Kafka Streams

scalaakka-streamapache-kafka-streamstypesafestream-processing

提问by nsanglar

I am currently working with Akka Stream Kafkato interact with kafka and I was wonderings what were the differences with Kafka Streams.

我目前正在使用Akka Stream Kafka与 kafka 进行交互,我想知道与Kafka Streams有什么区别。

I know that the Akka based approach implements the reactive specifications and handles back-pressure, functionality that kafka streams seems to be lacking.

我知道基于 Akka 的方法实现了反应式规范并处理了 kafka 流似乎缺乏的背压功能。

What would be the advantage of using kafka streams over akka streams kafka?

与 akka 流 kafka 相比,使用 kafka 流有什么优势?

回答by Frederic A.

Your question is very general, so I'll give a general answer from my point of view.

你的问题很笼统,我就从我的角度给出一个笼统的回答。

First, I've got two usage scenario:

首先,我有两个使用场景:

  1. cases where I'm reading data from kafka, processing it and writing some output back to kafka, for these I'm using kafka streams exclusively.
  2. cases where either the data source or sink is not kafka, for those I'm using akka streams.
  1. 在我从 kafka 读取数据,处理它并将一些输出写回 kafka 的情况下,对于这些我专门使用 kafka 流。
  2. 对于我使用 akka 流的数据源或接收器不是 kafka 的情况。

This already allows me to answer the part about back-pressure: for the 1st scenario above, there is a back-pressure mechanism in kafka streams.

这已经让我能够回答关于背压的部分:对于上面的第一个场景,kafka 流中有一个背压机制。

Let's now only focus on the first scenario described above. Let's see what I would loose if I decided to stop using Kafka streams:

现在让我们只关注上面描述的第一个场景。如果我决定停止使用 Kafka 流,让我们看看我会失去什么:

  • some of my stream processors stages need a persistent (distributed) state store, kafka streams provides it for me. It is something that akka streams doesn't provide.
  • scaling, kafka streams automatically balances the load as soon as a new instance of a stream processor is started, or as soon as one gets killed. This works inside the same JVM, as well as on other nodes: scaling up and out. This is not provided by akka streams.
  • 我的一些流处理器阶段需要一个持久(分布式)状态存储,kafka 流为我提供了它。这是 akka 流不提供的东西。
  • 缩放时,kafka 流会在流处理器的新实例启动或被杀死后立即自动平衡负载。这适用于同一个 JVM 以及其他节点:向上和向外扩展。这不是由 akka 流提供的。

Those are the biggest differences that matter to me, I'm hoping that it makes sense to you!

这些是对我来说最重要的区别,我希望它对你有意义!

回答by vgkowski

The big advantage of Akka Stream over Kafka Streams would be the possibility to implement very complex processing graphs that can be cyclic with fan in/out and feedback loop. Kafka streams only allows acyclic graph if I am not wrong. It would be very complicated to implement cyclic processing graph on top of Kafka streams

Akka Stream 相对于 Kafka Streams 的一大优势是可以实现非常复杂的处理图,这些处理图可以通过扇入/扇出和反馈循环循环。如果我没记错的话,Kafka 流只允许非循环图。在 Kafka 流之上实现循环处理图会非常复杂

回答by SemanticBeeng

Found this article to give a good summary of distributed design concerns that Kafka Streamsprovides (complements Akka Streams).

发现这篇文章很好地总结了Kafka Streams提供(补充Akka Streams)的分布式设计问题。

https://www.beyondthelines.net/computing/kafka-streams/

https://www.beyondthelines.net/computing/kafka-streams/

message ordering: Kafka maintains a sort of append only log where it stores all the messages, Each message has a sequence id also known as its offset. The offset is used to indicate the position of a message in the log. Kafka streams uses these message offsets to maintain ordering.

partitioning: Kafka splits a topic into partitions and each partition is replicated among different brokers. The partitioning allows to spread the load and replication makes the application fault-tolerant (if a broker is down the data are still available). That's good for data partitioning but we also need to distribute the processes in a similar way. Kafka Streams uses the processor topology that relies on Kafka group management. This is the same group management that is used by the Kafka consumer to distribute load evenly among brokers (This work is mainly managed by the brokers).

Fault tolerance: data replication ensures data fault tolerance. Group management has fault tolerance built-in as it redistributes the workload among remaining live broker instances.

State management: Kafka streams provides a local storage backed up by a kafka change-log topic which uses log compaction (keeps only latest value for a given key).Kafka log compaction

Reprocessing: When starting a new version of the app, we can reprocess the logs from the start to compute new state then redirect the traffic the new instance and shutdown old application.

Time management: “Stream data is never complete and can always arrive out-of-order” therefore one must distinguish the event time vs processed time and handle it correctly.

消息排序:Kafka 维护了一种仅附加日志,它存储所有消息,每条消息都有一个序列 ID,也称为它的偏移量。偏移量用于指示消息在日志中的位置。Kafka 流使用这些消息偏移量来维护排序。

分区:Kafka 将主题拆分为多个分区,每个分区在不同的代理之间复制。分区允许分散负载,复制使应用程序具有容错性(如果代理关闭,数据仍然可用)。这对数据分区很有好处,但我们还需要以类似的方式分配进程。Kafka Streams 使用依赖于 Kafka 组管理的处理器拓扑。这与Kafka消费者用于在broker之间平均分配负载的组管理相同(这项工作主要由broker管理)。

容错:数据复制保证数据容错。组管理具有内置的容错功能,因为它可以在剩余的实时代理实例之间重新分配工作负载。

状态管理:Kafka 流提供由 kafka 更改日志主题备份的本地存储,该主题使用日志压缩(仅保留给定键的最新值)。Kafka 日志压缩

重新处理:当启动新版本的应用程序时,我们可以从一开始就重新处理日志以计算新状态,然后将流量重定向到新实例并关闭旧应用程序。

时间管理:“流数据永远不会完整,总是会乱序到达”,因此必须区分事件时间与处理时间并正确处理。

Author also says "Using this change-log topic Kafka Stream is able to maintain a “table view” of the application state."

作者还说“使用这个更改日志主题 Kafka Stream 能够维护应用程序状态的“表视图”。”

My take is that this applies mostly to an enterprise application where the "application state" is ... small.

我的看法是,这主要适用于“应用程序状态”很小的企业应用程序。

For a data science application working with "big data", the "application state" produced by a combination of data munging, machine learning models and business logic to orchestrate all of this will likely not be managed well with Kafka Streams.

对于使用“大数据”的数据科学应用程序,由数据处理、机器学习模型和业务逻辑组合产生的“应用程序状态”可能无法很好地管理Kafka Streams

Also, am thinking that using a "pure functional event sourcing runtime"like https://github.com/notxcain/aecorwill help make the mutations explicit and separate the application logic from the technology used to manage the persistent form of the state through the principled management of state mutation and IO "effects" (functional programming).

此外,我认为使用像https://github.com/notxcain/aecor这样的“纯功能事件溯源运行时”将有助于使突变显式,并将应用程序逻辑与用于管理状态持久形式的技术分开状态突变和 IO“效果”(函数式编程)的原则性管理。

In other words the business logic does not become tangled with the Kafkaapis.

换句话说,业务逻辑不会与Kafkaapi纠缠在一起。