scala 什么时候在 Spark 中使用 Kryo 序列化?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/40261987/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-10-22 08:47:37  来源:igfitidea点击:

When to use Kryo serialization in Spark?

scalaapache-sparkrddkryo

提问by pythonic

I am already compressing RDDs using conf.set("spark.rdd.compress","true")and persist(MEMORY_AND_DISK_SER). Will using Kryo serialization make the program even more efficient, or is it not useful in this case? I know that Kryo is for sending the data between the nodes in a more efficient way. But if the communicated data is already compressed, is it even needed?

我已经在使用conf.set("spark.rdd.compress","true")和压缩 RDD persist(MEMORY_AND_DISK_SER)。使用 Kryo 序列化会使程序更高效,还是在这种情况下没有用?我知道 Kryo 用于以更有效的方式在节点之间发送数据。但是如果传输的数据已经被压缩,它甚至需要吗?

回答by Tim

Both of the RDD states you described (compressed and persisted) use serialization. When you persist an RDD, you are serializing it and saving it to disk (in your case, compressing the serialized output as well). You are right that serialization is also used for shuffles (sending data between nodes): any time data needs to leave a JVM, whether it's going to local disk or through the network, it needs to be serialized.

您描述的两个 RDD 状态(压缩和持久化)都使用序列化。当你持久化一个 RDD 时,你正在序列化它并将它保存到磁盘(在你的情况下,压缩序列化的输出)。您是对的,序列化也用于 shuffle(在节点之间发送数据):任何时候数据需要离开 JVM,无论是去本地磁盘还是通过网络,都需要序列化。

Kryo is a significantly optimized serializer, and performs better than the standard java serializer for just about everything. In your case, you may actually be using Kryo already. You can check your spark configuration parameter:

Kryo 是一个显着优化的序列化器,几乎在所有方面都比标准的 java 序列化器表现更好。就您而言,您实际上可能已经在使用 Kryo。您可以检查您的 spark 配置参数:

"spark.serializer" should be "org.apache.spark.serializer.KryoSerializer".

“spark.serializer”应该是“org.apache.spark.serializer.KryoSerializer”。

If it's not, then you can set this internally with:

如果不是,那么您可以在内部设置它:

conf.set( "spark.serializer", "org.apache.spark.serializer.KryoSerializer" )

Regarding your last question ("is it even needed?"), it's hard to make a general claim about that. Kryo optimizes one of the slow steps in communicating data, but it's entirely possible that in your use case, others are holding you back. But there's no downside to trying Kryo and benchmarking the difference!

关于你的最后一个问题(“它甚至需要吗?”),很难对此做出一般性声明。Kryo 优化了传输数据的缓慢步骤之一,但在您的用例中,其他人完全有可能阻碍您。但是尝试 Kryo 并对其差异进行基准测试没有任何缺点!

回答by Sandeep Purohit

Kryo serialization is a more optimized serialization technique so you can use it to serialize any class which is used in an RDD or Dataframe closure. For some specific information use of Kryo serialization, see below:

Kryo 序列化是一种更优化的序列化技术,因此您可以使用它来序列化 RDD 或数据帧闭包中使用的任何类。有关 Kryo 序列化使用的一些具体信息,请参见下文:

  1. Use when serializing third party non-serialize classes inside an RDD or dataframe closure
  2. You want to use efficient serialization technique
  3. If you ever got a serialization error because of some class, you can register that class with the Kryo serializer
  1. 在 RDD 或数据帧闭包中序列化第三方非序列化类时使用
  2. 您想使用高效的序列化技术
  3. 如果您因为某个类而出现序列化错误,您可以使用 Kryo 序列化器注册该类

回答by yanghaogn

Considering another point: kyro is faster than the default in serialization and deserialization, so it's better to use kyro. But the performance increase may be not as good as said, there are other points which will influence the program speed, like how you write your spark code, which lib you choose.

考虑另一点:kyro 在序列化和反序列化方面比默认值更快,因此最好使用 kyro。但是性能提升可能没有说的那么好,还有其他几点会影响程序速度,比如你如何编写你的spark代码,你选择哪个lib。