scala 在单元测试中模拟 Spark RDD

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/30944931/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-10-22 07:16:32  来源:igfitidea点击:

Mock a Spark RDD in the unit tests

scalaunit-testingmockingapache-sparkscalatest

提问by Edamame

Is it possible to mock a RDD without using sparkContext?

是否可以在不使用 sparkContext 的情况下模拟 RDD?

I want to unit test the following utility function:

我想对以下实用程序进行单元测试:

 def myUtilityFunction(data1: org.apache.spark.rdd.RDD[myClass1], data2: org.apache.spark.rdd.RDD[myClass2]): org.apache.spark.rdd.RDD[myClass1] = {...}

So I need to pass data1 and data2 to myUtilityFunction. How can I create a data1 from a mock org.apache.spark.rdd.RDD[myClass1], instead of create a real RDD from SparkContext? Thank you!

所以我需要将 data1 和 data2 传递给 myUtilityFunction。如何从模拟 org.apache.spark.rdd.RDD[myClass1] 创建一个 data1,而不是从 SparkContext 创建一个真正的 RDD?谢谢!

回答by Holden

RDDs are pretty complex, mocking them is probably not the best way to go about creating test data. Instead I'd recommend using sc.parallelize with your data. I'm also (somewhat biased) think that https://github.com/holdenk/spark-testing-basecan help by providing a trait to setup & teardown the Spark Context for your tests.

RDD 非常复杂,模拟它们可能不是创建测试数据的最佳方式。相反,我建议对您的数据使用 sc.parallelize。我也(有些偏见)认为https://github.com/holdenk/spark-testing-base可以通过提供一个特征来为您的测试设置和拆除 Spark 上下文。

回答by eliasah

I totally agree with @Holden on that!

我完全同意@Holden 的观点!

Mocking RDDS is difficult; executing your unit tests in a local Spark context is preferred, as recommended in the programming guide.

模拟 RDDS 很困难;按照编程指南中的建议,首选在本地 Spark 上下文中执行单元测试。

I know this may not technically be a unit test, but it is hopefully close enough.

我知道这在技术上可能不是单元测试,但希望它足够接近。

Unit Testing

Spark is friendly to unit testing with any popular unit test framework. Simply create a SparkContext in your test with the master URL set to local, run your operations, and then call SparkContext.stop() to tear it down. Make sure you stop the context within a finally block or the test framework's tearDown method, as Spark does not support two contexts running concurrently in the same program.

单元测试

Spark 对使用任何流行的单元测试框架进行单元测试都很友好。只需在您的测试中创建一个 SparkContext 并将主 URL 设置为本地,运行您的操作,然后调用 SparkContext.stop() 将其拆除。确保在 finally 块或测试框架的 tearDown 方法中停止上下文,因为 Spark 不支持在同一程序中同时运行两个上下文。

But if you are really interested and you still want to try mocking RDDs, I'll suggest that you read the ImplicitSuitetest code.

但是如果你真的很感兴趣并且仍然想尝试模拟 RDD,我建议你阅读ImplicitSuite测试代码。

The only reason they are pseudo-mocking the RDD is to test if implictworks well with the compiler, but they don't actually need a real RDD.

他们伪模拟 RDD 的唯一原因是测试是否implict与编译器一起工作良好,但他们实际上并不需要真正的 RDD。

def mockRDD[T]: org.apache.spark.rdd.RDD[T] = null

And it's not even a real mock. It just creates a null object of type RDD[T]

它甚至不是一个真正的模拟。它只是创建一个 RDD[T] 类型的空对象