在 Scala 中创建 SparkSession 对象以在 unittest 和 spark-submit 中使用的最佳实践

Question

提问by Joo-Won Jung

I have tried to write a transform method from DataFrame to DataFrame. And I also want to test it by scalatest.

我试图编写一个从 DataFrame 到 DataFrame 的转换方法。我也想通过 scalatest 来测试它。

As you know, in Spark 2.x with Scala API, you can create SparkSession object as follows:

如您所知，在带有 Scala API 的 Spark 2.x 中，您可以按如下方式创建 SparkSession 对象：

import org.apache.spark.sql.SparkSession

val spark = SparkSession.bulider
     .config("spark.master", "local[2]")
     .getOrCreate()

This code works fine with unit tests. But, when I run this code with spark-submit, the cluster options did not work. For example,

此代码适用于单元测试。但是，当我使用 spark-submit 运行此代码时，集群选项不起作用。例如，

spark-submit --master yarn --deploy-mode client --num-executors 10 ...

does not create any executors.

不创建任何执行程序。

I have found that the spark-submit arguments are applied when I remove config("master", "local[2]")part of the above code. But, without master setting the unit test code did not work.

我发现删除config("master", "local[2]")部分上述代码时会应用 spark-submit 参数。但是，没有大师设置单元测试代码不起作用。

I tried to split spark (SparkSession) object generation part to test and main. But there is so many code blocks needs spark, for example import spark.implicit,_and spark.createDataFrame(rdd, schema).

我尝试将 spark (SparkSession) 对象生成部分拆分为 test 和 main。但是有很多代码块需要spark，例如import spark.implicit,_和spark.createDataFrame(rdd, schema)。

Is there any best practice to write a code to create spark object both to test and to run spark-submit?

是否有任何最佳实践来编写代码来创建 spark 对象来测试和运行 spark-submit？

Answer 1

回答by Rick Moritz

One way is to create a trait which provides the SparkContext/SparkSession, and use that in your test cases, like so:

一种方法是创建一个提供 SparkContext/SparkSession 的特征，并在您的测试用例中使用它，如下所示：

trait SparkTestContext {
  private val master = "local[*]"
  private val appName = "testing"
  System.setProperty("hadoop.home.dir", "c:\winutils\")
  private val conf: SparkConf = new SparkConf()
    .setMaster(master)
    .setAppName(appName)
    .set("spark.driver.allowMultipleContexts", "false")
    .set("spark.ui.enabled", "false")

  val ss: SparkSession = SparkSession.builder().config(conf).enableHiveSupport().getOrCreate()
  val sc: SparkContext = ss.sparkContext
  val sqlContext: SQLContext = ss.sqlContext
}

And your test class header then looks like this for example:

然后你的测试类标题看起来像这样，例如：

class TestWithSparkTest extends BaseSpec with SparkTestContext with Matchers{

Answer 2

回答by Karima Rafes

I made a version where Spark will close correctly after tests.

我制作了一个版本，其中 Spark 会在测试后正确关闭。

import org.apache.spark.sql.{SQLContext, SparkSession}
import org.apache.spark.{SparkConf, SparkContext}
import org.scalatest.{BeforeAndAfterAll, FunSuite, Matchers}

trait SparkTest extends FunSuite with BeforeAndAfterAll with Matchers {
  var ss: SparkSession = _
  var sc: SparkContext = _
  var sqlContext: SQLContext = _

  override def beforeAll(): Unit = {
    val master = "local[*]"
    val appName = "MyApp"
    val conf: SparkConf = new SparkConf()
      .setMaster(master)
      .setAppName(appName)
      .set("spark.driver.allowMultipleContexts", "false")
      .set("spark.ui.enabled", "false")

    ss = SparkSession.builder().config(conf).getOrCreate()

    sc = ss.sparkContext
    sqlContext = ss.sqlContext
    super.beforeAll()
  }

  override def afterAll(): Unit = {
    sc.stop()
    super.afterAll()
  }
}

Answer 3

回答by wusuo li

The spark-submit command with parameter --master yarn is setting yarn master. And this will be conflict with your code master("x"), even using like master("yarn").

带参数 --master yarn 的 spark-submit 命令是设置 yarn master。这将与您的代码 master("x") 冲突，即使使用 like master("yarn")。

If you want to use import sparkSession.implicits._ like toDF ,toDS or other func, you can just use a local sparkSession variable created like below:

如果你想使用 import sparkSession.implicits._ 像 toDF ，toDS 或其他函数，你可以使用像下面这样创建的本地 sparkSession 变量：

val spark = SparkSession.builder().appName("YourName").getOrCreate()

without setting master("x") in spark-submit --master yarn, not in local machine.

没有在 spark-submit --master 纱线中设置 master("x")，而不是在本地机器中。

I advice : do not use global sparkSession in your code. That may cause some errors or exceptions.

我的建议是：不要在你的代码中使用全局 sparkSession。这可能会导致一些错误或异常。

hope this helps you. good luck!

希望这对你有帮助。祝你好运！

Answer 4

回答by u1516331

How about defining an object in which a method creates a singleton instance of SparkSession, like MySparkSession.get(), and pass it as a paramter in each of your unit tests.

如何定义一个对象，其中的方法创建 SparkSession 的单例实例，例如MySparkSession.get()，并将其作为参数传递给每个单元测试。

In your main method, you can create a separate SparkSession instance, which can have different configurations.

在你的 main 方法中，你可以创建一个单独的 SparkSession 实例，它可以有不同的配置。

在 Scala 中创建 SparkSession 对象以在 unittest 和 spark-submit 中使用的最佳实践

提问by Joo-Won Jung

回答by Rick Moritz

回答by Karima Rafes

回答by wusuo li

回答by u1516331

相关推荐

最近更新

标签

在 Scala 中创建 SparkSession 对象以在 unittest 和 spark-submit 中使用的最佳实践

提问by Joo-Won Jung

回答by Rick Moritz

回答by Karima Rafes

回答by wusuo li

回答by u1516331

相关推荐

scala Spark dataframe写方法写很多小文件

scala 如何在代码的任何位置获取当前 SparkSession？

如何在 Scala 的 Apache Spark 中将数据帧转换为数据集？

将一行转换为 spark scala 中的列表

相关推荐

最近更新

标签