在 Scala 中导入 spark.implicits._

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/39151189/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-10-22 08:34:38  来源:igfitidea点击:

Importing spark.implicits._ in scala

scalaapache-spark

提问by ShinySpiderdude

I am trying to import spark.implicits._ Apparently, this is an object inside a class in scala. when i import it in a method like so:

我正在尝试导入 spark.implicits._ 显然,这是 Scala 类中的一个对象。当我以这样的方法导入它时:

def f() = {
  val spark = SparkSession()....
  import spark.implicits._
}

It works fine, however i am writing a test class and i want to make this import available for all tests I have tried:

它工作正常,但是我正在编写一个测试类,我想让这个导入可用于我尝试过的所有测试:

class SomeSpec extends FlatSpec with BeforeAndAfter {
  var spark:SparkSession = _

  //This won't compile
  import spark.implicits._

  before {
    spark = SparkSession()....
    //This won't either
    import spark.implicits._
  }

  "a test" should "run" in {
    //Even this won't compile (although it already looks bad here)
    import spark.implicits._

    //This was the only way i could make it work
    val spark = this.spark
    import spark.implicits._
  }
}

Not only does this look bad, i don't want to do it for every test What is the "correct" way of doing it?

这不仅看起来很糟糕,而且我不想在每次测试时都这样做 这样做的“正确”方法是什么?

回答by bluenote10

You can do something similar to what is done in the Spark testing suites. For example this would work (inspired by SQLTestData):

您可以执行类似于 Spark 测试套件中所做的操作。例如,这将起作用(受 启发SQLTestData):

class SomeSpec extends FlatSpec with BeforeAndAfter { self =>

  var spark: SparkSession = _

  private object testImplicits extends SQLImplicits {
    protected override def _sqlContext: SQLContext = self.spark.sqlContext
  }
  import testImplicits._

  before {
    spark = SparkSession.builder().master("local").getOrCreate()
  }

  "a test" should "run" in {
    // implicits are working
    val df = spark.sparkContext.parallelize(List(1,2,3)).toDF()
  }
}

Alternatively you may use something like SharedSQLContextdirectly, which provides a testImplicits: SQLImplicits, i.e.:

或者,您可以SharedSQLContext直接使用类似的东西,它提供了一个testImplicits: SQLImplicits,即:

class SomeSpec extends FlatSpec with SharedSQLContext {
  import testImplicits._

  // ...

}

回答by Kehe CAI

I think the GitHub code in SparkSession.scalafile can give you a good hint:

我认为SparkSession.scala文件中的 GitHub 代码可以给你一个很好的提示:

      /**
       * :: Experimental ::
       * (Scala-specific) Implicit methods available in Scala for converting
       * common Scala objects into [[DataFrame]]s.
       *
       * {{{
       *   val sparkSession = SparkSession.builder.getOrCreate()
       *   import sparkSession.implicits._
       * }}}
       *
       * @since 2.0.0
       */
      @Experimental
      object implicits extends SQLImplicits with Serializable {
        protected override def _sqlContext: SQLContext = SparkSession.this.sqlContext
      }

here "spark" in "spark.implicits._" is just the sparkSession object we created.

这里“spark.implicits._”中的“spark”就是我们创建的sparkSession对象。

Hereis another reference!

是另一个参考!

回答by seufagner

I just instantiate SparkSession and before to use, "import implicits".

我只是实例化 SparkSession 并在使用之前,“导入隐式”。

    @transient lazy val spark = SparkSession
    .builder()
    .master("spark://master:7777")
    .getOrCreate()

    import spark.implicits._

回答by Azik

Thanks to @bluenote10 for helpful answer and we can simplify it again, for example without helper object testImplicits:

感谢@bluenote10 提供有用的答案,我们可以再次简化它,例如没有帮助对象testImplicits

private object testImplicits extends SQLImplicits {
  protected override def _sqlContext: SQLContext = self.spark.sqlContext
}

with following way:

用以下方式:

trait SharedSparkSession extends BeforeAndAfterAll { self: Suite =>

  /**
   * The SparkSession instance to use for all tests in one suite.
   */
  private var spark: SparkSession = _

  /**
   * Returns local running SparkSession instance.
   * @return SparkSession instance `spark`
   */
  protected def sparkSession: SparkSession = spark

  /**
   * A helper implicit value that allows us to import SQL implicits.
   */
  protected lazy val sqlImplicits: SQLImplicits = self.sparkSession.implicits

  /**
   * Starts a new local spark session for tests.
   */
  protected def startSparkSession(): Unit = {
    if (spark == null) {
      spark = SparkSession
        .builder()
        .master("local[2]")
        .appName("Testing Spark Session")
        .getOrCreate()
    }
  }

  /**
   * Stops existing local spark session.
   */
  protected def stopSparkSession(): Unit = {
    if (spark != null) {
      spark.stop()
      spark = null
    }
  }

  /**
   * Runs before all tests and starts spark session.
   */
  override def beforeAll(): Unit = {
    startSparkSession()
    super.beforeAll()
  }

  /**
   * Runs after all tests and stops existing spark session.
   */
  override def afterAll(): Unit = {
    super.afterAll()
    stopSparkSession()
  }
}

and finally we can use SharedSparkSessionfor unit tests and import sqlImplicits:

最后我们可以SharedSparkSession用于单元测试和导入sqlImplicits

class SomeSuite extends FunSuite with SharedSparkSession {
  // We can import sql implicits 
  import sqlImplicits._

  // We can use method sparkSession which returns locally running spark session
  test("some test") {
    val df = sparkSession.sparkContext.parallelize(List(1,2,3)).toDF()
    //...
  }
}

回答by ValaravausBlack

Well, I've been re-using existing SparkSession in each called method.. by creating local val inside method -

好吧,我一直在每个被调用的方法中重新使用现有的 SparkSession.. 通过在方法内部创建本地 val -

val spark: org.apache.spark.sql.SparkSession = org.apache.spark.sql.SparkSession.active

And then

接着

import spark.implicits._

回答by KayV

Create a sparksession object and use the spark.implicit._ just before you want to convert any rdd to datasets.

在您想将任何 rdd 转换为数据集之前,创建一个 sparksession 对象并使用 spark.implicit._ 。

Like this:

像这样:

val spark = SparkSession
      .builder
      .appName("SparkSQL")
      .master("local[*]")
      .getOrCreate()

import spark.implicits._
val someDataset = someRdd.toDS

回答by hobgoblin

It has to do something with using val vs var in scala.

它必须通过在 Scala 中使用 val 与 var 来做一些事情。

E.g. following does not work

例如以下不起作用

var sparkSession = new SparkSession.Builder().appName("my-app").config(sparkConf).getOrCreate
import sparkSession.implicits._

But following does

但以下确实

sparkSession = new SparkSession.Builder().appName("my-app").config(sparkConf).getOrCreate
val sparkSessionConst = sparkSession
import sparkSessionConst.implicits._

I am very familiar with scala so I can only guess that the reasoning is same as why we can only use outer variables declared final inside a closure in java.

我对scala非常熟悉,所以我只能猜测推理与为什么我们只能在java中的闭包内使用声明为final的外部变量相同。