在 Scala 中导入 spark.implicits._

Question

提问by ShinySpiderdude

I am trying to import spark.implicits._ Apparently, this is an object inside a class in scala. when i import it in a method like so:

我正在尝试导入 spark.implicits._ 显然，这是 Scala 类中的一个对象。当我以这样的方法导入它时：

def f() = {
  val spark = SparkSession()....
  import spark.implicits._
}

It works fine, however i am writing a test class and i want to make this import available for all tests I have tried:

它工作正常，但是我正在编写一个测试类，我想让这个导入可用于我尝试过的所有测试：

class SomeSpec extends FlatSpec with BeforeAndAfter {
  var spark:SparkSession = _

  //This won't compile
  import spark.implicits._

  before {
    spark = SparkSession()....
    //This won't either
    import spark.implicits._
  }

  "a test" should "run" in {
    //Even this won't compile (although it already looks bad here)
    import spark.implicits._

    //This was the only way i could make it work
    val spark = this.spark
    import spark.implicits._
  }
}

Not only does this look bad, i don't want to do it for every test What is the "correct" way of doing it?

这不仅看起来很糟糕，而且我不想在每次测试时都这样做这样做的“正确”方法是什么？

Answer 1

回答by bluenote10

You can do something similar to what is done in the Spark testing suites. For example this would work (inspired by SQLTestData):

您可以执行类似于 Spark 测试套件中所做的操作。例如，这将起作用（受启发SQLTestData）：

class SomeSpec extends FlatSpec with BeforeAndAfter { self =>

  var spark: SparkSession = _

  private object testImplicits extends SQLImplicits {
    protected override def _sqlContext: SQLContext = self.spark.sqlContext
  }
  import testImplicits._

  before {
    spark = SparkSession.builder().master("local").getOrCreate()
  }

  "a test" should "run" in {
    // implicits are working
    val df = spark.sparkContext.parallelize(List(1,2,3)).toDF()
  }
}

Alternatively you may use something like SharedSQLContextdirectly, which provides a testImplicits: SQLImplicits, i.e.:

或者，您可以SharedSQLContext直接使用类似的东西，它提供了一个testImplicits: SQLImplicits，即：

class SomeSpec extends FlatSpec with SharedSQLContext {
  import testImplicits._

  // ...

}

Answer 2

回答by Kehe CAI

I think the GitHub code in SparkSession.scalafile can give you a good hint:

我认为SparkSession.scala文件中的 GitHub 代码可以给你一个很好的提示：

      /**
       * :: Experimental ::
       * (Scala-specific) Implicit methods available in Scala for converting
       * common Scala objects into [[DataFrame]]s.
       *
       * {{{
       *   val sparkSession = SparkSession.builder.getOrCreate()
       *   import sparkSession.implicits._
       * }}}
       *
       * @since 2.0.0
       */
      @Experimental
      object implicits extends SQLImplicits with Serializable {
        protected override def _sqlContext: SQLContext = SparkSession.this.sqlContext
      }

here "spark" in "spark.implicits._" is just the sparkSession object we created.

这里“spark.implicits._”中的“spark”就是我们创建的sparkSession对象。

Hereis another reference!

这是另一个参考！

Answer 3

回答by seufagner

I just instantiate SparkSession and before to use, "import implicits".

我只是实例化 SparkSession 并在使用之前，“导入隐式”。

    @transient lazy val spark = SparkSession
    .builder()
    .master("spark://master:7777")
    .getOrCreate()

    import spark.implicits._

Answer 4

回答by Azik

Thanks to @bluenote10 for helpful answer and we can simplify it again, for example without helper object testImplicits:

感谢@bluenote10 提供有用的答案，我们可以再次简化它，例如没有帮助对象testImplicits：

private object testImplicits extends SQLImplicits {
  protected override def _sqlContext: SQLContext = self.spark.sqlContext
}

with following way:

用以下方式：

trait SharedSparkSession extends BeforeAndAfterAll { self: Suite =>

  /**
   * The SparkSession instance to use for all tests in one suite.
   */
  private var spark: SparkSession = _

  /**
   * Returns local running SparkSession instance.
   * @return SparkSession instance `spark`
   */
  protected def sparkSession: SparkSession = spark

  /**
   * A helper implicit value that allows us to import SQL implicits.
   */
  protected lazy val sqlImplicits: SQLImplicits = self.sparkSession.implicits

  /**
   * Starts a new local spark session for tests.
   */
  protected def startSparkSession(): Unit = {
    if (spark == null) {
      spark = SparkSession
        .builder()
        .master("local[2]")
        .appName("Testing Spark Session")
        .getOrCreate()
    }
  }

  /**
   * Stops existing local spark session.
   */
  protected def stopSparkSession(): Unit = {
    if (spark != null) {
      spark.stop()
      spark = null
    }
  }

  /**
   * Runs before all tests and starts spark session.
   */
  override def beforeAll(): Unit = {
    startSparkSession()
    super.beforeAll()
  }

  /**
   * Runs after all tests and stops existing spark session.
   */
  override def afterAll(): Unit = {
    super.afterAll()
    stopSparkSession()
  }
}

and finally we can use SharedSparkSessionfor unit tests and import sqlImplicits:

最后我们可以SharedSparkSession用于单元测试和导入sqlImplicits：

class SomeSuite extends FunSuite with SharedSparkSession {
  // We can import sql implicits 
  import sqlImplicits._

  // We can use method sparkSession which returns locally running spark session
  test("some test") {
    val df = sparkSession.sparkContext.parallelize(List(1,2,3)).toDF()
    //...
  }
}

Answer 5

回答by ValaravausBlack

Well, I've been re-using existing SparkSession in each called method.. by creating local val inside method -

好吧，我一直在每个被调用的方法中重新使用现有的 SparkSession.. 通过在方法内部创建本地 val -

val spark: org.apache.spark.sql.SparkSession = org.apache.spark.sql.SparkSession.active

And then

接着

import spark.implicits._

Answer 6

回答by KayV

Create a sparksession object and use the spark.implicit._ just before you want to convert any rdd to datasets.

在您想将任何 rdd 转换为数据集之前，创建一个 sparksession 对象并使用 spark.implicit._ 。

Like this:

像这样：

val spark = SparkSession
      .builder
      .appName("SparkSQL")
      .master("local[*]")
      .getOrCreate()

import spark.implicits._
val someDataset = someRdd.toDS

Answer 7

回答by hobgoblin

It has to do something with using val vs var in scala.

它必须通过在 Scala 中使用 val 与 var 来做一些事情。

E.g. following does not work

例如以下不起作用

var sparkSession = new SparkSession.Builder().appName("my-app").config(sparkConf).getOrCreate
import sparkSession.implicits._

But following does

但以下确实

sparkSession = new SparkSession.Builder().appName("my-app").config(sparkConf).getOrCreate
val sparkSessionConst = sparkSession
import sparkSessionConst.implicits._

I am very familiar with scala so I can only guess that the reasoning is same as why we can only use outer variables declared final inside a closure in java.

我对scala非常熟悉，所以我只能猜测推理与为什么我们只能在java中的闭包内使用声明为final的外部变量相同。

在 Scala 中导入 spark.implicits._

提问by ShinySpiderdude

回答by bluenote10

回答by Kehe CAI

回答by seufagner

回答by Azik

回答by ValaravausBlack

回答by KayV

回答by hobgoblin

相关推荐

最近更新

标签

在 Scala 中导入 spark.implicits._

提问by ShinySpiderdude

回答by bluenote10

回答by Kehe CAI

回答by seufagner

回答by Azik

回答by ValaravausBlack

回答by KayV

回答by hobgoblin

相关推荐

scala 如何在 Zeppelin 的 javascript 中将变量放入 z ZeppelinContext 中？

scala 删除 Spark DataFrame 的第一行

如何将 Java 流转换为 Scala 流？

scala 在 Spark 数据集中滚动你自己的 reduceByKey

相关推荐

最近更新

标签