如何在 Pyspark 中使用 Scala 类

Question

提问by Alberto Bonsanto

I've been searching for a while if there is any way to use a Scalaclass in Pyspark, and I haven't found any documentation nor guide about this subject.

我已经搜索了一段时间是否有任何方法可以Scala在Pyspark.

Let's say I create a simple class in Scalathat uses some libraries of apache-spark, something like:

假设我创建了一个简单的类，Scala其中使用了一些库apache-spark，例如：

class SimpleClass(sqlContext: SQLContext, df: DataFrame, column: String) {
  def exe(): DataFrame = {
    import sqlContext.implicits._

    df.select(col(column))
  }
}

Is there any possible way to use this class in Pyspark?
Is it too tough?
Do I have to create a .pyfile?
Is there any guide that shows how to do that?

有没有可能的方法在中使用这个类Pyspark？
是不是太难了？
我必须创建一个.py文件吗？
是否有任何指南显示如何做到这一点？

By the way I also looked at the sparkcode and I felt a bit lost, and I was incapable of replicating their functionality for my own purpose.

顺便说一下，我也看了spark代码，感觉有点迷茫，我无法为自己的目的复制它们的功能。

Answer 1

回答by zero323

Yes it is possible although can be far from trivial. Typically you want a Java (friendly) wrapper so you don't have to deal with Scala features which cannot be easily expressed using plain Java and as a result don't play well with Py4J gateway.

是的，虽然远非微不足道，但这是可能的。通常，您需要一个 Java（友好的）包装器，这样您就不必处理使用普通 Java 无法轻松表达的 Scala 功能，因此不能很好地与 Py4J 网关配合使用。

Assuming your class is int the package com.exampleand have Python DataFramecalled df

假设你的类是 int 包com.example并且有 PythonDataFrame调用df

df = ... # Python DataFrame

you'll have to:

你必须：

Build a jar using your favorite build tool.
Include it in the driver classpath for example using --driver-class-pathargument for PySpark shell / spark-submit. Depending on the exact code you may have to pass it using --jarsas well
Extract JVM instance from a Python SparkContextinstance:
```
jvm = sc._jvm
```
Extract Scala SQLContextfrom a SQLContextinstance:
```
ssqlContext = sqlContext._ssql_ctx
```
Extract Java DataFramefrom the df:
```
jdf = df._jdf
```

Create new instance of SimpleClass:

simpleObject = jvm.com.example.SimpleClass(ssqlContext, jdf, "v")

Callexemethod and wrap the result using Python DataFrame:

from pyspark.sql import DataFrame

DataFrame(simpleObject.exe(), ssqlContext)

使用您最喜欢的构建工具构建一个 jar 。
将它包含在驱动程序类路径中，例如使用--driver-class-pathPySpark shell / 的参数spark-submit。根据确切的代码可能无法使用通过它--jars以及
从 PythonSparkContext实例中提取 JVM实例：
```
jvm = sc._jvm
```
SQLContext从SQLContext实例中提取 Scala ：
```
ssqlContext = sqlContext._ssql_ctx
```
DataFrame从以下文件中提取 Java df：
```
jdf = df._jdf
```

创建的新实例SimpleClass：

simpleObject = jvm.com.example.SimpleClass(ssqlContext, jdf, "v")

调用exe方法并使用 Python 包装结果DataFrame：

from pyspark.sql import DataFrame

DataFrame(simpleObject.exe(), ssqlContext)

The result should be a valid PySpark DataFrame. You can of course combine all the steps into a single call.

结果应该是一个有效的 PySpark DataFrame。您当然可以将所有步骤合并为一个调用。

Important: This approach is possible only if Python code is executed solely on the driver. It cannot be used inside Python action or transformation. See How to use Java/Scala function from an action or a transformation?for details.

重要提示：仅当 Python 代码仅在驱动程序上执行时，此方法才可用。它不能在 Python 操作或转换中使用。请参阅如何从操作或转换中使用 Java/Scala 函数？详情。

如何在 Pyspark 中使用 Scala 类

提问by Alberto Bonsanto

回答by zero323

相关推荐

最近更新

标签

如何在 Pyspark 中使用 Scala 类

提问by Alberto Bonsanto

回答by zero323

相关推荐

scala SBT 使用的默认存储库是什么？

将 Java Future 转换为 Scala Future

scala 如何开始使用 Akka Streams？

scala 如何使用 Spark 计算累积总和

相关推荐

最近更新

标签