如何在 Pyspark 中使用 Scala 类
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/36023860/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
How to use a Scala class inside Pyspark
提问by Alberto Bonsanto
I've been searching for a while if there is any way to use a Scalaclass in Pyspark, and I haven't found any documentation nor guide about this subject.
我已经搜索了一段时间是否有任何方法可以Scala在Pyspark.
Let's say I create a simple class in Scalathat uses some libraries of apache-spark, something like:
假设我创建了一个简单的类,Scala其中使用了一些库apache-spark,例如:
class SimpleClass(sqlContext: SQLContext, df: DataFrame, column: String) {
def exe(): DataFrame = {
import sqlContext.implicits._
df.select(col(column))
}
}
- Is there any possible way to use this class in
Pyspark? - Is it too tough?
- Do I have to create a
.pyfile? - Is there any guide that shows how to do that?
- 有没有可能的方法在 中使用这个类
Pyspark? - 是不是太难了?
- 我必须创建一个
.py文件吗? - 是否有任何指南显示如何做到这一点?
By the way I also looked at the sparkcode and I felt a bit lost, and I was incapable of replicating their functionality for my own purpose.
顺便说一下,我也看了spark代码,感觉有点迷茫,我无法为自己的目的复制它们的功能。
回答by zero323
Yes it is possible although can be far from trivial. Typically you want a Java (friendly) wrapper so you don't have to deal with Scala features which cannot be easily expressed using plain Java and as a result don't play well with Py4J gateway.
是的,虽然远非微不足道,但这是可能的。通常,您需要一个 Java(友好的)包装器,这样您就不必处理使用普通 Java 无法轻松表达的 Scala 功能,因此不能很好地与 Py4J 网关配合使用。
Assuming your class is int the package com.exampleand have Python DataFramecalled df
假设你的类是 int 包com.example并且有 PythonDataFrame调用df
df = ... # Python DataFrame
you'll have to:
你必须:
Build a jar using your favorite build tool.
Include it in the driver classpath for example using
--driver-class-pathargument for PySpark shell /spark-submit. Depending on the exact code you may have to pass it using--jarsas wellExtract JVM instance from a Python
SparkContextinstance:jvm = sc._jvmExtract Scala
SQLContextfrom aSQLContextinstance:ssqlContext = sqlContext._ssql_ctxExtract Java
DataFramefrom thedf:jdf = df._jdfCreate new instance of
SimpleClass:simpleObject = jvm.com.example.SimpleClass(ssqlContext, jdf, "v")Call
exemethod and wrap the result using PythonDataFrame:from pyspark.sql import DataFrame DataFrame(simpleObject.exe(), ssqlContext)
使用您最喜欢的构建工具构建一个 jar 。
将它包含在驱动程序类路径中,例如使用
--driver-class-pathPySpark shell / 的参数spark-submit。根据确切的代码可能无法使用通过它--jars以及从 Python
SparkContext实例中提取 JVM实例:jvm = sc._jvmSQLContext从SQLContext实例中提取 Scala :ssqlContext = sqlContext._ssql_ctxDataFrame从以下文件中提取 Javadf:jdf = df._jdf创建 的新实例
SimpleClass:simpleObject = jvm.com.example.SimpleClass(ssqlContext, jdf, "v")调用
exe方法并使用 Python 包装结果DataFrame:from pyspark.sql import DataFrame DataFrame(simpleObject.exe(), ssqlContext)
The result should be a valid PySpark DataFrame. You can of course combine all the steps into a single call.
结果应该是一个有效的 PySpark DataFrame。您当然可以将所有步骤合并为一个调用。
Important: This approach is possible only if Python code is executed solely on the driver. It cannot be used inside Python action or transformation. See How to use Java/Scala function from an action or a transformation?for details.
重要提示:仅当 Python 代码仅在驱动程序上执行时,此方法才可用。它不能在 Python 操作或转换中使用。请参阅如何从操作或转换中使用 Java/Scala 函数?详情。

