如何在同一个 Spark 项目中同时使用 Scala 和 Python?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/32975636/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-10-22 07:41:45  来源:igfitidea点击:

How to Use both Scala and Python in a same Spark project?

pythonscalaapache-sparkpysparkspark-streaming

提问by Wilson Liao

Is that possible to pipe Spark RDDto Python?

是否可以将Spark RDD通过管道传输到 Python?

Because I need a python library to do some calculation on my data, but my main Spark project is based on Scala. Is there a way to mix them both or let python access the same spark context?

因为我需要一个 python 库来对我的数据做一些计算,但是我的主要 Spark 项目是基于 Scala 的。有没有办法将它们混合或让 python 访问相同的 spark 上下文?

回答by Stephen De Gennaro

You can indeed pipe out to a python script using Scala and Spark and a regular Python script.

您确实可以使用 Scala 和 Spark 以及常规 Python 脚本输出到 Python 脚本。

test.py

测试文件

#!/usr/bin/python

import sys

for line in sys.stdin:
  print "hello " + line

spark-shell (scala)

火花壳(scala)

val data = List("john","paul","george","ringo")

val dataRDD = sc.makeRDD(data)

val scriptPath = "./test.py"

val pipeRDD = dataRDD.pipe(scriptPath)

pipeRDD.foreach(println)

Output

输出

hello john

你好约翰

hello ringo

你好林戈

hello george

你好乔治

hello paul

你好保罗

回答by Ajay Gupta

You can run the Python code via Pipein Spark.

您可以在 Spark 中通过Pipe运行 Python 代码。

With pipe(), you can write a transformation of an RDD that reads each RDD element from standard input as String, manipulates that String as per script instruction, and then writes the result as String to standard output.

使用 pipe(),您可以编写 RDD 的转换,从标准输入读取每个 RDD 元素作为字符串,按照脚本指令操作该字符串,然后将结果作为字符串写入标准输出。

SparkContext.addFile(path), we can add up list of files for each of the worker nodes to download when a Spark job starts.All the worker node will have their copy of the script thus we will be getting parallel operation by pipe. We need to install all the libraries and dependency prior to it on all the worker and executor node.

SparkContext.addFile(path),我们可以为每个工作节点添加文件列表,以便在 Spark 作业启动时下载。所有工作节点都将拥有脚本的副本,因此我们将通过管道获得并行操作。我们需要先在所有工作节点和执行节点上安装所有库和依赖项。

Example :

例子 :

Python File: Code for making Input data to Uppercase

Python 文件:将输入数据转换为大写的代码

#!/usr/bin/python
import sys
for line in sys.stdin:
    print line.upper()

Spark Code: For Piping the data

Spark 代码:用于管道数据

val conf = new SparkConf().setAppName("Pipe")
val sc = new SparkContext(conf)
val distScript = "/path/on/driver/PipeScript.py"
val distScriptName = "PipeScript.py"
sc.addFile(distScript)
val ipData = sc.parallelize(List("asd","xyz","zxcz","sdfsfd","Ssdfd","Sdfsf"))
val opData = ipData.pipe(SparkFiles.get(distScriptName))
opData.foreach(println)

回答by Leb

If I understand you correctly, as long as you take the data from scalaand covert it to RDDor SparkContextthen you'll be able to use pysparkto manipulate the data using Spark Python API.

如果我对您的理解正确,只要您从中获取数据scala并将其转换为,RDD或者SparkContext您就可以pyspark使用 Spark Python API 来操作数据。

There's also a programming guidethat you can follow to utilize the different languages within spark

还有一个编程指南,您可以按照它来使用其中的不同语言spark