Python 将 Jar 添加到独立的 pyspark

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/35762459/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-19 16:56:12  来源:igfitidea点击:

Add Jar to standalone pyspark

pythonapache-sparkpyspark

提问by Nora Olsen

I'm launching a pyspark program:

我正在启动一个 pyspark 程序:

$ export SPARK_HOME=
$ export PYTHONPATH=$SPARK_HOME/python:$SPARK_HOME/python/lib/py4j-0.9-src.zip
$ python

And the py code:

和py代码:

from pyspark import SparkContext, SparkConf

SparkConf().setAppName("Example").setMaster("local[2]")
sc = SparkContext(conf=conf)

How do I add jar dependencies such as the Databricks csv jar? Using the command line, I can add the package like this:

如何添加 jar 依赖项,例如 Databricks csv jar?使用命令行,我可以添加这样的包:

$ pyspark/spark-submit --packages com.databricks:spark-csv_2.10:1.3.0 

But I'm not using any of these. The program is part of a larger workflow that is not using spark-submitI should be able to run my ./foo.py program and it should just work.

但我没有使用这些。该程序是不使用 spark-submit 的更大工作流的一部分,我应该能够运行我的 ./foo.py 程序,它应该可以正常工作。

  • I know you can set the spark properties for extraClassPath but you have to copy JAR files to each node?
  • Tried conf.set("spark.jars", "jar1,jar2") that didn't work too with a py4j CNF exception
  • 我知道您可以为 extraClassPath 设置 spark 属性,但是您必须将 JAR 文件复制到每个节点?
  • 尝试了 conf.set("spark.jars", "jar1,jar2") ,但在 py4j CNF 异常中也不起作用

回答by Briford Wylie

There are many approaches here (setting ENV vars, adding to $SPARK_HOME/conf/spark-defaults.conf, etc...) some of the answers already cover these. I wanted to add an additionalanswer for those specifically using Jupyter Notebooks and creating the Spark session from within the notebook. Here's the solution that worked best for me (in my case I wanted the Kafka package loaded):

这里有很多方法(设置 ENV 变量,添加到 $SPARK_HOME/conf/spark-defaults.conf 等...)一些答案已经涵盖了这些。我想为那些专门使用 Jupyter Notebooks 并从 Notebook 中创建 Spark 会话的人添加一个额外的答案。这是最适合我的解决方案(在我的情况下,我想要加载 Kafka 包):

spark = SparkSession.builder.appName('my_awesome')\
    .config('spark.jars.packages', 'org.apache.spark:spark-sql-kafka-0-10_2.11:2.2.0')\
    .getOrCreate()

Using this line of code I didn't need to do anything else (no ENVs or conf file changes).

使用这行代码我不需要做任何其他事情(没有 ENV 或 conf 文件更改)。

2019-10-30 Update:The above line of code is still working great but I wanted to note a couple of things for new people seeing this answer:

2019 年 10 月 30 日更新:上述代码行仍然运行良好,但我想为看到此答案的新人注意一些事项:

  • You'll need to change the version at the end to match your Spark version, so for Spark 2.4.4 you'll need: org.apache.spark:spark-sql-kafka-0-10_2.11:2.4.4
  • The newest version of this jar spark-sql-kafka-0-10_2.12is crashing for me (Mac Laptop), so if you get a crash when invoking 'readStream' revert to 2.11.
  • 您需要在最后更改版本以匹配您的 Spark 版本,因此对于 Spark 2.4.4,您需要: org.apache.spark:spark-sql-kafka-0-10_2.11:2.4.4
  • 这个 jar 的最新版本spark-sql-kafka-0-10_2.12对我来说崩溃了(Mac 笔记本电脑),所以如果你在调用 'readStream' 时崩溃,请恢复到 2.11。

回答by zero323

Any dependencies can be passed using spark.jars.packages(setting spark.jarsshould work as well) property in the $SPARK_HOME/conf/spark-defaults.conf. It should be a comma separated list of coordinates.

任何依赖关系可以使用通过spark.jars.packages(设置spark.jars应该工作以及)在属性$SPARK_HOME/conf/spark-defaults.conf。它应该是一个逗号分隔的坐标列表。

And packages or classpath properties have to be set before JVM is started and this happens during SparkConfinitialization. It means that SparkConf.setmethod cannot be used here.

并且必须在 JVM 启动之前设置包或类路径属性,这发生在SparkConf初始化期间。这意味着SparkConf.set这里不能使用该方法。

Alternative approach is to set PYSPARK_SUBMIT_ARGSenvironment variable before SparkConfobject is initialized:

另一种方法是PYSPARK_SUBMIT_ARGSSparkConf对象初始化之前设置环境变量:

import os
from pyspark import SparkConf

SUBMIT_ARGS = "--packages com.databricks:spark-csv_2.11:1.2.0 pyspark-shell"
os.environ["PYSPARK_SUBMIT_ARGS"] = SUBMIT_ARGS

conf = SparkConf()
sc = SparkContext(conf=conf)

回答by ximiki

I encountered a similar issue for a different jar("MongoDB Connector for Spark", mongo-spark-connector), but the big caveat was that I installed Sparkvia pysparkin conda(conda install pyspark). Therefore, all the assistance for Spark-specific answers weren't exactly helpful. For those of you installing with conda, here is the process that I cobbled together:

我遇到过类似的问题,对于不同的jar(“MongoDB的连接器星火”, mongo-spark-connector),但重要的提醒是,我安装Spark通过pysparkcondaconda install pyspark)。因此,针对Spark特定答案的所有帮助都没有完全帮助。对于那些使用 安装的人conda,这是我拼凑的过程:

1) Find where your pyspark/jarsare located. Mine were in this path: ~/anaconda2/pkgs/pyspark-2.3.0-py27_0/lib/python2.7/site-packages/pyspark/jars.

1) 找到您所在pyspark/jars的位置。我在这条路上:~/anaconda2/pkgs/pyspark-2.3.0-py27_0/lib/python2.7/site-packages/pyspark/jars

2) Downloadthe jarfile into the path found in step 1, from this location.

2)将文件下载jar到步骤 1 中找到的路径中,从此位置

3) Now you should be able to run something like this (code taken from MongoDB official tutorial, using Briford Wylie's answer above):

3)现在你应该能够运行这样的东西(代码取自MongoDB 官方教程,使用上面 Briford Wylie 的回答):

from pyspark.sql import SparkSession

my_spark = SparkSession \
    .builder \
    .appName("myApp") \
    .config("spark.mongodb.input.uri", "mongodb://127.0.0.1:27017/spark.test_pyspark_mbd_conn") \
    .config("spark.mongodb.output.uri", "mongodb://127.0.0.1:27017/spark.test_pyspark_mbd_conn") \
    .config('spark.jars.packages', 'org.mongodb.spark:mongo-spark-connector_2.11:2.2.2') \
    .getOrCreate()

Disclaimers:

免责声明:

1) I don't know if this answer is the right place/SO question to put this; please advise of a better place and I will move it.

1)我不知道这个答案是否是正确的地方/SO问题;请建议一个更好的地方,我会移动它。

2) If you think I have errored or have improvements to the process above, please comment and I will revise.

2)如果您认为我有错误或对上述过程有改进,请发表评论,我会修改。

回答by Indrajit

Finally found the answer after a multiple tries. The answer is specific to using spark-csv jar. Create a folder in you hard drive say D:\Spark\spark_jars. Place the following jars there:

经过多次尝试,终于找到了答案。答案特定于使用 spark-csv jar。在您的硬盘驱动器中创建一个文件夹,例如 D:\Spark\spark_jars。将以下罐子放在那里:

  1. spark-csv_2.10-1.4.0.jar (this is the version I am using)
  2. commons-csv-1.1.jar
  3. univocity-parsers-1.5.1.jar
  1. spark-csv_2.10-1.4.0.jar(这是我使用的版本)
  2. commons-csv-1.1.jar
  3. univocity-parsers-1.5.1.jar

2 and 3 are dependencies required by spark-csv, hence those two files need to be downloaded too. Go to your conf directory where you have downloaded Spark. In the spark-defaults.conf file add the line:

2 和 3 是 spark-csv 所需的依赖项,因此也需要下载这两个文件。转到您下载 Spark 的 conf 目录。在 spark-defaults.conf 文件中添加以下行:

spark.driver.extraClassPath D:/Spark/spark_jars/*

spark.driver.extraClassPath D:/Spark/spark_jars/*

The asterisk should include all the jars. Now run Python, create SparkContext, SQLContext as you normally would. Now you should be able to use spark-csv as

星号应该包括所有的罐子。现在运行 Python,像往常一样创建 SparkContext、SQLContext。现在你应该可以使用 spark-csv 作为

sqlContext.read.format('com.databricks.spark.csv').\
options(header='true', inferschema='true').\
load('foobar.csv')

回答by Thierry Barnier

import os
import sys
spark_home = os.environ.get('SPARK_HOME', None)
sys.path.insert(0, spark_home + "/python")
sys.path.insert(0, os.path.join(spark_home, 'python/lib/py4j-0.10.4-src.zip'))

Here it comes....

它来了....

sys.path.insert(0, <PATH TO YOUR JAR>)

Then...

然后...

import pyspark
import numpy as np

from pyspark import SparkContext

sc = SparkContext("local[1]")
.
.
.