Python 如何添加第三方 Java jar 以在 pyspark 中使用

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/27698111/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-19 02:08:46  来源:igfitidea点击:

How to add third party java jars for use in pyspark

pythonapache-sparkpy4j

提问by javadba

I have some third party Database client libraries in Java. I want to access them through

我有一些 Java 的第三方数据库客户端库。我想通过

java_gateway.py

E.g: to make the client class (not a jdbc driver!) available to the python client via the java gateway:

例如:使客户端类(不是 jdbc 驱动程序!)通过 java 网关对 python 客户端可用:

java_import(gateway.jvm, "org.mydatabase.MyDBClient")

It is not clear where to add the third party libraries to the jvm classpath. I tried to add to compute-classpath.sh but that did nto seem to work: I get

不清楚在哪里将第三方库添加到 jvm 类路径。我试图添加到 compute-classpath.sh 中,但这似乎不起作用:我得到

 Py4jError: Trying to call a package

Also, when comparing to Hive: the hive jar files are NOT loaded via compute-classpath.sh so that makes me suspicious. There seems to be some other mechanism happening to set up the jvm side classpath.

此外,与 Hive 相比时:hive jar 文件不是通过 compute-classpath.sh 加载的,这让我感到怀疑。似乎还有其他一些机制来设置 jvm 端类路径。

采纳答案by Marl

You can add external jars as arguments to pyspark

您可以将外部 jars 添加为 pyspark 的参数

pyspark --jars file1.jar,file2.jar

回答by Ryan Chou

You could add --jars xxx.jarwhen using spark-submit

您可以--jars xxx.jar在使用 spark-submit 时添加

./bin/spark-submit --jars xxx.jar your_spark_script.py

or set the enviroment variable SPARK_CLASSPATH

或设置环境变量 SPARK_CLASSPATH

SPARK_CLASSPATH='/path/xxx.jar:/path/xx2.jar' your_spark_script.py

your_spark_script.pywas written by pyspark API

your_spark_script.py由 pyspark API 编写

回答by Umang singhal

  1. Extractthe downloaded jar file.
  2. Edit system environment variable
    • Add a variable named SPARK_CLASSPATHand set its value to \path\to\the\extracted\jar\file.
  1. 解压下载的 jar 文件。
  2. 编辑系统环境变量
    • 添加名为SPARK_CLASSPATH的变量并将其值设置为 \path\to\the\extracted\jar\file。

Eg: you have extracted the jar file in C drive in folder named sparkts its value should be: C:\sparkts

例如:您已经在名为 sparkts 的文件夹中提取了 C 盘中的 jar 文件,其值应为:C:\sparkts

  1. Restartyour cluster
  1. 重启你的集群

回答by AAB

You could add the path to jar file using Spark configuration at Runtime.

您可以在运行时使用 Spark 配置添加 jar 文件的路径。

Here is an example :

这是一个例子:

conf = SparkConf().set("spark.jars", "/path-to-jar/spark-streaming-kafka-0-8-assembly_2.11-2.2.1.jar")

sc = SparkContext( conf=conf)

Refer the documentfor more information.

有关更多信息,请参阅文档

回答by Nab

One more thing you can do is to add the Jar in the pyspark jar folder where pyspark is installed. Usually /python3.6/site-packages/pyspark/jars

您还可以做的另一件事是将 Jar 添加到安装 pyspark 的 pyspark jar 文件夹中。通常 /python3.6/site-packages/pyspark/jars

Be careful if you are using a virtual environment that the jar needs to go to the pyspark installation in the virtual environment.

如果你使用的是虚拟环境,jar需要去虚拟环境中的pyspark安装时要小心。

This way you can use the jar without sending it in command line or load it in your code.

这样你就可以使用 jar 而不用在命令行中发送它或在你的代码中加载它。

回答by Gayatri

All the above answers did not work for me

以上所有答案都不适合我

What I had to do with pyspark was

我与 pyspark 的关系是

pyspark --py-files /path/to/jar/xxxx.jar

For Jupyter Notebook:

对于 Jupyter 笔记本:

spark = (SparkSession
    .builder
    .appName("Spark_Test")
    .master('yarn-client')
    .config("spark.sql.warehouse.dir", "/user/hive/warehouse")
    .config("spark.executor.cores", "4")
    .config("spark.executor.instances", "2")
    .config("spark.sql.shuffle.partitions","8")
    .enableHiveSupport()
    .getOrCreate())

# Do this 

spark.sparkContext.addPyFile("/path/to/jar/xxxx.jar")

Link to the source where I found it: https://github.com/graphframes/graphframes/issues/104

链接到我找到它的来源:https: //github.com/graphframes/graphframes/issues/104

回答by Sharvan Kumar

I've worked around this by dropping the jars into a directory drivers and then creating a spark-defaults.conf file in conf folder. Steps to follow;

我已经解决了这个问题,将 jars 放入目录驱动程序,然后在 conf 文件夹中创建一个 spark-defaults.conf 文件。要遵循的步骤;

To get the conf path:  
cd ${SPARK_HOME}/conf

vi spark-defaults.conf  
spark.driver.extraClassPath /Users/xxx/Documents/spark_project/drivers/*

run your Jupyter notebook.

运行你的 Jupyter 笔记本。

回答by D Untouchable

Apart from the accepted answer, you also have below options:

除了接受的答案外,您还有以下选择:

  1. if you are in virtual environment then you can place it in

    e.g. lib/python3.7/site-packages/pyspark/jars

  2. if you want java to discover it then you can place where your jre is installed under ext/directory

  1. 如果你在虚拟环境中,那么你可以把它放在

    例如 lib/python3.7/site-packages/pyspark/jars

  2. 如果你想让 java 发现它,那么你可以把你的 jre 安装在ext/目录下