Python 如何添加第三方 Java jar 以在 pyspark 中使用

Question

提问by javadba

I have some third party Database client libraries in Java. I want to access them through

我有一些 Java 的第三方数据库客户端库。我想通过

java_gateway.py

E.g: to make the client class (not a jdbc driver!) available to the python client via the java gateway:

例如：使客户端类（不是 jdbc 驱动程序！）通过 java 网关对 python 客户端可用：

java_import(gateway.jvm, "org.mydatabase.MyDBClient")

It is not clear where to add the third party libraries to the jvm classpath. I tried to add to compute-classpath.sh but that did nto seem to work: I get

不清楚在哪里将第三方库添加到 jvm 类路径。我试图添加到 compute-classpath.sh 中，但这似乎不起作用：我得到

 Py4jError: Trying to call a package

Also, when comparing to Hive: the hive jar files are NOT loaded via compute-classpath.sh so that makes me suspicious. There seems to be some other mechanism happening to set up the jvm side classpath.

此外，与 Hive 相比时：hive jar 文件不是通过 compute-classpath.sh 加载的，这让我感到怀疑。似乎还有其他一些机制来设置 jvm 端类路径。

Answer 1

采纳答案by Marl

You can add external jars as arguments to pyspark

您可以将外部 jars 添加为 pyspark 的参数

pyspark --jars file1.jar,file2.jar

Answer 2

回答by Ryan Chou

You could add --jars xxx.jarwhen using spark-submit

您可以--jars xxx.jar在使用 spark-submit 时添加

./bin/spark-submit --jars xxx.jar your_spark_script.py

or set the enviroment variable SPARK_CLASSPATH

或设置环境变量 SPARK_CLASSPATH

SPARK_CLASSPATH='/path/xxx.jar:/path/xx2.jar' your_spark_script.py

your_spark_script.pywas written by pyspark API

your_spark_script.py由 pyspark API 编写

Answer 3

回答by Umang singhal

Extractthe downloaded jar file.
Edit system environment variable
- Add a variable named SPARK_CLASSPATHand set its value to \path\to\the\extracted\jar\file.

解压下载的 jar 文件。
编辑系统环境变量
- 添加名为SPARK_CLASSPATH的变量并将其值设置为 \path\to\the\extracted\jar\file。

Eg: you have extracted the jar file in C drive in folder named sparkts its value should be: C:\sparkts

例如：您已经在名为 sparkts 的文件夹中提取了 C 盘中的 jar 文件，其值应为：C:\sparkts

Restartyour cluster

重启你的集群

Answer 4

回答by AAB

You could add the path to jar file using Spark configuration at Runtime.

您可以在运行时使用 Spark 配置添加 jar 文件的路径。

Here is an example :

这是一个例子：

conf = SparkConf().set("spark.jars", "/path-to-jar/spark-streaming-kafka-0-8-assembly_2.11-2.2.1.jar")

sc = SparkContext( conf=conf)

Refer the documentfor more information.

有关更多信息，请参阅文档。

Answer 5

回答by Nab

One more thing you can do is to add the Jar in the pyspark jar folder where pyspark is installed. Usually /python3.6/site-packages/pyspark/jars

您还可以做的另一件事是将 Jar 添加到安装 pyspark 的 pyspark jar 文件夹中。通常 /python3.6/site-packages/pyspark/jars

Be careful if you are using a virtual environment that the jar needs to go to the pyspark installation in the virtual environment.

如果你使用的是虚拟环境，jar需要去虚拟环境中的pyspark安装时要小心。

This way you can use the jar without sending it in command line or load it in your code.

这样你就可以使用 jar 而不用在命令行中发送它或在你的代码中加载它。

Answer 6

回答by Gayatri

All the above answers did not work for me

以上所有答案都不适合我

What I had to do with pyspark was

我与 pyspark 的关系是

pyspark --py-files /path/to/jar/xxxx.jar

For Jupyter Notebook:

对于 Jupyter 笔记本：

spark = (SparkSession
    .builder
    .appName("Spark_Test")
    .master('yarn-client')
    .config("spark.sql.warehouse.dir", "/user/hive/warehouse")
    .config("spark.executor.cores", "4")
    .config("spark.executor.instances", "2")
    .config("spark.sql.shuffle.partitions","8")
    .enableHiveSupport()
    .getOrCreate())

# Do this 

spark.sparkContext.addPyFile("/path/to/jar/xxxx.jar")

Link to the source where I found it: https://github.com/graphframes/graphframes/issues/104

链接到我找到它的来源：https: //github.com/graphframes/graphframes/issues/104

Answer 7

回答by Sharvan Kumar

I've worked around this by dropping the jars into a directory drivers and then creating a spark-defaults.conf file in conf folder. Steps to follow;

我已经解决了这个问题，将 jars 放入目录驱动程序，然后在 conf 文件夹中创建一个 spark-defaults.conf 文件。要遵循的步骤；

To get the conf path:  
cd ${SPARK_HOME}/conf

vi spark-defaults.conf  
spark.driver.extraClassPath /Users/xxx/Documents/spark_project/drivers/*

run your Jupyter notebook.

运行你的 Jupyter 笔记本。

Answer 8

回答by D Untouchable

Apart from the accepted answer, you also have below options:

除了接受的答案外，您还有以下选择：

if you are in virtual environment then you can place it in
e.g. lib/python3.7/site-packages/pyspark/jars
if you want java to discover it then you can place where your jre is installed under ext/directory

如果你在虚拟环境中，那么你可以把它放在
例如 lib/python3.7/site-packages/pyspark/jars
如果你想让 java 发现它，那么你可以把你的 jre 安装在ext/目录下

Python 如何添加第三方 Java jar 以在 pyspark 中使用

提问by javadba

采纳答案by Marl

回答by Ryan Chou

回答by Umang singhal

回答by AAB

回答by Nab

回答by Gayatri

回答by Sharvan Kumar

回答by D Untouchable

相关推荐

最近更新

标签

Python 如何添加第三方 Java jar 以在 pyspark 中使用

提问by javadba

采纳答案by Marl

回答by Ryan Chou

回答by Umang singhal

回答by AAB

回答by Nab

回答by Gayatri

回答by Sharvan Kumar

回答by D Untouchable

相关推荐

类型错误：列表索引必须是整数，而不是 str Python

Python .ix() 总是比 .loc() 和 .iloc() 更好，因为它更快并且支持整数和标签访问？

如何在python中平滑曲线

如何在 AWS EC2 实例上安装 Python 3？

相关推荐

最近更新

标签