postgresql Apache Spark:JDBC 连接不起作用
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 
原文地址: http://stackoverflow.com/questions/29821518/
Warning: these are provided under cc-by-sa 4.0 license.  You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Apache Spark : JDBC connection not working
提问by Soni Shashank
I have asked this question previously also but did not got any answer (Not able to connect to postgres using jdbc in pyspark shell).
我之前也问过这个问题,但没有得到任何答案(无法在 pyspark shell 中使用 jdbc 连接到 postgres)。
I have successfully installed Spark 1.3.0 on my local windows and ran sample programs to test using pyspark shell.
我已在本地 Windows 上成功安装 Spark 1.3.0 并运行示例程序以使用 pyspark shell 进行测试。
Now, I want to run Correlations from Mllib on the data that is stored in Postgresql, but I am not able to connect to postgresql.
现在,我想对存储在 Postgresql 中的数据运行 Mllib 的相关性,但我无法连接到 postgresql。
I have successfully added the required jar (tested this jar) in the classpath by running
我已经通过运行在类路径中成功添加了所需的 jar(测试了这个 jar)
pyspark --jars "C:\path\to\jar\postgresql-9.2-1002.jdbc3.jar"
I can see that jar is successfully added in environment UI.
我可以看到在环境 UI 中成功添加了 jar。
When I run the following in pyspark shell-
当我在 pyspark shell 中运行以下命令时-
from pyspark.sql import SQLContext
sqlContext = SQLContext(sc)
df = sqlContext.load(source="jdbc",url="jdbc:postgresql://[host]/[dbname]", dbtable="[schema.table]")  
I get this ERROR -
我收到这个错误 -
>>> df = sqlContext.load(source="jdbc",url="jdbc:postgresql://[host]/[dbname]", dbtable="[schema.table]")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "C:\Users\ACERNEW3\Desktop\Spark\spark-1.3.0-bin-hadoop2.4\python\pyspark\sql\context.py", line 482, in load
    df = self._ssql_ctx.load(source, joptions)
  File "C:\Users\ACERNEW3\Desktop\Spark\spark-1.3.0-bin-hadoop2.4\python\lib\py4j-0.8.2.1-src.zip\py4j\java_gateway.py", line 538, in __call__
  File "C:\Users\ACERNEW3\Desktop\Spark\spark-1.3.0-bin-hadoop2.4\python\lib\py4j-0.8.2.1-src.zip\py4j\protocol.py", line 300, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o20.load.
: java.sql.SQLException: No suitable driver found for     jdbc:postgresql://[host]/[dbname]
        at java.sql.DriverManager.getConnection(DriverManager.java:602)
        at java.sql.DriverManager.getConnection(DriverManager.java:207)
        at org.apache.spark.sql.jdbc.JDBCRDD$.resolveTable(JDBCRDD.scala:94)
        at org.apache.spark.sql.jdbc.JDBCRelation.<init>    (JDBCRelation.scala:125)
        at  org.apache.spark.sql.jdbc.DefaultSource.createRelation(JDBCRelation.scala:114)
        at org.apache.spark.sql.sources.ResolvedDataSource$.apply(ddl.scala:290)
        at org.apache.spark.sql.SQLContext.load(SQLContext.scala:679)
        at org.apache.spark.sql.SQLContext.load(SQLContext.scala:667)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
        at java.lang.reflect.Method.invoke(Method.java:597)
        at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231)
        at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379)
        at py4j.Gateway.invoke(Gateway.java:259)
        at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)
        at py4j.commands.CallCommand.execute(CallCommand.java:79)
        at py4j.GatewayConnection.run(GatewayConnection.java:207)
        at java.lang.Thread.run(Thread.java:619)
回答by 8forty
I had this exact problem with mysql/mariadb, and got BIG clue from this question
我在 mysql/mariadb 上遇到了这个确切的问题,并从这个问题中得到了很大的线索
So your pyspark command should be:
所以你的 pyspark 命令应该是:
pyspark --conf spark.executor.extraClassPath=<jdbc.jar> --driver-class-path <jdbc.jar> --jars <jdbc.jar> --master <master-URL>
Also watch for errors when pyspark start like "Warning: Local jar ... does not exist, skipping." and "ERROR SparkContext: Jar not found at ...", these probably mean you spelled the path wrong.
还要注意 pyspark 启动时的错误,例如“警告:本地 jar ... 不存在,正在跳过。” 和“ERROR SparkContext: Jar not found at ...”,这些可能意味着你拼错了路径。
回答by jake256
A slightly more elegant solution:
一个稍微优雅的解决方案:
val props = new Properties
props.put("driver", "org.postgresql.Driver")
sqlContext.read.jdbc("jdbc:postgresql://[host]/[dbname]", props)
回答by aks
As jake256 suggested
正如 jake256 建议的那样
"driver", "org.postgresql.Driver"
“驱动程序”,“org.postgresql.Driver”
key-value pair was missing. In my case, I launched pysparkas :
缺少键值对。就我而言,我将pyspark启动为:
pyspark --jars /path/to/postgresql-9.4.1210.jar
with following instructions :
使用以下说明:
  from pyspark.sql import DataFrameReader
  url = 'postgresql://192.168.2.4:5432/postgres'
  properties = {'user': 'myUser', 'password': 'myPasswd', 'driver': 'org.postgresql.Driver'}
  df = DataFrameReader(sqlContext).jdbc(
      url='jdbc:%s' % url, table='weather', properties=properties
  )
  df.show()
  +-------------+-------+-------+-----------+----------+
  |         city|temp_lo|temp_hi|       prcp|      date|
  +-------------+-------+-------+-----------+----------+
  |San Francisco|     46|     50|       0.25|1994-11-27|
  |San Francisco|     43|     57|        0.0|1994-11-29|
  |      Hayward|     54|     37|0.239999995|1994-11-29|
  +-------------+-------+-------+-----------+----------+
Tested on :
测试:
- Ubuntu 16.04 
- PostgreSQL server version 9.5. 
- Postgresql driver used is postgresql-9.4.1210.jar 
- and Spark version is spark-2.0.0-bin-hadoop2.6 
- but I am also confident that it should also work on spark-2.0.0-bin-hadoop2.7. 
- Java JDK 1.8 64bits 
- Ubuntu 16.04 
- PostgreSQL 服务器版本 9.5。 
- 使用的 Postgresql 驱动程序是 postgresql-9.4.1210.jar 
- Spark 版本是 spark-2.0.0-bin-hadoop2.6 
- 但我也相信它也应该适用于 spark-2.0.0-bin-hadoop2.7。 
- Java JDK 1.8 64 位 
other JDBC Drivers can be found on : https://www.petefreitag.com/articles/jdbc_urls/
可以在以下位置找到其他 JDBC 驱动程序:https: //www.petefreitag.com/articles/jdbc_urls/
tutorial I followed is on : https://developer.ibm.com/clouddataservices/2015/08/19/speed-your-sql-queries-with-spark-sql/
我遵循的教程是:https: //developer.ibm.com/clouddataservices/2015/08/19/speed-your-sql-queries-with-spark-sql/
similar solution was suggested also on : pyspark mysql jdbc load An error occurred while calling o23.load No suitable driver
也提出了类似的解决方案: pyspark mysql jdbc load 调用 o23.load 时发生错误 没有合适的驱动程序
回答by zhaozhi
回答by Will McGinnis
This error seems to get thrown when you use the wrong version of JDBC driver. Check https://jdbc.postgresql.org/download.htmlto make sure that you have the right one.
当您使用错误版本的 JDBC 驱动程序时,似乎会抛出此错误。检查https://jdbc.postgresql.org/download.html以确保您拥有正确的。
Note in particular:
特别注意:
JDK 1.1 - JDBC 1. Note that with the 8.0 release JDBC 1 support has been removed, so look to update your JDK when you update your server.
JDK 1.2, 1.3 - JDBC 2. JDK 1.3 + J2EE - JDBC 2 EE. This contains additional support for javax.sql classes.
JDK 1.4, 1.5 - JDBC 3. This contains support for SSL and javax.sql, but does not require J2EE as it has been added to the J2SE release. JDK 1.6 - JDBC4. Support for JDBC4 methods is not complete, but the majority of methods are implemented.
JDK 1.7, 1.8 - JDBC41. Support for JDBC4 methods is not complete, but the majority of methods are implemented.
JDK 1.1 - JDBC 1。请注意,在 8.0 版本中,JDBC 1 支持已被删除,因此请在更新服务器时更新 JDK。
JDK 1.2、1.3 - JDBC 2。JDK 1.3 + J2EE - JDBC 2 EE。这包含对 javax.sql 类的额外支持。
JDK 1.4、1.5 - JDBC 3。这包含对 SSL 和 javax.sql 的支持,但不需要 J2EE,因为它已添加到 J2SE 版本中。JDK 1.6 - JDBC4。对 JDBC4 方法的支持并不完整,但大部分方法都已实现。
JDK 1.7、1.8 - JDBC41。对 JDBC4 方法的支持并不完整,但大部分方法都已实现。
回答by Naveen Kumar
That's pretty straightforward. To connect to external database to retrieve data into Spark dataframes additional jar file is required. E.g. with MySQL the JDBC driver is required. Download the driver package and extract mysql-connector-java-x.yy.zz-bin.jar in a path that's accessible from every node in the cluster. Preferably this is a path on shared file system. E.g. with Pouta Virtual Cluster such path would be under /shared_data, here I use /shared_data/thirdparty_jars/.
这很简单。要连接到外部数据库以将数据检索到 Spark 数据帧中,需要额外的 jar 文件。例如,对于 MySQL,需要 JDBC 驱动程序。下载驱动程序包并将 mysql-connector-java-x.yy.zz-bin.jar 提取到可从集群中的每个节点访问的路径中。最好是共享文件系统上的路径。例如,对于 Pouta 虚拟集群,此类路径将位于 /shared_data 下,这里我使用 /shared_data/thirdparty_jars/。
With direct Spark job submissions from terminal one can specify –driver-class-path argument pointing to extra jars that should be provided to workers with the job. However this does not work with this approach, so we must configure these paths for front end and worker nodes in the spark-defaults.conf file, usually in /opt/spark/conf directory.
通过从终端直接提交 Spark 作业,可以指定 –driver-class-path 参数指向应提供给作业工人的额外 jars。然而,这不适用于这种方法,因此我们必须在 spark-defaults.conf 文件中为前端和工作节点配置这些路径,通常在 /opt/spark/conf 目录中。
spark.driver.extraClassPath /"your-path"/mysql-connector-java-5.1.35-bin.jar spark.executor.extraClassPath /"your-path"/mysql-connector-java-5.1.35-bin.jar
spark.driver.extraClassPath /"your-path"/mysql-connector-java-5.1.35-bin.jar spark.executor.extraClassPath /"your-path"/mysql-connector-java-5.1.35-bin.jar

