scala java.sql.SQLException: 将 DataFrame 加载到 Spark SQL 时找不到合适的驱动程序

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/29931759/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-10-22 07:05:54  来源:igfitidea点击:

java.sql.SQLException: No suitable driver found when loading DataFrame into Spark SQL

scalajdbcapache-sparkapache-spark-sql

提问by Wildfire

I'm hitting very strange problem when trying to load JDBC DataFrame into Spark SQL.

尝试将 JDBC DataFrame 加载到 Spark SQL 时遇到了非常奇怪的问题。

I've tried several Spark clusters - YARN, standalone cluster and pseudo distributed mode on my laptop. It's reproducible on both Spark 1.3.0 and 1.3.1. The problem occurs in both spark-shelland when executing the code with spark-submit. I've tried MySQL & MS SQL JDBC drivers without success.

我在我的笔记本电脑上尝试了几个 Spark 集群 - YARN、独立集群和伪分布式模式。它可以在 Spark 1.3.0 和 1.3.1 上重现。在这两个时发生该问题spark-shell执行与所述代码时和spark-submit。我尝试过 MySQL 和 MS SQL JDBC 驱动程序,但没有成功。

Consider following sample:

考虑以下示例:

val driver = "com.mysql.jdbc.Driver"
val url = "jdbc:mysql://localhost:3306/test"

val t1 = {
  sqlContext.load("jdbc", Map(
    "url" -> url,
    "driver" -> driver,
    "dbtable" -> "t1",
    "partitionColumn" -> "id",
    "lowerBound" -> "0",
    "upperBound" -> "100",
    "numPartitions" -> "50"
  ))
}

So far so good, the schema gets resolved properly:

到目前为止一切顺利,模式得到正确解析:

t1: org.apache.spark.sql.DataFrame = [id: int, name: string]

But when I evaluate DataFrame:

但是当我评估 DataFrame 时:

t1.take(1)

Following exception occurs:

出现以下异常:

15/04/29 01:56:44 WARN TaskSetManager: Lost task 0.0 in stage 0.0 (TID 0, 192.168.1.42): java.sql.SQLException: No suitable driver found for jdbc:mysql://<hostname>:3306/test
    at java.sql.DriverManager.getConnection(DriverManager.java:689)
    at java.sql.DriverManager.getConnection(DriverManager.java:270)
    at org.apache.spark.sql.jdbc.JDBCRDD$$anonfun$getConnector.apply(JDBCRDD.scala:158)
    at org.apache.spark.sql.jdbc.JDBCRDD$$anonfun$getConnector.apply(JDBCRDD.scala:150)
    at org.apache.spark.sql.jdbc.JDBCRDD$$anon.<init>(JDBCRDD.scala:317)
    at org.apache.spark.sql.jdbc.JDBCRDD.compute(JDBCRDD.scala:309)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
    at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
    at org.apache.spark.scheduler.Task.run(Task.scala:64)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:745)

When I try to open JDBC connection on executor:

当我尝试在执行程序上打开 JDBC 连接时:

import java.sql.DriverManager

sc.parallelize(0 until 2, 2).map { i =>
  Class.forName(driver)
  val conn = DriverManager.getConnection(url)
  conn.close()
  i
}.collect()

it works perfectly:

它完美地工作:

res1: Array[Int] = Array(0, 1)

When I run the same code on local Spark, it works perfectly too:

当我在本地 Spark 上运行相同的代码时,它也能完美运行:

scala> t1.take(1)
...
res0: Array[org.apache.spark.sql.Row] = Array([1,one])

I'm using Spark pre-built with Hadoop 2.4 support.

我正在使用带有 Hadoop 2.4 支持的预构建 Spark。

The easiest way to reproduce the problem is to start Spark in pseudo distributed mode with start-all.shscript and run following command:

重现该问题的最简单方法是使用start-all.sh脚本以伪分布式模式启动 Spark并运行以下命令:

/path/to/spark-shell --master spark://<hostname>:7077 --jars /path/to/mysql-connector-java-5.1.35.jar --driver-class-path /path/to/mysql-connector-java-5.1.35.jar

Is there a way to work this around? It looks like a severe problem, so it's strange that googling doesn't help here.

有没有办法解决这个问题?这看起来是一个严重的问题,所以很奇怪谷歌搜索在这里没有帮助。

采纳答案by Wildfire

Apparently this issue has been recently reported:

显然这个问题最近已被报道:

https://issues.apache.org/jira/browse/SPARK-6913

https://issues.apache.org/jira/browse/SPARK-6913

The problem is in java.sql.DriverManager that doesn't see the drivers loaded by ClassLoaders other than bootstrap ClassLoader.

问题出在 java.sql.DriverManager 中,它没有看到除引导类加载器之外的类加载器加载的驱动程序。

As a temporary workaround it's possible to add required drivers to boot classpath of executors.

作为临时解决方法,可以将所需的驱动程序添加到执行程序的引导类路径。

UPDATE: This pull request fixes the problem: https://github.com/apache/spark/pull/5782

更新:这个拉取请求解决了这个问题:https: //github.com/apache/spark/pull/5782

UPDATE 2: The fix merged to Spark 1.4

更新 2:修复合并到 Spark 1.4

回答by Harish Pathak

For writing data to MySQL

用于将数据写入 MySQL

In spark 1.4.0, you have to load MySQL before writing into it because it loads drivers on load function but not on write function. We have to put jar on every worker node and set the path in spark-defaults.conf file on each node. This issue has been fixed in spark 1.5.0

在 spark 1.4.0 中,您必须在写入之前加载 MySQL,因为它在加载函数上加载驱动程序,而不是在写入函数上加载驱动程序。我们必须将 jar 放在每个工作节点上,并在每个节点上的 spark-defaults.conf 文件中设置路径。这个问题已在 spark 1.5.0 中修复

https://issues.apache.org/jira/browse/SPARK-10036

https://issues.apache.org/jira/browse/SPARK-10036

回答by user3466407

I am using spark-1.6.1 with SQL server, still faced the same issue. I had to add the library(sqljdbc-4.0.jar) to the lib in the instance and below line in conf/spark-dfault.conffile.

我在 SQL 服务器上使用 spark-1.6.1,仍然面临同样的问题。我不得不将库(sqljdbc-4.0.jar)添加到实例中的库和conf/spark-dfault.conf文件中的下面一行。

spark.driver.extraClassPath lib/sqljdbc-4.0.jar

spark.driver.extraClassPath lib/sqljdbc-4.0.jar

回答by Kevin Pauli

We are stuck on Spark 1.3 (Cloudera 5.4) and so I found this question and Wildfire's answer helpful since it allowed me to stop banging my head against the wall.

我们被困在 Spark 1.3 (Cloudera 5.4) 上,所以我发现这个问题和 Wildfire 的回答很有帮助,因为它让我不再用头撞墙。

Thought I would share how we got the driver into the boot classpath: we simply copied it into /opt/cloudera/parcels/CDH-5.4.0-1.cdh5.4.0.p0.27/lib/hive/lib on all the nodes.

我想我会分享我们如何将驱动程序放入引导类路径:我们只是将它复制到 /opt/cloudera/parcels/CDH-5.4.0-1.cdh5.4.0.p0.27/lib/hive/lib节点。