postgresql 使用pyspark连接PostgreSQL

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/34948296/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-09 06:09:20  来源:igfitidea点击:

Using pyspark to connect to PostgreSQL

postgresqlapache-sparkpyspark

提问by Mpizos Dimitris

I am trying to connect to a database with pyspark and I am using the following code:

我正在尝试使用 pyspark 连接到数据库,并且正在使用以下代码:

sqlctx = SQLContext(sc)
df = sqlctx.load(
    url = "jdbc:postgresql://[hostname]/[database]",
    dbtable = "(SELECT * FROM talent LIMIT 1000) as blah",
    password = "MichaelJordan",
    user =  "ScottyPippen",
    source = "jdbc",
    driver = "org.postgresql.Driver"
)

and I am getting the following error:

我收到以下错误:

enter image description here

在此处输入图片说明

Any idea why is this happening?

知道为什么会这样吗?

Edit: I am trying to run the code locally in my computer.

编辑:我试图在我的计算机上本地运行代码。

回答by mdh

The following worked for me with postgres on localhost:

以下为我在本地主机上使用 postgres 工作:

Download the PostgreSQL JDBC Driver from https://jdbc.postgresql.org/download.html.

https://jdbc.postgresql.org/download.html下载 PostgreSQL JDBC 驱动程序。

For the pysparkshell you use the SPARK_CLASSPATHenvironment variable:

对于pysparkshell,您使用SPARK_CLASSPATH环境变量:

$ export SPARK_CLASSPATH=/path/to/downloaded/jar
$ pyspark

For submitting a script via spark-submituse the --driver-class-pathflag:

要通过spark-submit使用--driver-class-path标志提交脚本:

$ spark-submit --driver-class-path /path/to/downloaded/jar script.py

In the python script load the tables as a DataFrameas follows:

在 python 脚本中,DataFrame按如下方式加载表:

from pyspark.sql import DataFrameReader

url = 'postgresql://localhost:5432/dbname'
properties = {'user': 'username', 'password': 'password'}
df = DataFrameReader(sqlContext).jdbc(
    url='jdbc:%s' % url, table='tablename', properties=properties
)

or alternatively:

或者:

df = sqlContext.read.format('jdbc').\
    options(url='jdbc:%s' % url, dbtable='tablename').\
    load()

Note that when submitting the script via spark-submit, you need to define the sqlContext.

请注意,通过 提交脚本时spark-submit,您需要定义sqlContext.

回答by Rafael

Download the PostgreSQL JDBC Driver from https://jdbc.postgresql.org/download.html

https://jdbc.postgresql.org/download.html下载 PostgreSQL JDBC 驱动程序

Then replace the database configuration values by yours.

然后用你的替换数据库配置值。

from pyspark.sql import SparkSession

spark = SparkSession \
    .builder \
    .appName("Python Spark SQL basic example") \
    .config("spark.jars", "/path_to_postgresDriver/postgresql-42.2.5.jar") \
    .getOrCreate()

df = spark.read \
    .format("jdbc") \
    .option("url", "jdbc:postgresql://localhost:5432/databasename") \
    .option("dbtable", "tablename") \
    .option("user", "username") \
    .option("password", "password") \
    .option("driver", "org.postgresql.Driver") \
    .load()

df.printSchema()

More info: https://spark.apache.org/docs/latest/sql-data-sources-jdbc.html

更多信息:https: //spark.apache.org/docs/latest/sql-data-sources-jdbc.html

回答by Chevelle

It is necesary copy postgresql-42.1.4.jar in all nodes... for my case, I did copy in the path /opt/spark-2.2.0-bin-hadoop2.7/jars

在所有节点中都需要复制 postgresql-42.1.4.jar ......就我而言,我确实复制了路径 /opt/spark-2.2.0-bin-hadoop2.7/jars

Also, i set classpath in ~/.bashrc (export SPARK_CLASSPATH="/opt/spark-2.2.0-bin-hadoop2.7/jars" )

另外,我在 ~/.bashrc 中设置了类路径(导出 SPARK_CLASSPATH="/opt/spark-2.2.0-bin-hadoop2.7/jars" )

and work fine in pyspark console and jupyter

并在 pyspark 控制台和 jupyter 中正常工作

回答by Derek Hill

One approach, building on the example per the quick start guide, is this blog postwhich shows how to add the --packages org.postgresql:postgresql:9.4.1211argument to the spark-submitcommand.

一种基于快速入门指南示例的方法是这篇博客文章,它展示了如何将--packages org.postgresql:postgresql:9.4.1211参数添加到spark-submit命令中。

This downloads the driver into ~/.ivy2/jarsdirectory, in my case /Users/derekhill/.ivy2/jars/org.postgresql_postgresql-9.4.1211.jar. Passing this as the --driver-class-pathoption gives the full spark-submit command of:

这会将驱动程序下载到~/.ivy2/jars目录中,在我的情况下是/Users/derekhill/.ivy2/jars/org.postgresql_postgresql-9.4.1211.jar. 将此作为--driver-class-path选项传递给完整的 spark-submit 命令:

/usr/local/Cellar/apache-spark/2.0.2/bin/spark-submit\
 --packages org.postgresql:postgresql:9.4.1211\
 --driver-class-path /Users/derekhill/.ivy2/jars/org.postgresql_postgresql-9.4.1211.jar\
 --master local[4] main.py

And in main.py:

并在main.py

from pyspark.sql import SparkSession

spark = SparkSession.builder.getOrCreate()

dataframe = spark.read.format('jdbc').options(
        url = "jdbc:postgresql://localhost/my_db?user=derekhill&password=''",
        database='my_db',
        dbtable='my_table'
    ).load()

dataframe.show()

回答by MiguelPeralvo

You normally need either:

您通常需要:

  1. to install the Postgres Driver on your cluster,
  2. to provide the Postgres driver jar from your client with the --jars option
  3. or to provide the maven coordinates of the Postgres driver with --packages option.
  1. 在您的集群上安装 Postgres 驱动程序,
  2. 使用 --jars 选项从您的客户端提供 Postgres 驱动程序 jar
  3. 或者使用 --packages 选项提供 Postgres 驱动程序的 maven 坐标。

If you detail how are you launching pyspark, we may give you more details.

如果您详细说明如何启动 pyspark,我们可能会为您提供更多详细信息。

Some clues/ideas:

一些线索/想法:

spark-cannot-find-the-postgres-jdbc-driver

spark-cannot-find-the-postgres-jdbc-driver

Not able to connect to postgres using jdbc in pyspark shell

无法在 pyspark shell 中使用 jdbc 连接到 postgres

回答by Shawn Guo

This exception means jdbc driver does not in driver classpath. you can spark-submit jdbc jars with --jarparameter, also add it into driver classpath using spark.driver.extraClassPath.

此异常意味着 jdbc 驱动程序不在驱动程序类路径中。您可以使用--jar参数spark-submit jdbc jars ,也可以使用spark.driver.extraClassPath.

回答by Galuoises

To use pyspark and jupyter notebook notebook: first open pyspark with

要使用 pyspark 和 jupyter notebook notebook:首先打开 pyspark

pyspark --driver-class-path /spark_drivers/postgresql-42.2.12.jar  --jars /spark_drivers/postgresql-42.2.12.jar

Then in jupyter notebook

然后在 jupyter notebook 中

import os
jardrv = "~/spark_drivers/postgresql-42.2.12.jar"


from pyspark.sql import SparkSession
spark = SparkSession.builder.config('spark.driver.extraClassPath', jardrv).getOrCreate()
url = 'jdbc:postgresql://127.0.0.1/dbname'
properties = {'user': 'usr', 'password': 'pswd'}
df = spark.read.jdbc(url=url, table='tablename', properties=properties)

回答by Florian Uhlmann

I had trouble to get a connection to the postgresDB with the jars i had on my computer. This code solved my problem with the driver

我无法使用计算机上的 jar 连接到 postgresDB。此代码解决了我的驱动程序问题

 from pyspark.sql import SparkSession
 import os

 sparkClassPath = os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages org.postgresql:postgresql:42.1.1 pyspark-shell'

 spark = SparkSession \
    .builder \
    .config("spark.driver.extraClassPath", sparkClassPath) \
    .getOrCreate()

 df = spark.read \
    .format("jdbc") \
    .option("url", "jdbc:postgresql://localhost:5432/yourDBname") \
    .option("driver", "org.postgresql.Driver") \
    .option("dbtable", "yourtablename") \
    .option("user", "postgres") \
    .option("password", "***") \
    .load()

df.show()