postgresql 使用pyspark连接PostgreSQL
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/34948296/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Using pyspark to connect to PostgreSQL
提问by Mpizos Dimitris
I am trying to connect to a database with pyspark and I am using the following code:
我正在尝试使用 pyspark 连接到数据库,并且正在使用以下代码:
sqlctx = SQLContext(sc)
df = sqlctx.load(
url = "jdbc:postgresql://[hostname]/[database]",
dbtable = "(SELECT * FROM talent LIMIT 1000) as blah",
password = "MichaelJordan",
user = "ScottyPippen",
source = "jdbc",
driver = "org.postgresql.Driver"
)
and I am getting the following error:
我收到以下错误:
Any idea why is this happening?
知道为什么会这样吗?
Edit: I am trying to run the code locally in my computer.
编辑:我试图在我的计算机上本地运行代码。
回答by mdh
The following worked for me with postgres on localhost:
以下为我在本地主机上使用 postgres 工作:
Download the PostgreSQL JDBC Driver from https://jdbc.postgresql.org/download.html.
从https://jdbc.postgresql.org/download.html下载 PostgreSQL JDBC 驱动程序。
For the pyspark
shell you use the SPARK_CLASSPATH
environment variable:
对于pyspark
shell,您使用SPARK_CLASSPATH
环境变量:
$ export SPARK_CLASSPATH=/path/to/downloaded/jar
$ pyspark
For submitting a script via spark-submit
use the --driver-class-path
flag:
要通过spark-submit
使用--driver-class-path
标志提交脚本:
$ spark-submit --driver-class-path /path/to/downloaded/jar script.py
In the python script load the tables as a DataFrame
as follows:
在 python 脚本中,DataFrame
按如下方式加载表:
from pyspark.sql import DataFrameReader
url = 'postgresql://localhost:5432/dbname'
properties = {'user': 'username', 'password': 'password'}
df = DataFrameReader(sqlContext).jdbc(
url='jdbc:%s' % url, table='tablename', properties=properties
)
or alternatively:
或者:
df = sqlContext.read.format('jdbc').\
options(url='jdbc:%s' % url, dbtable='tablename').\
load()
Note that when submitting the script via spark-submit
, you need to define the sqlContext
.
请注意,通过 提交脚本时spark-submit
,您需要定义sqlContext
.
回答by Rafael
Download the PostgreSQL JDBC Driver from https://jdbc.postgresql.org/download.html
从https://jdbc.postgresql.org/download.html下载 PostgreSQL JDBC 驱动程序
Then replace the database configuration values by yours.
然后用你的替换数据库配置值。
from pyspark.sql import SparkSession
spark = SparkSession \
.builder \
.appName("Python Spark SQL basic example") \
.config("spark.jars", "/path_to_postgresDriver/postgresql-42.2.5.jar") \
.getOrCreate()
df = spark.read \
.format("jdbc") \
.option("url", "jdbc:postgresql://localhost:5432/databasename") \
.option("dbtable", "tablename") \
.option("user", "username") \
.option("password", "password") \
.option("driver", "org.postgresql.Driver") \
.load()
df.printSchema()
More info: https://spark.apache.org/docs/latest/sql-data-sources-jdbc.html
更多信息:https: //spark.apache.org/docs/latest/sql-data-sources-jdbc.html
回答by Chevelle
It is necesary copy postgresql-42.1.4.jar in all nodes... for my case, I did copy in the path /opt/spark-2.2.0-bin-hadoop2.7/jars
在所有节点中都需要复制 postgresql-42.1.4.jar ......就我而言,我确实复制了路径 /opt/spark-2.2.0-bin-hadoop2.7/jars
Also, i set classpath in ~/.bashrc (export SPARK_CLASSPATH="/opt/spark-2.2.0-bin-hadoop2.7/jars" )
另外,我在 ~/.bashrc 中设置了类路径(导出 SPARK_CLASSPATH="/opt/spark-2.2.0-bin-hadoop2.7/jars" )
and work fine in pyspark console and jupyter
并在 pyspark 控制台和 jupyter 中正常工作
回答by Derek Hill
One approach, building on the example per the quick start guide, is this blog postwhich shows how to add the --packages org.postgresql:postgresql:9.4.1211
argument to the spark-submit
command.
一种基于快速入门指南示例的方法是这篇博客文章,它展示了如何将--packages org.postgresql:postgresql:9.4.1211
参数添加到spark-submit
命令中。
This downloads the driver into ~/.ivy2/jars
directory, in my case /Users/derekhill/.ivy2/jars/org.postgresql_postgresql-9.4.1211.jar
. Passing this as the --driver-class-path
option gives the full spark-submit command of:
这会将驱动程序下载到~/.ivy2/jars
目录中,在我的情况下是/Users/derekhill/.ivy2/jars/org.postgresql_postgresql-9.4.1211.jar
. 将此作为--driver-class-path
选项传递给完整的 spark-submit 命令:
/usr/local/Cellar/apache-spark/2.0.2/bin/spark-submit\
--packages org.postgresql:postgresql:9.4.1211\
--driver-class-path /Users/derekhill/.ivy2/jars/org.postgresql_postgresql-9.4.1211.jar\
--master local[4] main.py
And in main.py
:
并在main.py
:
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
dataframe = spark.read.format('jdbc').options(
url = "jdbc:postgresql://localhost/my_db?user=derekhill&password=''",
database='my_db',
dbtable='my_table'
).load()
dataframe.show()
回答by MiguelPeralvo
You normally need either:
您通常需要:
- to install the Postgres Driver on your cluster,
- to provide the Postgres driver jar from your client with the --jars option
- or to provide the maven coordinates of the Postgres driver with --packages option.
- 在您的集群上安装 Postgres 驱动程序,
- 使用 --jars 选项从您的客户端提供 Postgres 驱动程序 jar
- 或者使用 --packages 选项提供 Postgres 驱动程序的 maven 坐标。
If you detail how are you launching pyspark, we may give you more details.
如果您详细说明如何启动 pyspark,我们可能会为您提供更多详细信息。
Some clues/ideas:
一些线索/想法:
spark-cannot-find-the-postgres-jdbc-driver
spark-cannot-find-the-postgres-jdbc-driver
回答by Shawn Guo
This exception means jdbc driver does not in driver classpath.
you can spark-submit jdbc jars with --jar
parameter, also add it into driver classpath using spark.driver.extraClassPath
.
此异常意味着 jdbc 驱动程序不在驱动程序类路径中。您可以使用--jar
参数spark-submit jdbc jars ,也可以使用spark.driver.extraClassPath
.
回答by Galuoises
To use pyspark and jupyter notebook notebook: first open pyspark with
要使用 pyspark 和 jupyter notebook notebook:首先打开 pyspark
pyspark --driver-class-path /spark_drivers/postgresql-42.2.12.jar --jars /spark_drivers/postgresql-42.2.12.jar
Then in jupyter notebook
然后在 jupyter notebook 中
import os
jardrv = "~/spark_drivers/postgresql-42.2.12.jar"
from pyspark.sql import SparkSession
spark = SparkSession.builder.config('spark.driver.extraClassPath', jardrv).getOrCreate()
url = 'jdbc:postgresql://127.0.0.1/dbname'
properties = {'user': 'usr', 'password': 'pswd'}
df = spark.read.jdbc(url=url, table='tablename', properties=properties)
回答by Florian Uhlmann
I had trouble to get a connection to the postgresDB with the jars i had on my computer. This code solved my problem with the driver
我无法使用计算机上的 jar 连接到 postgresDB。此代码解决了我的驱动程序问题
from pyspark.sql import SparkSession
import os
sparkClassPath = os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages org.postgresql:postgresql:42.1.1 pyspark-shell'
spark = SparkSession \
.builder \
.config("spark.driver.extraClassPath", sparkClassPath) \
.getOrCreate()
df = spark.read \
.format("jdbc") \
.option("url", "jdbc:postgresql://localhost:5432/yourDBname") \
.option("driver", "org.postgresql.Driver") \
.option("dbtable", "yourtablename") \
.option("user", "postgres") \
.option("password", "***") \
.load()
df.show()