Python Spark 可以从 pyspark 访问 Hive 表,但不能从 spark-submit 访问
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/36359812/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Spark can access Hive table from pyspark but not from spark-submit
提问by Dennis
So, when running from pyspark i would type in (without specifying any contexts) :
因此,当从 pyspark 运行时,我会输入(不指定任何上下文):
df_openings_latest = sqlContext.sql('select * from experian_int_openings_latest_orc')
.. and it works fine.
..它工作正常。
However, when i run my script from spark-submit
, like
但是,当我从 运行我的脚本时spark-submit
,例如
spark-submit script.py
i put the following in
spark-submit script.py
我把以下内容
from pyspark.sql import SQLContext
from pyspark import SparkConf, SparkContext
conf = SparkConf().setAppName('inc_dd_openings')
sc = SparkContext(conf=conf)
sqlContext = SQLContext(sc)
df_openings_latest = sqlContext.sql('select * from experian_int_openings_latest_orc')
But it gives me an error
但它给了我一个错误
pyspark.sql.utils.AnalysisException: u'Table not found: experian_int_openings_latest_orc;'
pyspark.sql.utils.AnalysisException:u'表未找到:experian_int_openings_latest_orc;'
So it doesnt see my table.
所以它没有看到我的桌子。
What am I doing wrong? Please help
我究竟做错了什么?请帮忙
P.S. Spark version is 1.6 running on Amazon EMR
PS Spark 版本是在 Amazon EMR 上运行的 1.6
回答by zero323
Spark 2.x
火花2.x
The same problem may occur in Spark 2.x if SparkSession
has been created without enabling Hive support.
如果SparkSession
在未启用 Hive 支持的情况下创建Spark 2.x 中可能会出现同样的问题。
Spark 1.x
火花1.x
It is pretty simple. When you use PySpark shell, and Spark has been build with Hive support, default SQLContext
implementation (the one available as a sqlContext
) is HiveContext
.
这很简单。当您使用PySpark外壳,和星火已经构建与蜂巢的支持,默认情况下SQLContext
实现(一个可以作为一个sqlContext
)是HiveContext
。
In your standalone application you use plain SQLContext
which doesn't provide Hive capabilities.
在您的独立应用程序中,您使用SQLContext
不提供 Hive 功能的plain 。
Assuming the rest of the configuration is correct just replace:
假设其余配置正确,只需替换:
from pyspark.sql import SQLContext
sqlContext = SQLContext(sc)
with
和
from pyspark.sql import HiveContext
sqlContext = HiveContext(sc)
回答by Mike Placentra
In Spark 2.x (Amazon EMR 5+) you will run into this issue with spark-submit
if you don't enable Hive support like this:
在 Spark 2.x (Amazon EMR 5+) 中,spark-submit
如果您没有像这样启用 Hive 支持,您将遇到此问题:
from pyspark.sql import SparkSession
spark = SparkSession.builder.master("yarn").appName("my app").enableHiveSupport().getOrCreate()
回答by Brian
Your problem may be related to your Hive
configurations. If your configurations use local metastore
, the metastore_db
directory gets created in the directory that you started you Hive
server from.
您的问题可能与您的Hive
配置有关。如果您的配置使用local metastore
,metastore_db
则会在您启动Hive
服务器的目录中创建该目录。
Since spark-submit
is launched from a different directory, it is creating a new metastore_db
in that directory which does not contain information about your previous tables.
由于spark-submit
是从不同目录启动的,因此它正在metastore_db
该目录中创建一个不包含有关您以前表的信息的新目录。
A quick fix would be to start the Hive
server from the same directory as spark-submit
and re-create your tables.
快速修复是Hive
从与表相同的目录启动服务器spark-submit
并重新创建表。
A more permanent fix is referenced in this SO Post
此SO Post 中引用了更永久的修复程序
You need to change your configuration in $HIVE_HOME/conf/hive-site.xml
您需要更改您的配置 $HIVE_HOME/conf/hive-site.xml
property name = javax.jdo.option.ConnectionURL
property value = jdbc:derby:;databaseName=/home/youruser/hive_metadata/metastore_db;create=true
You should now be able to run hive from any location and still find your tables
您现在应该能够从任何位置运行 hive 并且仍然可以找到您的表