Python Spark 可以从 pyspark 访问 Hive 表,但不能从 spark-submit 访问

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/36359812/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-19 17:46:49  来源:igfitidea点击:

Spark can access Hive table from pyspark but not from spark-submit

pythonhadoopapache-sparkpyspark

提问by Dennis

So, when running from pyspark i would type in (without specifying any contexts) :

因此,当从 pyspark 运行时,我会输入(不指定任何上下文):

df_openings_latest = sqlContext.sql('select * from experian_int_openings_latest_orc')

.. and it works fine.

..它工作正常。

However, when i run my script from spark-submit, like

但是,当我从 运行我的脚本时spark-submit,例如

spark-submit script.pyi put the following in

spark-submit script.py我把以下内容

from pyspark.sql import SQLContext
from pyspark import SparkConf, SparkContext
conf = SparkConf().setAppName('inc_dd_openings')
sc = SparkContext(conf=conf)
sqlContext = SQLContext(sc)

df_openings_latest = sqlContext.sql('select * from experian_int_openings_latest_orc')

But it gives me an error

但它给了我一个错误

pyspark.sql.utils.AnalysisException: u'Table not found: experian_int_openings_latest_orc;'

pyspark.sql.utils.AnalysisException:u'表未找到:experian_int_openings_latest_orc;'

So it doesnt see my table.

所以它没有看到我的桌子。

What am I doing wrong? Please help

我究竟做错了什么?请帮忙

P.S. Spark version is 1.6 running on Amazon EMR

PS Spark 版本是在 Amazon EMR 上运行的 1.6

回答by zero323

Spark 2.x

火花2.x

The same problem may occur in Spark 2.x if SparkSessionhas been created without enabling Hive support.

如果SparkSession在未启用 Hive 支持的情况下创建Spark 2.x 中可能会出现同样的问题。

Spark 1.x

火花1.x

It is pretty simple. When you use PySpark shell, and Spark has been build with Hive support, default SQLContextimplementation (the one available as a sqlContext) is HiveContext.

这很简单。当您使用PySpark外壳,和星火已经构建与蜂巢的支持,默认情况下SQLContext实现(一个可以作为一个sqlContext)是HiveContext

In your standalone application you use plain SQLContextwhich doesn't provide Hive capabilities.

在您的独立应用程序中,您使用SQLContext不提供 Hive 功能的plain 。

Assuming the rest of the configuration is correct just replace:

假设其余配置正确,只需替换:

from pyspark.sql import SQLContext

sqlContext = SQLContext(sc)

with

from pyspark.sql import HiveContext

sqlContext = HiveContext(sc)

回答by Mike Placentra

In Spark 2.x (Amazon EMR 5+) you will run into this issue with spark-submitif you don't enable Hive support like this:

在 Spark 2.x (Amazon EMR 5+) 中,spark-submit如果您没有像这样启用 Hive 支持,您将遇到此问题:

from pyspark.sql import SparkSession
spark = SparkSession.builder.master("yarn").appName("my app").enableHiveSupport().getOrCreate()

回答by Brian

Your problem may be related to your Hiveconfigurations. If your configurations use local metastore, the metastore_dbdirectory gets created in the directory that you started you Hiveserver from.

您的问题可能与您的Hive配置有关。如果您的配置使用local metastoremetastore_db则会在您启动Hive服务器的目录中创建该目录。

Since spark-submitis launched from a different directory, it is creating a new metastore_dbin that directory which does not contain information about your previous tables.

由于spark-submit是从不同目录启动的,因此它正在metastore_db该目录中创建一个不包含有关您以前表的信息的新目录。

A quick fix would be to start the Hiveserver from the same directory as spark-submitand re-create your tables.

快速修复是Hive从与表相同的目录启动服务器spark-submit并重新创建表。

A more permanent fix is referenced in this SO Post

SO Post 中引用了更永久的修复程序

You need to change your configuration in $HIVE_HOME/conf/hive-site.xml

您需要更改您的配置 $HIVE_HOME/conf/hive-site.xml

property name = javax.jdo.option.ConnectionURL

property value = jdbc:derby:;databaseName=/home/youruser/hive_metadata/metastore_db;create=true

You should now be able to run hive from any location and still find your tables

您现在应该能够从任何位置运行 hive 并且仍然可以找到您的表