SQL SparkSQL 是否支持子查询?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/33933118/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-01 04:13:32  来源:igfitidea点击:

Does SparkSQL support subquery?

sqlapache-sparksubqueryapache-spark-sql

提问by Rinku Buragohain

I am running this query in Spark shell but it gives me error,

我在 Spark shell 中运行这个查询,但它给了我错误,

sqlContext.sql(
 "select sal from samplecsv where sal < (select MAX(sal) from samplecsv)"
).collect().foreach(println)

error:

错误:

java.lang.RuntimeException: [1.47] failure: ``)'' expected but identifier MAX found

select sal from samplecsv where sal < (select MAX(sal) from samplecsv) ^ at scala.sys.package$.error(package.scala:27) Can anybody explan me,thanks

java.lang.RuntimeException: [1.47] 失败: ``)'' 预期但标识符 MAX

select sal from samplecsv where sal < (select MAX(sal) from samplecsv) ^ at scala.sys.package$.error(package.scala:27​​) 有人能解释一下吗,谢谢

回答by zero323

Planned features:

计划功能

  • SPARK-23945(Column.isin() should accept a single-column DataFrame as input).
  • SPARK-18455(General support for correlated subquery processing).
  • SPARK-23945(Column.isin() 应接受单列 DataFrame 作为输入)。
  • SPARK-18455(对相关子查询处理的一般支持)。

Spark 2.0+

火花 2.0+

Spark SQL should support both correlated and uncorrelated subqueries. See SubquerySuitefor details. Some examples include:

Spark SQL 应该支持相关和不相关的子查询。详情请参阅SubquerySuite。一些例子包括:

select * from l where exists (select * from r where l.a = r.c)
select * from l where not exists (select * from r where l.a = r.c)

select * from l where l.a in (select c from r)
select * from l where a not in (select c from r)

Unfortunately as for now (Spark 2.0) it is impossible to express the same logic using DataFrameDSL.

不幸的是,目前(Spark 2.0)无法使用DataFrameDSL表达相同的逻辑。

Spark < 2.0

火花 < 2.0

Spark supports subqueries in the FROMclause (same as Hive <= 0.12).

Spark 支持FROM子句中的子查询(与 Hive <= 0.12 相同)。

SELECT col FROM (SELECT *  FROM t1 WHERE bar) t2

It simply doesn't support subqueries in the WHEREclause.Generally speaking arbitrary subqueries (in particular correlated subqueries) couldn't be expressed using Spark without promoting to Cartesian join.

它根本不支持WHERE子句中的子查询。一般来说,在不升级为笛卡尔连接的情况下,无法使用 Spark 表达任意子查询(特别是相关子查询)。

Since subquery performance is usually a significant issue in a typical relational system and every subquery can be expressed using JOINthere is no loss-of-function here.

由于子查询性能在典型的关系系统中通常是一个重要问题,并且每个子查询都可以使用JOIN这里没有功能损失来表达。

回答by Tagar

https://issues.apache.org/jira/browse/SPARK-4226

https://issues.apache.org/jira/browse/SPARK-4226

There is a pull request to implement that feature .. my guess it might land in Spark 2.0.

有一个实现该功能的拉取请求……我猜它可能会出现在 Spark 2.0 中。