SQL SparkSQL 是否支持子查询?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/33933118/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Does SparkSQL support subquery?
提问by Rinku Buragohain
I am running this query in Spark shell but it gives me error,
我在 Spark shell 中运行这个查询,但它给了我错误,
sqlContext.sql(
"select sal from samplecsv where sal < (select MAX(sal) from samplecsv)"
).collect().foreach(println)
error:
错误:
java.lang.RuntimeException: [1.47] failure: ``)'' expected but identifier MAX found
select sal from samplecsv where sal < (select MAX(sal) from samplecsv) ^ at scala.sys.package$.error(package.scala:27) Can anybody explan me,thanks
java.lang.RuntimeException: [1.47] 失败: ``)'' 预期但标识符 MAX
select sal from samplecsv where sal < (select MAX(sal) from samplecsv) ^ at scala.sys.package$.error(package.scala:27) 有人能解释一下吗,谢谢
回答by zero323
Planned features:
计划功能:
- SPARK-23945(Column.isin() should accept a single-column DataFrame as input).
- SPARK-18455(General support for correlated subquery processing).
- SPARK-23945(Column.isin() 应接受单列 DataFrame 作为输入)。
- SPARK-18455(对相关子查询处理的一般支持)。
Spark 2.0+
火花 2.0+
Spark SQL should support both correlated and uncorrelated subqueries. See SubquerySuite
for details. Some examples include:
Spark SQL 应该支持相关和不相关的子查询。详情请参阅SubquerySuite
。一些例子包括:
select * from l where exists (select * from r where l.a = r.c)
select * from l where not exists (select * from r where l.a = r.c)
select * from l where l.a in (select c from r)
select * from l where a not in (select c from r)
Unfortunately as for now (Spark 2.0) it is impossible to express the same logic using DataFrame
DSL.
不幸的是,目前(Spark 2.0)无法使用DataFrame
DSL表达相同的逻辑。
Spark < 2.0
火花 < 2.0
Spark supports subqueries in the FROM
clause (same as Hive <= 0.12).
Spark 支持FROM
子句中的子查询(与 Hive <= 0.12 相同)。
SELECT col FROM (SELECT * FROM t1 WHERE bar) t2
It simply doesn't support subqueries in the WHERE
clause.Generally speaking arbitrary subqueries (in particular correlated subqueries) couldn't be expressed using Spark without promoting to Cartesian join.
它根本不支持WHERE
子句中的子查询。一般来说,在不升级为笛卡尔连接的情况下,无法使用 Spark 表达任意子查询(特别是相关子查询)。
Since subquery performance is usually a significant issue in a typical relational system and every subquery can be expressed using JOIN
there is no loss-of-function here.
由于子查询性能在典型的关系系统中通常是一个重要问题,并且每个子查询都可以使用JOIN
这里没有功能损失来表达。
回答by Tagar
https://issues.apache.org/jira/browse/SPARK-4226
https://issues.apache.org/jira/browse/SPARK-4226
There is a pull request to implement that feature .. my guess it might land in Spark 2.0.
有一个实现该功能的拉取请求……我猜它可能会出现在 Spark 2.0 中。