scala Spark 中有哪些连接类型?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/45990633/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
What are the various join types in Spark?
提问by pathikrit
I looked at the docs and it says the following join types are supported:
我查看了文档,它说支持以下连接类型:
Type of join to perform. Default inner. Must be one of: inner, cross, outer, full, full_outer, left, left_outer, right, right_outer, left_semi, left_anti.
要执行的连接类型。默认内部。必须是以下之一:内部、交叉、外部、完整、full_outer、left、left_outer、right、right_outer、left_semi、left_anti。
I looked at the StackOverflow answeron SQL joins and top couple of answers do not mention some of the joins from above e.g. left_semiand left_anti. What do they mean in Spark?
我查看了有关 SQL 连接的StackOverflow 答案,前几个答案没有提到上面的一些连接,例如left_semi和left_anti。它们在 Spark 中是什么意思?
回答by pathikrit
Here is a simple illustrative experiment:
这是一个简单的说明性实验:
import org.apache.spark.sql._
object SparkSandbox extends App {
implicit val spark = SparkSession.builder().master("local[*]").getOrCreate()
import spark.implicits._
spark.sparkContext.setLogLevel("ERROR")
val left = Seq((1, "A1"), (2, "A2"), (3, "A3"), (4, "A4")).toDF("id", "value")
val right = Seq((3, "A3"), (4, "A4"), (4, "A4_1"), (5, "A5"), (6, "A6")).toDF("id", "value")
println("LEFT")
left.orderBy("id").show()
println("RIGHT")
right.orderBy("id").show()
val joinTypes = Seq("inner", "outer", "full", "full_outer", "left", "left_outer", "right", "right_outer", "left_semi", "left_anti")
joinTypes foreach { joinType =>
println(s"${joinType.toUpperCase()} JOIN")
left.join(right = right, usingColumns = Seq("id"), joinType = joinType).orderBy("id").show()
}
}
Output
输出
LEFT
+---+-----+
| id|value|
+---+-----+
| 1| A1|
| 2| A2|
| 3| A3|
| 4| A4|
+---+-----+
RIGHT
+---+-----+
| id|value|
+---+-----+
| 3| A3|
| 4| A4|
| 4| A4_1|
| 5| A5|
| 6| A6|
+---+-----+
INNER JOIN
+---+-----+-----+
| id|value|value|
+---+-----+-----+
| 3| A3| A3|
| 4| A4| A4_1|
| 4| A4| A4|
+---+-----+-----+
OUTER JOIN
+---+-----+-----+
| id|value|value|
+---+-----+-----+
| 1| A1| null|
| 2| A2| null|
| 3| A3| A3|
| 4| A4| A4|
| 4| A4| A4_1|
| 5| null| A5|
| 6| null| A6|
+---+-----+-----+
FULL JOIN
+---+-----+-----+
| id|value|value|
+---+-----+-----+
| 1| A1| null|
| 2| A2| null|
| 3| A3| A3|
| 4| A4| A4|
| 4| A4| A4_1|
| 5| null| A5|
| 6| null| A6|
+---+-----+-----+
FULL_OUTER JOIN
+---+-----+-----+
| id|value|value|
+---+-----+-----+
| 1| A1| null|
| 2| A2| null|
| 3| A3| A3|
| 4| A4| A4|
| 4| A4| A4_1|
| 5| null| A5|
| 6| null| A6|
+---+-----+-----+
LEFT JOIN
+---+-----+-----+
| id|value|value|
+---+-----+-----+
| 1| A1| null|
| 2| A2| null|
| 3| A3| A3|
| 4| A4| A4_1|
| 4| A4| A4|
+---+-----+-----+
LEFT_OUTER JOIN
+---+-----+-----+
| id|value|value|
+---+-----+-----+
| 1| A1| null|
| 2| A2| null|
| 3| A3| A3|
| 4| A4| A4_1|
| 4| A4| A4|
+---+-----+-----+
RIGHT JOIN
+---+-----+-----+
| id|value|value|
+---+-----+-----+
| 3| A3| A3|
| 4| A4| A4_1|
| 4| A4| A4|
| 5| null| A5|
| 6| null| A6|
+---+-----+-----+
RIGHT_OUTER JOIN
+---+-----+-----+
| id|value|value|
+---+-----+-----+
| 3| A3| A3|
| 4| A4| A4|
| 4| A4| A4_1|
| 5| null| A5|
| 6| null| A6|
+---+-----+-----+
LEFT_SEMI JOIN
+---+-----+
| id|value|
+---+-----+
| 3| A3|
| 4| A4|
+---+-----+
LEFT_ANTI JOIN
+---+-----+
| id|value|
+---+-----+
| 1| A1|
| 2| A2|
+---+-----+
回答by jgp
Loved Pathikrit's example. Here is a possible translation in Java using Spark v2 and dataframes, including cross-join.
喜欢 Pathikrit 的例子。这是使用 Spark v2 和数据帧(包括交叉连接)在 Java 中的可能转换。
package net.jgp.books.sparkInAction.ch12.lab940AllJoins;
import java.util.ArrayList;
import java.util.List;
import org.apache.spark.sql.Dataset;
import org.apache.spark.sql.Row;
import org.apache.spark.sql.RowFactory;
import org.apache.spark.sql.SparkSession;
import org.apache.spark.sql.types.DataTypes;
import org.apache.spark.sql.types.StructField;
import org.apache.spark.sql.types.StructType;
/**
* All joins in a single app, inspired by
* https://stackoverflow.com/questions/45990633/what-are-the-various-join-types-in-spark.
*
* Used in Spark in Action 2e, http://jgp.net/sia
*
* @author jgp
*/
public class AllJoinsApp {
/**
* main() is your entry point to the application.
*
* @param args
*/
public static void main(String[] args) {
AllJoinsApp app = new AllJoinsApp();
app.start();
}
/**
* The processing code.
*/
private void start() {
// Creates a session on a local master
SparkSession spark = SparkSession.builder()
.appName("Processing of invoices")
.master("local")
.getOrCreate();
StructType schema = DataTypes.createStructType(new StructField[] {
DataTypes.createStructField(
"id",
DataTypes.IntegerType,
false),
DataTypes.createStructField(
"value",
DataTypes.StringType,
false) });
List<Row> rows = new ArrayList<Row>();
rows.add(RowFactory.create(1, "A1"));
rows.add(RowFactory.create(2, "A2"));
rows.add(RowFactory.create(3, "A3"));
rows.add(RowFactory.create(4, "A4"));
Dataset<Row> dfLeft = spark.createDataFrame(rows, schema);
dfLeft.show();
rows = new ArrayList<Row>();
rows.add(RowFactory.create(3, "A3"));
rows.add(RowFactory.create(4, "A4"));
rows.add(RowFactory.create(4, "A4_1"));
rows.add(RowFactory.create(5, "A5"));
rows.add(RowFactory.create(6, "A6"));
Dataset<Row> dfRight = spark.createDataFrame(rows, schema);
dfRight.show();
String[] joinTypes = new String[] {
"inner", // v2.0.0. default
"cross", // v2.2.0
"outer", // v2.0.0
"full", // v2.1.1
"full_outer", // v2.1.1
"left", // v2.1.1
"left_outer", // v2.0.0
"right", // v2.1.1
"right_outer", // v2.0.0
"left_semi", // v2.0.0, was leftsemi before v2.1.1
"left_anti" // v2.1.1
};
for (String joinType : joinTypes) {
System.out.println(joinType.toUpperCase() + " JOIN");
Dataset<Row> df = dfLeft.join(
dfRight,
dfLeft.col("id").equalTo(dfRight.col("id")),
joinType);
df.orderBy(dfLeft.col("id")).show();
}
}
}
I'll put this example in the Spark in Action, 2e's chapter 12 repository.
我将把这个例子放在Spark in Action,2e的第 12 章存储库中。
回答by Rajeev Rathor
Spark data frame support following types of joins between two dataframes.
Please find the list of joins and joining string with respect to join types along with scala syntax.
We can use following joining values used for specify the join type in Scala- Spark code.
***Mathod:*** Leftdataframe.join(Rightdataframe, join_conditions, joinStringName)
Join Name : Join String name in scala -Spark code
1. inner : 'inner'
2. cross: 'cross'
3. outer: 'outer'
4. full: 'full'
5. full outer: 'fullouter'
6. left : 'left'
7. left outer : 'leftouter'
8. right : 'right'
9. right outer : 'rightouter'
10. left semi: 'leftsemi'
11. left anti: 'leftanti'
example: 1. Left Semi join:
Leftdataframe.join(Rightdataframe, join_conditions, "leftsemi");
2. inner Join Example:
Leftdataframe.join(Rightdataframe, join_conditions, "inner");
Its tested and working well.
回答by Burrito
Left Semi returns rows where the join key is found in both tables, but it only includes the fields from the left table.
Left Semi 返回在两个表中都找到连接键的行,但它只包括左表中的字段。
Left Anti returns rows where the join key is found only in the left table.
Left Anti 返回仅在左表中找到连接键的行。
Good descriptions of the different join types: https://www.cloudera.com/documentation/enterprise/latest/topics/impala_joins.html
对不同连接类型的良好描述:https: //www.cloudera.com/documentation/enterprise/latest/topics/impala_joins.html

