加入数据帧 spark java

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/43033835/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-11-03 07:02:03  来源:igfitidea点击:

join in a dataframe spark java

javaapache-sparkdataframespark-dataframe

提问by Alejandro Reina

First of all, thank you for the time in reading my question.

首先,感谢您花时间阅读我的问题。

My question is the following: In Spark with Java, i load in two dataframe the data of two csv files.

我的问题如下:在带有 Java 的 Spark 中,我在两个数据帧中加载了两个 csv 文件的数据。

These dataframes will have the following information.

这些数据帧将具有以下信息。

Dataframe Airport

数据框机场

Id | Name    | City
-----------------------
1  | Barajas | Madrid

Dataframe airport_city_state

数据框 airport_city_state

City | state
----------------
Madrid | Espa?a

I want to join these two dataframes so that it looks like this:

我想加入这两个数据框,使其看起来像这样:

dataframe result

数据框结果

Id | Name    | City   | state
--------------------------
1  | Barajas | Madrid | Espa?a

Where dfairport.city = dfaiport_city_state.city

在哪里 dfairport.city = dfaiport_city_state.city

But I can not clarify with the syntax so I can do the join correctly. A little code of how I have created the variables:

但是我不能用语法来澄清,所以我可以正确地进行连接。我如何创建变量的一些代码:

 // Load the csv, you have to specify that you have header and what delimiter you have
Dataset <Row> dfairport = Load.Csv (sqlContext, data_airport);
Dataset <Row> dfairport_city_state = Load.Csv (sqlContext,   data_airport_city_state);


// Change the name of the columns in the csv dataframe to match the columns in the database
// Once they match the name we can insert them
Dfairport
.withColumnRenamed ("leg_key", "id")
.withColumnRenamed ("leg_name", "name")
.withColumnRenamed ("leg_city", "city")

dfairport_city_state
.withColumnRenamed("city", "ciudad")
.withColumnRenamed("state", "estado");

回答by Darshan Mehta

You can use joinmethod with column name to join two dataframes, e.g.:

您可以使用join带有列名的方法来连接两个数据框,例如:

Dataset <Row> dfairport = Load.Csv (sqlContext, data_airport);
Dataset <Row> dfairport_city_state = Load.Csv (sqlContext,   data_airport_city_state);

Dataset <Row> joined = dfairport.join(dfairport_city_state, dfairport_city_state("City"));

There is also an overloaded version that allows you to specify the jointype as third argument, e.g.:

还有一个重载版本,允许您将join类型指定为第三个参数,例如:

Dataset <Row> joined = dfairport.join(dfairport_city_state, dfairport_city_state("City"), "left_outer");

Dataset <Row> joined = dfairport.join(dfairport_city_state, dfairport_city_state("City"), "left_outer");

Here's more on joins.

这里有更多关于连接的信息。

回答by Alejandro Reina

First, thank you very much for your response.

首先,非常感谢您的回复。

I have tried both of my solutions but none of them work, I get the following error: The method dfairport_city_state (String) is undefined for the type ETL_Airport

我已经尝试了我的两种解决方案,但它们都不起作用,我收到以下错误:方法 dairport_city_state (String) is undefined for the type ETL_Airport

I can not access a specific column of the dataframe for join.

我无法访问数据框的特定列进行连接。

EDIT: Already got to do the join, I put here the solution in case someone else helps;)

编辑:已经加入了,我把解决方案放在这里,以防其他人提供帮助;)

Thanks for everything and best regards

谢谢你的一切和最好的问候

//Join de tablas en las que comparten ciudad
Dataset <Row> joined = dfairport.join(dfairport_city_state, dfairport.col("leg_city").equalTo(dfairport_city_state.col("city")));