scala 从 Spark DataFrame 中选择特定列

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/51689460/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-10-22 09:35:59  来源:igfitidea点击:

Select Specific Columns from Spark DataFrame

scalaapache-sparkapache-spark-sql

提问by A.HADDAD

I have loaded CSV data into a Spark DataFrame.

我已将 CSV 数据加载到 Spark DataFrame 中。

I need to slice this dataframe into two different dataframes, where each one contains a set of columns from the original dataframe.

我需要将此数据帧切成两个不同的数据帧,其中每个数据帧都包含来自原始数据帧的一组列。

How do I select a subset into a Spark dataframe, based on columns ?

如何根据列选择一个子集到 Spark 数据帧中?

回答by puhlen

If you want to split you dataframe into two different ones, do two selects on it with the different columns you want.

如果要将数据框拆分为两个不同的数据框,请使用所需的不同列对其进行两次选择。

 val sourceDf = spark.read.csv(...)
 val df1 = sourceDF.select("first column", "second column", "third column")
 val df2 = sourceDF.select("first column", "second column", "third column")

Note that this of course means that the sourceDf would be evaluated twice, so if it can fit into distributed memory and you use most of the columns across both dataframes it might be a good idea to cache it. It it has many extra columns that you don't need, then you can do a select on it first to select on the columns you will need so it would store all that extra data in memory.

请注意,这当然意味着 sourceDf 将被评估两次,因此如果它可以适合分布式内存并且您在两个数据帧中使用大部分列,那么缓存它可能是一个好主意。它有许多您不需要的额外列,然后您可以先对其进行选择以选择您需要的列,以便将所有额外数据存储在内存中。

回答by prasanna kumar

Let's say our parent Dataframe has 'n'columns

假设我们的父 Dataframe 有“n”

we can create 'x'child DataFrames( Lets consider 2 in our case).

我们可以创建“x”个子数据帧(在我们的例子中让我们考虑 2)。

The columns for the child Dataframe can be chosen as per desire from any of the parent Dataframe columns.

可以根据需要从任何父 Dataframe 列中选择子 Dataframe 的列。

Consider source has 10 columnsand we want to split into 2 DataFramesthat contains columns referenced from the parent Dataframe.

考虑源有10 列,我们希望拆分为2 个数据帧,其中包含从父数据帧引用的列。

The columns for the child Dataframe can be decided using the selectDataframe API

可以使用selectDataframe API决定子 Dataframe 的列

val parentDF = spark.read.format("csv").load("/path of the CSV file")

val Child1_DF = parentDF.select("col1","col2","col3","col9","col10").show()

val child2_DF = parentDF.select("col5", "col6","col7","col8","col1","col2").show()

Notice that the column count in the child dataframes can differ in length and will be less than the parent dataframe column count.

请注意,子数据帧中的列数的长度可能不同,并且会小于父数据帧的列数。

we can also refer to the column names without mentioning the real names using the positional indexes of the desired column from the parent dataframe

我们还可以使用父数据帧中所需列的位置索引来引用列名,而无需提及真实名称

Import spark implicits first which acts as a helper class for usage of $-notation to access the columns using the positional indexes

导入 spark 隐式首先充当帮助类,用于使用 $-notation 来访问使用位置索引的列

import spark.implicits._
import org.apache.spark.sql.functions._

val child3_DF  = parentDF.select("_c0","_c1","_c2","_c8","_c9").show()

we can also select column basing on certain conditions. Lets say we want only even numbered columns to be selected in the child dataframe. By even we refer to even indexed columns and index being starting from '0'

我们也可以根据特定条件选择列。假设我们只想在子数据框中选择偶数列。通过 even 我们指的是偶数索引列和索引从“0”开始

val parentColumns = parentDF.columns.toList


res0: List[String] = List(_c0, _c1, _c2, _c3, _c4, _c5, _c6, _c7,_c8,_c9)

val evenParentColumns =  res0.zipWithIndex.filter(_._2 % 2 == 0).map( _._1).toSeq

res1: scala.collection.immutable.Seq[String] = List(_c0, _c2, _c4, _c6,_c8)

Now feed these columns to be selected from the parentDF.Note that the select API need seq type arguments.So we converted the "evenParentColumns" to Seq collection

现在提供这些要从 parentDF 中选择的列。注意选择 API 需要 seq 类型参数。所以我们将“evenParentColumns”转换为 Seq 集合

val child4_DF = parentDF.select(res1.head, res1.tail:_*).show()

This will show the even indexed columns from the parent Dataframe.

这将显示来自父 Dataframe 的偶数索引列。



| _c0 | _c2 | _c4 |_c6 |_c8 |

| _c0 | _c2 | _c4 |_c6 |_c8 |



|ITE00100554|TMAX|null| E| 1 |

|ITE00100554|TMAX|空| E| 1 |

|TE00100554 |TMIN|null| E| 4 |

|TE00100554 |TMIN|空| E| 4 |

|GM000010962|PRCP|null| E| 7 |

|GM000010962|PRCP|空| E| 7 |

So Now we are left with the even numbered columns in the dataframe

所以现在我们在数据框中留下偶数列

Similarly we can also apply other operations to the Dataframe column like shown below

同样,我们也可以对 Dataframe 列应用其他操作,如下所示

val child5_DF = parentDF.select($"_c0", $"_c8" + 1).show()

So by many ways as mentioned we can select the columns in the Dataframe.

因此,通过前面提到的多种方式,我们可以选择 Dataframe 中的列。

回答by A.HADDAD

Solved, just use selectmethod for the dataframe to select columns:

解决了,只需使用数据框的select方法来选择列:

 val df=spark.read.csv("C:\Users\Ahmed\Desktop\cabs_trajectories\cabs_trajectories\green\2014\green_tripdata_2014-09.csv")

val df1=df.select("_c0")

this would subset the first column of the dataframe

这将对数据框的第一列进行子集化

回答by dehasi

Just by using select selectyou can select particular columns, give them readable names and cast them. For example like this:

只需使用 selectselect就可以选择特定的列,为它们指定可读的名称并进行转换。例如像这样:

spark.read.csv(path).select(
          '_c0.alias("stn").cast(StringType),
          '_c1.alias("wban").cast(StringType),
          '_c2.alias("lat").cast(DoubleType),
          '_c3.alias("lon").cast(DoubleType)
        )
          .where('_c2.isNotNull && '_c3.isNotNull && '_c2 =!= 0.0 && '_c3 =!= 0.0)

回答by Vivek

You can use the below code to select columns based on their index (position). You can alter the numbers for variable colNos to select only those columns

您可以使用以下代码根据索引(位置)选择列。您可以更改变量 colNos 的数字以仅选择那些列

import org.apache.spark.sql.functions.col

val colNos = Seq(0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35)
val Df_01 = Df.select(colNos_01 map Df.columns map col: _*)
Df_01.show(20, false)