scala 基于列索引的 Spark Dataframe 选择

Question

提问by Vikas J

How do I select all the columns of a dataframe that has certain indexes in Scala?

如何选择在 Scala 中具有特定索引的数据帧的所有列？

For example if a dataframe has 100 columns and i want to extract only columns (10,12,13,14,15), how to do the same?

例如，如果一个数据框有 100 列，而我只想提取列 (10,12,13,14,15)，如何做同样的事情？

Below selects all columns from dataframe dfwhich has the column name mentioned in the Array colNames:

下面从数据df框中选择所有列，这些列具有 Array colNames 中提到的列名：

df = df.select(colNames.head,colNames.tail: _*)

If there is similar, colNos array which has

如果有类似的，colNos 数组，其中有

colNos = Array(10,20,25,45)

How do I transform the above df.selectto fetch only those columns at the specific indexes.

如何转换上述内容df.select以仅获取特定索引处的那些列。

Answer 1

回答by zero323

You can mapover columns:

您可以map过度columns：

import org.apache.spark.sql.functions.col

df.select(colNos map df.columns map col: _*)

or:

或者：

df.select(colNos map (df.columns andThen col): _*)

or:

或者：

df.select(colNos map (col _ compose df.columns): _*)

All the methods shown above are equivalent and don't impose performance penalty. Following mapping:

上面显示的所有方法都是等效的，不会造成性能损失。以下映射：

colNos map df.columns

is just a local Arrayaccess (constant time access for each index) and choosing between Stringor Columnbased variant of selectdoesn't affect the execution plan:

只是一个本地Array访问（每个索引的恒定时间访问）并且在String或Column基于变体之间进行选择select不会影响执行计划：

val df = Seq((1, 2, 3 ,4, 5, 6)).toDF

val colNos = Seq(0, 3, 5)

df.select(colNos map df.columns map col: _*).explain

== Physical Plan ==
LocalTableScan [_1#46, _4#49, _6#51]

df.select("_1", "_4", "_6").explain

== Physical Plan ==
LocalTableScan [_1#46, _4#49, _6#51]

Answer 2

回答by Ramesh Maharjan

@user6910411's answer above works like a charm and the number of tasks/logical plan is similar to my approach below. BUTmy approach is a bit faster.
So,
I would suggest you to go with the column namesrather than column numbers. Column namesare much saferand much ligherthan using numbers. You can use the following solution :

@user6910411 上面的回答很有魅力，任务/逻辑计划的数量与我下面的方法相似。但是我的方法要快一些。
所以，
我建议你使用column names而不是column numbers. Column names是更安全和多ligher比使用numbers。您可以使用以下解决方案：

val colNames = Seq("col1", "col2" ...... "col99", "col100")

val selectColNames = Seq("col1", "col3", .... selected column names ... )

val selectCols = selectColNames.map(name => df.col(name))

df = df.select(selectCols:_*)

If you are hesitant to write all the 100 column names then there is a shortcut method too

如果您对写所有 100 个列名犹豫不决，那么也有一个快捷方法

val colNames = df.schema.fieldNames

Answer 3

回答by the775

Example: Grab first 14 columns of Spark Dataframe by Index using Scala.

示例：使用 Scala 按索引抓取 Spark Dataframe 的前 14 列。

import org.apache.spark.sql.functions.col

// Gives array of names by index (first 14 cols for example)
val sliceCols = df.columns.slice(0, 14)
// Maps names & selects columns in dataframe
val subset_df = df.select(sliceCols.map(name=>col(name)):_*)

You cannotsimply do this (as I tried and failed):

你不能简单地这样做（因为我尝试过但失败了）：

// Gives array of names by index (first 14 cols for example)
val sliceCols = df.columns.slice(0, 14)
// Maps names & selects columns in dataframe
val subset_df = df.select(sliceCols)

The reason is that you have to convert your datatype of Array[String] to Array[org.apache.spark.sql.Column] in order for the slicing to work.

原因是您必须将 Array[String] 的数据类型转换为 Array[org.apache.spark.sql.Column] 才能进行切片。

ORWrap it in a function using Currying (high five to my colleague for this):

或使用 Currying 将其包装在一个函数中（为此向我的同事致敬）：

// Subsets Dataframe to using beg_val & end_val index.
def subset_frame(beg_val:Int=0, end_val:Int)(df: DataFrame): DataFrame = {
  val sliceCols = df.columns.slice(beg_val, end_val)
  return df.select(sliceCols.map(name => col(name)):_*)
}

// Get first 25 columns as subsetted dataframe
val subset_df:DataFrame = df_.transform(subset_frame(0, 14))

scala 基于列索引的 Spark Dataframe 选择

提问by Vikas J

回答by zero323

回答by Ramesh Maharjan

回答by the775

相关推荐

最近更新

标签

scala 基于列索引的 Spark Dataframe 选择

提问by Vikas J

回答by zero323

回答by Ramesh Maharjan

回答by the775

相关推荐

在 Spark/Scala 中写入 HDFS 读取 zip 文件

scala 如何通过键或过滤器（）使用带有两个 RDD 的火花交叉点（）？

scala Spark：分解结构的数据帧数组并附加 id

scala where子句在spark sql数据框中不起作用

相关推荐

最近更新

标签