scala 基于列索引的 Spark Dataframe 选择
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/43553803/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Spark Dataframe select based on column index
提问by Vikas J
How do I select all the columns of a dataframe that has certain indexes in Scala?
如何选择在 Scala 中具有特定索引的数据帧的所有列?
For example if a dataframe has 100 columns and i want to extract only columns (10,12,13,14,15), how to do the same?
例如,如果一个数据框有 100 列,而我只想提取列 (10,12,13,14,15),如何做同样的事情?
Below selects all columns from dataframe dfwhich has the column name mentioned in the Array colNames:
下面从数据df框中选择所有列,这些列具有 Array colNames 中提到的列名:
df = df.select(colNames.head,colNames.tail: _*)
If there is similar, colNos array which has
如果有类似的,colNos 数组,其中有
colNos = Array(10,20,25,45)
How do I transform the above df.selectto fetch only those columns at the specific indexes.
如何转换上述内容df.select以仅获取特定索引处的那些列。
回答by zero323
You can mapover columns:
您可以map过度columns:
import org.apache.spark.sql.functions.col
df.select(colNos map df.columns map col: _*)
or:
或者:
df.select(colNos map (df.columns andThen col): _*)
or:
或者:
df.select(colNos map (col _ compose df.columns): _*)
All the methods shown above are equivalent and don't impose performance penalty. Following mapping:
上面显示的所有方法都是等效的,不会造成性能损失。以下映射:
colNos map df.columns
is just a local Arrayaccess (constant time access for each index) and choosing between Stringor Columnbased variant of selectdoesn't affect the execution plan:
只是一个本地Array访问(每个索引的恒定时间访问)并且在String或Column基于变体之间进行选择select不会影响执行计划:
val df = Seq((1, 2, 3 ,4, 5, 6)).toDF
val colNos = Seq(0, 3, 5)
df.select(colNos map df.columns map col: _*).explain
== Physical Plan ==
LocalTableScan [_1#46, _4#49, _6#51]
df.select("_1", "_4", "_6").explain
== Physical Plan ==
LocalTableScan [_1#46, _4#49, _6#51]
回答by Ramesh Maharjan
@user6910411's answer above works like a charm and the number of tasks/logical plan is similar to my approach below. BUTmy approach is a bit faster.
So,
I would suggest you to go with the column namesrather than column numbers. Column namesare much saferand much ligherthan using numbers. You can use the following solution :
@user6910411 上面的回答很有魅力,任务/逻辑计划的数量与我下面的方法相似。但是我的方法要快一些。
所以,
我建议你使用column names而不是column numbers. Column names是更安全和多ligher比使用numbers。您可以使用以下解决方案:
val colNames = Seq("col1", "col2" ...... "col99", "col100")
val selectColNames = Seq("col1", "col3", .... selected column names ... )
val selectCols = selectColNames.map(name => df.col(name))
df = df.select(selectCols:_*)
If you are hesitant to write all the 100 column names then there is a shortcut method too
如果您对写所有 100 个列名犹豫不决,那么也有一个快捷方法
val colNames = df.schema.fieldNames
回答by the775
Example: Grab first 14 columns of Spark Dataframe by Index using Scala.
示例:使用 Scala 按索引抓取 Spark Dataframe 的前 14 列。
import org.apache.spark.sql.functions.col
// Gives array of names by index (first 14 cols for example)
val sliceCols = df.columns.slice(0, 14)
// Maps names & selects columns in dataframe
val subset_df = df.select(sliceCols.map(name=>col(name)):_*)
You cannotsimply do this (as I tried and failed):
你不能简单地这样做(因为我尝试过但失败了):
// Gives array of names by index (first 14 cols for example)
val sliceCols = df.columns.slice(0, 14)
// Maps names & selects columns in dataframe
val subset_df = df.select(sliceCols)
The reason is that you have to convert your datatype of Array[String] to Array[org.apache.spark.sql.Column] in order for the slicing to work.
原因是您必须将 Array[String] 的数据类型转换为 Array[org.apache.spark.sql.Column] 才能进行切片。
ORWrap it in a function using Currying (high five to my colleague for this):
或使用 Currying 将其包装在一个函数中(为此向我的同事致敬):
// Subsets Dataframe to using beg_val & end_val index.
def subset_frame(beg_val:Int=0, end_val:Int)(df: DataFrame): DataFrame = {
val sliceCols = df.columns.slice(beg_val, end_val)
return df.select(sliceCols.map(name => col(name)):_*)
}
// Get first 25 columns as subsetted dataframe
val subset_df:DataFrame = df_.transform(subset_frame(0, 14))

