通过遍历 Scala 列名列表中的列,从 Spark 数据框中删除多个列
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/39786733/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Dropping multiple columns from Spark dataframe by Iterating through the columns from a Scala List of Column names
提问by Ramesh
I have a dataframe which has columns around 400, I want to drop 100 columns as per my requirement. So i have created a Scala List of 100 column names. And then i want to iterate through a for loop to actually drop the column in each for loop iteration.
我有一个包含大约 400 列的数据框,我想根据我的要求删除 100 列。所以我创建了一个包含 100 个列名的 Scala 列表。然后我想遍历 for 循环以实际删除每个 for 循环迭代中的列。
Below is the code.
下面是代码。
final val dropList: List[String] = List("Col1","Col2",...."Col100”)
def drpColsfunc(inputDF: DataFrame): DataFrame = {
for (i <- 0 to dropList.length - 1) {
val returnDF = inputDF.drop(dropList(i))
}
return returnDF
}
val test_df = drpColsfunc(input_dataframe)
test_df.show(5)
回答by Ricky McMaster
If you just want to do nothing more complex than dropping several named columns, as opposed to selecting them by a particular condition, you can simply do the following:
如果您只想删除多个命名列而不是通过特定条件选择它们,那么您只需执行以下操作即可:
df.drop("colA", "colB", "colC")
回答by Ramesh
Answer:
回答:
val colsToRemove = Seq("colA", "colB", "colC", etc)
val filteredDF = df.select(df.columns .filter(colName => !colsToRemove.contains(colName)) .map(colName => new Column(colName)): _*)
回答by mr59
This should work fine :
这应该可以正常工作:
val dropList : List[String] |
val df : DataFrame |
val test_df = df.drop(dropList : _*)
回答by Fahad Siddiqui
You can just do,
你可以这样做,
def dropColumns(inputDF: DataFrame, dropList: List[String]): DataFrame =
dropList.foldLeft(inputDF)((df, col) => df.drop(col))
It will return you the DataFramewithout the columns passed in dropList.
它将返回DataFrame没有传入的列dropList。
As an example (of what's happening behind the scene), let me put it this way.
作为一个例子(幕后发生的事情),让我这样说。
scala> val list = List(0, 1, 2, 3, 4, 5, 6, 7)
list: List[Int] = List(0, 1, 2, 3, 4, 5, 6, 7)
scala> val removeThese = List(0, 2, 3)
removeThese: List[Int] = List(0, 2, 3)
scala> removeThese.foldLeft(list)((l, r) => l.filterNot(_ == r))
res2: List[Int] = List(1, 4, 5, 6, 7)
The returned list (in our case, map it to your DataFrame) is the latest filtered. After each fold, the latest is passed to the next function (_, _) => _.
返回的列表(在我们的例子中,将其映射到您的 DataFrame)是最新过滤的。每次折叠后,最新的将传递给下一个函数(_, _) => _。

