通过遍历 Scala 列名列表中的列，从 Spark 数据框中删除多个列

Question

提问by Ramesh

I have a dataframe which has columns around 400, I want to drop 100 columns as per my requirement. So i have created a Scala List of 100 column names. And then i want to iterate through a for loop to actually drop the column in each for loop iteration.

我有一个包含大约 400 列的数据框，我想根据我的要求删除 100 列。所以我创建了一个包含 100 个列名的 Scala 列表。然后我想遍历 for 循环以实际删除每个 for 循环迭代中的列。

Below is the code.

下面是代码。

final val dropList: List[String] = List("Col1","Col2",...."Col100”)

def drpColsfunc(inputDF: DataFrame): DataFrame = { 
    for (i <- 0 to dropList.length - 1) {
        val returnDF = inputDF.drop(dropList(i))
    }
    return returnDF
}

val test_df = drpColsfunc(input_dataframe) 

test_df.show(5)

Answer 1

回答by Ricky McMaster

If you just want to do nothing more complex than dropping several named columns, as opposed to selecting them by a particular condition, you can simply do the following:

如果您只想删除多个命名列而不是通过特定条件选择它们，那么您只需执行以下操作即可：

df.drop("colA", "colB", "colC")

Answer 2

回答by Ramesh

Answer:

回答：

val colsToRemove = Seq("colA", "colB", "colC", etc) 

val filteredDF = df.select(df.columns .filter(colName => !colsToRemove.contains(colName)) .map(colName => new Column(colName)): _*)

Answer 3

回答by mr59

This should work fine :

这应该可以正常工作：

val dropList : List[String]  |
val df : DataFrame  |
val test_df = df.drop(dropList : _*)

Answer 4

回答by Fahad Siddiqui

You can just do,

你可以这样做，

def dropColumns(inputDF: DataFrame, dropList: List[String]): DataFrame = 
    dropList.foldLeft(inputDF)((df, col) => df.drop(col))

It will return you the DataFramewithout the columns passed in dropList.

它将返回DataFrame没有传入的列dropList。

As an example (of what's happening behind the scene), let me put it this way.

作为一个例子（幕后发生的事情），让我这样说。

scala> val list = List(0, 1, 2, 3, 4, 5, 6, 7)
list: List[Int] = List(0, 1, 2, 3, 4, 5, 6, 7)

scala> val removeThese = List(0, 2, 3)
removeThese: List[Int] = List(0, 2, 3)

scala> removeThese.foldLeft(list)((l, r) => l.filterNot(_ == r))
res2: List[Int] = List(1, 4, 5, 6, 7)

The returned list (in our case, map it to your DataFrame) is the latest filtered. After each fold, the latest is passed to the next function (_, _) => _.

返回的列表（在我们的例子中，将其映射到您的 DataFrame）是最新过滤的。每次折叠后，最新的将传递给下一个函数(_, _) => _。

通过遍历 Scala 列名列表中的列，从 Spark 数据框中删除多个列

提问by Ramesh

回答by Ricky McMaster

回答by Ramesh

回答by mr59

回答by Fahad Siddiqui

相关推荐

最近更新

标签

通过遍历 Scala 列名列表中的列，从 Spark 数据框中删除多个列

提问by Ramesh

回答by Ricky McMaster

回答by Ramesh

回答by mr59

回答by Fahad Siddiqui

相关推荐

na.fill in Spark DataFrame Scala

scala 在spark scala中将1列拆分为3列

scala 无法将 Spark SQL DataFrame 写入 S3

scala 在 Spark 数据框中分解嵌套结构

相关推荐

最近更新

标签