scala 如何选择以公共标签开头的所有列

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/35340390/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-10-22 08:00:39  来源:igfitidea点击:

how to select all columns that starts with a common label

scalaapache-sparkspark-dataframe

提问by user299791

I have a dataframe in Spark 1.6 and want to select just some columns out of it. The column names are like:

我在 Spark 1.6 中有一个数据框,只想从中选择一些列。列名是这样的:

colA, colB, colC, colD, colE, colF-0, colF-1, colF-2

I know I can do like this to select specific columns:

我知道我可以这样做来选择特定的列:

df.select("colA", "colB", "colE")

but how to select, say "colA", "colB" and all the colF-* columns at once? Is there a way like in Pandas?

但是如何一次选择“colA”、“colB”和所有 colF-* 列?有没有像Pandas那样的方法?

回答by Michael Lloyd Lee mlk

First grab the column names with df.columns, then filter down to just the column names you want .filter(_.startsWith("colF")). This gives you an array of Strings. But the select takes select(String, String*). Luckily select for columns is select(Column*), so finally convert the Strings into Columns with .map(df(_)), and finally turn the Array of Columns into a var arg with : _*.

首先使用 获取列名df.columns,然后过滤到您想要的列名.filter(_.startsWith("colF"))。这为您提供了一个字符串数组。但是选择需要select(String, String*). 幸运的是,选择列是select(Column*),所以最后将字符串转换为列,最后将列.map(df(_))数组转换为可变参数: _*

df.select(df.columns.filter(_.startsWith("colF")).map(df(_)) : _*).show

This filter could be made more complex (same as Pandas). It is however a rather ugly solution (IMO):

这个过滤器可以做得更复杂(和 Pandas 一样)。然而,这是一个相当丑陋的解决方案(IMO):

df.select(df.columns.filter(x => (x.equals("colA") || x.startsWith("colF"))).map(df(_)) : _*).show 

If the list of other columns is fixed you could also merge a fixed array of columns names with filtered array.

如果其他列的列表是固定的,您还可以将列名的固定数组与过滤数组合并。

df.select((Array("colA", "colB") ++ df.columns.filter(_.startsWith("colF"))).map(df(_)) : _*).show

回答by kfkhalili

I wrote a function that does that. Read the comments to see how it works.

我写了一个函数来做到这一点。阅读评论以了解它是如何工作的。

  /**
    * Given a sequence of prefixes, select suitable columns from [[DataFrame]]
    * @param columnPrefixes Sequence of prefixes
    * @param dF Incoming [[DataFrame]]
    * @return [[DataFrame]] with prefixed columns selected
    */
  def selectPrefixedColumns(columnPrefixes: Seq[String], dF: DataFrame): DataFrame = {
    // Find out if given column name matches any of the provided prefixes
    def colNameStartsWith: String => Boolean = (colName: String) =>
        columnsPrefix.map(prefix => colName.startsWith(prefix)).reduce(_ || _)
    // Filter columns list by checking against given prefixes sequence
    val columns = dF.columns.filter(colNameStartsWith)
    // Select filtered columns list
    dF.select(columns.head, columns.tail:_*)
  }