scala 是否可以在 spark sql 中以编程方式对列进行别名？

Question

提问by Prikso NAI

In spark SQL (perhaps only HiveQL) one can do:

在 spark SQL（也许只有 HiveQL）中可以做到：

select sex, avg(age) as avg_age
from humans
group by sex

which would result in a DataFramewith columns named "sex"and "avg_age".

这将导致一个DataFrame名为"sex"and 的列"avg_age"。

How can avg(age)be aliased to "avg_age"without using textual SQL?

如何avg(age)在"avg_age"不使用文本 SQL 的情况下使用别名？

Edit:After zero323 's answer, I need to add the constraint that:

编辑：在 zero323 的回答之后，我需要添加以下约束：

The column-to-be-renamed's name may not be known/guaranteed or even addressable. In textual SQL, using "select EXPR as NAME" removes the requirement to have an intermediate name for EXPR. This is also the case in the example above, where "avg(age)" could get a variety of auto-generated names (which also vary among spark releases and sql-context backends).

要重命名的列的名称可能未知/无法保证，甚至无法寻址。在文本 SQL 中，使用“select EXPR as NAME”消除了对 EXPR 具有中间名称的要求。在上面的示例中也是如此，其中“avg(age)”可以获得各种自动生成的名称（在 spark 版本和 sql-context 后端之间也有所不同）。

Answer 1

采纳答案by Prikso NAI

Turns out def toDF(colNames: String*): DataFramedoes exactly that. Pasting from 2.11.7 documentation:

事实证明def toDF(colNames: String*): DataFrame确实如此。从 2.11.7 文档粘贴：

def toDF(colNames: String*): DataFrame

Returns a new DataFrame with columns renamed. This can be quite
convenient in conversion from a RDD of tuples into a DataFrame
with meaningful names. For example:

    val rdd: RDD[(Int, String)] = ...
    rdd.toDF()  // this implicit conversion creates a DataFrame
                // with column name _1 and _2
    rdd.toDF("id", "name")  // this creates a DataFrame with
                            // column name "id" and "name"

Answer 2

回答by Robert Chevallier

Let's suppose human_dfis the DataFrame for humans. Since Spark 1.3:

让我们假设human_df是人类的 DataFrame。从 Spark 1.3 开始：

human_df.groupBy("sex").agg(avg("age").alias("avg_age"))

Answer 3

回答by zero323

If you prefer to rename a single column it is possible to use withColumnRenamedmethod:

如果您更喜欢重命名单个列，则可以使用withColumnRenamed方法：

case class Person(name: String, age: Int)

val df = sqlContext.createDataFrame(
    Person("Alice", 2) :: Person("Bob", 5) :: Nil) 
df.withColumnRenamed("name", "first_name")

Alternatively you can use aliasmethod:

或者，您可以使用alias方法：

import org.apache.spark.sql.functions.avg

df.select(avg($"age").alias("average_age"))

You can take it further with small helper:

你可以用小帮手更进一步：

import org.apache.spark.sql.Column

def normalizeName(c: Column) = {
  val pattern = "\W+".r
  c.alias(pattern.replaceAllIn(c.toString, "_"))
}

df.select(normalizeName(avg($"age")))

Answer 4

回答by Sim

Anonymous columns, such as the one that would be generated by avg(age)without AS avg_age, get automatically assigned names. As you point out in your question, the names are implementation-specific, generated by a naming strategy. If needed, you could write code that sniffs the environment and instantiates an appropriate discovery & renaming strategy based on the specific naming strategy. There are not many of them.

匿名列，例如由avg(age)without生成的列AS avg_age，会自动分配名称。正如您在问题中指出的那样，名称是特定于实现的，由命名策略生成。如果需要，您可以编写代码来嗅探环境并根据特定的命名策略实例化适当的发现和重命名策略。数量不多。

In Spark 1.4.1 with HiveContext, the format is "_cN" where Nis the position of the anonymous column in the table. In your case, the name would be _c1.

在带有的 Spark 1.4.1 中HiveContext，格式为“_c N”，其中N是表中匿名列的位置。在您的情况下，名称将是_c1.

scala 是否可以在 spark sql 中以编程方式对列进行别名？

提问by Prikso NAI

采纳答案by Prikso NAI

回答by Robert Chevallier

回答by zero323

回答by Sim

相关推荐

最近更新

标签

scala 是否可以在 spark sql 中以编程方式对列进行别名？

提问by Prikso NAI

采纳答案by Prikso NAI

回答by Robert Chevallier

回答by zero323

回答by Sim

相关推荐

scala 如何定义DataFrame的分区？

scala 在 Spark 中将多个小文件合并为几个大文件

scala 找不到密钥“akka.version”的配置设置

scala 如何在 Spark ML 中为分类创建正确的数据框

相关推荐

最近更新

标签