scala 是否可以在 spark sql 中以编程方式对列进行别名?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/31538624/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Is it possible to alias columns programmatically in spark sql?
提问by Prikso NAI
In spark SQL (perhaps only HiveQL) one can do:
在 spark SQL(也许只有 HiveQL)中可以做到:
select sex, avg(age) as avg_age
from humans
group by sex
which would result in a DataFramewith columns named "sex"and "avg_age".
这将导致一个DataFrame名为"sex"and 的列"avg_age"。
How can avg(age)be aliased to "avg_age"without using textual SQL?
如何avg(age)在"avg_age"不使用文本 SQL 的情况下使用别名?
Edit:After zero323 's answer, I need to add the constraint that:
编辑:在 zero323 的回答之后,我需要添加以下约束:
The column-to-be-renamed's name may not be known/guaranteed or even addressable. In textual SQL, using "select EXPR as NAME" removes the requirement to have an intermediate name for EXPR. This is also the case in the example above, where "avg(age)" could get a variety of auto-generated names (which also vary among spark releases and sql-context backends).
要重命名的列的名称可能未知/无法保证,甚至无法寻址。在文本 SQL 中,使用“select EXPR as NAME”消除了对 EXPR 具有中间名称的要求。在上面的示例中也是如此,其中“avg(age)”可以获得各种自动生成的名称(在 spark 版本和 sql-context 后端之间也有所不同)。
采纳答案by Prikso NAI
Turns out def toDF(colNames: String*): DataFramedoes exactly that. Pasting from 2.11.7 documentation:
事实证明def toDF(colNames: String*): DataFrame确实如此。从 2.11.7 文档粘贴:
def toDF(colNames: String*): DataFrame
Returns a new DataFrame with columns renamed. This can be quite
convenient in conversion from a RDD of tuples into a DataFrame
with meaningful names. For example:
val rdd: RDD[(Int, String)] = ...
rdd.toDF() // this implicit conversion creates a DataFrame
// with column name _1 and _2
rdd.toDF("id", "name") // this creates a DataFrame with
// column name "id" and "name"
回答by Robert Chevallier
Let's suppose human_dfis the DataFrame for humans. Since Spark 1.3:
让我们假设human_df是人类的 DataFrame。从 Spark 1.3 开始:
human_df.groupBy("sex").agg(avg("age").alias("avg_age"))
回答by zero323
If you prefer to rename a single column it is possible to use withColumnRenamedmethod:
如果您更喜欢重命名单个列,则可以使用withColumnRenamed方法:
case class Person(name: String, age: Int)
val df = sqlContext.createDataFrame(
Person("Alice", 2) :: Person("Bob", 5) :: Nil)
df.withColumnRenamed("name", "first_name")
Alternatively you can use aliasmethod:
或者,您可以使用alias方法:
import org.apache.spark.sql.functions.avg
df.select(avg($"age").alias("average_age"))
You can take it further with small helper:
你可以用小帮手更进一步:
import org.apache.spark.sql.Column
def normalizeName(c: Column) = {
val pattern = "\W+".r
c.alias(pattern.replaceAllIn(c.toString, "_"))
}
df.select(normalizeName(avg($"age")))
回答by Sim
Anonymous columns, such as the one that would be generated by avg(age)without AS avg_age, get automatically assigned names. As you point out in your question, the names are implementation-specific, generated by a naming strategy. If needed, you could write code that sniffs the environment and instantiates an appropriate discovery & renaming strategy based on the specific naming strategy. There are not many of them.
匿名列,例如由avg(age)without生成的列AS avg_age,会自动分配名称。正如您在问题中指出的那样,名称是特定于实现的,由命名策略生成。如果需要,您可以编写代码来嗅探环境并根据特定的命名策略实例化适当的发现和重命名策略。数量不多。
In Spark 1.4.1 with HiveContext, the format is "_cN" where Nis the position of the anonymous column in the table. In your case, the name would be _c1.
在带有 的 Spark 1.4.1 中HiveContext,格式为“_c N”,其中N是表中匿名列的位置。在您的情况下,名称将是_c1.

