Python 如何在pyspark中更改数据框列名?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/34077353/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-19 14:26:41  来源:igfitidea点击:

How to change dataframe column names in pyspark?

pythonapache-sparkpysparkpyspark-sql

提问by Shubhanshu Mishra

I come from pandas background and am used to reading data from CSV files into a dataframe and then simply changing the column names to something useful using the simple command:

我来自熊猫背景,习惯于将 CSV 文件中的数据读取到数据框中,然后使用简单的命令简单地将列名更改为有用的内容:

df.columns = new_column_name_list

However, the same doesn't work in pyspark dataframes created using sqlContext. The only solution I could figure out to do this easily is the following:

但是,这在使用 sqlContext 创建的 pyspark 数据帧中不起作用。我能想到的唯一解决方案是:

df = sqlContext.read.format("com.databricks.spark.csv").options(header='false', inferschema='true', delimiter='\t').load("data.txt")
oldSchema = df.schema
for i,k in enumerate(oldSchema.fields):
  k.name = new_column_name_list[i]
df = sqlContext.read.format("com.databricks.spark.csv").options(header='false', delimiter='\t').load("data.txt", schema=oldSchema)

This is basically defining the variable twice and inferring the schema first then renaming the column names and then loading the dataframe again with the updated schema.

这基本上是两次定义变量并首先推断架构,然后重命名列名,然后使用更新的架构再次加载数据帧。

Is there a better and more efficient way to do this like we do in pandas ?

有没有更好、更有效的方法来做到这一点,就像我们在熊猫中所做的那样?

My spark version is 1.5.0

我的火花版本是 1.5.0

采纳答案by Alberto Bonsanto

There are many ways to do that:

有很多方法可以做到这一点:

  • Option 1. Using selectExpr.

    data = sqlContext.createDataFrame([("Alberto", 2), ("Dakota", 2)], 
                                      ["Name", "askdaosdka"])
    data.show()
    data.printSchema()
    
    # Output
    #+-------+----------+
    #|   Name|askdaosdka|
    #+-------+----------+
    #|Alberto|         2|
    #| Dakota|         2|
    #+-------+----------+
    
    #root
    # |-- Name: string (nullable = true)
    # |-- askdaosdka: long (nullable = true)
    
    df = data.selectExpr("Name as name", "askdaosdka as age")
    df.show()
    df.printSchema()
    
    # Output
    #+-------+---+
    #|   name|age|
    #+-------+---+
    #|Alberto|  2|
    #| Dakota|  2|
    #+-------+---+
    
    #root
    # |-- name: string (nullable = true)
    # |-- age: long (nullable = true)
    
  • Option 2. Using withColumnRenamed, notice that this method allows you to "overwrite" the same column. For Python3, replace xrangewith range.

    from functools import reduce
    
    oldColumns = data.schema.names
    newColumns = ["name", "age"]
    
    df = reduce(lambda data, idx: data.withColumnRenamed(oldColumns[idx], newColumns[idx]), xrange(len(oldColumns)), data)
    df.printSchema()
    df.show()
    
  • Option 3. using alias, in Scala you can also use as.

    from pyspark.sql.functions import col
    
    data = data.select(col("Name").alias("name"), col("askdaosdka").alias("age"))
    data.show()
    
    # Output
    #+-------+---+
    #|   name|age|
    #+-------+---+
    #|Alberto|  2|
    #| Dakota|  2|
    #+-------+---+
    
  • Option 4. Using sqlContext.sql, which lets you use SQL queries on DataFramesregistered as tables.

    sqlContext.registerDataFrameAsTable(data, "myTable")
    df2 = sqlContext.sql("SELECT Name AS name, askdaosdka as age from myTable")
    
    df2.show()
    
    # Output
    #+-------+---+
    #|   name|age|
    #+-------+---+
    #|Alberto|  2|
    #| Dakota|  2|
    #+-------+---+
    
  • 选项 1. 使用selectExpr

    data = sqlContext.createDataFrame([("Alberto", 2), ("Dakota", 2)], 
                                      ["Name", "askdaosdka"])
    data.show()
    data.printSchema()
    
    # Output
    #+-------+----------+
    #|   Name|askdaosdka|
    #+-------+----------+
    #|Alberto|         2|
    #| Dakota|         2|
    #+-------+----------+
    
    #root
    # |-- Name: string (nullable = true)
    # |-- askdaosdka: long (nullable = true)
    
    df = data.selectExpr("Name as name", "askdaosdka as age")
    df.show()
    df.printSchema()
    
    # Output
    #+-------+---+
    #|   name|age|
    #+-------+---+
    #|Alberto|  2|
    #| Dakota|  2|
    #+-------+---+
    
    #root
    # |-- name: string (nullable = true)
    # |-- age: long (nullable = true)
    
  • 选项 2. 使用withColumnRenamed,请注意此方法允许您“覆盖”同一列。对于 Python3,替换xrangerange.

    from functools import reduce
    
    oldColumns = data.schema.names
    newColumns = ["name", "age"]
    
    df = reduce(lambda data, idx: data.withColumnRenamed(oldColumns[idx], newColumns[idx]), xrange(len(oldColumns)), data)
    df.printSchema()
    df.show()
    
  • 选项 3. 使用 alias,在 Scala 中你也可以使用as

    from pyspark.sql.functions import col
    
    data = data.select(col("Name").alias("name"), col("askdaosdka").alias("age"))
    data.show()
    
    # Output
    #+-------+---+
    #|   name|age|
    #+-------+---+
    #|Alberto|  2|
    #| Dakota|  2|
    #+-------+---+
    
  • 选项 4. 使用sqlContext.sql,它允许您在DataFrames注册为表上使用 SQL 查询。

    sqlContext.registerDataFrameAsTable(data, "myTable")
    df2 = sqlContext.sql("SELECT Name AS name, askdaosdka as age from myTable")
    
    df2.show()
    
    # Output
    #+-------+---+
    #|   name|age|
    #+-------+---+
    #|Alberto|  2|
    #| Dakota|  2|
    #+-------+---+
    

回答by Pankaj Kumar

df = df.withColumnRenamed("colName", "newColName")
       .withColumnRenamed("colName2", "newColName2")

Advantage of using this way: With long list of columns you would like to change only few column names. This can be very convenient in these scenarios. Very useful when joining tables with duplicate column names.

使用这种方式的优点:对于一长列列,您只想更改几个列名。这在这些场景中非常方便。连接具有重复列名的表时非常有用。

回答by Ratul Ghosh

If you want to rename a single column and keep the rest as it is:

如果要重命名单个列并将其余列保持原样:

from pyspark.sql.functions import col
new_df = old_df.select(*[col(s).alias(new_name) if s == column_to_change else s for s in old_df.columns])

回答by user8117731

If you want to change all columns names, try df.toDF(*cols)

如果要更改所有列名称,请尝试 df.toDF(*cols)

回答by ganeiy

For a single column rename, you can still use toDF(). For example,

对于单列重命名,您仍然可以使用 toDF()。例如,

df1.selectExpr("SALARY*2").toDF("REVISED_SALARY").show()

回答by mike

I use this one:

我用这个:

from pyspark.sql.functions import col
df.select(['vin',col('timeStamp').alias('Date')]).show()

回答by pbahr

In case you would like to apply a simple transformation on all column names, this code does the trick: (I am replacing all spaces with underscore)

如果您想对所有列名应用简单的转换,此代码可以解决问题:(我用下划线替换所有空格)

new_column_name_list= list(map(lambda x: x.replace(" ", "_"), df.columns))

df = df.toDF(*new_column_name_list)

Thanks to @user8117731 for toDftrick.

感谢@user8117731 的toDf技巧。

回答by scottlittle

Another way to rename just one column (using import pyspark.sql.functions as F):

仅重命名一列的另一种方法(使用import pyspark.sql.functions as F):

df = df.select( '*', F.col('count').alias('new_count') ).drop('count')

回答by Sahan Jayasumana

df.withColumnRenamed('age', 'age2')

df.withColumnRenamed('age', 'age2')

回答by Grant Shannon

this is the approach that I used:

这是我使用的方法:

create pyspark session:

创建 pyspark 会话:

import pyspark
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('changeColNames').getOrCreate()

create dataframe:

创建数据框:

df = spark.createDataFrame(data = [('Bob', 5.62,'juice'),  ('Sue',0.85,'milk')], schema = ["Name", "Amount","Item"])

view df with column names:

使用列名查看 df:

df.show()
+----+------+-----+
|Name|Amount| Item|
+----+------+-----+
| Bob|  5.62|juice|
| Sue|  0.85| milk|
+----+------+-----+

create a list with new column names:

创建一个具有新列名的列表:

newcolnames = ['NameNew','AmountNew','ItemNew']

change the column names of the df:

更改 df 的列名:

for c,n in zip(df.columns,newcolnames):
    df=df.withColumnRenamed(c,n)

view df with new column names:

使用新的列名查看 df:

df.show()
+-------+---------+-------+
|NameNew|AmountNew|ItemNew|
+-------+---------+-------+
|    Bob|     5.62|  juice|
|    Sue|     0.85|   milk|
+-------+---------+-------+