Python 如何在pyspark中更改数据框列名？

Question

提问by Shubhanshu Mishra

I come from pandas background and am used to reading data from CSV files into a dataframe and then simply changing the column names to something useful using the simple command:

我来自熊猫背景，习惯于将 CSV 文件中的数据读取到数据框中，然后使用简单的命令简单地将列名更改为有用的内容：

df.columns = new_column_name_list

However, the same doesn't work in pyspark dataframes created using sqlContext. The only solution I could figure out to do this easily is the following:

但是，这在使用 sqlContext 创建的 pyspark 数据帧中不起作用。我能想到的唯一解决方案是：

df = sqlContext.read.format("com.databricks.spark.csv").options(header='false', inferschema='true', delimiter='\t').load("data.txt")
oldSchema = df.schema
for i,k in enumerate(oldSchema.fields):
  k.name = new_column_name_list[i]
df = sqlContext.read.format("com.databricks.spark.csv").options(header='false', delimiter='\t').load("data.txt", schema=oldSchema)

This is basically defining the variable twice and inferring the schema first then renaming the column names and then loading the dataframe again with the updated schema.

这基本上是两次定义变量并首先推断架构，然后重命名列名，然后使用更新的架构再次加载数据帧。

Is there a better and more efficient way to do this like we do in pandas ?

有没有更好、更有效的方法来做到这一点，就像我们在熊猫中所做的那样？

My spark version is 1.5.0

我的火花版本是 1.5.0

Answer 1

采纳答案by Alberto Bonsanto

There are many ways to do that:

有很多方法可以做到这一点：

Option 1. Using selectExpr.

data = sqlContext.createDataFrame([("Alberto", 2), ("Dakota", 2)], 
                                  ["Name", "askdaosdka"])
data.show()
data.printSchema()

# Output
#+-------+----------+
#|   Name|askdaosdka|
#+-------+----------+
#|Alberto|         2|
#| Dakota|         2|
#+-------+----------+

#root
# |-- Name: string (nullable = true)
# |-- askdaosdka: long (nullable = true)

df = data.selectExpr("Name as name", "askdaosdka as age")
df.show()
df.printSchema()

# Output
#+-------+---+
#|   name|age|
#+-------+---+
#|Alberto|  2|
#| Dakota|  2|
#+-------+---+

#root
# |-- name: string (nullable = true)
# |-- age: long (nullable = true)

Option 2. Using withColumnRenamed, notice that this method allows you to "overwrite" the same column. For Python3, replace xrangewith range.

from functools import reduce

oldColumns = data.schema.names
newColumns = ["name", "age"]

df = reduce(lambda data, idx: data.withColumnRenamed(oldColumns[idx], newColumns[idx]), xrange(len(oldColumns)), data)
df.printSchema()
df.show()

Option 3. using alias, in Scala you can also use as.

from pyspark.sql.functions import col

data = data.select(col("Name").alias("name"), col("askdaosdka").alias("age"))
data.show()

# Output
#+-------+---+
#|   name|age|
#+-------+---+
#|Alberto|  2|
#| Dakota|  2|
#+-------+---+

Option 4. Using sqlContext.sql, which lets you use SQL queries on DataFramesregistered as tables.

sqlContext.registerDataFrameAsTable(data, "myTable")
df2 = sqlContext.sql("SELECT Name AS name, askdaosdka as age from myTable")

df2.show()

# Output
#+-------+---+
#|   name|age|
#+-------+---+
#|Alberto|  2|
#| Dakota|  2|
#+-------+---+

选项 1. 使用selectExpr。

data = sqlContext.createDataFrame([("Alberto", 2), ("Dakota", 2)], 
                                  ["Name", "askdaosdka"])
data.show()
data.printSchema()

# Output
#+-------+----------+
#|   Name|askdaosdka|
#+-------+----------+
#|Alberto|         2|
#| Dakota|         2|
#+-------+----------+

#root
# |-- Name: string (nullable = true)
# |-- askdaosdka: long (nullable = true)

df = data.selectExpr("Name as name", "askdaosdka as age")
df.show()
df.printSchema()

# Output
#+-------+---+
#|   name|age|
#+-------+---+
#|Alberto|  2|
#| Dakota|  2|
#+-------+---+

#root
# |-- name: string (nullable = true)
# |-- age: long (nullable = true)

选项 2. 使用withColumnRenamed，请注意此方法允许您“覆盖”同一列。对于 Python3，替换xrange为range.

from functools import reduce

oldColumns = data.schema.names
newColumns = ["name", "age"]

df = reduce(lambda data, idx: data.withColumnRenamed(oldColumns[idx], newColumns[idx]), xrange(len(oldColumns)), data)
df.printSchema()
df.show()

选项 3. 使用 alias，在 Scala 中你也可以使用as。

from pyspark.sql.functions import col

data = data.select(col("Name").alias("name"), col("askdaosdka").alias("age"))
data.show()

# Output
#+-------+---+
#|   name|age|
#+-------+---+
#|Alberto|  2|
#| Dakota|  2|
#+-------+---+

选项 4. 使用sqlContext.sql，它允许您在DataFrames注册为表上使用 SQL 查询。

sqlContext.registerDataFrameAsTable(data, "myTable")
df2 = sqlContext.sql("SELECT Name AS name, askdaosdka as age from myTable")

df2.show()

# Output
#+-------+---+
#|   name|age|
#+-------+---+
#|Alberto|  2|
#| Dakota|  2|
#+-------+---+

Answer 2

回答by Pankaj Kumar

df = df.withColumnRenamed("colName", "newColName")
       .withColumnRenamed("colName2", "newColName2")

Advantage of using this way: With long list of columns you would like to change only few column names. This can be very convenient in these scenarios. Very useful when joining tables with duplicate column names.

使用这种方式的优点：对于一长列列，您只想更改几个列名。这在这些场景中非常方便。连接具有重复列名的表时非常有用。

Answer 3

回答by Ratul Ghosh

If you want to rename a single column and keep the rest as it is:

如果要重命名单个列并将其余列保持原样：

from pyspark.sql.functions import col
new_df = old_df.select(*[col(s).alias(new_name) if s == column_to_change else s for s in old_df.columns])

Answer 4

回答by user8117731

If you want to change all columns names, try df.toDF(*cols)

如果要更改所有列名称，请尝试 df.toDF(*cols)

Answer 5

回答by ganeiy

For a single column rename, you can still use toDF(). For example,

对于单列重命名，您仍然可以使用 toDF()。例如，

df1.selectExpr("SALARY*2").toDF("REVISED_SALARY").show()

Answer 6

回答by mike

I use this one:

我用这个：

from pyspark.sql.functions import col
df.select(['vin',col('timeStamp').alias('Date')]).show()

Answer 7

回答by pbahr

In case you would like to apply a simple transformation on all column names, this code does the trick: (I am replacing all spaces with underscore)

如果您想对所有列名应用简单的转换，此代码可以解决问题：（我用下划线替换所有空格）

new_column_name_list= list(map(lambda x: x.replace(" ", "_"), df.columns))

df = df.toDF(*new_column_name_list)

Thanks to @user8117731 for toDftrick.

感谢@user8117731 的toDf技巧。

Answer 8

回答by scottlittle

Another way to rename just one column (using import pyspark.sql.functions as F):

仅重命名一列的另一种方法（使用import pyspark.sql.functions as F）：

df = df.select( '*', F.col('count').alias('new_count') ).drop('count')

Answer 9

回答by Sahan Jayasumana

df.withColumnRenamed('age', 'age2')

Answer 10

回答by Grant Shannon

this is the approach that I used:

这是我使用的方法：

create pyspark session:

创建 pyspark 会话：

import pyspark
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('changeColNames').getOrCreate()

create dataframe:

创建数据框：

df = spark.createDataFrame(data = [('Bob', 5.62,'juice'),  ('Sue',0.85,'milk')], schema = ["Name", "Amount","Item"])

view df with column names:

使用列名查看 df：

df.show()
+----+------+-----+
|Name|Amount| Item|
+----+------+-----+
| Bob|  5.62|juice|
| Sue|  0.85| milk|
+----+------+-----+

create a list with new column names:

创建一个具有新列名的列表：

newcolnames = ['NameNew','AmountNew','ItemNew']

change the column names of the df:

更改 df 的列名：

for c,n in zip(df.columns,newcolnames):
    df=df.withColumnRenamed(c,n)

view df with new column names:

使用新的列名查看 df：

df.show()
+-------+---------+-------+
|NameNew|AmountNew|ItemNew|
+-------+---------+-------+
|    Bob|     5.62|  juice|
|    Sue|     0.85|   milk|
+-------+---------+-------+

Python 如何在pyspark中更改数据框列名？

提问by Shubhanshu Mishra

采纳答案by Alberto Bonsanto

回答by Pankaj Kumar

回答by Ratul Ghosh

回答by user8117731

回答by ganeiy

回答by mike

回答by pbahr

回答by scottlittle

回答by Sahan Jayasumana

回答by Grant Shannon

相关推荐

最近更新

标签

Python 如何在pyspark中更改数据框列名？

提问by Shubhanshu Mishra

采纳答案by Alberto Bonsanto

回答by Pankaj Kumar

回答by Ratul Ghosh

回答by user8117731

回答by ganeiy

回答by mike

回答by pbahr

回答by scottlittle

回答by Sahan Jayasumana

回答by Grant Shannon

相关推荐

如何使用 Python 关闭计算机

Python 如何在列表中找到第二大数字？

关闭文件上的 Python ValueError I/O 操作

Python 如何将 CSV 文件转换为多行 JSON？

相关推荐

最近更新

标签