scala 如何更改火花数据框中的列位置?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/38104600/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-10-22 08:25:54  来源:igfitidea点击:

How to change a column position in a spark dataframe?

scalaapache-sparkdataframeapache-spark-sql

提问by obiwan kenobi

I was wondering if it is possible to change the position of a column in a dataframe, actually to change the schema?

我想知道是否可以更改数据框中列的位置,实际上是更改架构?

Precisely if I have got a dataframe like [field1, field2, field3], and I would like to get [field1, field3, field2].

确切地说,如果我有一个像 的数据框[field1, field2, field3],并且我想得到[field1, field3, field2].

I can't put any piece of code. Let us imagine we're working with a dataframe with one hundred columns, after some joins and transformations, some of these columns are misplaced regarding the schema of the destination table.

我不能放任何一段代码。让我们想象一下,我们正在处理一个包含 100 列的数据框,经过一些连接和转换后,其中一些列与目标表的架构错位了。

How to move one or several columns, i.e: how to change the schema?

如何移动一列或几列,即:如何更改架构?

回答by Tzach Zohar

You can get the column names, reorder them however you want, and then use selecton the original DataFrame to get a new one with this new order:

您可以获取列名,根据需要对它们进行重新排序,然后select在原始 DataFrame 上使用以获取具有此新顺序的新列名:

val columns: Array[String] = dataFrame.columns
val reorderedColumnNames: Array[String] = ??? // do the reordering you want
val result: DataFrame = dataFrame.select(reorderedColumnNames.head, reorderedColumnNames.tail: _*)

回答by Powers

The spark-darialibrary has a reorderColumnsmethod that makes it easy to reorder the columns in a DataFrame.

火花达里娅图书馆有reorderColumns,可以很容易进行重新排序的数据帧中的列方法。

import com.github.mrpowers.spark.daria.sql.DataFrameExt._

val actualDF = sourceDF.reorderColumns(
  Seq("field1", "field3", "field2")
)

The reorderColumnsmethod uses @Rockie Yang's solution under the hood.

reorderColumns方法在幕后使用@Rockie Yang 的解决方案。

If you want to get the column ordering of df1to equal the column ordering of df2, something like this should work better than hardcoding all the columns:

如果您想让 的列顺序df1等于 的列顺序df2,这样的操作应该比硬编码所有列更有效:

df1.reorderColumns(df2.columns)

The spark-darialibrary also defines a sortColumnstransformation to sort columns in ascending or descending order (if you don't want to specify all the column in a sequence).

火花达里亚库还定义了一个sortColumns升序或降序(如果你不想指定所有序列中柱)转化为对列进行排序。

import com.github.mrpowers.spark.daria.sql.transformations._

df.transform(sortColumns("asc"))

回答by Rapha?l Brugier

Like others have commented, I'm curious to know why would you do this as the order is not relevant when you can query the columns by their names.

就像其他人评论的那样,我很想知道您为什么要这样做,因为当您可以按名称查询列时,顺序并不相关。

Anyway, using a select should give the feeling the columns have moved in schema description:

无论如何,使用选择应该给人的感觉是列在模式描述中移动了:

val data = Seq(
  ("a",       "hello", 1),
  ("b",       "spark", 2)
)
.toDF("field1", "field2", "field3")

data
 .show()

data
 .select("field3", "field2", "field1")
 .show()

回答by Rockie Yang

A tiny different version compare to @Tzach Zohar

与@Tzach Zohar 相比的微小不同版本

val cols = df.columns.map(df(_)).reverse
val reversedColDF = df.select(cols:_*)

回答by DHEERAJ

for any dynamic frame, firstly convert dynamic frame to data frame to use standard pyspark functions

对于任何动态帧,首先将动态帧转换为数据帧以使用标准 pyspark 函数

data_frame = dynamic_frame.toDF()

Now, rearrange columns to new data frame using select function operation.

现在,使用选择函数操作将列重新排列到新数据框。

data_frame_temp = data_frame.select(["col_5","col_1","col_2","col_3","col_4"])

回答by huagang

Here's what you can do in pyspark:

以下是您可以在 pyspark 中执行的操作:

As with MySQL queries, you can re-select and pass in the desired column order to the parameters, returning the same order as you passed in the query parameters.

与 MySQL 查询一样,您可以重新选择并将所需的列顺序传递给参数,返回与您在查询参数中传递的顺序相同的顺序。

from pyspark.sql import SparkSession

data = [
    {'id': 1, 'sex': 1, 'name': 'foo', 'age': 13},
    {'id': 1, 'sex': 0, 'name': 'bar', 'age': 12},
]

spark = SparkSession \
    .builder \
    .appName("Python Spark SQL basic example") \
    .getOrCreate()

# init df
df = spark.createDataFrame(data)
df.show()

The output is as follows

输出如下

+---+---+----+---+
|age| id|name|sex|
+---+---+----+---+
| 13|  1| foo|  1|
| 12|  1| bar|  0|
+---+---+----+---+

Pass in the column position order you want as an argument to select

传入您想要的列位置顺序作为选择的参数

# change columns position
df = df.select(df.id, df.name, df.age, df.sex)
df.show()

The output is as follows

输出如下

+---+----+---+---+
| id|name|age|sex|
+---+----+---+---+
|  1| foo| 13|  1|
|  1| bar| 12|  0|
+---+----+---+---+

I hope I can help you.

我希望我能帮助你。