scala 如何更改火花数据框中的列位置？

Question

提问by obiwan kenobi

I was wondering if it is possible to change the position of a column in a dataframe, actually to change the schema?

我想知道是否可以更改数据框中列的位置，实际上是更改架构？

Precisely if I have got a dataframe like [field1, field2, field3], and I would like to get [field1, field3, field2].

确切地说，如果我有一个像的数据框[field1, field2, field3]，并且我想得到[field1, field3, field2].

I can't put any piece of code. Let us imagine we're working with a dataframe with one hundred columns, after some joins and transformations, some of these columns are misplaced regarding the schema of the destination table.

我不能放任何一段代码。让我们想象一下，我们正在处理一个包含 100 列的数据框，经过一些连接和转换后，其中一些列与目标表的架构错位了。

How to move one or several columns, i.e: how to change the schema?

如何移动一列或几列，即：如何更改架构？

Answer 1

回答by Tzach Zohar

You can get the column names, reorder them however you want, and then use selecton the original DataFrame to get a new one with this new order:

您可以获取列名，根据需要对它们进行重新排序，然后select在原始 DataFrame 上使用以获取具有此新顺序的新列名：

val columns: Array[String] = dataFrame.columns
val reorderedColumnNames: Array[String] = ??? // do the reordering you want
val result: DataFrame = dataFrame.select(reorderedColumnNames.head, reorderedColumnNames.tail: _*)

Answer 2

回答by Powers

The spark-darialibrary has a reorderColumnsmethod that makes it easy to reorder the columns in a DataFrame.

该火花达里娅图书馆有reorderColumns，可以很容易进行重新排序的数据帧中的列方法。

import com.github.mrpowers.spark.daria.sql.DataFrameExt._

val actualDF = sourceDF.reorderColumns(
  Seq("field1", "field3", "field2")
)

The reorderColumnsmethod uses @Rockie Yang's solution under the hood.

该reorderColumns方法在幕后使用@Rockie Yang 的解决方案。

If you want to get the column ordering of df1to equal the column ordering of df2, something like this should work better than hardcoding all the columns:

如果您想让的列顺序df1等于的列顺序df2，这样的操作应该比硬编码所有列更有效：

df1.reorderColumns(df2.columns)

The spark-darialibrary also defines a sortColumnstransformation to sort columns in ascending or descending order (if you don't want to specify all the column in a sequence).

该火花达里亚库还定义了一个sortColumns升序或降序（如果你不想指定所有序列中柱）转化为对列进行排序。

import com.github.mrpowers.spark.daria.sql.transformations._

df.transform(sortColumns("asc"))

Answer 3

回答by Rapha?l Brugier

Like others have commented, I'm curious to know why would you do this as the order is not relevant when you can query the columns by their names.

就像其他人评论的那样，我很想知道您为什么要这样做，因为当您可以按名称查询列时，顺序并不相关。

Anyway, using a select should give the feeling the columns have moved in schema description:

无论如何，使用选择应该给人的感觉是列在模式描述中移动了：

val data = Seq(
  ("a",       "hello", 1),
  ("b",       "spark", 2)
)
.toDF("field1", "field2", "field3")

data
 .show()

data
 .select("field3", "field2", "field1")
 .show()

Answer 4

回答by Rockie Yang

A tiny different version compare to @Tzach Zohar

与@Tzach Zohar 相比的微小不同版本

val cols = df.columns.map(df(_)).reverse
val reversedColDF = df.select(cols:_*)

Answer 5

回答by DHEERAJ

for any dynamic frame, firstly convert dynamic frame to data frame to use standard pyspark functions

对于任何动态帧，首先将动态帧转换为数据帧以使用标准 pyspark 函数

data_frame = dynamic_frame.toDF()

Now, rearrange columns to new data frame using select function operation.

现在，使用选择函数操作将列重新排列到新数据框。

data_frame_temp = data_frame.select(["col_5","col_1","col_2","col_3","col_4"])

Answer 6

回答by huagang

Here's what you can do in pyspark:

以下是您可以在 pyspark 中执行的操作：

As with MySQL queries, you can re-select and pass in the desired column order to the parameters, returning the same order as you passed in the query parameters.

与 MySQL 查询一样，您可以重新选择并将所需的列顺序传递给参数，返回与您在查询参数中传递的顺序相同的顺序。

from pyspark.sql import SparkSession

data = [
    {'id': 1, 'sex': 1, 'name': 'foo', 'age': 13},
    {'id': 1, 'sex': 0, 'name': 'bar', 'age': 12},
]

spark = SparkSession \
    .builder \
    .appName("Python Spark SQL basic example") \
    .getOrCreate()

# init df
df = spark.createDataFrame(data)
df.show()

The output is as follows

输出如下

+---+---+----+---+
|age| id|name|sex|
+---+---+----+---+
| 13|  1| foo|  1|
| 12|  1| bar|  0|
+---+---+----+---+

Pass in the column position order you want as an argument to select

传入您想要的列位置顺序作为选择的参数

# change columns position
df = df.select(df.id, df.name, df.age, df.sex)
df.show()

The output is as follows

输出如下

+---+----+---+---+
| id|name|age|sex|
+---+----+---+---+
|  1| foo| 13|  1|
|  1| bar| 12|  0|
+---+----+---+---+

I hope I can help you.

我希望我能帮助你。

scala 如何更改火花数据框中的列位置？

提问by obiwan kenobi

回答by Tzach Zohar

回答by Powers

回答by Rapha?l Brugier

回答by Rockie Yang

回答by DHEERAJ

回答by huagang

相关推荐

最近更新

标签

scala 如何更改火花数据框中的列位置？

提问by obiwan kenobi

回答by Tzach Zohar

回答by Powers

回答by Rapha?l Brugier

回答by Rockie Yang

回答by DHEERAJ

回答by huagang

相关推荐

scala 如何从 RDD 创建 Spark 数据集

scala 从 spark DataFrame 中提取 `Seq[(String,String,String)]`

scala 我们是否应该像在训练之前并行化 Seq 一样并行化 DataFrame

scala Spark unionAll 多个数据帧

相关推荐

最近更新

标签