scala Spark：将字符串列转换为数组

Question

提问by Nikhil Utane

How to convert a column that has been read as a string into a column of arrays? i.e. convert from below schema

如何将已作为字符串读取的列转换为数组列？即从下面的模式转换

scala> test.printSchema
root
 |-- a: long (nullable = true)
 |-- b: string (nullable = true)

+---+---+
|  a|  b|
+---+---+
|  1|2,3|
+---+---+
|  2|4,5|
+---+---+

To:

到：

scala> test1.printSchema
root
 |-- a: long (nullable = true)
 |-- b: array (nullable = true)
 |    |-- element: long (containsNull = true)

+---+-----+
|  a|  b  |
+---+-----+
|  1|[2,3]|
+---+-----+
|  2|[4,5]|
+---+-----+

Please share both scala and python implementation if possible. On a related note, how do I take care of it while reading from the file itself? I have data with ~450 columns and few of them I want to specify in this format. Currently I am reading in pyspark as below:

如果可能，请共享 scala 和 python 实现。在相关说明中，从文件本身读取时如何处理它？我有大约 450 列的数据，其中很少有我想以这种格式指定。目前我正在 pyspark 中阅读如下：

df = spark.read.format('com.databricks.spark.csv').options(
    header='true', inferschema='true', delimiter='|').load(input_file)

Thanks.

谢谢。

Answer 1

回答by ktheitroadalo

There are various method,

方法多种多样，

The best way to do is using splitfunction and cast to array<long>

最好的方法是使用split函数并强制转换为array<long>

data.withColumn("b", split(col("b"), ",").cast("array<long>"))

You can also create simple udf to convert the values

您还可以创建简单的 udf 来转换值

val tolong = udf((value : String) => value.split(",").map(_.toLong))

data.withColumn("newB", tolong(data("b"))).show

Hope this helps!

希望这可以帮助！

Answer 2

回答by himanshuIIITian

Using a UDFwould give you exact required schema. Like this:

使用UDF将为您提供确切所需的架构。像这样：

val toArray = udf((b: String) => b.split(",").map(_.toLong))

val test1 = test.withColumn("b", toArray(col("b")))

It would give you schema as follows:

它会给你架构如下：

scala> test1.printSchema
root
 |-- a: long (nullable = true)
 |-- b: array (nullable = true)
 |    |-- element: long (containsNull = true)

+---+-----+
|  a|  b  |
+---+-----+
|  1|[2,3]|
+---+-----+
|  2|[4,5]|
+---+-----+

As far as applying schema on file read itself is concerned, I think that is a tough task. So, for now you can apply transformation after creating DataFrameReaderof test.

就在文件读取本身上应用模式而言，我认为这是一项艰巨的任务。所以，现在你可以在创建后应用转化DataFrameReader的test。

I hope this helps!

我希望这有帮助！

Answer 3

回答by Ariana Bermúdez

In python (pyspark) it would be:

在 python (pyspark) 中，它将是：

from pyspark.sql.types import *
from pyspark.sql.functions import col, split
test = test.withColumn(
        "b",
        split(col("b"), ",\s*").cast("array<int>").alias("ev")
 )

scala Spark：将字符串列转换为数组

提问by Nikhil Utane

回答by ktheitroadalo

回答by himanshuIIITian

回答by Ariana Bermúdez

相关推荐

最近更新

标签

scala Spark：将字符串列转换为数组

提问by Nikhil Utane

回答by ktheitroadalo

回答by himanshuIIITian

回答by Ariana Bermúdez

相关推荐

scala 如何从 Spark 数据帧的列中的向量中提取值

scala 如何在 Spark 2.1 中保存分区的镶木地板文件？

scala SparkContext、JavaSparkContext、SQLContext 和 SparkSession 之间的区别？

scala Spark2.1.0 不兼容 Jackson 版本 2.7.6

相关推荐

最近更新

标签