scala Spark:将字符串列转换为数组

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/44690174/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-10-22 09:19:00  来源:igfitidea点击:

Spark: Convert column of string to an array

scalaapache-sparkpyspark

提问by Nikhil Utane

How to convert a column that has been read as a string into a column of arrays? i.e. convert from below schema

如何将已作为字符串读取的列转换为数组列?即从下面的模式转换

scala> test.printSchema
root
 |-- a: long (nullable = true)
 |-- b: string (nullable = true)

+---+---+
|  a|  b|
+---+---+
|  1|2,3|
+---+---+
|  2|4,5|
+---+---+

To:

到:

scala> test1.printSchema
root
 |-- a: long (nullable = true)
 |-- b: array (nullable = true)
 |    |-- element: long (containsNull = true)

+---+-----+
|  a|  b  |
+---+-----+
|  1|[2,3]|
+---+-----+
|  2|[4,5]|
+---+-----+

Please share both scala and python implementation if possible. On a related note, how do I take care of it while reading from the file itself? I have data with ~450 columns and few of them I want to specify in this format. Currently I am reading in pyspark as below:

如果可能,请共享 scala 和 python 实现。在相关说明中,从文件本身读取时如何处理它?我有大约 450 列的数据,其中很少有我想以这种格式指定。目前我正在 pyspark 中阅读如下:

df = spark.read.format('com.databricks.spark.csv').options(
    header='true', inferschema='true', delimiter='|').load(input_file)

Thanks.

谢谢。

回答by ktheitroadalo

There are various method,

方法多种多样,

The best way to do is using splitfunction and cast to array<long>

最好的方法是使用split函数并强制转换为array<long>

data.withColumn("b", split(col("b"), ",").cast("array<long>"))

You can also create simple udf to convert the values

您还可以创建简单的 udf 来转换值

val tolong = udf((value : String) => value.split(",").map(_.toLong))

data.withColumn("newB", tolong(data("b"))).show

Hope this helps!

希望这可以帮助!

回答by himanshuIIITian

Using a UDFwould give you exact required schema. Like this:

使用UDF将为您提供确切所需的架构。像这样:

val toArray = udf((b: String) => b.split(",").map(_.toLong))

val test1 = test.withColumn("b", toArray(col("b")))

It would give you schema as follows:

它会给你架构如下:

scala> test1.printSchema
root
 |-- a: long (nullable = true)
 |-- b: array (nullable = true)
 |    |-- element: long (containsNull = true)

+---+-----+
|  a|  b  |
+---+-----+
|  1|[2,3]|
+---+-----+
|  2|[4,5]|
+---+-----+

As far as applying schema on file read itself is concerned, I think that is a tough task. So, for now you can apply transformation after creating DataFrameReaderof test.

就在文件读取本身上应用模式而言,我认为这是一项艰巨的任务。所以,现在你可以在创建后应用转化DataFrameReadertest

I hope this helps!

我希望这有帮助!

回答by Ariana Bermúdez

In python (pyspark) it would be:

在 python (pyspark) 中,它将是:

from pyspark.sql.types import *
from pyspark.sql.functions import col, split
test = test.withColumn(
        "b",
        split(col("b"), ",\s*").cast("array<int>").alias("ev")
 )