scala 如何从 Spark 数据帧的列中的向量中提取值

Question

提问by you zhenghong

When using SparkML to predict labels the result Dataframe is:

使用 SparkML 预测标签时，结果 Dataframe 为：

scala> result.show
+-----------+--------------+
|probability|predictedLabel|
+-----------+--------------+
|  [0.0,1.0]|           0.0|
|  [0.0,1.0]|           0.0|
|  [0.0,1.0]|           0.0|
|  [0.0,1.0]|           0.0|
|  [0.0,1.0]|           0.0|
|  [0.1,0.9]|           0.0|
|  [0.0,1.0]|           0.0|
|  [0.0,1.0]|           0.0|
|  [0.0,1.0]|           0.0|
|  [0.0,1.0]|           0.0|
|  [0.0,1.0]|           0.0|
|  [0.0,1.0]|           0.0|
|  [0.1,0.9]|           0.0|
|  [0.6,0.4]|           1.0|
|  [0.6,0.4]|           1.0|
|  [1.0,0.0]|           1.0|
|  [0.9,0.1]|           1.0|
|  [0.9,0.1]|           1.0|
|  [1.0,0.0]|           1.0|
|  [1.0,0.0]|           1.0|
+-----------+--------------+
only showing top 20 rows

I want to create a new Dataframe with a new column named prob which is the first value from the Vector in probability column of original Dataframe e.g.:

我想用一个名为 prob 的新列创建一个新的数据框，它是原始数据框概率列中向量中的第一个值，例如：

+-----------+--------------+----------+
|probability|predictedLabel|   prob   |
+-----------+--------------+----------+
|  [0.0,1.0]|           0.0|       0.0|
|  [0.0,1.0]|           0.0|       0.0|
|  [0.0,1.0]|           0.0|       0.0|
|  [0.0,1.0]|           0.0|       0.0|
|  [0.0,1.0]|           0.0|       0.0|
|  [0.1,0.9]|           0.0|       0.0|
|  [0.0,1.0]|           0.0|       0.0|
|  [0.0,1.0]|           0.0|       0.0|
|  [0.0,1.0]|           0.0|       0.0|
|  [0.0,1.0]|           0.0|       0.0|
|  [0.0,1.0]|           0.0|       0.0|
|  [0.0,1.0]|           0.0|       0.0|
|  [0.1,0.9]|           0.0|       0.1|
|  [0.6,0.4]|           1.0|       0.6|
|  [0.6,0.4]|           1.0|       0.6|
|  [1.0,0.0]|           1.0|       1.0|
|  [0.9,0.1]|           1.0|       0.9|
|  [0.9,0.1]|           1.0|       0.9|
|  [1.0,0.0]|           1.0|       1.0|
|  [1.0,0.0]|           1.0|       1.0|
+-----------+--------------+----------+

How can extract this value into a new column?

如何将此值提取到新列中？

Answer 1

回答by Vidya

You can use the capabilities of Datasetand the wonderful functionslibraryto accomplish what you need:

您可以使用Dataset强大的functions库来完成您需要的功能：

result.withColumn("prob", $"probability".getItem(0))

This adds a new Columncalled probwhose value is derived from the probabilityColumnby taking the first item (at index 0--we are computer scientists after all) in the array.

这添加了一个新的Column调用，prob它的值是probabilityColumn通过获取数组中的第一个项目（在索引 0 处——毕竟我们是计算机科学家）来派生的。

I would mention also that UDFs should be your last resort because the Catalyst optimizer cannot currently optimize UDFs, so you should always prefer the built-in functions to get the most out of Catalyst.

我还要提到 UDF 应该是您最后的手段，因为 Catalyst 优化器目前无法优化 UDF，因此您应该始终更喜欢内置函数来充分利用 Catalyst。

Answer 2

回答by himanshuIIITian

It is fairly simple if you use Spark UDF(s). Like this:

如果您使用 Spark UDF，这相当简单。像这样：

val headValue = udf((arr: Seq[Double]) => arr.head)

result.withColumn("prob", headValue(result("probability"))).show

It will give you desired output:

它将为您提供所需的输出：

+-----------+--------------+----------+
|probability|predictedLabel|   prob   |
+-----------+--------------+----------+
|  [0.0,1.0]|           0.0|       0.0|
|  [0.0,1.0]|           0.0|       0.0|
|  [0.0,1.0]|           0.0|       0.0|
|  [0.0,1.0]|           0.0|       0.0|
|  [0.0,1.0]|           0.0|       0.0|
|  [0.1,0.9]|           0.0|       0.0|
|  [0.0,1.0]|           0.0|       0.0|
|  [0.0,1.0]|           0.0|       0.0|
|  [0.0,1.0]|           0.0|       0.0|
|  [0.0,1.0]|           0.0|       0.0|
|  [0.0,1.0]|           0.0|       0.0|
|  [0.0,1.0]|           0.0|       0.0|
|  [0.1,0.9]|           0.0|       0.1|
|  [0.6,0.4]|           1.0|       0.6|
|  [0.6,0.4]|           1.0|       0.6|
|  [1.0,0.0]|           1.0|       1.0|
|  [0.9,0.1]|           1.0|       0.9|
|  [0.9,0.1]|           1.0|       0.9|
|  [1.0,0.0]|           1.0|       1.0|
|  [1.0,0.0]|           1.0|       1.0|
+-----------+--------------+----------+

scala 如何从 Spark 数据帧的列中的向量中提取值

提问by you zhenghong

回答by Vidya

回答by himanshuIIITian

相关推荐

最近更新

标签

scala 如何从 Spark 数据帧的列中的向量中提取值

提问by you zhenghong

回答by Vidya

回答by himanshuIIITian

相关推荐

scala Spark DataFrame groupBy

scala 无法解析 Spark Dataframe 中的列（数字列名称）

scala 在火花数据框中创建子字符串列

scala Spark - 从具有不同列类型的行数据框中删除特殊字符

相关推荐

最近更新

标签