scala 如何从 Spark 数据帧的列中的向量中提取值
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/43731181/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
How to extract a value from a Vector in a column of a Spark Dataframe
提问by you zhenghong
When using SparkML to predict labels the result Dataframe is:
使用 SparkML 预测标签时,结果 Dataframe 为:
scala> result.show
+-----------+--------------+
|probability|predictedLabel|
+-----------+--------------+
| [0.0,1.0]| 0.0|
| [0.0,1.0]| 0.0|
| [0.0,1.0]| 0.0|
| [0.0,1.0]| 0.0|
| [0.0,1.0]| 0.0|
| [0.1,0.9]| 0.0|
| [0.0,1.0]| 0.0|
| [0.0,1.0]| 0.0|
| [0.0,1.0]| 0.0|
| [0.0,1.0]| 0.0|
| [0.0,1.0]| 0.0|
| [0.0,1.0]| 0.0|
| [0.1,0.9]| 0.0|
| [0.6,0.4]| 1.0|
| [0.6,0.4]| 1.0|
| [1.0,0.0]| 1.0|
| [0.9,0.1]| 1.0|
| [0.9,0.1]| 1.0|
| [1.0,0.0]| 1.0|
| [1.0,0.0]| 1.0|
+-----------+--------------+
only showing top 20 rows
I want to create a new Dataframe with a new column named prob which is the first value from the Vector in probability column of original Dataframe e.g.:
我想用一个名为 prob 的新列创建一个新的数据框,它是原始数据框概率列中向量中的第一个值,例如:
+-----------+--------------+----------+
|probability|predictedLabel| prob |
+-----------+--------------+----------+
| [0.0,1.0]| 0.0| 0.0|
| [0.0,1.0]| 0.0| 0.0|
| [0.0,1.0]| 0.0| 0.0|
| [0.0,1.0]| 0.0| 0.0|
| [0.0,1.0]| 0.0| 0.0|
| [0.1,0.9]| 0.0| 0.0|
| [0.0,1.0]| 0.0| 0.0|
| [0.0,1.0]| 0.0| 0.0|
| [0.0,1.0]| 0.0| 0.0|
| [0.0,1.0]| 0.0| 0.0|
| [0.0,1.0]| 0.0| 0.0|
| [0.0,1.0]| 0.0| 0.0|
| [0.1,0.9]| 0.0| 0.1|
| [0.6,0.4]| 1.0| 0.6|
| [0.6,0.4]| 1.0| 0.6|
| [1.0,0.0]| 1.0| 1.0|
| [0.9,0.1]| 1.0| 0.9|
| [0.9,0.1]| 1.0| 0.9|
| [1.0,0.0]| 1.0| 1.0|
| [1.0,0.0]| 1.0| 1.0|
+-----------+--------------+----------+
How can extract this value into a new column?
如何将此值提取到新列中?
回答by Vidya
You can use the capabilities of Datasetand the wonderful functionslibraryto accomplish what you need:
您可以使用Dataset强大的functions库来完成您需要的功能:
result.withColumn("prob", $"probability".getItem(0))
result.withColumn("prob", $"probability".getItem(0))
This adds a new Columncalled probwhose value is derived from the probabilityColumnby taking the first item (at index 0--we are computer scientists after all) in the array.
这添加了一个新的Column调用,prob它的值是probabilityColumn通过获取数组中的第一个项目(在索引 0 处——毕竟我们是计算机科学家)来派生的。
I would mention also that UDFs should be your last resort because the Catalyst optimizer cannot currently optimize UDFs, so you should always prefer the built-in functions to get the most out of Catalyst.
我还要提到 UDF 应该是您最后的手段,因为 Catalyst 优化器目前无法优化 UDF,因此您应该始终更喜欢内置函数来充分利用 Catalyst。
回答by himanshuIIITian
It is fairly simple if you use Spark UDF(s). Like this:
如果您使用 Spark UDF,这相当简单。像这样:
val headValue = udf((arr: Seq[Double]) => arr.head)
result.withColumn("prob", headValue(result("probability"))).show
It will give you desired output:
它将为您提供所需的输出:
+-----------+--------------+----------+
|probability|predictedLabel| prob |
+-----------+--------------+----------+
| [0.0,1.0]| 0.0| 0.0|
| [0.0,1.0]| 0.0| 0.0|
| [0.0,1.0]| 0.0| 0.0|
| [0.0,1.0]| 0.0| 0.0|
| [0.0,1.0]| 0.0| 0.0|
| [0.1,0.9]| 0.0| 0.0|
| [0.0,1.0]| 0.0| 0.0|
| [0.0,1.0]| 0.0| 0.0|
| [0.0,1.0]| 0.0| 0.0|
| [0.0,1.0]| 0.0| 0.0|
| [0.0,1.0]| 0.0| 0.0|
| [0.0,1.0]| 0.0| 0.0|
| [0.1,0.9]| 0.0| 0.1|
| [0.6,0.4]| 1.0| 0.6|
| [0.6,0.4]| 1.0| 0.6|
| [1.0,0.0]| 1.0| 1.0|
| [0.9,0.1]| 1.0| 0.9|
| [0.9,0.1]| 1.0| 0.9|
| [1.0,0.0]| 1.0| 1.0|
| [1.0,0.0]| 1.0| 1.0|
+-----------+--------------+----------+

