scala 将 Spark Row 转换为类型化的双精度数组
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/30354483/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Convert Spark Row to typed Array of Doubles
提问by user2726995
I am using Spark 1.3.1 with Hive and have a row object that is a long series of doubles to be passed to a Vecors.dense constructor, however when I convert a Row to an array via
我将 Spark 1.3.1 与 Hive 一起使用,并且有一个行对象,该对象是一长串双精度数,要传递给 Vecors.dense 构造函数,但是当我通过以下方式将 Row 转换为数组时
SparkDataFrame.map{r => r.toSeq.toArray}
All type information is lost and I get back an array of [Any] type. I am unable to cast this object to double using
所有类型信息都丢失了,我取回了 [Any] 类型的数组。我无法使用
SparkDataFrame.map{r =>
val array = r.toSeq.toArray
array.map(_.toDouble)
} // Fails with value toDouble is not a member of any
as does
就像
SparkDataFrame.map{r =>
val array = r.toSeq.toArray
array.map(_.asInstanceOf[Double])
} // Fails with java.lang.ClassCastException: java.lang.Integer cannot be cast to java.lang.Double
I see that the Row object has an API that supports getting specific elements as a type, via:
我看到 Row 对象有一个 API,它支持通过以下方式获取特定元素作为类型:
SparkDataFrame.map{r =>
r.getDouble(5)}
However event this fails with java.lang.Integer cannot be cast to java.lang.Double
但是,java.lang.Integer 失败的事件不能转换为 java.lang.Double
The only work around I have found is the following:
我发现的唯一解决方法如下:
SparkDataFrame.map{r =>
doubleArray = Array(r.getInt(5).toDouble, r.getInt(6).toDouble)
Vectors.dense(doubleArray) }
However this is prohibitively tedious when index 5 through 1000 need to be converted to an array of doubles.
然而,当需要将索引 5 到 1000 转换为双精度数组时,这是非常乏味的。
Any way around explicitly indexing the row object?
有什么办法可以显式索引行对象?
采纳答案by bwawok
Let's look at your code blocks 1 by 1
让我们一一看看你的代码块
SparkDataFrame.map{r =>
val array = r.toSeq.toArray
val doubleArra = array.map(_.toDouble)
} // Fails with value toDouble is not a member of any
Map returns the last statement as the type (i.e. there is a kind of implied return on any function in Scala that the last result is your return value). Your last statement is of type Unit (like Void).. because assigning a variable to a val has no return. To fix that, take out the assignment (this has the side benefit of being less code to read).
Map 将最后一条语句作为类型返回(即,Scala 中的任何函数都有一种隐式返回,即最后一个结果是您的返回值)。您的最后一条语句是 Unit 类型(如 Void)...因为将变量分配给 val 没有回报。要解决此问题,请取消分配(这样做的附带好处是要阅读的代码更少)。
SparkDataFrame.map{r =>
val array = r.toSeq.toArray
array.map(_.toDouble)
}
_.toDoubleis not a cast..you can do it on a String or in your case an Integer, and it will change the instance of the variable type. If you call _.toDoubleon a Int, it is more like doing Double.parseDouble(inputInt).
_.toDouble不是强制转换..你可以在字符串上或者在你的情况下是整数,它会改变变量类型的实例。如果您调用_.toDoubleInt,则更像是在执行Double.parseDouble(inputInt).
_.asInstanceOf[Double]would be a cast.. which if your data is really a double, would change the type. But not sure you need to cast here, avoid casting if you can.
_.asInstanceOf[Double]将是一个演员......如果你的数据真的是一个双倍,会改变类型。但不确定您是否需要在这里投射,如果可以,请避免投射。
Update
更新
So you changed the code to this
所以你把代码改成这样
SparkDataFrame.map{r =>
val array = r.toSeq.toArray
array.map(_.toDouble)
} // Fails with value toDouble is not a member of any
You are calling toDouble on a node of your SparkDataFrame. Apparently it is not something that has a toDouble method.. i.e. it is not an Int or a String or a Long.
您正在 SparkDataFrame 的节点上调用 toDouble。显然它不是具有 toDouble 方法的东西.. 即它不是 Int 或 String 或 Long。
If this works
如果这有效
SparkDataFrame.map{r =>
doubleArray = Array(r.getInt(5).toDouble, r.getInt(6).toDouble)
Vectors.dense(doubleArray) }
But you need to do from 5 to 1000.. why not do
但是你需要做从 5 到 1000 .. 为什么不做
SparkDataFrame.map{r =>
val doubleArray = for (i <- 5 to 1000){
r.getInt(i).toDouble
}.toArray
Vectors.dense(doubleArray)
}
回答by Jason
you should use the Double.parseDouble from java.
你应该使用 java 中的 Double.parseDouble。
import java.lang.Double
SparkDataFrame.map{r =>
val doubleArray = for (i <- 5 to 1000){
Double.parseDouble(r.get(i).toString)
}.toArray
Vectors.dense(doubleArray)
}
回答by Edi Bice
Had a similar, harder, problem in that my features are not all Double. Here's how I was able to convert from my DataFrame (pulled from Hive table as well) to a LabeledPoint RDD:
有一个类似的、更难的问题,因为我的功能不全是 Double。这是我如何能够从我的 DataFrame(也从 Hive 表中拉出)转换为 LabeledPoint RDD:
val loaff = oaff.map(r =>
LabeledPoint(if (r.getString(classIdx)=="NOT_FRAUD") 0 else 1,
Vectors.dense(featIdxs.map(r.get(_) match {case null => Double.NaN
case d: Double => d
case l: Long => l}).toArray)))

