scala 将 Spark Row 转换为类型化的双精度数组

Question

提问by user2726995

I am using Spark 1.3.1 with Hive and have a row object that is a long series of doubles to be passed to a Vecors.dense constructor, however when I convert a Row to an array via

我将 Spark 1.3.1 与 Hive 一起使用，并且有一个行对象，该对象是一长串双精度数，要传递给 Vecors.dense 构造函数，但是当我通过以下方式将 Row 转换为数组时

SparkDataFrame.map{r => r.toSeq.toArray}

All type information is lost and I get back an array of [Any] type. I am unable to cast this object to double using

所有类型信息都丢失了，我取回了 [Any] 类型的数组。我无法使用

SparkDataFrame.map{r => 
  val array = r.toSeq.toArray 
  array.map(_.toDouble) 
} // Fails with value toDouble is not a member of any

as does

就像

SparkDataFrame.map{r => 
      val array = r.toSeq.toArray 
      array.map(_.asInstanceOf[Double]) 
    } // Fails with java.lang.ClassCastException: java.lang.Integer cannot be cast to java.lang.Double

I see that the Row object has an API that supports getting specific elements as a type, via:

我看到 Row 对象有一个 API，它支持通过以下方式获取特定元素作为类型：

SparkDataFrame.map{r => 
  r.getDouble(5)}

However event this fails with java.lang.Integer cannot be cast to java.lang.Double

但是，java.lang.Integer 失败的事件不能转换为 java.lang.Double

The only work around I have found is the following:

我发现的唯一解决方法如下：

 SparkDataFrame.map{r => 
  doubleArray = Array(r.getInt(5).toDouble, r.getInt(6).toDouble) 
  Vectors.dense(doubleArray) }

However this is prohibitively tedious when index 5 through 1000 need to be converted to an array of doubles.

然而，当需要将索引 5 到 1000 转换为双精度数组时，这是非常乏味的。

Any way around explicitly indexing the row object?

有什么办法可以显式索引行对象？

Answer 1

采纳答案by bwawok

Let's look at your code blocks 1 by 1

让我们一一看看你的代码块

SparkDataFrame.map{r => 
  val array = r.toSeq.toArray 
  val doubleArra = array.map(_.toDouble) 
} // Fails with value toDouble is not a member of any

Map returns the last statement as the type (i.e. there is a kind of implied return on any function in Scala that the last result is your return value). Your last statement is of type Unit (like Void).. because assigning a variable to a val has no return. To fix that, take out the assignment (this has the side benefit of being less code to read).

Map 将最后一条语句作为类型返回（即，Scala 中的任何函数都有一种隐式返回，即最后一个结果是您的返回值）。您的最后一条语句是 Unit 类型（如 Void）...因为将变量分配给 val 没有回报。要解决此问题，请取消分配（这样做的附带好处是要阅读的代码更少）。

SparkDataFrame.map{r => 
  val array = r.toSeq.toArray 
  array.map(_.toDouble) 
}

_.toDoubleis not a cast..you can do it on a String or in your case an Integer, and it will change the instance of the variable type. If you call _.toDoubleon a Int, it is more like doing Double.parseDouble(inputInt).

_.toDouble不是强制转换..你可以在字符串上或者在你的情况下是整数，它会改变变量类型的实例。如果您调用_.toDoubleInt，则更像是在执行Double.parseDouble(inputInt).

_.asInstanceOf[Double]would be a cast.. which if your data is really a double, would change the type. But not sure you need to cast here, avoid casting if you can.

_.asInstanceOf[Double]将是一个演员......如果你的数据真的是一个双倍，会改变类型。但不确定您是否需要在这里投射，如果可以，请避免投射。

Update

更新

So you changed the code to this

所以你把代码改成这样

SparkDataFrame.map{r => 
  val array = r.toSeq.toArray 
  array.map(_.toDouble) 
} // Fails with value toDouble is not a member of any

You are calling toDouble on a node of your SparkDataFrame. Apparently it is not something that has a toDouble method.. i.e. it is not an Int or a String or a Long.

您正在 SparkDataFrame 的节点上调用 toDouble。显然它不是具有 toDouble 方法的东西.. 即它不是 Int 或 String 或 Long。

If this works

如果这有效

SparkDataFrame.map{r => 
  doubleArray = Array(r.getInt(5).toDouble, r.getInt(6).toDouble) 
  Vectors.dense(doubleArray) }

But you need to do from 5 to 1000.. why not do

但是你需要做从 5 到 1000 .. 为什么不做

SparkDataFrame.map{r => 
  val doubleArray = for (i <- 5 to 1000){
      r.getInt(i).toDouble
  }.toArray
  Vectors.dense(doubleArray) 
 }

Answer 2

回答by Jason

you should use the Double.parseDouble from java.

你应该使用 java 中的 Double.parseDouble。

import  java.lang.Double

SparkDataFrame.map{r => 
  val doubleArray = for (i <- 5 to 1000){
      Double.parseDouble(r.get(i).toString)
  }.toArray
  Vectors.dense(doubleArray) 
 }

Answer 3

回答by Edi Bice

Had a similar, harder, problem in that my features are not all Double. Here's how I was able to convert from my DataFrame (pulled from Hive table as well) to a LabeledPoint RDD:

有一个类似的、更难的问题，因为我的功能不全是 Double。这是我如何能够从我的 DataFrame（也从 Hive 表中拉出）转换为 LabeledPoint RDD：

val loaff = oaff.map(r => 
  LabeledPoint(if (r.getString(classIdx)=="NOT_FRAUD") 0 else 1, 
  Vectors.dense(featIdxs.map(r.get(_) match {case null => Double.NaN 
                                             case d: Double => d 
                                             case l: Long => l}).toArray)))

scala 将 Spark Row 转换为类型化的双精度数组

提问by user2726995

采纳答案by bwawok

回答by Jason

回答by Edi Bice

相关推荐

最近更新

标签

scala 将 Spark Row 转换为类型化的双精度数组

提问by user2726995

采纳答案by bwawok

回答by Jason

回答by Edi Bice

相关推荐

Scala 中 NonFatal 和 Exception 的区别

Scala spark中的RDD过滤器

scala Apache Spark：执行程序之间的网络错误

scala 如何处理 spark 中的错误 SPARK-5063

相关推荐

最近更新

标签