scala 添加两个 RDD[mllib.linalg.Vector]'s

Question

提问by krishna

I need addition of two matrices that are stored in two files.

我需要添加两个存储在两个文件中的矩阵。

The content of latest1.txtand latest2.txthas the next str:

的内容latest1.txt，并latest2.txt有下一个STR：

1 2 3
4 5 6
7 8 9

I am reading those files as follows:

我正在阅读这些文件如下：

scala> val rows = sc.textFile(“latest1.txt”).map { line => val values = line.split(‘ ‘).map(_.toDouble)
    Vectors.sparse(values.length,values.zipWithIndex.map(e => (e._2, e._1)).filter(_._2 != 0.0))
}

scala> val r1 = rows
r1: org.apache.spark.rdd.RDD[org.apache.spark.mllib.linalg.Vector] = MappedRDD[2] at map at :14

scala> val rows = sc.textFile(“latest2.txt”).map { line => val values = line.split(‘ ‘).map(_.toDouble)
    Vectors.sparse(values.length,values.zipWithIndex.map(e => (e._2, e._1)).filter(_._2 != 0.0))
}

scala> val r2 = rows
r2: org.apache.spark.rdd.RDD[org.apache.spark.mllib.linalg.Vector] = MappedRDD[2] at map at :14

I want to add r1, r2. So, Is there any way to add this two RDD[mllib.linalg.Vector]s in Apache-Spark.

我想添加r1，r2。那么，有没有办法RDD[mllib.linalg.Vector]在Apache-Spark中添加这两个s。

Answer 1

回答by javadba

This is actually a good question. I work with mllib regularly and did not realize these basic linear algebra operations are not easily accessible.

这其实是个好问题。我经常使用 mllib 并且没有意识到这些基本的线性代数运算不容易访问。

The point is that the underlying breezevectors have all of the linear algebra manipulations you would expect - including of course basic element wise addition that you specifically mentioned.

关键是基础微风向量具有您期望的所有线性代数操作 - 当然包括您特别提到的基本元素明智的加法。

However the breeze implementation is hidden from the outside world via:

然而，微风的实现是通过以下方式对外界隐藏的：

[private mllib]

So then, from the outside world/public API perspective, how do we access those primitives?

那么，从外部世界/公共 API 的角度来看，我们如何访问这些原语？

Some of them are already exposed: e.g. sum of squares:

其中一些已经暴露：例如平方和：

/**
 * Returns the squared distance between two Vectors.
 * @param v1 first Vector.
 * @param v2 second Vector.
 * @return squared distance between two Vectors.
 */
def sqdist(v1: Vector, v2: Vector): Double = { 
  ...
}

However the selection of such available methods is limited - and in fact does notinclude the basic operations including element wise addition, subtraction, multiplication, etc.

然而这样的可用方法的选择是有限的-而事实上也没有包括基本操作，包括元素方式加，减，乘，等

So here is the best I could see:

所以这是我能看到的最好的：

Convert the vectors to breeze:
Perform the vector operations in breeze
Convert back from breeze to mllib Vector

将向量转换为微风：
在微风中执行向量运算
从微风转换回 mllib 矢量

Here is some sample code:

下面是一些示例代码：

val v1 = Vectors.dense(1.0, 2.0, 3.0)
val v2 = Vectors.dense(4.0, 5.0, 6.0)
val bv1 = new DenseVector(v1.toArray)
val bv2 = new DenseVector(v2.toArray)

val vectout = Vectors.dense((bv1 + bv2).toArray)
vectout: org.apache.spark.mllib.linalg.Vector = [5.0,7.0,9.0]

Answer 2

回答by Jussi Kujala

The following code exposes asBreeze and fromBreeze methods from Spark. This solution supports SparseVectorin contrast to using vector.toArray. Note that Spark may change their API in the future and already has renamed toBreezeto asBreeze.

以下代码公开了来自 Spark 的 asBreeze 和 fromBreeze 方法。该解决方案支持SparseVector在对比使用vector.toArray。请注意，Spark 将来可能会更改其 API，并且已将其重命名toBreeze为asBreeze.

package org.apache.spark.mllib.linalg
import breeze.linalg.{Vector => BV}
import org.apache.spark.sql.functions.udf

/** expose vector.toBreeze and Vectors.fromBreeze
  */
object VectorUtils {

  def fromBreeze(breezeVector: BV[Double]): Vector = {
    Vectors.fromBreeze( breezeVector )
  }

  def asBreeze(vector: Vector): BV[Double] = {
    // this is vector.asBreeze in Spark 2.0
    vector.toBreeze
  }

  val addVectors = udf {
    (v1: Vector, v2: Vector) => fromBreeze( asBreeze(v1) + asBreeze(v2) )
  }

}

With this you can do df.withColumn("xy", addVectors($"x", $"y")).

有了这个，你可以做到df.withColumn("xy", addVectors($"x", $"y"))。

scala 添加两个 RDD[mllib.linalg.Vector]'s

提问by krishna

回答by javadba

回答by Jussi Kujala

相关推荐

最近更新

标签

scala 添加两个 RDD[mllib.linalg.Vector]'s

提问by krishna

回答by javadba

回答by Jussi Kujala

相关推荐

scala 带参数的Scala传递函数

scala 使用/不使用 Spark SQL 连接两个普通 RDD

scala 为什么 Spark 会因 java.lang.OutOfMemoryError 失败：超出 GC 开销限制？

Scala 将元组列表转换为列表元组

相关推荐

最近更新

标签