Spark：scala.MatchError（类 org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema

Question

提问by zork

The scala.MatchError (of class org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema)exception happens when I try to access DataFramerow elements. The following code counts book pairs, where count of a pair equals the number of readers who read this pair of books.

在scala.MatchError (of class org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema)当我试图访问异常发生的DataFrame行元素。以下代码计算图书对数，其中图书对数等于阅读这对图书的读者数。

Interesting thing is that exception happens only when trainPairsare created as a result of trainDf.join(...). In case the same data structure is created inline as:

有趣的是，异常仅在trainPairs作为trainDf.join(...). 如果内联创建相同的数据结构：

case class BookPair (book1:Int, book2:Int, cnt:Int, name1: String, name2: String)
  val recs = Array(
    BookPair(1, 2, 3, "book1", "book2"),
    BookPair(2, 3, 1, "book2", "book3"),
    BookPair(1, 3, 2, "book1", "book3"),
    BookPair(1, 4, 5, "book1", "book4"),
    BookPair(2, 4, 7, "book2", "book4")
  )

This exception does not happen at all!

这种异常根本不会发生！

The complete code that produce this exception:

产生此异常的完整代码：

import org.apache.spark.SparkConf
import org.apache.spark.SparkContext
import org.apache.log4j.Logger
import org.apache.log4j.Level
import org.apache.spark.rdd.RDD
import org.apache.spark.sql.{Row, DataFrame}
import org.apache.spark.sql.functions._

object Scratch {

  case class Book(book: Int, reader: Int, name:String)

  val recs = Array(
    Book(book = 1, reader = 30, name = "book1"),
    Book(book = 2, reader = 10, name = "book2"),
    Book(book = 3, reader = 20, name = "book3"),
    Book(book = 1, reader = 20, name = "book1"),
    Book(book = 1, reader = 10, name = "book1"),
    Book(book = 1, reader = 40, name = "book1"),
    Book(book = 2, reader = 40, name = "book2"),
    Book(book = 1, reader = 100, name = "book1"),
    Book(book = 2, reader = 100, name = "book2"),
    Book(book = 3, reader = 100, name = "book3"),
    Book(book = 4, reader = 100, name = "book4"),
    Book(book = 5, reader = 100, name = "book5"),
    Book(book = 4, reader = 500, name = "book4"),
    Book(book = 1, reader = 510, name = "book1"),
    Book(book = 2, reader = 30, name = "book2"))


  def main(args: Array[String]) {
    Logger.getLogger("org.apache.spark").setLevel(Level.WARN)
    Logger.getLogger("org.eclipse.jetty.server").setLevel(Level.OFF)
    // set up environment
    val conf = new SparkConf()
      .setMaster("local[5]")
      .setAppName("Scratch")
      .set("spark.executor.memory", "2g")
    val sc = new SparkContext(conf)

    val data = sc.parallelize(recs)

    /**
     * Remove readers with many books
    count books by reader
    and filter readers with books count > 10
     */
    val maxBookCnt = 4
    val readersWithLotsOfBooksRDD = data.map(r => (r.reader, 1)).reduceByKey((x, y) => x + y).filter{ case (_, x) => x > maxBookCnt }
    readersWithLotsOfBooksRDD.collect()
    val readersWithBooksRDD = data.map( r => (r.reader, (r.book, r.name) ))
    readersWithBooksRDD.collect()
    println("*** Records left after removing readers with maxBookCnt > "+maxBookCnt)
    val data2 = readersWithBooksRDD.subtractByKey(readersWithLotsOfBooksRDD)
    data2.foreach(println)

    // *** Prepair train  data
    val trainData = data2.map(tuple => tuple match {
      case (reader,v) => Book(reader = reader, book = v._1, name = v._2)
    })

    val sqlContext = new org.apache.spark.sql.SQLContext(sc)
    import sqlContext.implicits._
    val trainDf = trainData.toDF()

    println("*** Creating pairs...")
    val trainPairs = trainDf.join(
      trainDf.select($"book" as "r_book", $"reader" as "r_reader", $"name" as "r_name"),
      $"reader" === $"r_reader" and $"book" < $"r_book")
      .groupBy($"book", $"r_book", $"name", $"r_name")
      .agg($"book",$"r_book", count($"reader") as "cnt", $"name", $"r_name")

    trainPairs.registerTempTable("trainPairs")
    println("*** Pairs Schema:")
    trainPairs.printSchema()

    // Order pairs by count
    val pairsSorted = sqlContext.sql("SELECT * FROM trainPairs ORDER BY cnt DESC")
    println("*** Pairs Sorted by Count")
    pairsSorted.show

    // Key pairs by book
    val keyedPairs = trainPairs.rdd.map({case Row(book1: Int, book2: Int, count: Int, name1: String, name2:String)
    => (book1,(book2, count, name1, name2))})
    println("*** keyedPairs:")
    keyedPairs.foreach(println)
  }

}

Any ideas?

有任何想法吗？

Update

更新

zero323 writes:

zero323 写道：

"It throws an exception because schema of trainPairs doesn't match pattern you've provided. Schema looks like this:

“它引发异常，因为 trainPairs 的模式与您提供的模式不匹配。模式如下所示：

root
 |-- book: integer (nullable = false)
 |-- r_book: integer (nullable = false)
 |-- name: string (nullable = true)
 |-- r_name: string (nullable = true)
 |-- book: integer (nullable = false)
 |-- r_book: integer (nullable = false)
 |-- cnt: long (nullable = false)
 |-- name: string (nullable = true)
 |-- r_name: string (nullable = true)

Ok, but how can I find a complete schema of trainPairs? Why then when I print trainPairsschema with command:

好的，但是我怎样才能找到的完整模式trainPairs？为什么当我trainPairs使用命令打印模式时：

trainPairs.printSchema()

I get only part of this schema:

我只得到这个架构的一部分：

root
 |-- book: integer (nullable = false)
 |-- r_book: integer (nullable = false)
 |-- cnt: long (nullable = false)
 |-- name: string (nullable = true)
 |-- r_name: string (nullable = true)

How can I print / find a complete schema of trainPairs?

如何打印/找到的完整模式trainPairs？

Besides

除了

Row(Int, Int, String, String, Int, Int, Long, String, String)

results in the same scala.MatchError!

结果一样scala.MatchError！

Answer 1

回答by zork

As I found out excepion was caused by wrong type of countrow field. It should be Longand not Int. So instead of:

当我发现异常是由错误类型的count行字段引起的。应该是Long，不是Int。所以而不是：

// Key pairs by book
val keyedPairs = trainPairs.rdd.map({case Row(book1: Int, book2: Int, count: Int, name1: String, name2:String)
=> (book1,(book2, count, name1, name2))})

The correct code should be:

正确的代码应该是：

val keyedPairs = trainPairs.rdd.map({case Row(book1: Int, book2: Int, count: Long, name1: String, name2:String)
=> (book1,(book2, count, name1, name2))})

And everything would work as expected.

一切都会按预期进行。

Spark：scala.MatchError（类 org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema

提问by zork

回答by zork

相关推荐

最近更新

标签

Spark：scala.MatchError（类 org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema

提问by zork

回答by zork

相关推荐

scala-spark：如何在groupby之后过滤RDD

scala HashPartitioner 是如何工作的？

scala 如何在spark的数据框中“否定选择”列

scala Slick 3.0 Insert 然后获取 Auto Increment Value

相关推荐

最近更新

标签