Spark:scala.MatchError(类 org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/31838539/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Spark: scala.MatchError (of class org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema
提问by zork
The scala.MatchError (of class org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema)exception happens when I try to access DataFramerow elements. The following code counts book pairs, where count of a pair equals the number of readers who read this pair of books.
在scala.MatchError (of class org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema)当我试图访问异常发生的DataFrame行元素。以下代码计算图书对数,其中图书对数等于阅读这对图书的读者数。
Interesting thing is that exception happens only when trainPairsare created as a result of trainDf.join(...). In case the same data structure is created inline as:
有趣的是,异常仅在trainPairs作为trainDf.join(...). 如果内联创建相同的数据结构:
case class BookPair (book1:Int, book2:Int, cnt:Int, name1: String, name2: String)
val recs = Array(
BookPair(1, 2, 3, "book1", "book2"),
BookPair(2, 3, 1, "book2", "book3"),
BookPair(1, 3, 2, "book1", "book3"),
BookPair(1, 4, 5, "book1", "book4"),
BookPair(2, 4, 7, "book2", "book4")
)
This exception does not happen at all!
这种异常根本不会发生!
The complete code that produce this exception:
产生此异常的完整代码:
import org.apache.spark.SparkConf
import org.apache.spark.SparkContext
import org.apache.log4j.Logger
import org.apache.log4j.Level
import org.apache.spark.rdd.RDD
import org.apache.spark.sql.{Row, DataFrame}
import org.apache.spark.sql.functions._
object Scratch {
case class Book(book: Int, reader: Int, name:String)
val recs = Array(
Book(book = 1, reader = 30, name = "book1"),
Book(book = 2, reader = 10, name = "book2"),
Book(book = 3, reader = 20, name = "book3"),
Book(book = 1, reader = 20, name = "book1"),
Book(book = 1, reader = 10, name = "book1"),
Book(book = 1, reader = 40, name = "book1"),
Book(book = 2, reader = 40, name = "book2"),
Book(book = 1, reader = 100, name = "book1"),
Book(book = 2, reader = 100, name = "book2"),
Book(book = 3, reader = 100, name = "book3"),
Book(book = 4, reader = 100, name = "book4"),
Book(book = 5, reader = 100, name = "book5"),
Book(book = 4, reader = 500, name = "book4"),
Book(book = 1, reader = 510, name = "book1"),
Book(book = 2, reader = 30, name = "book2"))
def main(args: Array[String]) {
Logger.getLogger("org.apache.spark").setLevel(Level.WARN)
Logger.getLogger("org.eclipse.jetty.server").setLevel(Level.OFF)
// set up environment
val conf = new SparkConf()
.setMaster("local[5]")
.setAppName("Scratch")
.set("spark.executor.memory", "2g")
val sc = new SparkContext(conf)
val data = sc.parallelize(recs)
/**
* Remove readers with many books
count books by reader
and filter readers with books count > 10
*/
val maxBookCnt = 4
val readersWithLotsOfBooksRDD = data.map(r => (r.reader, 1)).reduceByKey((x, y) => x + y).filter{ case (_, x) => x > maxBookCnt }
readersWithLotsOfBooksRDD.collect()
val readersWithBooksRDD = data.map( r => (r.reader, (r.book, r.name) ))
readersWithBooksRDD.collect()
println("*** Records left after removing readers with maxBookCnt > "+maxBookCnt)
val data2 = readersWithBooksRDD.subtractByKey(readersWithLotsOfBooksRDD)
data2.foreach(println)
// *** Prepair train data
val trainData = data2.map(tuple => tuple match {
case (reader,v) => Book(reader = reader, book = v._1, name = v._2)
})
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
import sqlContext.implicits._
val trainDf = trainData.toDF()
println("*** Creating pairs...")
val trainPairs = trainDf.join(
trainDf.select($"book" as "r_book", $"reader" as "r_reader", $"name" as "r_name"),
$"reader" === $"r_reader" and $"book" < $"r_book")
.groupBy($"book", $"r_book", $"name", $"r_name")
.agg($"book",$"r_book", count($"reader") as "cnt", $"name", $"r_name")
trainPairs.registerTempTable("trainPairs")
println("*** Pairs Schema:")
trainPairs.printSchema()
// Order pairs by count
val pairsSorted = sqlContext.sql("SELECT * FROM trainPairs ORDER BY cnt DESC")
println("*** Pairs Sorted by Count")
pairsSorted.show
// Key pairs by book
val keyedPairs = trainPairs.rdd.map({case Row(book1: Int, book2: Int, count: Int, name1: String, name2:String)
=> (book1,(book2, count, name1, name2))})
println("*** keyedPairs:")
keyedPairs.foreach(println)
}
}
Any ideas?
有任何想法吗?
Update
更新
zero323 writes:
zero323 写道:
"It throws an exception because schema of trainPairs doesn't match pattern you've provided. Schema looks like this:
“它引发异常,因为 trainPairs 的模式与您提供的模式不匹配。模式如下所示:
root
|-- book: integer (nullable = false)
|-- r_book: integer (nullable = false)
|-- name: string (nullable = true)
|-- r_name: string (nullable = true)
|-- book: integer (nullable = false)
|-- r_book: integer (nullable = false)
|-- cnt: long (nullable = false)
|-- name: string (nullable = true)
|-- r_name: string (nullable = true)
Ok, but how can I find a complete schema of trainPairs? Why then when I print trainPairsschema with command:
好的,但是我怎样才能找到 的完整模式trainPairs?为什么当我trainPairs使用命令打印模式时:
trainPairs.printSchema()
I get only part of this schema:
我只得到这个架构的一部分:
root
|-- book: integer (nullable = false)
|-- r_book: integer (nullable = false)
|-- cnt: long (nullable = false)
|-- name: string (nullable = true)
|-- r_name: string (nullable = true)
How can I print / find a complete schema of trainPairs?
如何打印/找到 的完整模式trainPairs?
Besides
除了
Row(Int, Int, String, String, Int, Int, Long, String, String)
results in the same scala.MatchError!
结果一样scala.MatchError!
回答by zork
As I found out excepion was caused by wrong type of countrow field. It should be Longand not Int. So instead of:
当我发现异常是由错误类型的count行字段引起的。应该是Long,不是Int。所以而不是:
// Key pairs by book
val keyedPairs = trainPairs.rdd.map({case Row(book1: Int, book2: Int, count: Int, name1: String, name2:String)
=> (book1,(book2, count, name1, name2))})
The correct code should be:
正确的代码应该是:
val keyedPairs = trainPairs.rdd.map({case Row(book1: Int, book2: Int, count: Long, name1: String, name2:String)
=> (book1,(book2, count, name1, name2))})
And everything would work as expected.
一切都会按预期进行。

