scala 从 `org.apache.spark.sql.Row` 中提取信息
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/28035832/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Extract information from a `org.apache.spark.sql.Row`
提问by sds
I have Array[org.apache.spark.sql.Row]returned by sqc.sql(sqlcmd).collect():
我已Array[org.apache.spark.sql.Row]返回sqc.sql(sqlcmd).collect():
Array([10479,6,10], [8975,149,640], ...)
I can get the individual values:
我可以获得各个值:
scala> pixels(0)(0)
res34: Any = 10479
but they are Any, not Int.
但他们是Any,不是Int。
How do I extract them as Int?
我如何将它们提取为Int?
The most obvious solutiondid not work:
最明显的解决方案不起作用:
scala> pixels(0).getInt(0)
java.lang.ClassCastException: java.lang.String cannot be cast to java.lang.Int
PS. I can do pixels(0)(0).toString.toIntor pixels(0).getString(0).toInt, but they feel wrong...
附注。我可以做pixels(0)(0).toString.toInt或pixels(0).getString(0).toInt,但他们觉得不对...
采纳答案by Justin Pihony
Using getIntshould work. Here is a contrived example as a proof of concept
使用getInt应该有效。这是一个人为的例子作为概念证明
import org.apache.spark.sql._
sc.parallelize(Array(1,2,3)).map(Row(_)).collect()(0).getInt(0)
This return 1
这回 1
However,
然而,
sc.parallelize(Array("1","2","3")).map(Row(_)).collect()(0).getInt(0)
fails. So, it looks like it is coming in as a string and you will have to convert to an int manually.
失败。因此,看起来它是作为字符串输入的,您必须手动转换为 int。
sc.parallelize(Array("1","2","3")).map(Row(_)).collect()(0).getString(0).toInt
The documentationstates that getInt:
该文件指出getInt:
Returns the value of column i as an int. This function will throw an exception if the value is at i is not an integer, or if it is null.
以 int 形式返回列 i 的值。如果 i 处的值不是整数或为空,则此函数将引发异常。
So, it will not try to cast for you it seems
所以,它似乎不会尝试为你投射
回答by tgpfeiffer
The Rowclass(also see https://spark.apache.org/docs/1.1.0/api/scala/index.html#org.apache.spark.sql.package) has methods getInt(i: Int), getDouble(i: Int)etc.
在Row类(见https://spark.apache.org/docs/1.1.0/api/scala/index.html#org.apache.spark.sql.package)的方法getInt(i: Int),getDouble(i: Int)等等。
Also note that a SchemaRDDis an RDD[Row]plusa schemathat tells you which column has which data type. If you do .collect()you will only get an Array[Row]which does nothave that information. So unless you know for sure what your data looks like, get the schema from the SchemaRDD, then collect the rows and then access each field using the correct type information.
另请注意,aSchemaRDD是一个RDD[Row]加号a schema,它告诉您哪个列具有哪种数据类型。如果你这样做.collect(),你将只能得到Array[Row]其中也没有这方面的资料。因此,除非您确定您的数据是什么样的,否则请从 中获取架构SchemaRDD,然后收集行,然后使用正确的类型信息访问每个字段。
回答by Pankaj Narang
the answer is relevant. you dont need to use collect instead you need to call the methods getIntgetStringand getAsas well in case the datatype is complex
答案是相关的。您不需要使用 collect 而需要调用方法getIntgetString,getAs以及在数据类型复杂的情况下
val popularHashTags = sqlContext.sql("SELECT hashtags, usersMentioned, Url FROM tweets")
var hashTagsList = popularHashTags.flatMap ( x => x.getAs[Seq[String]](0))

