scala 从 `org.apache.spark.sql.Row` 中提取信息

Question

提问by sds

I have Array[org.apache.spark.sql.Row]returned by sqc.sql(sqlcmd).collect():

我已Array[org.apache.spark.sql.Row]返回sqc.sql(sqlcmd).collect()：

Array([10479,6,10], [8975,149,640], ...)

I can get the individual values:

我可以获得各个值：

scala> pixels(0)(0)
res34: Any = 10479

but they are Any, not Int.

但他们是Any，不是Int。

How do I extract them as Int?

我如何将它们提取为Int？

The most obvious solutiondid not work:

最明显的解决方案不起作用：

scala> pixels(0).getInt(0)
java.lang.ClassCastException: java.lang.String cannot be cast to java.lang.Int

PS. I can do pixels(0)(0).toString.toIntor pixels(0).getString(0).toInt, but they feel wrong...

附注。我可以做pixels(0)(0).toString.toInt或pixels(0).getString(0).toInt，但他们觉得不对...

Answer 1

采纳答案by Justin Pihony

Using getIntshould work. Here is a contrived example as a proof of concept

使用getInt应该有效。这是一个人为的例子作为概念证明

import org.apache.spark.sql._
sc.parallelize(Array(1,2,3)).map(Row(_)).collect()(0).getInt(0)

This return 1

这回 1

However,

然而，

sc.parallelize(Array("1","2","3")).map(Row(_)).collect()(0).getInt(0)

fails. So, it looks like it is coming in as a string and you will have to convert to an int manually.

失败。因此，看起来它是作为字符串输入的，您必须手动转换为 int。

sc.parallelize(Array("1","2","3")).map(Row(_)).collect()(0).getString(0).toInt

The documentationstates that getInt:

该文件指出getInt：

Returns the value of column i as an int. This function will throw an exception if the value is at i is not an integer, or if it is null.

以 int 形式返回列 i 的值。如果 i 处的值不是整数或为空，则此函数将引发异常。

So, it will not try to cast for you it seems

所以，它似乎不会尝试为你投射

Answer 2

回答by tgpfeiffer

The Rowclass(also see https://spark.apache.org/docs/1.1.0/api/scala/index.html#org.apache.spark.sql.package) has methods getInt(i: Int), getDouble(i: Int)etc.

在Row类（见https://spark.apache.org/docs/1.1.0/api/scala/index.html#org.apache.spark.sql.package）的方法getInt(i: Int)，getDouble(i: Int)等等。

Also note that a SchemaRDDis an RDD[Row]plusa schemathat tells you which column has which data type. If you do .collect()you will only get an Array[Row]which does nothave that information. So unless you know for sure what your data looks like, get the schema from the SchemaRDD, then collect the rows and then access each field using the correct type information.

另请注意，aSchemaRDD是一个RDD[Row]加号a schema，它告诉您哪个列具有哪种数据类型。如果你这样做.collect()，你将只能得到Array[Row]其中也没有这方面的资料。因此，除非您确定您的数据是什么样的，否则请从中获取架构SchemaRDD，然后收集行，然后使用正确的类型信息访问每个字段。

Answer 3

回答by Pankaj Narang

the answer is relevant. you dont need to use collect instead you need to call the methods getIntgetStringand getAsas well in case the datatype is complex

答案是相关的。您不需要使用 collect 而需要调用方法getIntgetString，getAs以及在数据类型复杂的情况下

val popularHashTags = sqlContext.sql("SELECT hashtags, usersMentioned, Url FROM tweets")
var hashTagsList =  popularHashTags.flatMap ( x => x.getAs[Seq[String]](0))

scala 从 `org.apache.spark.sql.Row` 中提取信息

提问by sds

采纳答案by Justin Pihony

回答by tgpfeiffer

回答by Pankaj Narang

相关推荐

最近更新

标签

scala 从 `org.apache.spark.sql.Row` 中提取信息

提问by sds

采纳答案by Justin Pihony

回答by tgpfeiffer

回答by Pankaj Narang

相关推荐

scala 在哪里可以找到下载的 sbt 库？

scala 如何在 Spark 中将 Parquet 文件拆分为多个分区？

scala SBT 程序集 - 重复数据删除错误和排除错误

如何使用 Scala 客户端开始使用 Elastic Search

相关推荐

最近更新

标签