scala Spark 从一行中提取值

Question

提问by Sam

I have the following dataframe

我有以下数据框

val transactions_with_counts = sqlContext.sql(
  """SELECT user_id AS user_id, category_id AS category_id,
  COUNT(category_id) FROM transactions GROUP BY user_id, category_id""")

I'm trying to convert the rows to Rating objects but since x(0) returns an array this fails

我正在尝试将行转换为 Rating 对象，但由于 x(0) 返回一个数组，因此失败

val ratings = transactions_with_counts
  .map(x => Rating(x(0).toInt, x(1).toInt, x(2).toInt))

error: value toInt is not a member of Any

错误：值 toInt 不是 Any 的成员

Answer 1

回答by zero323

Lets start with some dummy data:

让我们从一些虚拟数据开始：

val transactions = Seq((1, 2), (1, 4), (2, 3)).toDF("user_id", "category_id")

val transactions_with_counts = transactions
  .groupBy($"user_id", $"category_id")
  .count

transactions_with_counts.printSchema

// root
// |-- user_id: integer (nullable = false)
// |-- category_id: integer (nullable = false)
// |-- count: long (nullable = false)

There are a few ways to access Rowvalues and keep expected types:

有几种方法可以访问Row值并保留预期类型：

Pattern matching

import org.apache.spark.sql.Row

transactions_with_counts.map{
  case Row(user_id: Int, category_id: Int, rating: Long) =>
    Rating(user_id, category_id, rating)
}

Typed get*methods like getInt, getLong:

transactions_with_counts.map(
  r => Rating(r.getInt(0), r.getInt(1), r.getLong(2))
)

getAsmethod which can use both names and indices:
```
transactions_with_counts.map(r => Rating(
  r.getAs[Int]("user_id"), r.getAs[Int]("category_id"), r.getAs[Long](2)
))
```
It can be used to properly extract user defined types, including mllib.linalg.Vector. Obviously accessing by name requires a schema.
Converting to statically typed Dataset(Spark 1.6+ / 2.0+):
```
transactions_with_counts.as[(Int, Int, Long)]
```

模式匹配

import org.apache.spark.sql.Row

transactions_with_counts.map{
  case Row(user_id: Int, category_id: Int, rating: Long) =>
    Rating(user_id, category_id, rating)
}

类型get*方法，如getInt, getLong：

transactions_with_counts.map(
  r => Rating(r.getInt(0), r.getInt(1), r.getLong(2))
)

getAs可以使用名称和索引的方法：
```
transactions_with_counts.map(r => Rating(
  r.getAs[Int]("user_id"), r.getAs[Int]("category_id"), r.getAs[Long](2)
))
```
它可用于正确提取用户定义的类型，包括mllib.linalg.Vector. 显然，按名称访问需要一个模式。
转换为静态类型Dataset（Spark 1.6+ / 2.0+）：
```
transactions_with_counts.as[(Int, Int, Long)]
```

Answer 2

回答by user-asterix

Using Datasets you can define Ratings as follows:

使用数据集，您可以按如下方式定义评级：

case class Rating(user_id: Int, category_id:Int, count:Long)

The Rating class here has a column name 'count' instead of 'rating' as zero323 suggested. Thus the rating variable is assigned as follows:

这里的 Rating 类有一个列名“count”，而不是像 zero323 建议的那样“rating”。因此，评级变量分配如下：

val transactions_with_counts = transactions.groupBy($"user_id", $"category_id").count

val rating = transactions_with_counts.as[Rating]

This way you will not run into run-time errors in Spark because your Rating class column name is identical to the 'count' column name generated by Spark on run-time.

这样您就不会在 Spark 中遇到运行时错误，因为您的 Rating 类列名称与 Spark 在运行时生成的“计数”列名称相同。

Answer 3

回答by Sarath Avanavu

To access a value of a row of Dataframe, you need to use rdd.collectof Dataframewith for loop.

要访问的行的值数据框，你需要使用rdd.collect的数据帧与循环。

Consider your Dataframe looks like below.

考虑您的 Dataframe 如下所示。

val df = Seq(
      (1,"James"),    
      (2,"Albert"),
      (3,"Pete")).toDF("user_id","name")

Use rdd.collecton top of your Dataframe. The rowvariable will contain each row of Dataframeof rddrow type. To get each element from a row, use row.mkString(",")which will contain value of each row in comma separated values. Using splitfunction (inbuilt function) you can access each column value of rddrow with index.

使用rdd.collect您的顶部数据帧。该row变量将包含行类型的Dataframe的每一行rdd。要从一行中获取每个元素，请使用row.mkString(",")which 将以逗号分隔的值包含每行的值。使用split函数（内置函数），您可以rdd使用索引访问行的每一列值。

for (row <- df.rdd.collect)
{   
    var user_id = row.mkString(",").split(",")(0)
    var category_id = row.mkString(",").split(",")(1)       
}

The above code looks little more bigger when compared to dataframe.foreachloops, but you will get more control over your logic by using the above code.

与dataframe.foreach循环相比，上面的代码看起来要大一些，但是通过使用上面的代码，您可以更好地控制您的逻辑。

scala Spark 从一行中提取值

提问by Sam

回答by zero323

回答by user-asterix

回答by Sarath Avanavu

相关推荐

最近更新

标签

scala Spark 从一行中提取值

提问by Sam

回答by zero323

回答by user-asterix

回答by Sarath Avanavu

相关推荐

如何在 Scala 中将 DataFrame 转换为 RDD？

如何将 Dataframe 列名称与 Scala 案例类属性匹配？

scala 如何在Spark中按键分区RDD？

scala 如何将 Column.isin 与列表一起使用？

相关推荐

最近更新

标签