scala 使用 Spark DataFrame 获取列上的不同值

Question

提问by Kazhiyur

Using Spark 1.6.1 version I need to fetch distinct values on a column and then perform some specific transformation on top of it. The column contains more than 50 million records and can grow larger.
I understand that doing a distinct.collect()will bring the call back to the driver program. Currently I am performing this task as below, is there a better approach?

使用 Spark 1.6.1 版本我需要在列上获取不同的值，然后在它之上执行一些特定的转换。该列包含超过 5000 万条记录，并且可以变得更大。
我知道执行 adistinct.collect()会将调用带回驱动程序。目前我正在执行以下任务，有没有更好的方法？

 import sqlContext.implicits._
 preProcessedData.persist(StorageLevel.MEMORY_AND_DISK_2)

 preProcessedData.select(ApplicationId).distinct.collect().foreach(x => {
   val applicationId = x.getAs[String](ApplicationId)
   val selectedApplicationData = preProcessedData.filter($"$ApplicationId" === applicationId)
   // DO SOME TASK PER applicationId
 })

 preProcessedData.unpersist()

Answer 1

回答by Alberto Bonsanto

Well to obtain all different values in a Dataframeyou can use distinct. As you can see in the documentation that method returns another DataFrame. After that you can create a UDFin order to transformeach record.

要在 a 中获取所有不同的值，Dataframe您可以使用distinct。正如您在文档中所见，该方法返回另一个DataFrame. 之后，您可以创建一个UDF以转换每个记录。

For example:

例如：

val df = sc.parallelize(Array((1, 2), (3, 4), (1, 6))).toDF("age", "salary")

// I obtain all different values. If you show you must see only {1, 3}
val distinctValuesDF = df.select(df("age")).distinct

// Define your udf. In this case I defined a simple function, but they can get complicated.
val myTransformationUDF = udf(value => value / 10)

// Run that transformation "over" your DataFrame
val afterTransformationDF = distinctValuesDF.select(myTransformationUDF(col("age")))

scala 使用 Spark DataFrame 获取列上的不同值

提问by Kazhiyur

回答by Alberto Bonsanto

相关推荐

最近更新

标签

scala 使用 Spark DataFrame 获取列上的不同值

提问by Kazhiyur

回答by Alberto Bonsanto

相关推荐

scala 如何迭代ScalawrappedArray？（火花）

忽略 Scala 中字符串的大小写

scala java.lang.ClassNotFoundException: org.apache.spark.sql.Dataset

如何在 JSON 中使用 read.schema 仅指定特定字段：SPARK Scala

相关推荐

最近更新

标签