scala 如何计算数据框中每一列的每个不同值的出现次数？

Question

提问by Leothorn

edf.select("x").distinct.show()shows the distinct values that are present in xcolumn of edfDataFrame.

edf.select("x").distinct.show()显示存在于DataFramex列中的不同值edf。

Is there an efficient method to also show the number of times these distinct values occur in the data frame? (count for each distinct value)

是否有一种有效的方法也可以显示这些不同值在数据框中出现的次数？（为每个不同的值计数）

Answer 1

回答by zero323

countDistinctis probably the first choice:

countDistinct可能是第一选择：

import org.apache.spark.sql.functions.countDistinct

df.agg(countDistinct("some_column"))

If speed is more important than the accuracy you may consider approx_count_distinct(approxCountDistinctin Spark 1.x):

如果速度比您可能考虑的准确性更重要approx_count_distinct（approxCountDistinct在 Spark 1.x 中）：

import org.apache.spark.sql.functions.approx_count_distinct

df.agg(approx_count_distinct("some_column"))

To get values and counts:

要获取值和计数：

df.groupBy("some_column").count()

In SQL (spark-sql):

在 SQL ( spark-sql) 中：

SELECT COUNT(DISTINCT some_column) FROM df

and

和

SELECT approx_count_distinct(some_column) FROM df

Answer 2

回答by Antoni

Another option without resorting to sql functions

不求助于 sql 函数的另一种选择

df.groupBy('your_column_name').count().show()

showwill print the different values and their occurrences. The result without show will be a dataframe.

show将打印不同的值及其出现的次数。没有显示的结果将是一个数据框。

Answer 3

回答by user10232195

import org.apache.spark.sql.functions.countDistinct

df.groupBy("a").agg(countDistinct("s")).collect()

Answer 4

回答by shengshan zhang

df.select("some_column").distinct.count

Answer 5

回答by ForeverLearner

If you are using Java, the import org.apache.spark.sql.functions.countDistinct;will give an error : The import org.apache.spark.sql.functions.countDistinct cannot be resolved

如果您使用的是 Java，import org.apache.spark.sql.functions.countDistinct;则会出现错误： The import org.apache.spark.sql.functions.countDistinct cannot be resolved

To use the countDistinctin java, use the below format:

要countDistinct在 java 中使用，请使用以下格式：

import org.apache.spark.sql.functions.*;
import org.apache.spark.sql.*;
import org.apache.spark.sql.types.*;

df.agg(functions.countDistinct("some_column"));

Answer 6

scala 如何计算数据框中每一列的每个不同值的出现次数？

提问by Leothorn

回答by zero323

回答by Antoni

回答by user10232195

回答by shengshan zhang

回答by ForeverLearner

回答by Saurav Sahu

Roughly speaking, how it works:

粗略地说，它是如何工作的：

相关推荐

最近更新

标签

scala 如何计算数据框中每一列的每个不同值的出现次数？

提问by Leothorn

回答by zero323

回答by Antoni

回答by user10232195

回答by shengshan zhang

回答by ForeverLearner

回答by Saurav Sahu

Roughly speaking, how it works:

粗略地说，它是如何工作的：

相关推荐

scala DataFrame 错误：“带有替代方法的重载方法值过滤器”

spark - scala：不是 org.apache.spark.sql.Row 的成员

scala 什么是版本库 spark 支持 SparkSession

Spark Scala 数据帧查找最大值

相关推荐

最近更新

标签