scala 如何计算数据框中每一列的每个不同值的出现次数?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/37949494/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-10-22 08:24:12  来源:igfitidea点击:

How to count occurrences of each distinct value for every column in a dataframe?

scalaapache-spark

提问by Leothorn

edf.select("x").distinct.show()shows the distinct values that are present in xcolumn of edfDataFrame.

edf.select("x").distinct.show()显示存在于DataFramex列中的不同值edf

Is there an efficient method to also show the number of times these distinct values occur in the data frame? (count for each distinct value)

是否有一种有效的方法也可以显示这些不同值在数据框中出现的次数?(为每个不同的值计数)

回答by zero323

countDistinctis probably the first choice:

countDistinct可能是第一选择:

import org.apache.spark.sql.functions.countDistinct

df.agg(countDistinct("some_column"))

If speed is more important than the accuracy you may consider approx_count_distinct(approxCountDistinctin Spark 1.x):

如果速度比您可能考虑的准确性更重要approx_count_distinctapproxCountDistinct在 Spark 1.x 中):

import org.apache.spark.sql.functions.approx_count_distinct

df.agg(approx_count_distinct("some_column"))

To get values and counts:

要获取值和计数:

df.groupBy("some_column").count()

In SQL (spark-sql):

在 SQL ( spark-sql) 中:

SELECT COUNT(DISTINCT some_column) FROM df

and

SELECT approx_count_distinct(some_column) FROM df

回答by Antoni

Another option without resorting to sql functions

不求助于 sql 函数的另一种选择

df.groupBy('your_column_name').count().show()

showwill print the different values and their occurrences. The result without show will be a dataframe.

show将打印不同的值及其出现的次数。没有显示的结果将是一个数据框。

回答by user10232195

import org.apache.spark.sql.functions.countDistinct

df.groupBy("a").agg(countDistinct("s")).collect()

回答by shengshan zhang

df.select("some_column").distinct.count

回答by ForeverLearner

If you are using Java, the import org.apache.spark.sql.functions.countDistinct;will give an error : The import org.apache.spark.sql.functions.countDistinct cannot be resolved

如果您使用的是 Java,import org.apache.spark.sql.functions.countDistinct;则会出现错误: The import org.apache.spark.sql.functions.countDistinct cannot be resolved

To use the countDistinctin java, use the below format:

countDistinct在 java 中使用,请使用以下格式:

import org.apache.spark.sql.functions.*;
import org.apache.spark.sql.*;
import org.apache.spark.sql.types.*;

df.agg(functions.countDistinct("some_column"));

回答by Saurav Sahu

Roughly speaking, how it works:

粗略地说,它是如何工作的:

enter image description here

在此处输入图片说明

enter image description here

在此处输入图片说明