scala 如何计算数据框中每一列的每个不同值的出现次数?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/37949494/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
How to count occurrences of each distinct value for every column in a dataframe?
提问by Leothorn
edf.select("x").distinct.show()shows the distinct values that are present in xcolumn of edfDataFrame.
edf.select("x").distinct.show()显示存在于DataFramex列中的不同值edf。
Is there an efficient method to also show the number of times these distinct values occur in the data frame? (count for each distinct value)
是否有一种有效的方法也可以显示这些不同值在数据框中出现的次数?(为每个不同的值计数)
回答by zero323
countDistinctis probably the first choice:
countDistinct可能是第一选择:
import org.apache.spark.sql.functions.countDistinct
df.agg(countDistinct("some_column"))
If speed is more important than the accuracy you may consider approx_count_distinct(approxCountDistinctin Spark 1.x):
如果速度比您可能考虑的准确性更重要approx_count_distinct(approxCountDistinct在 Spark 1.x 中):
import org.apache.spark.sql.functions.approx_count_distinct
df.agg(approx_count_distinct("some_column"))
To get values and counts:
要获取值和计数:
df.groupBy("some_column").count()
In SQL (spark-sql):
在 SQL ( spark-sql) 中:
SELECT COUNT(DISTINCT some_column) FROM df
and
和
SELECT approx_count_distinct(some_column) FROM df
回答by Antoni
Another option without resorting to sql functions
不求助于 sql 函数的另一种选择
df.groupBy('your_column_name').count().show()
showwill print the different values and their occurrences. The result without show will be a dataframe.
show将打印不同的值及其出现的次数。没有显示的结果将是一个数据框。
回答by user10232195
import org.apache.spark.sql.functions.countDistinct
df.groupBy("a").agg(countDistinct("s")).collect()
回答by shengshan zhang
df.select("some_column").distinct.count
回答by ForeverLearner
If you are using Java, the import org.apache.spark.sql.functions.countDistinct;will give an error :
The import org.apache.spark.sql.functions.countDistinct cannot be resolved
如果您使用的是 Java,import org.apache.spark.sql.functions.countDistinct;则会出现错误:
The import org.apache.spark.sql.functions.countDistinct cannot be resolved
To use the countDistinctin java, use the below format:
要countDistinct在 java 中使用,请使用以下格式:
import org.apache.spark.sql.functions.*;
import org.apache.spark.sql.*;
import org.apache.spark.sql.types.*;
df.agg(functions.countDistinct("some_column"));


