Python 如何计算pyspark数据帧中每个不同值的计数?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/42451189/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
How to calculate the counts of each distinct value in a pyspark dataframe?
提问by madsthaks
I have a column filled with a bunch of states' initials as strings. My goal is to how the count of each state in such list.
我有一列填充了一堆状态的首字母作为字符串。我的目标是如何计数此类列表中的每个状态。
For example: (("TX":3),("NJ":2))
should be the output when there are two occurrences of "TX"
and "NJ"
.
例如:(("TX":3),("NJ":2))
应该是出现两次"TX"
和时的输出"NJ"
。
I'm fairly new to pyspark so I'm stumped with this problem. Any help would be much appreciated.
我对 pyspark 还很陌生,所以我被这个问题难住了。任何帮助将非常感激。
回答by eddies
I think you're looking to use the DataFrame idiom of groupByand count.
我认为您正在寻找使用groupBy和count的 DataFrame 习惯用法。
For example, given the following dataframe, one state per row:
例如,给定以下数据帧,每行一个状态:
df = sqlContext.createDataFrame([('TX',), ('NJ',), ('TX',), ('CA',), ('NJ',)], ('state',))
df.show()
+-----+
|state|
+-----+
| TX|
| NJ|
| TX|
| CA|
| NJ|
+-----+
The following yields:
以下产生:
df.groupBy('state').count().show()
+-----+-----+
|state|count|
+-----+-----+
| TX| 2|
| NJ| 2|
| CA| 1|
+-----+-----+
回答by gench
import pandas as pd
import pyspark.sql.functions as F
def value_counts(spark_df, colm, order=1, n=10):
"""
Count top n values in the given column and show in the given order
Parameters
----------
spark_df : pyspark.sql.dataframe.DataFrame
Data
colm : string
Name of the column to count values in
order : int, default=1
1: sort the column descending by value counts and keep nulls at top
2: sort the column ascending by values
3: sort the column descending by values
4: do 2 and 3 (combine top n and bottom n after sorting the column by values ascending)
n : int, default=10
Number of top values to display
Returns
----------
Value counts in pandas dataframe
"""
if order==1 :
return pd.DataFrame(spark_df.select(colm).groupBy(colm).count().orderBy(F.desc_nulls_first("count")).head(n),columns=["value","count"])
if order==2 :
return pd.DataFrame(spark_df.select(colm).groupBy(colm).count().orderBy(F.asc(colm)).head(n),columns=["value","count"])
if order==3 :
return pd.DataFrame(spark_df.select(colm).groupBy(colm).count().orderBy(F.desc(colm)).head(n),columns=["value","count"])
if order==4 :
return pd.concat([pd.DataFrame(spark_df.select(colm).groupBy(colm).count().orderBy(F.asc(colm)).head(n),columns=["value","count"]),
pd.DataFrame(spark_df.select(colm).groupBy(colm).count().orderBy(F.desc(colm)).head(n),columns=["value","count"])])