Python 如何计算pyspark数据帧中每个不同值的计数？

Question

提问by madsthaks

I have a column filled with a bunch of states' initials as strings. My goal is to how the count of each state in such list.

我有一列填充了一堆状态的首字母作为字符串。我的目标是如何计数此类列表中的每个状态。

For example: (("TX":3),("NJ":2))should be the output when there are two occurrences of "TX"and "NJ".

例如：(("TX":3),("NJ":2))应该是出现两次"TX"和时的输出"NJ"。

I'm fairly new to pyspark so I'm stumped with this problem. Any help would be much appreciated.

我对 pyspark 还很陌生，所以我被这个问题难住了。任何帮助将非常感激。

Answer 1

回答by eddies

I think you're looking to use the DataFrame idiom of groupByand count.

我认为您正在寻找使用groupBy和count的 DataFrame 习惯用法。

For example, given the following dataframe, one state per row:

例如，给定以下数据帧，每行一个状态：

df = sqlContext.createDataFrame([('TX',), ('NJ',), ('TX',), ('CA',), ('NJ',)], ('state',))
df.show()
+-----+
|state|
+-----+
|   TX|
|   NJ|
|   TX|
|   CA|
|   NJ|
+-----+

The following yields:

以下产生：

df.groupBy('state').count().show()
+-----+-----+
|state|count|
+-----+-----+
|   TX|    2|
|   NJ|    2|
|   CA|    1|
+-----+-----+

Answer 2

回答by gench


import pandas as pd
import pyspark.sql.functions as F

def value_counts(spark_df, colm, order=1, n=10):
    """
    Count top n values in the given column and show in the given order

    Parameters
    ----------
    spark_df : pyspark.sql.dataframe.DataFrame
        Data
    colm : string
        Name of the column to count values in
    order : int, default=1
        1: sort the column descending by value counts and keep nulls at top
        2: sort the column ascending by values
        3: sort the column descending by values
        4: do 2 and 3 (combine top n and bottom n after sorting the column by values ascending) 
    n : int, default=10
        Number of top values to display

    Returns
    ----------
    Value counts in pandas dataframe
    """

    if order==1 :
        return pd.DataFrame(spark_df.select(colm).groupBy(colm).count().orderBy(F.desc_nulls_first("count")).head(n),columns=["value","count"]) 
    if order==2 :
        return pd.DataFrame(spark_df.select(colm).groupBy(colm).count().orderBy(F.asc(colm)).head(n),columns=["value","count"]) 
    if order==3 :
        return pd.DataFrame(spark_df.select(colm).groupBy(colm).count().orderBy(F.desc(colm)).head(n),columns=["value","count"]) 
    if order==4 :
        return pd.concat([pd.DataFrame(spark_df.select(colm).groupBy(colm).count().orderBy(F.asc(colm)).head(n),columns=["value","count"]),
                          pd.DataFrame(spark_df.select(colm).groupBy(colm).count().orderBy(F.desc(colm)).head(n),columns=["value","count"])])

Python 如何计算pyspark数据帧中每个不同值的计数？

提问by madsthaks

回答by eddies

回答by gench

相关推荐

最近更新

标签

Python 如何计算pyspark数据帧中每个不同值的计数？

提问by madsthaks

回答by eddies

回答by gench

相关推荐

Python字典vs列表，哪个更快？

Python Pandas - 在 DataFrame 中的任何位置查找值的索引

Python matplotlib：“TypeError：图像数据无法转换为浮点数”，看起来像一个精细的矩阵

如何将 pip 命令覆盖为 Python3.x 而不是 Python2.7？

相关推荐

最近更新

标签