Python Pyspark：显示数据框列的直方图

Question

提问by Edamame

In pandas data frame, I am using the following code to plot histogram of a column:

在熊猫数据框中，我使用以下代码绘制列的直方图：

my_df.hist(column = 'field_1')

Is there something that can achieve the same goal in pyspark data frame? (I am in Jupyter Notebook) Thanks!

在 pyspark 数据框中是否有可以实现相同目标的东西？（我在 Jupyter Notebook）谢谢！

Answer 1

回答by Shivam Gaur

Unfortunately I don't think that there's a clean plot()or hist()function in the PySpark Dataframes API, but I'm hoping that things will eventually go in that direction.

不幸的是，我认为PySpark Dataframes API中没有 cleanplot()或hist()function，但我希望事情最终会朝着这个方向发展。

For the time being, you could compute the histogram in Spark, and plot the computed histogram as a bar chart. Example:

目前，您可以在 Spark 中计算直方图，并将计算出的直方图绘制为条形图。例子：

import pandas as pd
import pyspark.sql as sparksql

# Let's use UCLA's college admission dataset
file_name = "https://stats.idre.ucla.edu/stat/data/binary.csv"

# Creating a pandas dataframe from Sample Data
df_pd = pd.read_csv(file_name)

sql_context = sparksql.SQLcontext(sc)

# Creating a Spark DataFrame from a pandas dataframe
df_spark = sql_context.createDataFrame(df_pd)

df_spark.show(5)

This is what the data looks like:

这是数据的样子：

Out[]:    +-----+---+----+----+
          |admit|gre| gpa|rank|
          +-----+---+----+----+
          |    0|380|3.61|   3|
          |    1|660|3.67|   3|
          |    1|800| 4.0|   1|
          |    1|640|3.19|   4|
          |    0|520|2.93|   4|
          +-----+---+----+----+
          only showing top 5 rows


# This is what we want
df_pandas.hist('gre');

Histogram when plotted in using df_pandas.hist()

使用 df_pandas.hist() 绘制时的直方图

# Doing the heavy lifting in Spark. We could leverage the `histogram` function from the RDD api

gre_histogram = df_spark.select('gre').rdd.flatMap(lambda x: x).histogram(11)

# Loading the Computed Histogram into a Pandas Dataframe for plotting
pd.DataFrame(
    list(zip(*gre_histogram)), 
    columns=['bin', 'frequency']
).set_index(
    'bin'
).plot(kind='bar');

Histogram computed by using RDD.histogram()

使用 RDD.histogram() 计算的直方图

Answer 2

回答by Chris van den Berg

You can now use the pyspark_dist_explorepackage to leverage the matplotlib hist function for Spark DataFrames:

您现在可以使用pyspark_dist_explore包来利用 Spark DataFrames 的 matplotlib hist 函数：

from pyspark_dist_explore import hist
import matplotlib.pyplot as plt

fig, ax = plt.subplots()
hist(ax, data_frame, bins = 20, color=['red'])

This library uses the rdd histogram function to calculate bin values.

该库使用 rdd 直方图函数来计算 bin 值。

Answer 3

回答by Andrew

The histogrammethod for RDDs returns the bin ranges and the bin counts. Here's a function that takes this histogram data and plots it as a histogram.

histogramRDD的方法返回 bin 范围和 bin 计数。这是一个函数，它采用此直方图数据并将其绘制为直方图。

import numpy as np
import matplotlib.pyplot as mplt
import matplotlib.ticker as mtick

def plotHistogramData(data):
    binSides, binCounts = data

    N = len(binCounts)
    ind = np.arange(N)
    width = 1

    fig, ax = mplt.subplots()
    rects1 = ax.bar(ind+0.5, binCounts, width, color='b')

    ax.set_ylabel('Frequencies')
    ax.set_title('Histogram')
    ax.set_xticks(np.arange(N+1))
    ax.set_xticklabels(binSides)
    ax.xaxis.set_major_formatter(mtick.FormatStrFormatter('%.2e'))
    ax.yaxis.set_major_formatter(mtick.FormatStrFormatter('%.2e'))

    mplt.show()

(This code assumes that bins have equal length.)

（此代码假定 bin 具有相等的长度。）

Answer 4

回答by Elior Malul

Another solution, without the need for extra imports, which should also be efficient; First, use window partition:

另一种解决方案，不需要额外的导入，也应该是高效的；首先，使用窗口分区：

import pyspark.sql.functions as F
import pyspark.sql as SQL
win = SQL.Window.partitionBy('column_of_values')

Then all you need it to use countaggregation partitioned by the window:

然后你需要它来使用由窗口分区的计数聚合：

df.select(F.count('column_of_values').over(win).alias('histogram'))

The aggregative operators happens on each partition of the cluster, and does not require an extra round-trip to the host.

聚合运算符发生在集群的每个分区上，不需要额外的到主机的往返。

Answer 5

回答by conner.xyz

This is straightforward and works well.

这很简单，而且效果很好。

df.groupby(
  '<group-index>'
).count().select(
  'count'
).rdd.flatMap(
  lambda x: x
).histogram(20)

Python Pyspark：显示数据框列的直方图

提问by Edamame

回答by Shivam Gaur

回答by Chris van den Berg

回答by Andrew

回答by Elior Malul

回答by conner.xyz

相关推荐

最近更新

标签

Python Pyspark：显示数据框列的直方图

提问by Edamame

回答by Shivam Gaur

回答by Chris van den Berg

回答by Andrew

回答by Elior Malul

回答by conner.xyz

相关推荐

Python ValueError: Unable to configure handler 'file': [Errno 2] No such file or directory:

使用 Python 重命名和移动文件

Python 具有多个条件的 Numpy“where”

Python 无法获得直方图以显示带有垂直线的分隔箱

相关推荐

最近更新

标签