Python Pyspark:显示数据框列的直方图
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/39154325/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Pyspark: show histogram of a data frame column
提问by Edamame
In pandas data frame, I am using the following code to plot histogram of a column:
在熊猫数据框中,我使用以下代码绘制列的直方图:
my_df.hist(column = 'field_1')
Is there something that can achieve the same goal in pyspark data frame? (I am in Jupyter Notebook) Thanks!
在 pyspark 数据框中是否有可以实现相同目标的东西?(我在 Jupyter Notebook)谢谢!
回答by Shivam Gaur
Unfortunately I don't think that there's a clean plot()
or hist()
function in the PySpark Dataframes API, but I'm hoping that things will eventually go in that direction.
不幸的是,我认为PySpark Dataframes API中没有 cleanplot()
或hist()
function,但我希望事情最终会朝着这个方向发展。
For the time being, you could compute the histogram in Spark, and plot the computed histogram as a bar chart. Example:
目前,您可以在 Spark 中计算直方图,并将计算出的直方图绘制为条形图。例子:
import pandas as pd
import pyspark.sql as sparksql
# Let's use UCLA's college admission dataset
file_name = "https://stats.idre.ucla.edu/stat/data/binary.csv"
# Creating a pandas dataframe from Sample Data
df_pd = pd.read_csv(file_name)
sql_context = sparksql.SQLcontext(sc)
# Creating a Spark DataFrame from a pandas dataframe
df_spark = sql_context.createDataFrame(df_pd)
df_spark.show(5)
This is what the data looks like:
这是数据的样子:
Out[]: +-----+---+----+----+
|admit|gre| gpa|rank|
+-----+---+----+----+
| 0|380|3.61| 3|
| 1|660|3.67| 3|
| 1|800| 4.0| 1|
| 1|640|3.19| 4|
| 0|520|2.93| 4|
+-----+---+----+----+
only showing top 5 rows
# This is what we want
df_pandas.hist('gre');
Histogram when plotted in using df_pandas.hist()
# Doing the heavy lifting in Spark. We could leverage the `histogram` function from the RDD api
gre_histogram = df_spark.select('gre').rdd.flatMap(lambda x: x).histogram(11)
# Loading the Computed Histogram into a Pandas Dataframe for plotting
pd.DataFrame(
list(zip(*gre_histogram)),
columns=['bin', 'frequency']
).set_index(
'bin'
).plot(kind='bar');
回答by Chris van den Berg
You can now use the pyspark_dist_explorepackage to leverage the matplotlib hist function for Spark DataFrames:
您现在可以使用pyspark_dist_explore包来利用 Spark DataFrames 的 matplotlib hist 函数:
from pyspark_dist_explore import hist
import matplotlib.pyplot as plt
fig, ax = plt.subplots()
hist(ax, data_frame, bins = 20, color=['red'])
This library uses the rdd histogram function to calculate bin values.
该库使用 rdd 直方图函数来计算 bin 值。
回答by Andrew
The histogram
method for RDDs returns the bin ranges and the bin counts. Here's a function that takes this histogram data and plots it as a histogram.
histogram
RDD的方法返回 bin 范围和 bin 计数。这是一个函数,它采用此直方图数据并将其绘制为直方图。
import numpy as np
import matplotlib.pyplot as mplt
import matplotlib.ticker as mtick
def plotHistogramData(data):
binSides, binCounts = data
N = len(binCounts)
ind = np.arange(N)
width = 1
fig, ax = mplt.subplots()
rects1 = ax.bar(ind+0.5, binCounts, width, color='b')
ax.set_ylabel('Frequencies')
ax.set_title('Histogram')
ax.set_xticks(np.arange(N+1))
ax.set_xticklabels(binSides)
ax.xaxis.set_major_formatter(mtick.FormatStrFormatter('%.2e'))
ax.yaxis.set_major_formatter(mtick.FormatStrFormatter('%.2e'))
mplt.show()
(This code assumes that bins have equal length.)
(此代码假定 bin 具有相等的长度。)
回答by Elior Malul
Another solution, without the need for extra imports, which should also be efficient; First, use window partition:
另一种解决方案,不需要额外的导入,也应该是高效的;首先,使用窗口分区:
import pyspark.sql.functions as F
import pyspark.sql as SQL
win = SQL.Window.partitionBy('column_of_values')
Then all you need it to use countaggregation partitioned by the window:
然后你需要它来使用由窗口分区的计数聚合:
df.select(F.count('column_of_values').over(win).alias('histogram'))
df.select(F.count('column_of_values').over(win).alias('histogram'))
The aggregative operators happens on each partition of the cluster, and does not require an extra round-trip to the host.
聚合运算符发生在集群的每个分区上,不需要额外的到主机的往返。
回答by conner.xyz
This is straightforward and works well.
这很简单,而且效果很好。
df.groupby(
'<group-index>'
).count().select(
'count'
).rdd.flatMap(
lambda x: x
).histogram(20)