Python 使用 Spark DataFrame 列制作直方图

Question

提问by user2857014

I am trying to make a histogram with a column from a dataframe which looks like

我正在尝试使用数据框中的列制作直方图，看起来像

DataFrame[C0: int, C1: int, ...]

If I were to make a histogram with the column C1, what should I do?

如果我要使用列 C1 制作直方图，我该怎么办？

Some things I have tried are

我尝试过的一些事情是

df.groupBy("C1").count().histogram()
df.C1.countByValue()

Which do not work because of mismatch in data types.

由于数据类型不匹配而不起作用。

Answer 1

回答by zero323

You can use histogram_numericHive UDAF:

您可以使用histogram_numericHive UDAF：

import random

random.seed(323)

sqlContext = HiveContext(sc)
n = 3  # Number of buckets
df = sqlContext.createDataFrame(
    sc.parallelize(enumerate(random.random() for _ in range(1000))),
   ["id", "v"]
)

hists = df.selectExpr("histogram_numeric({0}, {1})".format("v", n))

hists.show(1, False)
## +------------------------------------------------------------------------------------+
## |histogram_numeric(v,3)                                                              |
## +------------------------------------------------------------------------------------+
## |[[0.2124888140177466,415.0], [0.5918851340384337,330.0], [0.8890271451209697,255.0]]|
## +------------------------------------------------------------------------------------+

You can also extract the column of interest and use histogrammethod on RDD:

您还可以提取感兴趣的列并使用histogram方法RDD：

df.select("v").rdd.flatMap(lambda x: x).histogram(n)
## ([0.002028109534323752,
##  0.33410233677189705,
##  0.6661765640094703,
##  0.9982507912470436],
## [327, 326, 347])

Answer 2

回答by lanenok

What worked for me is

对我有用的是

df.groupBy("C1").count().rdd.values().histogram()

I have to convert to RDD because I found histogrammethod in pyspark.RDD class, but not in spark.SQL module

我必须转换为 RDD，因为我histogram在 pyspark.RDD 类中找到了方法，但在 spark.SQL 模块中没有

Answer 3

回答by Briford Wylie

The pyspark_dist_explorepackage that @Chris van den Berg mentioned is quite nice. If you prefer not to add an additional dependency you can use this bit of code to plot a simple histogram.

@Chris van den Berg 提到的pyspark_dist_explore包非常好。如果您不想添加额外的依赖项，您可以使用这段代码来绘制一个简单的直方图。

import matplotlib.pyplot as plt
# Show histogram of the 'C1' column
bins, counts = df.select('C1').rdd.flatMap(lambda x: x).histogram(20)

# This is a bit awkward but I believe this is the correct way to do it 
plt.hist(bins[:-1], bins=bins, weights=counts)

Answer 4

回答by Assaf Mendelson

Let's say your values in C1 are between 1-1000 and you want to get a histogram of 10 bins. You can do something like: df.withColumn("bins", df.C1/100).groupBy("bins").count() If your binning is more complex you can make a UDF for it (and at worse, you might need to analyze the column first, e.g. by using describe or through some other method).

假设您在 C1 中的值在 1-1000 之间，并且您想要获得 10 个 bin 的直方图。您可以执行以下操作： df.withColumn("bins", df.C1/100).groupBy("bins").count() 如果您的分箱更复杂，您可以为它制作一个 UDF（更糟的是，您可能需要首先分析该列，例如通过使用 describe 或通过其他一些方法）。

Answer 5

回答by Chris van den Berg

If you want a to plot the Histogram, you could use the pyspark_dist_explorepackage:

如果你想绘制直方图，你可以使用pyspark_dist_explore包：

fig, ax = plt.subplots()
hist(ax, df.groupBy("C1").count().select("count"))

If you would like the data in a pandas DataFrame you could use:

如果您想要 Pandas DataFrame 中的数据，您可以使用：

pandas_df = pandas_histogram(df.groupBy("C1").count().select("count"))

Answer 6

回答by Jagannath Banerjee

One easy way could be

一种简单的方法可能是

import pandas as pd
x = df.select('symboling').toPandas()  # symboling is the column for histogram
x.plot(kind='hist')

Python 使用 Spark DataFrame 列制作直方图

提问by user2857014

回答by zero323

回答by lanenok

回答by Briford Wylie

回答by Assaf Mendelson

回答by Chris van den Berg

回答by Jagannath Banerjee

相关推荐

最近更新

标签

Python 使用 Spark DataFrame 列制作直方图

提问by user2857014

回答by zero323

回答by lanenok

回答by Briford Wylie

回答by Assaf Mendelson

回答by Chris van den Berg

回答by Jagannath Banerjee

相关推荐

Python Numpy 将 1d 数组重塑为 1 列的 2d 数组

Python Pandas 如何使用 pd.cut()

Python 如何在 Pandas Dataframe 上的 groupby 之后进行条件计数？

Python 像Qlik一样计算pandas数据框中列中的唯一值？

相关推荐

最近更新

标签