Python Spark 数据框添加带有随机数据的新列

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/41459138/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-20 01:00:05  来源:igfitidea点击:

Spark dataframe add new column with random data

pythonapache-sparkpysparkapache-spark-sql

提问by Dilma

I want to add a new column to the dataframe with values consist of either 0 or 1. I used 'randint' function from,

我想向数据框中添加一个新列,其值由 0 或 1 组成。我使用了“randint”函数,

from random import randint

df1 = df.withColumn('isVal',randint(0,1))

But I get the following error,

但我收到以下错误,

/spark/python/pyspark/sql/dataframe.py", line 1313, in withColumn assert isinstance(col, Column), "col should be Column" AssertionError: col should be Column

/spark/python/pyspark/sql/dataframe.py", line 1313, in withColumn assert isinstance(col, Column), "col should be Column" AssertionError: col should be Column

how to use a custom function or randint function for generate random value for the column?

如何使用自定义函数或 randint 函数为列生成随机值?

回答by Assaf Mendelson

You are using python builtin random. This returns a specific value which is constant (the returned value).

您正在使用 python 内置随机。这将返回一个特定的常量值(返回值)。

As the error message shows, we expect a column which represents the expression.

正如错误消息所示,我们需要一个代表表达式的列。

To do this do:

为此,请执行以下操作:

from pyspark.sql.functions import rand,when
df1 = df.withColumn('isVal', when(rand() > 0.5, 1).otherwise(0))

This would give a uniform distribution between 0 and 1. See the functions documentation for more options (http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#module-pyspark.sql.functions)

这将给出 0 和 1 之间的均匀分布。有关更多选项,请参阅函数文档(http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#module-pyspark.sql.functions)

回答by gogogod

Had a similar problem with integer values from 5 to 10. I've used the rand()function from pyspark.sql.functions

从 5 到 10 的整数值也有类似的问题。我使用了rand()来自pyspark.sql.functions

from pyspark.sql.functions import *
df1 = df.withColumn("random", round(rand()*(10-5)+5,0))