Python Spark 数据框添加带有随机数据的新列

Question

提问by Dilma

I want to add a new column to the dataframe with values consist of either 0 or 1. I used 'randint' function from,

我想向数据框中添加一个新列，其值由 0 或 1 组成。我使用了“randint”函数，

from random import randint

df1 = df.withColumn('isVal',randint(0,1))

But I get the following error,

但我收到以下错误，

/spark/python/pyspark/sql/dataframe.py", line 1313, in withColumn assert isinstance(col, Column), "col should be Column" AssertionError: col should be Column

how to use a custom function or randint function for generate random value for the column?

如何使用自定义函数或 randint 函数为列生成随机值？

Answer 1

回答by Assaf Mendelson

You are using python builtin random. This returns a specific value which is constant (the returned value).

您正在使用 python 内置随机。这将返回一个特定的常量值（返回值）。

As the error message shows, we expect a column which represents the expression.

正如错误消息所示，我们需要一个代表表达式的列。

To do this do:

为此，请执行以下操作：

from pyspark.sql.functions import rand,when
df1 = df.withColumn('isVal', when(rand() > 0.5, 1).otherwise(0))

This would give a uniform distribution between 0 and 1. See the functions documentation for more options (http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#module-pyspark.sql.functions)

这将给出 0 和 1 之间的均匀分布。有关更多选项，请参阅函数文档（http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#module-pyspark.sql.functions)

Answer 2

回答by gogogod

Had a similar problem with integer values from 5 to 10. I've used the rand()function from pyspark.sql.functions

从 5 到 10 的整数值也有类似的问题。我使用了rand()来自pyspark.sql.functions

from pyspark.sql.functions import *
df1 = df.withColumn("random", round(rand()*(10-5)+5,0))

Python Spark 数据框添加带有随机数据的新列

提问by Dilma

回答by Assaf Mendelson

回答by gogogod

相关推荐

最近更新

标签

Python Spark 数据框添加带有随机数据的新列

提问by Dilma

回答by Assaf Mendelson

回答by gogogod

相关推荐

计算pandas数据框列中列表长度的Pythonic方法

Python 3.6.0 语法错误“调用‘打印’时缺少括号

带有 Python 3.5 的 OpenCV 3.1.0 中的 `CV_HAAR_SCALE_IMAGE` 在哪里？

Python 下载 NLTK 数据时出现 SSL 错误

相关推荐

最近更新

标签