Python 如何在 Spark DataFrame 中添加常量列？

Question

提问by Evan Zamir

I want to add a column in a DataFramewith some arbitrary value (that is the same for each row). I get an error when I use withColumnas follows:

我想在 a 中添加一个DataFrame具有任意值的列（每行都相同）。我在使用时出现错误withColumn，如下所示：

dt.withColumn('new_column', 10).head(5)

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-50-a6d0257ca2be> in <module>()
      1 dt = (messages
      2     .select(messages.fromuserid, messages.messagetype, floor(messages.datetime/(1000*60*5)).alias("dt")))
----> 3 dt.withColumn('new_column', 10).head(5)

/Users/evanzamir/spark-1.4.1/python/pyspark/sql/dataframe.pyc in withColumn(self, colName, col)
   1166         [Row(age=2, name=u'Alice', age2=4), Row(age=5, name=u'Bob', age2=7)]
   1167         """
-> 1168         return self.select('*', col.alias(colName))
   1169 
   1170     @ignore_unicode_prefix

AttributeError: 'int' object has no attribute 'alias'

It seems that I can trick the function into working as I want by adding and subtracting one of the other columns (so they add to zero) and then adding the number I want (10 in this case):

似乎我可以通过添加和减去其他列之一（因此它们添加为零）然后添加我想要的数字（在这种情况下为 10）来欺骗函数按照我想要的方式工作：

dt.withColumn('new_column', dt.messagetype - dt.messagetype + 10).head(5)

[Row(fromuserid=425, messagetype=1, dt=4809600.0, new_column=10),
 Row(fromuserid=47019141, messagetype=1, dt=4809600.0, new_column=10),
 Row(fromuserid=49746356, messagetype=1, dt=4809600.0, new_column=10),
 Row(fromuserid=93506471, messagetype=1, dt=4809600.0, new_column=10),
 Row(fromuserid=80488242, messagetype=1, dt=4809600.0, new_column=10)]

This is supremely hacky, right? I assume there is a more legit way to do this?

这非常hacky，对吧？我认为有更合法的方式来做到这一点？

Answer 1

采纳答案by zero323

Spark 2.2+

火花 2.2+

Spark 2.2 introduces typedLitto support Seq, Map, and Tuples(SPARK-19254) and following calls should be supported (Scala):

Spark 2.2 引入typedLit了支持 Seq、Map和Tuples( SPARK-19254) 并且应该支持以下调用 (Scala)：

import org.apache.spark.sql.functions.typedLit

df.withColumn("some_array", typedLit(Seq(1, 2, 3)))
df.withColumn("some_struct", typedLit(("foo", 1, .0.3)))
df.withColumn("some_map", typedLit(Map("key1" -> 1, "key2" -> 2)))

Spark 1.3+(lit), 1.4+(array, struct), 2.0+(map):

火花 1.3+( lit), 1.4+( array, struct), 2.0+( map):

The second argument for DataFrame.withColumnshould be a Columnso you have to use a literal:

for 的第二个参数DataFrame.withColumn应该是 aColumn所以你必须使用文字：

from pyspark.sql.functions import lit

df.withColumn('new_column', lit(10))

If you need complex columns you can build these using blocks like array:

如果您需要复杂的列，您可以使用以下块构建这些列array：

from pyspark.sql.functions import array, create_map, struct

df.withColumn("some_array", array(lit(1), lit(2), lit(3)))
df.withColumn("some_struct", struct(lit("foo"), lit(1), lit(.3)))
df.withColumn("some_map", create_map(lit("key1"), lit(1), lit("key2"), lit(2)))

Exactly the same methods can be used in Scala.

在 Scala 中可以使用完全相同的方法。

import org.apache.spark.sql.functions.{array, lit, map, struct}

df.withColumn("new_column", lit(10))
df.withColumn("map", map(lit("key1"), lit(1), lit("key2"), lit(2)))

To provide names for structsuse either aliason each field:

为了提供名称structs或者使用alias上的每个字段：

df.withColumn(
    "some_struct",
    struct(lit("foo").alias("x"), lit(1).alias("y"), lit(0.3).alias("z"))
 )

or caston the whole object

或cast在整个对象上

df.withColumn(
    "some_struct", 
    struct(lit("foo"), lit(1), lit(0.3)).cast("struct<x: string, y: integer, z: double>")
 )

It is also possible, although slower, to use an UDF.

虽然速度较慢，但也可以使用 UDF。

Note:

注意：

The same constructs can be used to pass constant arguments to UDFs or SQL functions.

可以使用相同的构造将常量参数传递给 UDF 或 SQL 函数。

Answer 2

回答by Ayush Vatsyayan

In spark 2.2 there are two ways to add constant value in a column in DataFrame:

在 spark 2.2 中，有两种方法可以在 DataFrame 的列中添加常量值：

1) Using lit

1) 使用 lit

2) Using typedLit.

2）使用typedLit。

The difference between the two is that typedLitcan also handle parameterized scala types e.g. List, Seq, and Map

两者的区别在于，typedLit还可以处理参数化的scala类型，例如List、Seq和Map

Sample DataFrame:

示例数据帧：

val df = spark.createDataFrame(Seq((0,"a"),(1,"b"),(2,"c"))).toDF("id", "col1")

+---+----+
| id|col1|
+---+----+
|  0|   a|
|  1|   b|
+---+----+

1) Using lit:Adding constant string value in new column named newcol:

1）使用lit：在名为newcol的新列中添加常量字符串值：

import org.apache.spark.sql.functions.lit
val newdf = df.withColumn("newcol",lit("myval"))

Result:

结果：

+---+----+------+
| id|col1|newcol|
+---+----+------+
|  0|   a| myval|
|  1|   b| myval|
+---+----+------+

2) Using typedLit:

2）使用typedLit：

import org.apache.spark.sql.functions.typedLit
df.withColumn("newcol", typedLit(("sample", 10, .044)))

Result:

结果：

+---+----+-----------------+
| id|col1|           newcol|
+---+----+-----------------+
|  0|   a|[sample,10,0.044]|
|  1|   b|[sample,10,0.044]|
|  2|   c|[sample,10,0.044]|
+---+----+-----------------+

Python 如何在 Spark DataFrame 中添加常量列？

提问by Evan Zamir

采纳答案by zero323

回答by Ayush Vatsyayan

相关推荐

最近更新

标签

Python 如何在 Spark DataFrame 中添加常量列？

提问by Evan Zamir

采纳答案by zero323

回答by Ayush Vatsyayan

相关推荐

Python 类型错误：worker() 采用 0 个位置参数，但给出了 1 个

使用 Python 检查字符串是否在文件中

Python 将压缩文件作为 Pandas DataFrame 读取

Python 熊猫按groupby求和，但排除某些列

相关推荐

最近更新

标签