Python 在 DataFrame 上应用映射函数

Question

提问by yahalom

I have just started using databricks/pyspark. Im using python/spark 2.1. I have uploaded data to a table. This table is a single column full of strings. I wish to apply a mapping function to each element in the column. I load the table into a dataframe:

我刚刚开始使用 databricks/pyspark。我使用 python/spark 2.1。我已将数据上传到表中。该表是一个充满字符串的单列。我希望对列中的每个元素应用映射函数。我将表加载到数据框中：

df = spark.table("mynewtable")

The only way I could see was others saying was to convert it to RDD to apply the mapping function and then back to dataframe to show the data. But this throws up job aborted stage failure:

我能看到的唯一方法是其他人说是将其转换为 RDD 以应用映射函数，然后返回到数据帧以显示数据。但这会引发工作中止阶段失败：

df2 = df.select("_c0").rdd.flatMap(lambda x: x.append("anything")).toDF()

All i want to do is just apply any sort of map function to my data in the table. For example append something to each string in the column, or perform a split on a char, and then put that back into a dataframe so i can .show() or display it.

我想要做的只是将任何类型的地图函数应用于表中的数据。例如，将某些内容附加到列中的每个字符串，或对字符执行拆分，然后将其放回数据帧中，以便我可以 .show() 或显示它。

Answer 1

回答by Alper t. Turker

You cannot:

你不能：

Use flatMapbecause it will flatten the Row
You cannot use appendbecause:
- tupleor Rowhave no append method
- append(if present on collection) is executed for side effects and returns None

使用，flatMap因为它会压平Row
您不能使用，append因为：
- tuple或者Row没有追加方法
- append（如果存在于集合中）为副作用执行并返回 None

I would use withColumn:

我会用withColumn：

df.withColumn("foo", lit("anything"))

but mapshould work as well:

但也map应该工作：

df.select("_c0").rdd.flatMap(lambda x: x + ("anything", )).toDF()

Edit(given the comment):

编辑（给出评论）：

You probably want an udf

你可能想要一个 udf

from pyspark.sql.functions import udf

def iplookup(s):
    return ... # Some lookup logic

iplookup_udf = udf(iplookup)

df.withColumn("foo", iplookup_udf("c0"))

Default return type is StringType, so if you want something else you should adjust it.

默认返回类型是StringType，所以如果你想要别的东西，你应该调整它。

Python 在 DataFrame 上应用映射函数

提问by yahalom

回答by Alper t. Turker

相关推荐

最近更新

标签

Python 在 DataFrame 上应用映射函数

提问by yahalom

回答by Alper t. Turker

相关推荐

Python 迭代 Pandas DataFrame 中的每个元素

Python Sklearn：用于多类分类的 ROC

Python pip.conf 不注意受信任的主机

Python 如何在 Django REST Framework 上启用 CORS

相关推荐

最近更新

标签