Python 在 DataFrame 上应用映射函数
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/45404644/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Applying Mapping Function on DataFrame
提问by yahalom
I have just started using databricks/pyspark. Im using python/spark 2.1. I have uploaded data to a table. This table is a single column full of strings. I wish to apply a mapping function to each element in the column. I load the table into a dataframe:
我刚刚开始使用 databricks/pyspark。我使用 python/spark 2.1。我已将数据上传到表中。该表是一个充满字符串的单列。我希望对列中的每个元素应用映射函数。我将表加载到数据框中:
df = spark.table("mynewtable")
The only way I could see was others saying was to convert it to RDD to apply the mapping function and then back to dataframe to show the data. But this throws up job aborted stage failure:
我能看到的唯一方法是其他人说是将其转换为 RDD 以应用映射函数,然后返回到数据帧以显示数据。但这会引发工作中止阶段失败:
df2 = df.select("_c0").rdd.flatMap(lambda x: x.append("anything")).toDF()
All i want to do is just apply any sort of map function to my data in the table. For example append something to each string in the column, or perform a split on a char, and then put that back into a dataframe so i can .show() or display it.
我想要做的只是将任何类型的地图函数应用于表中的数据。例如,将某些内容附加到列中的每个字符串,或对字符执行拆分,然后将其放回数据帧中,以便我可以 .show() 或显示它。
回答by Alper t. Turker
You cannot:
你不能:
- Use
flatMap
because it will flatten theRow
You cannot use
append
because:tuple
orRow
have no append methodappend
(if present on collection) is executed for side effects and returnsNone
- 使用,
flatMap
因为它会压平Row
您不能使用,
append
因为:tuple
或者Row
没有追加方法append
(如果存在于集合中)为副作用执行并返回None
I would use withColumn
:
我会用withColumn
:
df.withColumn("foo", lit("anything"))
but map
should work as well:
但也map
应该工作:
df.select("_c0").rdd.flatMap(lambda x: x + ("anything", )).toDF()
Edit(given the comment):
编辑(给出评论):
You probably want an udf
你可能想要一个 udf
from pyspark.sql.functions import udf
def iplookup(s):
return ... # Some lookup logic
iplookup_udf = udf(iplookup)
df.withColumn("foo", iplookup_udf("c0"))
Default return type is StringType
, so if you want something else you should adjust it.
默认返回类型是StringType
,所以如果你想要别的东西,你应该调整它。