Python Pyspark 替换 Spark 数据框列中的字符串

Question

提问by Luke

I'd like to perform some basic stemming on a Spark Dataframe column by replacing substrings. What's the quickest way to do this?

我想通过替换子字符串对 Spark Dataframe 列执行一些基本的词干提取。执行此操作的最快方法是什么？

In my current use case, I have a list of addresses that I want to normalize. For example this dataframe:

在我当前的用例中，我有一个要规范化的地址列表。例如这个数据框：

id     address
1       2 foo lane
2       10 bar lane
3       24 pants ln

Would become

会成为

id     address
1       2 foo ln
2       10 bar ln
3       24 pants ln

Answer 1

回答by Daniel de Paula

For Spark 1.5 or later, you can use the functionspackage:

对于 Spark 1.5 或更高版本，您可以使用函数包：

from pyspark.sql.functions import *
newDf = df.withColumn('address', regexp_replace('address', 'lane', 'ln'))

Quick explanation:

快速解释：

The function withColumnis called to add (or replace, if the name exists) a column to the data frame.
The function regexp_replacewill generate a new column by replacing all substrings that match the pattern.

withColumn调用该函数以向数据框中添加（或替换，如果名称存在）一列。
该函数regexp_replace将通过替换与模式匹配的所有子字符串来生成一个新列。

Answer 2

回答by loneStar

For scala

对于 Scala

import org.apache.spark.sql.functions.regexp_replace
import org.apache.spark.sql.functions.col
data.withColumn("addr_new", regexp_replace(col("addr_line"), "\*", ""))

Python Pyspark 替换 Spark 数据框列中的字符串

提问by Luke

回答by Daniel de Paula

回答by loneStar

相关推荐

最近更新

标签

Python Pyspark 替换 Spark 数据框列中的字符串

提问by Luke

回答by Daniel de Paula

回答by loneStar

相关推荐

Python 删除 pandas DataFrame 列中字符串条目的结尾

Python UnicodeEncodeError：'ascii' 编解码器无法对位置 0 中的字符进行编码：序号不在范围内（128）

Google Cloud Storage - 如何从 Python 3 上传文件？

Python 减去数据框中的两列

相关推荐

最近更新

标签