scala Spark 列字符串在其他列（行）中出现时替换

Question

提问by Karol Sudol

I would like to remove strings from col1that are present in col2:

我想删除字符串从col1存在于col2：

val df = spark.createDataFrame(Seq(
("Hi I heard about Spark", "Spark"),
("I wish Java could use case classes", "Java"),
("Logistic regression models are neat", "models")
)).toDF("sentence", "label")

using regexp_replaceor translateref: spark functions api

使用regexp_replace或translate参考：spark 函数 api

val res = df.withColumn("sentence_without_label", regexp_replace 
(col("sentence") , "(?????)", "" ))

so that reslooks as below:

所以res看起来如下：

Answer 1

回答by ktheitroadalo

You could simply use regexp_replace

你可以简单地使用 regexp_replace

df5.withColumn("sentence_without_label", regexp_replace($"sentence" , lit($"label"), lit("" )))

or you can use simple udf function as below

或者您可以使用如下简单的 udf 函数

val df5 = spark.createDataFrame(Seq(
  ("Hi I heard about Spark", "Spark"),
  ("I wish Java could use case classes", "Java"),
  ("Logistic regression models are neat", "models")
)).toDF("sentence", "label")

val replace = udf((data: String , rep : String)=>data.replaceAll(rep, ""))

val res = df5.withColumn("sentence_without_label", replace($"sentence" , $"label"))

res.show()

Output:

输出：

+-----------------------------------+------+------------------------------+
|sentence                           |label |sentence_without_label        |
+-----------------------------------+------+------------------------------+
|Hi I heard about Spark             |Spark |Hi I heard about              |
|I wish Java could use case classes |Java  |I wish  could use case classes|
|Logistic regression models are neat|models|Logistic regression  are neat |
+-----------------------------------+------+------------------------------+

Answer 2

回答by Alper t. Turker

If labelit just a literal it is pretty simple:

如果label它只是一个文字，它非常简单：

import org.apache.spark.sql.functions._

df.withColumn("sentence_without_label", 
  regexp_replace(col("sentence"), col("label"), lit(""))).show(false)

+-----------------------------------+------+------------------------------+
|sentence                           |label |sentence_without_label        |
+-----------------------------------+------+------------------------------+
|Hi I heard about Spark             |Spark |Hi I heard about              |
|I wish Java could use case classes |Java  |I wish  could use case classes|
|Logistic regression models are neat|models|Logistic regression  are neat |
+-----------------------------------+------+------------------------------+

In Spark 1.6 you can do the same with expr:

在 Spark 1.6 中，您可以执行相同的操作expr：

df.withColumn(
  "sentence_without_label",
  expr("regexp_replace(sentence, label, '')"))

scala Spark 列字符串在其他列（行）中出现时替换

提问by Karol Sudol

回答by ktheitroadalo

回答by Alper t. Turker

相关推荐

最近更新

标签

scala Spark 列字符串在其他列（行）中出现时替换

提问by Karol Sudol

回答by ktheitroadalo

回答by Alper t. Turker

相关推荐

如何在 Scala 的 Apache Spark 中将数据帧转换为数据集？

将一行转换为 spark scala 中的列表

scala 在文本文件中写入/存储数据帧

scala spark sql cast 函数创建带有 NULLS 的列

相关推荐

最近更新

标签