scala Spark 列字符串在其他列(行)中出现时替换
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/45615621/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Spark column string replace when present in other column (row)
提问by Karol Sudol
I would like to remove strings from col1that are present in col2:
我想删除字符串从col1存在于col2:
val df = spark.createDataFrame(Seq(
("Hi I heard about Spark", "Spark"),
("I wish Java could use case classes", "Java"),
("Logistic regression models are neat", "models")
)).toDF("sentence", "label")
using regexp_replaceor translateref: spark functions api
使用regexp_replace或translate参考:spark 函数 api
val res = df.withColumn("sentence_without_label", regexp_replace
(col("sentence") , "(?????)", "" ))
so that reslooks as below:
所以res看起来如下:
回答by ktheitroadalo
You could simply use regexp_replace
你可以简单地使用 regexp_replace
df5.withColumn("sentence_without_label", regexp_replace($"sentence" , lit($"label"), lit("" )))
or you can use simple udf function as below
或者您可以使用如下简单的 udf 函数
val df5 = spark.createDataFrame(Seq(
("Hi I heard about Spark", "Spark"),
("I wish Java could use case classes", "Java"),
("Logistic regression models are neat", "models")
)).toDF("sentence", "label")
val replace = udf((data: String , rep : String)=>data.replaceAll(rep, ""))
val res = df5.withColumn("sentence_without_label", replace($"sentence" , $"label"))
res.show()
Output:
输出:
+-----------------------------------+------+------------------------------+
|sentence |label |sentence_without_label |
+-----------------------------------+------+------------------------------+
|Hi I heard about Spark |Spark |Hi I heard about |
|I wish Java could use case classes |Java |I wish could use case classes|
|Logistic regression models are neat|models|Logistic regression are neat |
+-----------------------------------+------+------------------------------+
回答by Alper t. Turker
If labelit just a literal it is pretty simple:
如果label它只是一个文字,它非常简单:
import org.apache.spark.sql.functions._
df.withColumn("sentence_without_label",
regexp_replace(col("sentence"), col("label"), lit(""))).show(false)
+-----------------------------------+------+------------------------------+
|sentence |label |sentence_without_label |
+-----------------------------------+------+------------------------------+
|Hi I heard about Spark |Spark |Hi I heard about |
|I wish Java could use case classes |Java |I wish could use case classes|
|Logistic regression models are neat|models|Logistic regression are neat |
+-----------------------------------+------+------------------------------+
In Spark 1.6 you can do the same with expr:
在 Spark 1.6 中,您可以执行相同的操作expr:
df.withColumn(
"sentence_without_label",
expr("regexp_replace(sentence, label, '')"))


