scala Spark - 从具有不同列类型的行数据框中删除特殊字符

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/42839726/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-10-22 09:07:48  来源:igfitidea点击:

Spark - remove special characters from rows Dataframe with different column types

regexscalaapache-sparkdataframerdd

提问by Alg_D

Assuming I've a Dataframe with many columns, some are type stringothers type intand others type map.

假设我有一个包含许多列的数据框,有些是类型string其他类型int和其他类型map

e.g. field/columns types: stringType|intType|mapType<string,int>|...

例如 字段/列types: stringType|intType|mapType<string,int>|...

|--------------------------------------------------------------------------
|  myString1      |myInt1|  myMap1                                              |...
|--------------------------------------------------------------------------
|"this_is_#string"| 123 |{"str11_in#map":1,"str21_in#map":2, "str31_in#map": 31}|...
|"this_is_#string"| 456 |{"str12_in#map":1,"str22_in#map":2, "str32_in#map": 32}|...
|"this_is_#string"| 789 |{"str13_in#map":1,"str23_in#map":2, "str33_in#map": 33}|...
|--------------------------------------------------------------------------

I want to remove some characters like '_' and '#' from all columns of String and Map type so the resultDataframe/RDD will be:

我想从 String 和 Map 类型的所有列中删除一些像 '_' 和 '#' 这样的字符,所以结果Dataframe/RDD 将是:

|------------------------------------------------------------------------
|myString1     |myInt1|     myMap1|...                                 |
|------------------------------------------------------------------------
|"thisisstring"| 123 |{"str11inmap":1,"str21inmap":2, "str31inmap": 31}|...
|"thisisstring"| 456 |{"str12inmap":1,"str22inmap":2, "str32inmap": 32}|...
|"thisisstring"| 789 |{"str13inmap":1,"str23inmap":2, "str33inmap": 33}|...
|-------------------------------------------------------------------------

I am not sure if it's better to convert the Dataframe into an RDD and work with it or perform the work in the Dataframe.

我不确定将 Dataframe 转换为 RDD 并使用它或在 Dataframe 中执行工作是否更好。

Also, not sure how to handle the regexp with different column types in the best way (I am sing scala). And I would like to perform this action for all column of these two types (string and map), trying to avoid using the column names like:

另外,不确定如何以最佳方式处理具有不同列类型的正则表达式(我正在唱scala)。我想对这两种类型(字符串和映射)的所有列执行此操作,尽量避免使用如下列名:

def cleanRows(mytabledata: DataFrame): RDD[String] = {

//this will do the work for a specific column (myString1) of type string
val oneColumn_clean = mytabledata.withColumn("myString1", regexp_replace(col("myString1"),"[_#]",""))

       ...
//return type can be RDD or Dataframe...
}

Is there any simple solution to perform this? Thanks

是否有任何简单的解决方案来执行此操作?谢谢

回答by Psidom

One option is to define two udfs to handle string type column and Map type column separately:

一种选择是定义两个 udfs 来分别处理 string 类型列和 Map 类型列:

import org.apache.spark.sql.functions.udf
val df = Seq(("this_is#string", 3, Map("str1_in#map" -> 3))).toDF("myString", "myInt", "myMap")
df.show
+--------------+-----+--------------------+
|      myString|myInt|               myMap|
+--------------+-----+--------------------+
|this_is#string|    3|Map(str1_in#map -...|
+--------------+-----+--------------------+

1) Udf to handle string type columns:

1) Udf 处理字符串类型列:

def remove_string: String => String = _.replaceAll("[_#]", "")
def remove_string_udf = udf(remove_string)

2) Udf to handle Map type columns:

2) Udf 来处理 Map 类型的列:

def remove_map: Map[String, Int] => Map[String, Int] = _.map{ case (k, v) => k.replaceAll("[_#]", "") -> v }
def remove_map_udf = udf(remove_map)

3) Apply udfs to corresponding columns to clean it up:

3)将udfs应用到对应的列进行清理:

df.withColumn("myString", remove_string_udf($"myString")).
   withColumn("myMap", remove_map_udf($"myMap")).show

+------------+-----+-------------------+
|    myString|myInt|              myMap|
+------------+-----+-------------------+
|thisisstring|    3|Map(str1inmap -> 3)|
+------------+-----+-------------------+