scala Spark - 从具有不同列类型的行数据框中删除特殊字符

Question

提问by Alg_D

Assuming I've a Dataframe with many columns, some are type stringothers type intand others type map.

假设我有一个包含许多列的数据框，有些是类型string其他类型int和其他类型map。

e.g. field/columns types: stringType|intType|mapType<string,int>|...

例如 字段/列types: stringType|intType|mapType<string,int>|...

|--------------------------------------------------------------------------
|  myString1      |myInt1|  myMap1                                              |...
|--------------------------------------------------------------------------
|"this_is_#string"| 123 |{"str11_in#map":1,"str21_in#map":2, "str31_in#map": 31}|...
|"this_is_#string"| 456 |{"str12_in#map":1,"str22_in#map":2, "str32_in#map": 32}|...
|"this_is_#string"| 789 |{"str13_in#map":1,"str23_in#map":2, "str33_in#map": 33}|...
|--------------------------------------------------------------------------

I want to remove some characters like '_' and '#' from all columns of String and Map type so the resultDataframe/RDD will be:

我想从 String 和 Map 类型的所有列中删除一些像 '_' 和 '#' 这样的字符，所以结果Dataframe/RDD 将是：

|------------------------------------------------------------------------
|myString1     |myInt1|     myMap1|...                                 |
|------------------------------------------------------------------------
|"thisisstring"| 123 |{"str11inmap":1,"str21inmap":2, "str31inmap": 31}|...
|"thisisstring"| 456 |{"str12inmap":1,"str22inmap":2, "str32inmap": 32}|...
|"thisisstring"| 789 |{"str13inmap":1,"str23inmap":2, "str33inmap": 33}|...
|-------------------------------------------------------------------------

I am not sure if it's better to convert the Dataframe into an RDD and work with it or perform the work in the Dataframe.

我不确定将 Dataframe 转换为 RDD 并使用它或在 Dataframe 中执行工作是否更好。

Also, not sure how to handle the regexp with different column types in the best way (I am sing scala). And I would like to perform this action for all column of these two types (string and map), trying to avoid using the column names like:

另外，不确定如何以最佳方式处理具有不同列类型的正则表达式（我正在唱scala）。我想对这两种类型（字符串和映射）的所有列执行此操作，尽量避免使用如下列名：

def cleanRows(mytabledata: DataFrame): RDD[String] = {

//this will do the work for a specific column (myString1) of type string
val oneColumn_clean = mytabledata.withColumn("myString1", regexp_replace(col("myString1"),"[_#]",""))

       ...
//return type can be RDD or Dataframe...
}

Is there any simple solution to perform this? Thanks

是否有任何简单的解决方案来执行此操作？谢谢

Answer 1

回答by Psidom

One option is to define two udfs to handle string type column and Map type column separately:

一种选择是定义两个 udfs 来分别处理 string 类型列和 Map 类型列：

import org.apache.spark.sql.functions.udf
val df = Seq(("this_is#string", 3, Map("str1_in#map" -> 3))).toDF("myString", "myInt", "myMap")
df.show
+--------------+-----+--------------------+
|      myString|myInt|               myMap|
+--------------+-----+--------------------+
|this_is#string|    3|Map(str1_in#map -...|
+--------------+-----+--------------------+

1) Udf to handle string type columns:

1) Udf 处理字符串类型列：

def remove_string: String => String = _.replaceAll("[_#]", "")
def remove_string_udf = udf(remove_string)

2) Udf to handle Map type columns:

2) Udf 来处理 Map 类型的列：

def remove_map: Map[String, Int] => Map[String, Int] = _.map{ case (k, v) => k.replaceAll("[_#]", "") -> v }
def remove_map_udf = udf(remove_map)

3) Apply udfs to corresponding columns to clean it up:

3）将udfs应用到对应的列进行清理：

df.withColumn("myString", remove_string_udf($"myString")).
   withColumn("myMap", remove_map_udf($"myMap")).show

+------------+-----+-------------------+
|    myString|myInt|              myMap|
+------------+-----+-------------------+
|thisisstring|    3|Map(str1inmap -> 3)|
+------------+-----+-------------------+

scala Spark - 从具有不同列类型的行数据框中删除特殊字符

提问by Alg_D

回答by Psidom

相关推荐

最近更新

标签

scala Spark - 从具有不同列类型的行数据框中删除特殊字符

提问by Alg_D

回答by Psidom

相关推荐

scala 多次触发数据帧分组

scala 来自 Spark Streaming 的 RestAPI 服务调用

将列表转换为数据帧 spark scala

将 RDD[String] 转换为 RDD[Row] 到 Dataframe Spark Scala

相关推荐

最近更新

标签