scala Apache Spark:如何将带有正则表达式的数据帧列转换为另一个数据帧?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/32121961/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-10-22 07:29:51  来源:igfitidea点击:

Apache Spark: how to transform Data Frame column with regex to another Data Frame?

regexscalaapache-spark

提问by snowindy

I have Spark Data Frame 1 of several columns: (user_uuid, url, date_visit)

我有几列的 Spark 数据帧 1: (user_uuid, url, date_visit)

I want to transform this DF1 to Data Frame 2 with form : (user_uuid, domain, date_visit)

我想将此 DF1 转换为具有以下形式的数据帧 2 : (user_uuid, domain, date_visit)

What I wanted to use is regular expression to detect domain and apply it to DF1 val regexpr = """(?i)^((https?):\/\/)?((www|www1)\.)?([\w-\.]+)""".r

我想使用的是正则表达式来检测域并将其应用于 DF1 val regexpr = """(?i)^((https?):\/\/)?((www|www1)\.)?([\w-\.]+)""".r

Could you please help me composing code to transform Data Frames in Scala? I am completely new to Spark and Scala and syntax is hard. Thanks!

你能帮我编写代码来转换 Scala 中的数据帧吗?我对 Spark 和 Scala 完全陌生,语法很难。谢谢!

回答by zero323

Spark >= 1.5:

火花 >= 1.5

You can use regexp_extractfunction:

您可以使用regexp_extract功能:

import org.apache.spark.sql.functions.regexp_extract

val patter: String = ??? 
val groupIdx: Int = ???

df.withColumn("domain", regexp_extract(url, pattern, groupIdx))

Spark < 1.5.0

火花 < 1.5.0

Define an UDF

定义一个 UDF

val pattern: scala.util.matching.Regex = ???

def getFirst(pattern: scala.util.matching.Regex) = udf(
  (url: String) => pattern.findFirstIn(url) match { 
    case Some(domain) => domain
    case None => "unknown"
  }
)

Use defined UDF:

使用定义的 UDF:

df.select(
  $"user_uuid",
  getFirst(pattern)($"url").alias("domain"),
  $"date_visit"
)

or register temp table:

或注册临时表:

df.registerTempTable("df")

sqlContext.sql(s"""
  SELECT user_uuid, regexp_extract(url, '$pattern', $group_idx) AS domain, date_visit FROM df""")

Replace patternwith a valid Java regexp and group_idwith an index of the group.

替换pattern为有效的 Java 正则表达式和group_id组的索引。