scala 从火花数据框中的字符串列中提取单词

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/47981699/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-10-22 09:31:40  来源:igfitidea点击:

Extract words from a string column in spark dataframe

regexscalaapache-spark

提问by Sree51

I have a column in spark dataframe which has text.

我在 spark 数据框中有一列包含文本。

I want to extract all the words which start with a special character '@'and I am using regexp_extractfrom each row in that text column. If the text contains multiple words starting with '@'it just returns the first one.

我想提取所有以特殊字符开头的单词,'@'并且我正在使用regexp_extract该文本列中的每一行。如果文本包含多个以'@'它开头的单词,则只返回第一个单词。

I am looking for extracting multiple words which match my pattern in spark.

我正在寻找在 spark 中提取与我的模式匹配的多个单词。

data_frame.withColumn("Names", regexp_extract($"text","(?<=^|(?<=[^a-zA-Z0-9-_\.]))@([A-Za-z]+[A-Za-z0-9_]+)",1).show

Sample input: @always_nidhi @YouTube no i dnt understand bt i loved the music nd their dance awesome all the song of this mve is rocking

样本输入: @always_nidhi @YouTube no i dnt understand bt i loved the music nd their dance awesome all the song of this mve is rocking

Sample output: @always_nidhi,@YouTube

示例输出: @always_nidhi,@YouTube

回答by Amit Kumar

You can create a udf function in spark as below:

您可以在 spark 中创建一个 udf 函数,如下所示:

import java.util.regex.Pattern
import org.apache.spark.sql.functions.udf
import org.apache.spark.sql.functions.lit

def regexp_extractAll = udf((job: String, exp: String, groupIdx: Int) => {
      println("the column value is" + job.toString())
      val pattern = Pattern.compile(exp.toString)
      val m = pattern.matcher(job.toString)
      var result = Seq[String]()
      while (m.find) {
        val temp = 
        result =result:+m.group(groupIdx)
      }
      result.mkString(",")
    })

And then call the udf as below:

然后调用 udf 如下:

data_frame.withColumn("Names", regexp_extractAll(new Column("text"), lit("@\w+"), lit(0))).show()

Above you give you output as below:

上面你给你的输出如下:

+--------------------+
|               Names|
+--------------------+
|@always_nidhi,@Yo...|
+--------------------+

I have used regex, as per the output you have posted in the question. You can modify it to suite your needs.

根据您在问题中发布的输出,我使用了正则表达式。您可以修改它以满足您的需要。

回答by Souvik

You can use java RegExto extract those words. Below is the working code.

您可以使用 java RegEx来提取这些单词。下面是工作代码。

val sparkConf = new SparkConf().setAppName("myapp").setMaster("local[*]")
val sc = new SparkContext(sparkConf)

val sqlContext = new org.apache.spark.sql.SQLContext(sc)
import sqlContext.implicits._
import org.apache.spark.sql.functions.{col, udf}
import java.util.regex.Pattern

//User Defined function to extract
def toExtract(str: String) = {      
  val pattern = Pattern.compile("@\w+")
  val tmplst = scala.collection.mutable.ListBuffer.empty[String]
  val matcher = pattern.matcher(str)
  while (matcher.find()) {
    tmplst += matcher.group()
  }
  tmplst.mkString(",")
}

val Extract = udf(toExtract _)
val values = List("@always_nidhi @YouTube no i dnt understand bt i loved the music nd their dance awesome all the song of this mve is rocking")
val df = sc.parallelize(values).toDF("words")
df.select(Extract(col("words"))).show()

Output

输出

+--------------------+
|          UDF(words)|
+--------------------+
|@always_nidhi,@Yo...|
+--------------------+

回答by Sree51

I took the suggestion of Amit Kumar and created a UDF and then ran it in spark sql:

我接受了 Amit Kumar 的建议并创建了一个 UDF,然后在 spark sql 中运行它:

select Words(status) as people from dataframe

从数据框中选择 Words(status) 作为人

"Words" is my UDF and status is my dataframe column.

“Words”是我的 UDF,状态是我的数据框列。