scala 从火花数据框中的字符串列中提取单词

Question

提问by Sree51

I have a column in spark dataframe which has text.

我在 spark 数据框中有一列包含文本。

I want to extract all the words which start with a special character '@'and I am using regexp_extractfrom each row in that text column. If the text contains multiple words starting with '@'it just returns the first one.

我想提取所有以特殊字符开头的单词，'@'并且我正在使用regexp_extract该文本列中的每一行。如果文本包含多个以'@'它开头的单词，则只返回第一个单词。

I am looking for extracting multiple words which match my pattern in spark.

我正在寻找在 spark 中提取与我的模式匹配的多个单词。

data_frame.withColumn("Names", regexp_extract($"text","(?<=^|(?<=[^a-zA-Z0-9-_\.]))@([A-Za-z]+[A-Za-z0-9_]+)",1).show

Sample input: @always_nidhi @YouTube no i dnt understand bt i loved the music nd their dance awesome all the song of this mve is rocking

样本输入： @always_nidhi @YouTube no i dnt understand bt i loved the music nd their dance awesome all the song of this mve is rocking

Sample output: @always_nidhi,@YouTube

示例输出： @always_nidhi,@YouTube

Answer 1

回答by Amit Kumar

You can create a udf function in spark as below:

您可以在 spark 中创建一个 udf 函数，如下所示：

import java.util.regex.Pattern
import org.apache.spark.sql.functions.udf
import org.apache.spark.sql.functions.lit

def regexp_extractAll = udf((job: String, exp: String, groupIdx: Int) => {
      println("the column value is" + job.toString())
      val pattern = Pattern.compile(exp.toString)
      val m = pattern.matcher(job.toString)
      var result = Seq[String]()
      while (m.find) {
        val temp = 
        result =result:+m.group(groupIdx)
      }
      result.mkString(",")
    })

And then call the udf as below:

然后调用 udf 如下：

data_frame.withColumn("Names", regexp_extractAll(new Column("text"), lit("@\w+"), lit(0))).show()

Above you give you output as below:

上面你给你的输出如下：

+--------------------+
|               Names|
+--------------------+
|@always_nidhi,@Yo...|
+--------------------+

I have used regex, as per the output you have posted in the question. You can modify it to suite your needs.

根据您在问题中发布的输出，我使用了正则表达式。您可以修改它以满足您的需要。

Answer 2

回答by Souvik

You can use java RegExto extract those words. Below is the working code.

您可以使用 java RegEx来提取这些单词。下面是工作代码。

val sparkConf = new SparkConf().setAppName("myapp").setMaster("local[*]")
val sc = new SparkContext(sparkConf)

val sqlContext = new org.apache.spark.sql.SQLContext(sc)
import sqlContext.implicits._
import org.apache.spark.sql.functions.{col, udf}
import java.util.regex.Pattern

//User Defined function to extract
def toExtract(str: String) = {      
  val pattern = Pattern.compile("@\w+")
  val tmplst = scala.collection.mutable.ListBuffer.empty[String]
  val matcher = pattern.matcher(str)
  while (matcher.find()) {
    tmplst += matcher.group()
  }
  tmplst.mkString(",")
}

val Extract = udf(toExtract _)
val values = List("@always_nidhi @YouTube no i dnt understand bt i loved the music nd their dance awesome all the song of this mve is rocking")
val df = sc.parallelize(values).toDF("words")
df.select(Extract(col("words"))).show()

Output

输出

+--------------------+
|          UDF(words)|
+--------------------+
|@always_nidhi,@Yo...|
+--------------------+

Answer 3

回答by Sree51

I took the suggestion of Amit Kumar and created a UDF and then ran it in spark sql:

我接受了 Amit Kumar 的建议并创建了一个 UDF，然后在 spark sql 中运行它：

select Words(status) as people from dataframe

从数据框中选择 Words(status) 作为人

"Words" is my UDF and status is my dataframe column.

“Words”是我的 UDF，状态是我的数据框列。

scala 从火花数据框中的字符串列中提取单词

提问by Sree51

回答by Amit Kumar

回答by Souvik

回答by Sree51

相关推荐

最近更新

标签

scala 从火花数据框中的字符串列中提取单词

提问by Sree51

回答by Amit Kumar

回答by Souvik

回答by Sree51

相关推荐

scala 获取数组列的大小/长度

scala 将火花数据帧转换为数组 [String]

将 Row 转换为 spark Scala 中的映射

线程“main”中的异常 java.lang.NoSuchMethodError: scala.Product.$init$(Lscala/Product;)

相关推荐

最近更新

标签