如何使用 Scala(spark) 逐行读取文本文件并使用分隔符拆分并将值存储在各自的列中?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/46345620/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-10-22 09:27:25  来源:igfitidea点击:

How to read text file using Scala(spark) line by line and split using delimiter and store values in respective columns?

scalaapache-spark

提问by niranjan

I am new to Scala.

我是 Scala 的新手。

My requirement is that I need to read line by line and split it on particular delimiter and extract values to put in respective columns in different file.

我的要求是我需要逐行读取并将其拆分为特定的分隔符并提取值以放入不同文件中的相应列。

Below is my input sample data:

以下是我的输入示例数据:

ABC Log

Aug 10 14:36:52 127.0.0.1 CEF:0|McAfee|ePolicy Orchestrator|IFSSLCRT0.5.0.5/epo4.0|2410|DeploymentTask|High  eventId=34 externalId=23
Aug 10 15:45:56 127.0.0.1 CEF:0|McAfee|ePolicy Orchestrator|IFSSLCRT0.5.0.5/epo4.0|2890|DeploymentTask|Medium eventId=888 externalId=7788
Aug 10 16:40:59 127.0.0.1 CEF:0|NV|ePolicy Orchestrator|IFSSLCRT0.5.0.5/epo4.0|2990|DeploymentTask|Low eventId=989 externalId=0004


XYZ Log

Aug 15 14:32:15 142.101.36.118 cef[10612]: CEF:0|fire|cc|3.5.1|FireEye Acquisition Started
Aug 16 16:45:10 142.101.36.189 cef[10612]: CEF:0|cold|dd|3.5.4|FireEye Acquisition Started
Aug 18 19:50:20 142.101.36.190 cef[10612]: CEF:0|fire|ee|3.5.6|FireEye Acquisition Started

In above data I need to read first part under 'ABC log' heading and extract values from each line and put it under respective column.Here few first values columns names are hardcoded and last columns i need to extract by splitting on "=" i.e. eventId=34 externalId=23 => col = eventId value = 34 and col = value = externalId

在上面的数据中,我需要阅读“ABC 日志”标题下的第一部分,并从每一行中提取值并将其放在相应的列下。这里的几个第一个值列名称是硬编码的,我需要通过拆分“=”来提取最后一列,即eventId=34 externalId=23 => col = eventId value = 34 and col = value = externalId

Column names 

date time ip_address col1 col2 col3 col4 col5

I want output like below:

我想要如下输出:

This is for first part 'ABC Log' and put it into one file and same for rest.

这是第一部分“ABC日志”并将其放入一个文件中,其余部分相同。

 date    time     ip_address  col1   col2    col3          col4      col5 col6                            col7  
 Aug 10  14:36:52 127.0.0.1   CEF:0  McAfee   ePolicy Orchestrator IFSSLCRT0.5.0.5/epo4.0 2410 DeploymentTask High

Aug 10 15:45:56 127.0.0.1 CEF:0 McAfee ePolicy Orchestrator IFSSLCRT0.5.0.5/epo4.0 2890 DeploymentTask Medium

Below code I have been trying :

下面的代码我一直在尝试:

package AV_POC_Parsing
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import org.apache.spark.SparkConf
import org.apache.log4j.Logger

import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import org.apache.spark.SparkConf

// For implicit conversions like converting RDDs to DataFrames

//import org.apache.spark.implicits._

//import spark.implicits._


object scala {

   def main(args: Array[String]) {

  // create Spark context with Spark configuration
    val sc = new SparkContext(new SparkConf().setAppName("AV_Log_Processing").setMaster("local[*]"))

    // Read text file in spark RDD 

    val textFile = sc.textFile("input.txt");


    val splitRdd = textFile.map( line => line.split(" "))
    // RDD[ Array[ String ]


    // printing values
    splitRdd.foreach { x => x.foreach { y => println(y) } }

   // how to store split values in different column and write it into file

}}

How to split on two delimiters in Scala.

如何在 Scala 中拆分两个分隔符。

Thanks

谢谢

回答by ashburshui

Maybe it helps you.

也许它可以帮助你。

import org.apache.spark.{SparkConf, SparkContext}

object DataFilter {

  def main(args: Array[String]): Unit = {

    // create Spark context with Spark configuration
    val sc = new SparkContext(new SparkConf().setAppName("AV_Log_Processing").setMaster("local[*]"))

    // Read text file in spark RDD
    val textFile = sc.textFile("input.txt");

    val splitRdd = textFile.map { s =>
      val a = s.split("[ |]")
      val date = Array(a(0) + " " + a(1))
      (date ++ a.takeRight(10)).mkString("\t")
    }
    // RDD[ Array[ String ]


    // printing values
    splitRdd.foreach(println)

    // how to store split values in different column and write it into file
  }
}