如何使用 Scala(spark) 逐行读取文本文件并使用分隔符拆分并将值存储在各自的列中?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/46345620/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
How to read text file using Scala(spark) line by line and split using delimiter and store values in respective columns?
提问by niranjan
I am new to Scala.
我是 Scala 的新手。
My requirement is that I need to read line by line and split it on particular delimiter and extract values to put in respective columns in different file.
我的要求是我需要逐行读取并将其拆分为特定的分隔符并提取值以放入不同文件中的相应列。
Below is my input sample data:
以下是我的输入示例数据:
ABC Log
Aug 10 14:36:52 127.0.0.1 CEF:0|McAfee|ePolicy Orchestrator|IFSSLCRT0.5.0.5/epo4.0|2410|DeploymentTask|High eventId=34 externalId=23
Aug 10 15:45:56 127.0.0.1 CEF:0|McAfee|ePolicy Orchestrator|IFSSLCRT0.5.0.5/epo4.0|2890|DeploymentTask|Medium eventId=888 externalId=7788
Aug 10 16:40:59 127.0.0.1 CEF:0|NV|ePolicy Orchestrator|IFSSLCRT0.5.0.5/epo4.0|2990|DeploymentTask|Low eventId=989 externalId=0004
XYZ Log
Aug 15 14:32:15 142.101.36.118 cef[10612]: CEF:0|fire|cc|3.5.1|FireEye Acquisition Started
Aug 16 16:45:10 142.101.36.189 cef[10612]: CEF:0|cold|dd|3.5.4|FireEye Acquisition Started
Aug 18 19:50:20 142.101.36.190 cef[10612]: CEF:0|fire|ee|3.5.6|FireEye Acquisition Started
In above data I need to read first part under 'ABC log' heading and extract values from each line and put it under respective column.Here few first values columns names are hardcoded and last columns i need to extract by splitting on "=" i.e. eventId=34 externalId=23 => col = eventId value = 34 and col = value = externalId
在上面的数据中,我需要阅读“ABC 日志”标题下的第一部分,并从每一行中提取值并将其放在相应的列下。这里的几个第一个值列名称是硬编码的,我需要通过拆分“=”来提取最后一列,即eventId=34 externalId=23 => col = eventId value = 34 and col = value = externalId
Column names
date time ip_address col1 col2 col3 col4 col5
I want output like below:
我想要如下输出:
This is for first part 'ABC Log' and put it into one file and same for rest.
这是第一部分“ABC日志”并将其放入一个文件中,其余部分相同。
date time ip_address col1 col2 col3 col4 col5 col6 col7
Aug 10 14:36:52 127.0.0.1 CEF:0 McAfee ePolicy Orchestrator IFSSLCRT0.5.0.5/epo4.0 2410 DeploymentTask High
Aug 10 15:45:56 127.0.0.1 CEF:0 McAfee ePolicy Orchestrator IFSSLCRT0.5.0.5/epo4.0 2890 DeploymentTask Medium
Below code I have been trying :
下面的代码我一直在尝试:
package AV_POC_Parsing
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import org.apache.spark.SparkConf
import org.apache.log4j.Logger
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import org.apache.spark.SparkConf
// For implicit conversions like converting RDDs to DataFrames
//import org.apache.spark.implicits._
//import spark.implicits._
object scala {
def main(args: Array[String]) {
// create Spark context with Spark configuration
val sc = new SparkContext(new SparkConf().setAppName("AV_Log_Processing").setMaster("local[*]"))
// Read text file in spark RDD
val textFile = sc.textFile("input.txt");
val splitRdd = textFile.map( line => line.split(" "))
// RDD[ Array[ String ]
// printing values
splitRdd.foreach { x => x.foreach { y => println(y) } }
// how to store split values in different column and write it into file
}}
How to split on two delimiters in Scala.
如何在 Scala 中拆分两个分隔符。
Thanks
谢谢
回答by ashburshui
Maybe it helps you.
也许它可以帮助你。
import org.apache.spark.{SparkConf, SparkContext}
object DataFilter {
def main(args: Array[String]): Unit = {
// create Spark context with Spark configuration
val sc = new SparkContext(new SparkConf().setAppName("AV_Log_Processing").setMaster("local[*]"))
// Read text file in spark RDD
val textFile = sc.textFile("input.txt");
val splitRdd = textFile.map { s =>
val a = s.split("[ |]")
val date = Array(a(0) + " " + a(1))
(date ++ a.takeRight(10)).mkString("\t")
}
// RDD[ Array[ String ]
// printing values
splitRdd.foreach(println)
// how to store split values in different column and write it into file
}
}

