Scala spark中的RDD过滤器
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/29750325/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
RDD filter in scala spark
提问by Esmaeil zahedi
I have a dataset and i want to extract those (review/text) which have (review/time) between x and y, for example ( 1183334400 < time < 1185926400),
我有一个数据集,我想提取那些在 x 和 y 之间具有(评论/时间)的(评论/文本),例如(1183334400 < 时间 < 1185926400),
here are part of my data:
这是我的部分数据:
product/productId: B000278ADA
product/title: Jobst Ultrasheer 15-20 Knee-High Silky Beige Large
product/price: 46.34
review/userId: A17KXW1PCUAIIN
review/profileName: Mark Anthony "Mark"
review/helpfulness: 4/4
review/score: 5.0
review/time: 1174435200
review/summary: Jobst UltraSheer Knee High Stockings
review/text: Does a very good job of relieving fatigue.
product/productId: B000278ADB
product/title: Jobst Ultrasheer 15-20 Knee-High Silky Beige Large
product/price: 46.34
review/userId: A9Q3932GX4FX8
review/profileName: Trina Wehle
review/helpfulness: 1/1
review/score: 3.0
review/time: 1352505600
review/summary: Delivery was very long wait.....
review/text: It took almost 3 weeks to recieve the two pairs of stockings .
product/productId: B000278ADB
product/title: Jobst Ultrasheer 15-20 Knee-High Silky Beige Large
product/price: 46.34
review/userId: AUIZ1GNBTG5OB
review/profileName: dgodoy
review/helpfulness: 1/1
review/score: 2.0
review/time: 1287014400
review/summary: sizes recomended in the size chart are not real
review/text: sizes are much smaller than what is recomended in the chart. I tried to put it and sheer it!.
my Spark-Scala Code :
我的 Spark-Scala 代码:
import org.apache.hadoop.conf.Configuration
import org.apache.hadoop.io.{LongWritable, Text}
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat
import org.apache.spark.{SparkConf, SparkContext}
object test1 {
def main(args: Array[String]): Unit = {
val conf1 = new SparkConf().setAppName("golabi1").setMaster("local")
val sc = new SparkContext(conf1)
val conf: Configuration = new Configuration
conf.set("textinputformat.record.delimiter", "product/title:")
val input1=sc.newAPIHadoopFile("data/Electronics.txt", classOf[TextInputFormat], classOf[LongWritable], classOf[Text], conf)
val lines = input1.map { text => text._2}
val filt = lines.filter(text=>(text.toString.contains(tt => tt in (startdate until enddate))))
filt.saveAsTextFile("data/filter1")
}
}
but my code does not work well,
但我的代码不能正常工作,
how can i filter these lines?
我怎样才能过滤这些行?
回答by Daniel Langdon
Is is much simpler than that. Try this:
是比那简单得多。试试这个:
object test1
{
def main(args: Array[String]): Unit =
{
val conf1 = new SparkConf().setAppName("golabi1").setMaster("local")
val sc = new SparkContext(conf1)
def extractDateAndCompare(line: String): Boolean=
{
val from = line.indexOf("/time: ") + 7
val to = line.indexOf("review/text: ") -1
val date = line.substring(from, to).toLong
date > startDate && date < endDate
}
sc.textFile("data/Electronics.txt")
.filter(extractDateAndCompare)
.saveAsTextFile("data/filter1")
}
}
I usually find those intermediate auxiliary methods to make things much clearer. Of course, this assumes the boundary dates are defined somewhere and that the input file contain format issues. I did this intentionally to keep this simple, but adding a try, returning an Option clause and using flatMap() can help you avoid errors if you have them.
我通常会找到那些中间辅助方法来使事情变得更加清晰。当然,这假设边界日期在某处定义并且输入文件包含格式问题。我这样做是为了保持简单,但是添加一个 try、返回一个 Option 子句并使用 flatMap() 可以帮助您避免错误(如果有错误)。
Also, your raw text is a little cumbersome, you might want to explore Json, TSV files or some other alternative, easier format.
此外,您的原始文本有点麻烦,您可能想要探索 Json、TSV 文件或其他一些更简单的替代格式。

