Scala spark中的RDD过滤器

Question

提问by Esmaeil zahedi

I have a dataset and i want to extract those (review/text) which have (review/time) between x and y, for example ( 1183334400 < time < 1185926400),

我有一个数据集，我想提取那些在 x 和 y 之间具有（评论/时间）的（评论/文本），例如（1183334400 < 时间 < 1185926400），

here are part of my data:

这是我的部分数据：

product/productId: B000278ADA
product/title: Jobst Ultrasheer 15-20 Knee-High Silky Beige Large
product/price: 46.34
review/userId: A17KXW1PCUAIIN
review/profileName: Mark Anthony "Mark"
review/helpfulness: 4/4
review/score: 5.0
review/time: 1174435200
review/summary: Jobst UltraSheer Knee High Stockings
review/text: Does a very good job of relieving fatigue.

product/productId: B000278ADB
product/title: Jobst Ultrasheer 15-20 Knee-High Silky Beige Large
product/price: 46.34
review/userId: A9Q3932GX4FX8
review/profileName: Trina Wehle
review/helpfulness: 1/1
review/score: 3.0
review/time: 1352505600
review/summary: Delivery was very long wait.....
review/text: It took almost 3 weeks to recieve the two pairs of stockings .

product/productId: B000278ADB
product/title: Jobst Ultrasheer 15-20 Knee-High Silky Beige Large
product/price: 46.34
review/userId: AUIZ1GNBTG5OB
review/profileName: dgodoy
review/helpfulness: 1/1
review/score: 2.0
review/time: 1287014400
review/summary: sizes recomended in the size chart are not real
review/text: sizes are much smaller than what is recomended in the chart. I tried to put it and sheer it!.

my Spark-Scala Code :

我的 Spark-Scala 代码：

import org.apache.hadoop.conf.Configuration
import org.apache.hadoop.io.{LongWritable, Text}
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat
import org.apache.spark.{SparkConf, SparkContext}

object test1 {
  def main(args: Array[String]): Unit = {
    val conf1 = new SparkConf().setAppName("golabi1").setMaster("local")
    val sc = new SparkContext(conf1)
    val conf: Configuration = new Configuration
    conf.set("textinputformat.record.delimiter", "product/title:")
    val input1=sc.newAPIHadoopFile("data/Electronics.txt",     classOf[TextInputFormat], classOf[LongWritable], classOf[Text], conf)
    val lines = input1.map { text => text._2}
    val filt = lines.filter(text=>(text.toString.contains(tt => tt in (startdate until enddate))))
    filt.saveAsTextFile("data/filter1")
  }
}

but my code does not work well,

但我的代码不能正常工作，

how can i filter these lines?

我怎样才能过滤这些行？

Answer 1

回答by Daniel Langdon

Is is much simpler than that. Try this:

是比那简单得多。试试这个：

object test1 
{
  def main(args: Array[String]): Unit = 
  {
    val conf1 = new SparkConf().setAppName("golabi1").setMaster("local")
    val sc = new SparkContext(conf1)

    def extractDateAndCompare(line: String): Boolean=
    {
        val from = line.indexOf("/time: ") + 7
        val to = line.indexOf("review/text: ") -1
        val date = line.substring(from, to).toLong
        date > startDate && date < endDate
    }

    sc.textFile("data/Electronics.txt")
        .filter(extractDateAndCompare)
        .saveAsTextFile("data/filter1")
  }
}

I usually find those intermediate auxiliary methods to make things much clearer. Of course, this assumes the boundary dates are defined somewhere and that the input file contain format issues. I did this intentionally to keep this simple, but adding a try, returning an Option clause and using flatMap() can help you avoid errors if you have them.

我通常会找到那些中间辅助方法来使事情变得更加清晰。当然，这假设边界日期在某处定义并且输入文件包含格式问题。我这样做是为了保持简单，但是添加一个 try、返回一个 Option 子句并使用 flatMap() 可以帮助您避免错误（如果有错误）。

Also, your raw text is a little cumbersome, you might want to explore Json, TSV files or some other alternative, easier format.

此外，您的原始文本有点麻烦，您可能想要探索 Json、TSV 文件或其他一些更简单的替代格式。

Scala spark中的RDD过滤器

提问by Esmaeil zahedi

回答by Daniel Langdon

相关推荐

最近更新

标签

Scala spark中的RDD过滤器

提问by Esmaeil zahedi

回答by Daniel Langdon

相关推荐

scala 如何使用spark生成大量随机整数？

scala 如何初始化元组列表并在scala中添加项目

Scala/Spark 应用程序在“def main”样式应用程序中出现“No TypeTag available”错误

如何在 IDEA 中完全清理、重新解析和重建 Scala sbt 管理的项目？

相关推荐

最近更新

标签