使用 Spark 和 Scala 计算字数
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/30734261/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Word count using Spark and Scala
提问by ScazzoMatto
i have to write a program in Scala, using spark which counts how many times a word occours in a text, but using the RDD my variable count always displays 0 at the end. Can you help me please? This is my code
我必须在 Scala 中编写一个程序,使用 spark 来计算一个单词在文本中出现的次数,但是使用 RDD 我的变量计数总是在最后显示 0。你能帮我吗?这是我的代码
import scala.io.Source
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import org.apache.spark.SparkConf
object wordcount {
def main(args: Array[String]) {
// set spark context
val conf = new SparkConf().setAppName("wordcount").setMaster("local[*]")
val sc = new SparkContext(conf)
val distFile = sc.textFile("bible.txt")
print("Enter word to loook for in the HOLY BILE: ")
val word = Console.readLine
var count = 0;
println("You entered " + word)
for (bib <- distFile.flatMap(_.split(" "))) {
if (word==bib) {
count += 1
}
}
println(word + " occours " + count + " times in the HOLY BIBLE!")
}
}
回答by Sathish
I suggest you to use available transformations in RDD instead of your own program (though its not harm) to get the desired result, for example following code could be used to retrieve the word count.
我建议您使用 RDD 中的可用转换而不是您自己的程序(尽管它没有害处)来获得所需的结果,例如以下代码可用于检索字数。
val word = Console.readLine
println("You entered " + word)
val input = sc.textFile("bible.txt")
val splitedLines = input.flatMap(line => line.split(" "))
.filter(x => x.equals(word))
System.out.println(splitedLines.count())
Please refer to this linkfor more information about the internals of Spark.
有关 Spark 内部结构的更多信息,请参阅此链接。
回答by Justin Pihony
The problem is that you are using a mutable variable on a distributed set. This is hard to control in normal situations, and especially in Spark, the variable is copied separately to each worker. So, they end up with their own version of the countvariable and the original is actually never updated. You would need to use an accumulator, which is only guaranteed for actions. All that said, you can accomplish this without variables or accumulators:
问题是您在分布式集上使用可变变量。这在正常情况下很难控制,尤其是在 Spark 中,变量被单独复制到每个 worker。因此,他们最终拥有自己的count变量版本,而原始版本实际上从未更新过。您将需要使用accumulator,这仅保证用于操作。综上所述,您可以在没有变量或累加器的情况下完成此操作:
val splitData = distFile.flatMap(_.split(" "))
val finalCount = splitData.aggregate(0)(
(accum, word) => if(word == bib) accum + 1 else accum,
_ + _)
What this is doing is first seeding the count with a 0. Then, the first operation is what will be run on each partition. The accumis the accumulated count and the wordis the current word to compare. The second operation is simply the combiner used to add all of the partition's counts together.
它所做的是首先用 0 播种计数。然后,第一个操作是将在每个分区上运行的操作。该accum是累计计数和word是当前的单词进行比较。第二个操作只是用于将所有分区的counts 加在一起的组合器。
回答by MomoAG
I think that the iteration : bib <- distFile.flatMap(_.split(" "))won't work, because your data is in the RDD, try to do a collect like :
我认为迭代 :bib <- distFile.flatMap(_.split(" "))将不起作用,因为您的数据在 RDD 中,请尝试进行如下收集:
for (bib<-distFile.flatMap(_.split(" ")).collect).
for (bib<-distFile.flatMap(_.split(" ")).collect).
(it works just in case of your data are not huge, and you can make a collect on it)
(它只是在您的数据不是很大的情况下才有效,您可以对其进行收集)
otherwise, if your data set is huge you can do like :
否则,如果您的数据集很大,您可以这样做:
val distFile = sc.textFile("bible.txt")
val word = Console.readLine
val count = distFile.flatMap(_.split(" ")).filter(l=>l==word).count
println(word + " occours " + count + " times in the HOLY BIBLE!")
回答by Chinmoy
val textFile = sc.textFile("demoscala.txt")
val counts = textFile.flatMap(line => line.split(" ")).map(word => (word, 1)).reduceByKey(_ + _)
counts.saveAsTextFile("WordCountSpark")
If anyone is confused of (_).Good blog below
如果有人对 (_) 感到困惑。下面的好博客
http://www.codecommit.com/blog/scala/quick-explanation-of-scalas-syntax
http://www.codecommit.com/blog/scala/quick-explanation-of-scalas-syntax
回答by user11095143
val text=sc.textfile("filename.txt")
val counts=text.flatmap(line=>line.split("")).map(word=>(word,1)).reduceByKey(_+_) counts.collect

