scala 计算文件中单词的最简单方法

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/15487413/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-10-22 05:03:45  来源:igfitidea点击:

Simplest way to count words in a file

scala

提问by Dariusz Mydlarz

I'm trying to code in the simplest way a program to count word occurrences in file in Scala Language. So far I have these piece of code:

我正在尝试以最简单的方式编写一个程序来计算 Scala 语言文件中的单词出现次数。到目前为止,我有这些代码:

import scala.io.Codec.string2codec
import scala.io.Source
import scala.reflect.io.File

object WordCounter {
    val SrcDestination: String = ".." + File.separator + "file.txt"
    val Word = "\b([A-Za-z\-])+\b".r

    def main(args: Array[String]): Unit = {

        val counter = Source.fromFile(SrcDestination)("UTF-8")
                .getLines
                .map(l => Word.findAllIn(l.toLowerCase()).toSeq)
                .toStream
                .groupBy(identity)
                .mapValues(_.length)

        println(counter)
    }
}

Don't bother of regexp expression. I would like to know how to extract single words from sequence retrieved in this line:

不要理会正则表达式。我想知道如何从这一行中检索到的序列中提取单个单词:

map(l => Word.findAllIn(l.toLowerCase()).toSeq)

in order to get each word occurency counted. Currently I'm getting map with counted words sequences.

为了计算每个单词的出现次数。目前我正在获取带有计数词序列的地图。

回答by Garrett Hall

You can turn the file lines into words by splitting them with the regex "\\W+"(flatmapis lazy so it doesn't need to load the entire file into memory). To count occurrences you can fold over a Map[String, Int]updating it with each word (much more memory and time efficient than using groupBy)

您可以通过使用正则表达式将文件行拆分为单词"\\W+"flatmap是惰性的,因此不需要将整个文件加载到内存中)。要计算出现次数,您可以Map[String, Int]使用每个单词折叠更新它(比使用 节省更多的内存和时间groupBy

scala.io.Source.fromFile("file.txt")
  .getLines
  .flatMap(_.split("\W+"))
  .foldLeft(Map.empty[String, Int]){
     (count, word) => count + (word -> (count.getOrElse(word, 0) + 1))
  }

回答by Michael Yakobi

I think the following is slightly easier to understand:

我认为以下内容稍微容易理解:

Source.fromFile("file.txt").
  getLines().
  flatMap(_.split("\W+")).
  toList.
  groupBy((word: String) => word).
  mapValues(_.length)

回答by Xavier Guihot

Starting Scala 2.13, in addition to retrieving words with Source, we can use the groupMapReducemethod which is (as its name suggests) an equivalent of a groupByfollowed by mapValuesand a reduce step:

从 开始Scala 2.13,除了使用 检索单词外Source,我们还可以使用groupMapReduce方法,该方法(顾名思义)相当于 agroupBy后跟mapValues和减少步骤:

import scala.io.Source

Source.fromFile("file.txt")
  .getLines.to(LazyList)
  .flatMap(_.split("\W+"))
  .groupMapReduce(identity)(_ => 1)(_ + _)

The groupMapReducestage, similarly to Hadoop's map/reduce logic,

groupMapReduce阶段,类似于Hadoop的地图/减少逻辑,

  • groups words by themselves (identity) (group part of groupMapReduce)

  • maps each grouped word occurrence to 1 (map part of groupMapReduce)

  • reduces values within a group of words (_ + _) by summing them (reduce part of groupMapReduce).

  • groups 单词本身(身份)(MapReduce 的组部分)

  • maps 每个分组单词出现次数为 1(映射组MapReduce 的部分)

  • reduce将一组单词 ( _ + _) 中的s 值相加(减少 groupMap Reduce 的一部分)。

This is a one-pass versionof what can be translated by:

这是可以通过以下方式翻译的内容的一次性版本

seq.groupBy(identity).mapValues(_.map(_ => 1).reduce(_ + _))

Also note the cast from Iteratorto LazyListin order to use a collection which provides groupMapReduce(we don't use a Stream, since starting Scala 2.13, LazyListis the recommended replacement of Streams).

还要注意从铸造IteratorLazyList以使用集合提供groupMapReduce(我们不使用Stream,因为出发Scala 2.13LazyList为推荐的替代的StreamS)。



On the same principle, one could also use a for-comprehensionversion:

基于同样的原则,我们也可以使用一个for-comprehension版本:

(for {
  line <- Source.fromFile("file.txt").getLines.to(LazyList)
  word <- line.split("\W+")
} yield word)
.groupMapReduce(identity)(_ => 1)(_ + _)

回答by DaoWen

I'm not 100% sure what you're asking, but I think I see the problem. Try using flatMapinstead of map:

我不是 100% 确定你在问什么,但我想我看到了问题所在。尝试使用flatMap代替map

flatMap(l => Word.findAllIn(l.toLowerCase()).toSeq)

This will concatenate all of your sequences together so that groupByis done on individual words instead of at the line level.

这会将您的所有序列连接在一起,以便groupBy在单个单词上而不是在行级别完成。



A note about your Regex

关于您的正则表达式的说明

I know you said not to worry about your Regex, but here are a couple changes you can make to make it a little more readable. Here's what you have right now:

我知道您说过不要担心您的正则表达式,但您可以进行一些更改以使其更具可读性。这是您现在拥有的:

val Word = "\b([A-Za-z\-])+\b".r

First, you can use Scala's triple-quoted strings so you don't have to escape your backslashes:

首先,您可以使用 Scala 的三重引号字符串,这样您就不必转义反斜杠:

val Word = """\b([A-Za-z\-])+\b""".r

Second, if you put the -at the beginningof your character class then you don't need to escape it:

其次,如果您将 放在字符类-开头,则无需对其进行转义:

val Word = """\b([-A-Za-z])+\b""".r

回答by JasonG

Here is what I did. This will chop a file. Hashmap is a good bet for high performance and will outperform any sort of sort. There is a more terse sort and slice function in there too you can look at.

这是我所做的。这将切碎一个文件。Hashmap 是高性能的好选择,并且会胜过任何类型的。你也可以看看那里有一个更简洁的排序和切片功能。

import java.io.FileNotFoundException

/**.
 * Cohesive static method object for file handling.
 */
object WordCountFileHandler {

  val FILE_FORMAT = "utf-8"

  /**
   * Take input from file. Split on spaces.
   * @param fileLocationAndName string location of file
   * @return option of string iterator
   */
  def apply (fileLocationAndName: String) : Option[Iterator[String]] = {
    apply (fileLocationAndName, " ")
  }

  /**
   * Split on separator parameter.
   * Speculative generality :P
   * @param fileLocationAndName string location of file
   * @param wordSeperator split on this string
   * @return
   */
  def apply (fileLocationAndName: String, wordSeperator: String): Option[Iterator[String]] = {
    try{
      val words = scala.io.Source.fromFile(fileLocationAndName).getLines() //scala io.Source is a bit hackey. No need to close file.

      //Get rid of anything funky... need the double space removal for files like the README.md...
      val wordList = words.reduceLeft(_ + wordSeperator + _).replaceAll("[^a-zA-Z\s]", "").replaceAll("  ", "").split(wordSeperator)
      //wordList.foreach(println(_))
      wordList.length match {
        case 0 => return None
        case _ => return Some(wordList.toIterator)
      }
    } catch {
      case _:FileNotFoundException => println("file not found: " + fileLocationAndName); return None
      case e:Exception => println("Unknown exception occurred during file handling: \n\n" + e.getStackTrace); return None
    }
  }
}

import collection.mutable

/**
 * Static method object.
 * Takes a processed map and spits out the needed info
 * While a small performance hit is made in not doing this during the word list analysis,
 * this does demonstrate cohesion and open/closed much better.
 * author: jason goodwin
 */
object WordMapAnalyzer {

  /**
   * get input size
   * @param input
   * @return
   */
  def getNumberOfWords(input: mutable.Map[String, Int]): Int = {
    input.size
  }

  /**
   * Should be fairly logarithmic given merge sort performance is generally about O(6nlog2n + 6n).
   * See below for more performant method.
   * @param input
   * @return
   */

  def getTopCWordsDeclarative(input: mutable.HashMap[String, Int], c: Int): Map[String, Int] = {
    val sortedInput = input.toList.sortWith(_._2 > _._2)
    sortedInput.take(c).toMap
  }

  /**
   * Imperative style is used here for much better performance relative to the above.
   * Growth can be reasoned at linear growth on random input.
   * Probably upper bounded around O(3n + nc) in worst case (ie a sorted input from small to high).
   * @param input
   * @param c
   * @return
   */
  def getTopCWordsImperative(input: mutable.Map[String, Int], c: Int): mutable.Map[String, Int] = {
    var bottomElement: (String, Int) = ("", 0)
    val topList = mutable.HashMap[String, Int]()

    for (x <- input) {
      if (x._2 >= bottomElement._2 && topList.size == c ){
        topList -= (bottomElement._1)
        topList +=((x._1, x._2))
        bottomElement = topList.toList.minBy(_._2)
      } else if (topList.size < c ){
        topList +=((x._1, x._2))
        bottomElement = topList.toList.minBy(_._2)
      }
    }
    //println("Size: " + topList.size)

    topList.asInstanceOf[mutable.Map[String, Int]]
  }
}

object WordMapCountCalculator {

  /**
   * Take a list and return a map keyed by words with a count as the value.
   * @param wordList List[String] to be analysed
   * @return HashMap[String, Int] with word as key and count as pair.
   * */

   def apply (wordList: Iterator[String]): mutable.Map[String, Int] = {
    wordList.foldLeft(new mutable.HashMap[String, Int])((word, count) => {
      word get(count) match{
        case Some(x) => word += (count -> (x+1))   //if in map already, increment count
        case None => word += (count -> 1)          //otherwise, set to 1
      }
    }).asInstanceOf[mutable.Map[String, Int]] 
}