在 Scala 中为字符串生成频率图

Question

提问by nohat

Let's say I have a string, "hello", and I want to generate a character frequency map:

假设我有一个字符串“hello”，我想生成一个字符频率图：

Map[Char,Int] = Map(h -> 1, e -> 1, o -> 1, l -> 2)

I could do this iteratively:

我可以反复执行此操作：

val str = "hello"
var counts = new scala.collection.mutable.HashMap[Char,Int]
for (i <- str) {
    if (counts.contains(i))
        counts.put(i, counts(i) + 1)
    else
        counts.put(i, 1)
}

By messing around in the REPL, I've found I can do something a bit more concise and not using a mutable collection:

通过在 REPL 中乱搞，我发现我可以做一些更简洁的事情，而不是使用可变集合：

> str.groupBy(_.toChar).map{ p => (p._1, p._2.length)}
scala.collection.immutable.Map[Char,Int] = Map(h -> 1, e -> 1, o -> 1, l -> 2)

But I don't know about the performance characteristics of groupBy() nor what is going on in the block passed to map (like what, exactly, p is).

但是我不知道 groupBy() 的性能特征，也不知道传递给 map 的块中发生了什么（例如 p 是什么）。

How do I do this idiomatically using the functional paradigms in Scala?

我如何使用 Scala 中的函数范式惯用地做到这一点？

For background, I'm just coming to Scala for the first time from Ruby. In Ruby, I would use injectbut I'm not sure what the parallel way to do it in Scala is:

作为背景，我是第一次从 Ruby 来到 Scala。在 Ruby 中，我会使用，inject但我不确定在 Scala 中这样做的并行方法是什么：

counts = str.each_byte.inject(Hash.new(0)){ |h, c| h[c] += 1; h}

Answer 1

回答by axel22

1) What does `p`mean?

1）什么`p`意思？

groupBytakes a function which maps an elements to a key of type K. When invoked on some collection Coll, it returns a Map[K, Coll]which contains mappings from keys Kto all the elements which mapped to the same key.

groupBy接受一个将元素映射到类型键的函数K。当在某个集合上调用时Coll，它返回一个Map[K, Coll]包含从键K到映射到同一键的所有元素的映射。

So, in your case, str.groupBy(_.toChar)yields a map mapping from a key k(which is a character) to a string with all the elements (characters) csuch that k == c.toChar. You get this:

因此，在您的情况下，str.groupBy(_.toChar)生成从键k（它是一个字符）到包含所有元素（字符）的字符串的映射，c例如k == c.toChar. 你得到这个：

Map(e -> "e", h -> "h", l -> "ll", o -> "o")

A Mapis an iterable of pairs of keys and values. In this case, each pair is a character and a string of elements. Calling the mapoperation on a Mapinvolves mapping on these pairs - pis a pair where p._1is a character, and p._2is the associated string (on which you can call length, as you did above).

AMap是键和值对的可迭代对象。在这种情况下，每一对都是一个字符和一串元素。map在 aMap上调用操作涉及对这些对进行映射 -p是一对，其中p._1是一个字符，p._2是相关联的字符串（您可以调用length，如上所述）。

2) How to do this idiomatically

2）如何惯用地做到这一点

The above is how to do it idiomatically - using groupByand map. Alternatively, you can use an immutable map and recursion on the string length to compute the frequencies, or an immutable map and a foldLeft.

以上是如何惯用地做到这一点 - 使用groupByand map。或者，您可以在字符串长度上使用不可变映射和递归来计算频率，或者使用不可变映射和foldLeft.

3) Performance characteristic

3) 性能特点

Best to benchmarkto see the differences. Here are a couple of microbenchmark for a highly-repetitive string (~3GHz iMac, JDK7, Scala 2.10.0 nightly):

最好进行基准测试以查看差异。以下是针对高度重复字符串的几个微基准测试（每晚约 3GHz iMac、JDK7、Scala 2.10.0）：

object Imperative extends testing.Benchmark {
  val str = "abc" * 750000

  def run() {
    var counts = new scala.collection.mutable.HashMap[Char,Int]
    var i = 0
    val until = str.length
    while (i < until) {
      var c = str(i)
      if (counts.contains(c))
        counts.put(c, counts(c) + 1)
      else
        counts.put(c, 1)
      i += 1
    }

    //println(f)
  }
}


object Combinators extends testing.Benchmark {
  val str = "abc" * 750000

  def run() {
    val f = str.groupBy(_.toChar).map(p => (p._1, p._2.length))
  }
}


object Fold extends testing.Benchmark {
  val str = "abc" * 750000

  def run() {
    val f = str.foldLeft(Map[Char, Int]() withDefaultValue 0){(h, c) => h.updated(c, h(c)+1)}
  }
}

Results:

结果：

Imperative: $ 103 57 53 58 53 53 53 53 53 53
Combinators: $ 72 51 63 56 53 52 52 54 53 53
Fold: $ 163 62 71 62 57 57 57 58 57 57

至关重要的： $ 103 57 53 58 53 53 53 53 53 53
组合器： $ 72 51 63 56 53 52 52 54 53 53
折叠： $ 163 62 71 62 57 57 57 58 57 57

Note that changing the imperative version to use withDefaultValue:

请注意，更改命令式版本以使用withDefaultValue：

var counts = new scala.collection.mutable.HashMap[Char,Int].withDefaultValue(0)
var i = 0
val until = str.length
while (i < until) {
  var c = str(i)
  counts.put(c, counts(c) + 1)
  i += 1
}

is apparently terribly slow due to forwarding each putcall:

由于转发每个put呼叫，显然非常慢：

withDefaultValue: $ 133 87 109 106 101 100 101 100 101 101

withDefaultValue： $ 133 87 109 106 101 100 101 100 101 101

Conclusion: the boxing and unboxing of characters in this case is high-enough so that the differences in performance between these approaches are hard to observe.

结论：在这种情况下，字符的装箱和拆箱已经足够高，因此很难观察到这些方法之间的性能差异。

EDIT:

编辑：

Update: You may want to use ScalaMeter inline benchmarkingin place of the Benchmarktrait.

更新：您可能希望使用ScalaMeter 内联基准测试来代替Benchmark特征。

Answer 2

回答by Nikita Volkov

Extending Axel's answer.

扩展阿克塞尔的答案。

Your groupBysolution is already functional. There's just a tiny-tiny correction to it which could make it cleaner:

您的groupBy解决方案已经可用。只是对它进行了微小的修正，可以使它更干净：

str.groupBy(_.toChar).mapValues(_.size)

The Scala's alternative to injectis foldLeft, foldRight, reduce, reduceOptiondepending on how you use it. The way you've used injectin Ruby is not functional, since your solution is based on mutating hand in functional world mutability is a "no-no". Here's how you'd do the solution close to your injectbut in functional style in Scala:

Scala 的替代方法inject是foldLeft, foldRight, reduce，reduceOption这取决于您如何使用它。您inject在 Ruby 中使用的方式不是功能性的，因为您的解决方案基于变异，h而在功能性世界中，可变性是“禁忌”。以下是您如何inject在 Scala 中以函数式风格接近您的解决方案：

str.foldLeft( Map[Char, Int]() ){ (m, c) => m + (c -> (m.getOrElse(c, 0) + 1)) }

Obviously groupBylooks much better.

显然groupBy看起来好多了。

Answer 3

回答by incrop

Your example on ruby can be almost directly translated to Scala using foldLeftand immutable Map.

您在 ruby 上的示例几乎可以使用foldLeft和 immutable直接转换为 Scala Map。

Here is one of possible solutions:

这是可能的解决方案之一：

str.foldLeft(Map[Char, Int]() withDefaultValue 0){(h, c) => h.updated(c, h(c)+1)}

Actually, if you are ok with local mutability, you can make something like this:

实际上，如果您对本地可变性没问题，您可以做这样的事情：

def charFrequencies(str: String): collection.Map[Char, Int] = {
  val hash = collection.mutable.HashMap.empty[Char, Int] withDefaultValue 0
  str foreach { hash(_) += 1 }
  hash
}

Expression hash(_) += 1will be desugared to c => hash(c) = hash(c) + 1and then to c => hash.update(c, hash.apply(c) + 1)

表达式hash(_) += 1将被脱糖c => hash(c) = hash(c) + 1，然后到c => hash.update(c, hash.apply(c) + 1)

This solution should be more efficient than functional ones, because it don't create intermediate collections. Also because method returns immutable collection.Map[Char, Int], result will be treated as immutable (as long as no one will perform unsafe downcasting on it).

这个解决方案应该比函数式解决方案更有效，因为它不创建中间集合。同样因为方法返回 immutable collection.Map[Char, Int]， result 将被视为不可变的（只要没有人会对其执行不安全的向下转换）。

Answer 4

回答by Xavier Guihot

Starting Scala 2.13, we can use the groupMapReducemethod which is (as its name suggests) an equivalent of a groupByfollowed by mapValuesand a reduce step:

开始Scala 2.13，我们可以使用groupMapReduce方法，它（顾名思义）相当于一个groupBy后跟mapValues和一个减少步骤：

"hello".groupMapReduce(identity)(_ => 1)(_ + _)
// immutable.Map[Char,Int] = Map(e -> 1, h -> 1, l -> 2, o -> 1)

This:

这：

groups characters (group part of groupMapReduce)
maps each grouped value occurrence to 1 (map part of groupMapReduce)
reduces values within a group of values (_ + _) by summing them (reduce part of groupMapReduce).

groups 字符（组MapReduce 的组部分）
maps 每个分组值出现为 1（组MapReduce 的映射部分）
reduce_ + _通过对一组值 ( ) 中的值进行求和（减少 groupMap Reduce 的一部分）。

This is an equivalent version performed in one passthrough the sequence of chars of:

这是一次通过以下字符序列执行的等效版本：

"hello".groupBy(identity).mapValues(_.map(_ => 1).reduce(_+_))

在 Scala 中为字符串生成频率图

提问by nohat

回答by axel22

1) What does `p`mean?

1）什么`p`意思？

2) How to do this idiomatically

2）如何惯用地做到这一点

3) Performance characteristic

3) 性能特点

回答by Nikita Volkov

回答by incrop

回答by Xavier Guihot

相关推荐

最近更新

标签

在 Scala 中为字符串生成频率图

提问by nohat

回答by axel22

1) What does pmean?

1）什么p意思？

2) How to do this idiomatically

2）如何惯用地做到这一点

3) Performance characteristic

3) 性能特点

回答by Nikita Volkov

回答by incrop

回答by Xavier Guihot

相关推荐

覆盖 Scala 枚举中的 toString 方法

scala 模式匹配序列理解的惯用方法是什么？

scala 如何在 ScalaTest 中禁用测试套件

Haskell、Scala、Clojure，高性能模式匹配和并发选择什么

相关推荐

最近更新

标签

1) What does `p`mean?

1）什么`p`意思？