Scala:列表元素的groupBy(身份)

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/4237674/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-10-22 02:34:57  来源:igfitidea点击:

Scala: groupBy (identity) of List Elements

scala

提问by sgzmd

I develop an application that builds pairs of words in (tokenised) text and produces the number of times each pair occurs (even when same-word pairs occur multiple times, it's OK as it'll be evened out later in the algorithm).

我开发了一个应用程序,它在(标记化的)文本中构建成对的单词并生成每对出现的次数(即使相同的单词对出现多次,也没关系,因为它会在算法的后面进行平衡)。

When I use

当我使用

elements groupBy()

I want to group by the elements' content itself, so I wrote the following:

我想按元素的内容本身分组,所以我写了以下内容:

def self(x: (String, String)) = x

/**
 * Maps a collection of words to a map where key is a pair of words and the 
 *  value is number of
 * times this pair
 * occurs in the passed array
 */
def producePairs(words: Array[String]): Map[(String,String), Double] = {
  var table = List[(String, String)]()
  words.foreach(w1 =>
    words.foreach(w2 =>
      table = table ::: List((w1, w2))))


  val grouppedPairs = table.groupBy(self)
  val size = int2double(grouppedPairs.size)
  return grouppedPairs.mapValues(_.length / size)
}

Now, I fully realise that this self() trick is a dirty hack. So I thought a little a came out with a:

现在,我完全意识到这个 self() 技巧是一个肮脏的黑客。所以我想出了一个:

grouppedPairs = table groupBy (x => x)

This way it produced what I want. However, I still feel that I clearly miss something and there should be easier way of doing it. Any ideas at all, dear all?

这样它就产生了我想要的东西。但是,我仍然觉得我显然错过了一些东西,应该有更简单的方法来做。有什么想法吗,亲爱的?

Also, if you'd help me to improve the pairs extraction part, it'll also help a lot – it looks very imperative, C++ - ish right now. Many thanks in advance!

此外,如果你能帮助我改进对提取部分,它也会有很大帮助 - 现在看起来非常必要,C++ - ish。提前谢谢了!

回答by Landei

I'd suggest this:

我建议这样做:

def producePairs(words: Array[String]): Map[(String,String), Double] = {
    val table = for(w1 <- words; w2 <- words) yield (w1,w2)
    val grouppedPairs = table.groupBy(identity)
    val size = grouppedPairs.size.toDouble
    grouppedPairs.mapValues(_.length / size)
}

The for comprehension is much easier to read, and there is already a predifined function identity, with is a generalized version of your self.

for comprehension 更容易阅读,并且已经有一个预定义的函数identity,它是你的self.

回答by 0__

you are creating a list of pairs of all words against all words by iterating over words twice, where i guess you just want the neighbouring pairs. the easiest is to use a sliding view instead.

您正在通过迭代单词两次来创建所有单词对所有单词的列表,我猜你只想要相邻的单词对。最简单的方法是使用滑动视图。

def producePairs(words: Array[String]): Map[(String, String), Int] = {
  val pairs   = words.sliding(2, 1).map(arr => arr(0) -> arr(1)).toList
  val grouped = pairs.groupBy(t => t)
  grouped.mapValues(_.size)
}

another approach would be to fold the list of pairs by summing them up. not sure though that this is more efficient:

另一种方法是通过总结它们来折叠对列表。不确定这是否更有效:

def producePairs(words: Array[String]): Map[(String, String), Int] = {
  val pairs = words.sliding(2, 1).map(arr => arr(0) -> arr(1))
  pairs.foldLeft(Map.empty[(String, String), Int]) { (m, p) =>
     m + (p -> (m.getOrElse(p, 0) + 1))
  }
}

i see you are return a relative number (Double). for simplicity i have just counted the occurances, so you need to do the final division. i think you want to divide by the number of total pairs (words.size - 1) and not by the number of unique pairs (grouped.size)..., so the relative frequencies sum up to 1.0

我看到你正在返回一个相对数字(双)。为简单起见,我只计算了出现的次数,因此您需要进行最后的除法。我想你想除以总对数(words.size - 1)而不是除以唯一对数(grouped.size)......,所以相对频率总和为1.0

回答by Debilski

Alternative approach which is not of order O(num_words * num_words)but of order O(num_unique_words * num_unique_words)(or something like that):

不是有序O(num_words * num_words)而是有序的替代方法O(num_unique_words * num_unique_words)(或类似的东西):

def producePairs[T <% Traversable[String]](words: T): Map[(String,String), Double] = {
  val counts = words.groupBy(identity).map{case (w, ws) => (w -> ws.size)}
  val size = (counts.size * counts.size).toDouble
  for(w1 <- counts; w2 <- counts) yield {
      ((w1._1, w2._1) -> ((w1._2 * w2._2) / size))
  }
}