Scala：列表元素的groupBy（身份）

Question

提问by sgzmd

I develop an application that builds pairs of words in (tokenised) text and produces the number of times each pair occurs (even when same-word pairs occur multiple times, it's OK as it'll be evened out later in the algorithm).

我开发了一个应用程序，它在（标记化的）文本中构建成对的单词并生成每对出现的次数（即使相同的单词对出现多次，也没关系，因为它会在算法的后面进行平衡）。

When I use

当我使用

elements groupBy()

I want to group by the elements' content itself, so I wrote the following:

我想按元素的内容本身分组，所以我写了以下内容：

def self(x: (String, String)) = x

/**
 * Maps a collection of words to a map where key is a pair of words and the 
 *  value is number of
 * times this pair
 * occurs in the passed array
 */
def producePairs(words: Array[String]): Map[(String,String), Double] = {
  var table = List[(String, String)]()
  words.foreach(w1 =>
    words.foreach(w2 =>
      table = table ::: List((w1, w2))))


  val grouppedPairs = table.groupBy(self)
  val size = int2double(grouppedPairs.size)
  return grouppedPairs.mapValues(_.length / size)
}

Now, I fully realise that this self() trick is a dirty hack. So I thought a little a came out with a:

现在，我完全意识到这个 self() 技巧是一个肮脏的黑客。所以我想出了一个：

grouppedPairs = table groupBy (x => x)

This way it produced what I want. However, I still feel that I clearly miss something and there should be easier way of doing it. Any ideas at all, dear all?

这样它就产生了我想要的东西。但是，我仍然觉得我显然错过了一些东西，应该有更简单的方法来做。有什么想法吗，亲爱的？

Also, if you'd help me to improve the pairs extraction part, it'll also help a lot – it looks very imperative, C++ - ish right now. Many thanks in advance!

此外，如果你能帮助我改进对提取部分，它也会有很大帮助 - 现在看起来非常必要，C++ - ish。提前谢谢了！

Answer 1

回答by Landei

I'd suggest this:

我建议这样做：

def producePairs(words: Array[String]): Map[(String,String), Double] = {
    val table = for(w1 <- words; w2 <- words) yield (w1,w2)
    val grouppedPairs = table.groupBy(identity)
    val size = grouppedPairs.size.toDouble
    grouppedPairs.mapValues(_.length / size)
}

The for comprehension is much easier to read, and there is already a predifined function identity, with is a generalized version of your self.

for comprehension 更容易阅读，并且已经有一个预定义的函数identity，它是你的self.

Answer 2

回答by 0__

you are creating a list of pairs of all words against all words by iterating over words twice, where i guess you just want the neighbouring pairs. the easiest is to use a sliding view instead.

您正在通过迭代单词两次来创建所有单词对所有单词的列表，我猜你只想要相邻的单词对。最简单的方法是使用滑动视图。

def producePairs(words: Array[String]): Map[(String, String), Int] = {
  val pairs   = words.sliding(2, 1).map(arr => arr(0) -> arr(1)).toList
  val grouped = pairs.groupBy(t => t)
  grouped.mapValues(_.size)
}

another approach would be to fold the list of pairs by summing them up. not sure though that this is more efficient:

另一种方法是通过总结它们来折叠对列表。不确定这是否更有效：

def producePairs(words: Array[String]): Map[(String, String), Int] = {
  val pairs = words.sliding(2, 1).map(arr => arr(0) -> arr(1))
  pairs.foldLeft(Map.empty[(String, String), Int]) { (m, p) =>
     m + (p -> (m.getOrElse(p, 0) + 1))
  }
}

i see you are return a relative number (Double). for simplicity i have just counted the occurances, so you need to do the final division. i think you want to divide by the number of total pairs (words.size - 1) and not by the number of unique pairs (grouped.size)..., so the relative frequencies sum up to 1.0

我看到你正在返回一个相对数字（双）。为简单起见，我只计算了出现的次数，因此您需要进行最后的除法。我想你想除以总对数（words.size - 1）而不是除以唯一对数（grouped.size）......，所以相对频率总和为1.0

Answer 3

回答by Debilski

Alternative approach which is not of order O(num_words * num_words)but of order O(num_unique_words * num_unique_words)(or something like that):

不是有序O(num_words * num_words)而是有序的替代方法O(num_unique_words * num_unique_words)（或类似的东西）：

def producePairs[T <% Traversable[String]](words: T): Map[(String,String), Double] = {
  val counts = words.groupBy(identity).map{case (w, ws) => (w -> ws.size)}
  val size = (counts.size * counts.size).toDouble
  for(w1 <- counts; w2 <- counts) yield {
      ((w1._1, w2._1) -> ((w1._2 * w2._2) / size))
  }
}

Scala：列表元素的groupBy（身份）

提问by sgzmd

回答by Landei

回答by 0__

回答by Debilski

相关推荐

最近更新

标签

Scala：列表元素的groupBy（身份）

提问by sgzmd

回答by Landei

回答by 0__

回答by Debilski

相关推荐

Scala 中的数组初始化

scala 如何将特征混合到实例中？

如何使用值初始化 Scala 不可变哈希图？

scala 更新嵌套结构的更简洁方法

相关推荐

最近更新

标签