Scala:列表元素的groupBy(身份)
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/4237674/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Scala: groupBy (identity) of List Elements
提问by sgzmd
I develop an application that builds pairs of words in (tokenised) text and produces the number of times each pair occurs (even when same-word pairs occur multiple times, it's OK as it'll be evened out later in the algorithm).
我开发了一个应用程序,它在(标记化的)文本中构建成对的单词并生成每对出现的次数(即使相同的单词对出现多次,也没关系,因为它会在算法的后面进行平衡)。
When I use
当我使用
elements groupBy()
I want to group by the elements' content itself, so I wrote the following:
我想按元素的内容本身分组,所以我写了以下内容:
def self(x: (String, String)) = x
/**
* Maps a collection of words to a map where key is a pair of words and the
* value is number of
* times this pair
* occurs in the passed array
*/
def producePairs(words: Array[String]): Map[(String,String), Double] = {
var table = List[(String, String)]()
words.foreach(w1 =>
words.foreach(w2 =>
table = table ::: List((w1, w2))))
val grouppedPairs = table.groupBy(self)
val size = int2double(grouppedPairs.size)
return grouppedPairs.mapValues(_.length / size)
}
Now, I fully realise that this self() trick is a dirty hack. So I thought a little a came out with a:
现在,我完全意识到这个 self() 技巧是一个肮脏的黑客。所以我想出了一个:
grouppedPairs = table groupBy (x => x)
This way it produced what I want. However, I still feel that I clearly miss something and there should be easier way of doing it. Any ideas at all, dear all?
这样它就产生了我想要的东西。但是,我仍然觉得我显然错过了一些东西,应该有更简单的方法来做。有什么想法吗,亲爱的?
Also, if you'd help me to improve the pairs extraction part, it'll also help a lot – it looks very imperative, C++ - ish right now. Many thanks in advance!
此外,如果你能帮助我改进对提取部分,它也会有很大帮助 - 现在看起来非常必要,C++ - ish。提前谢谢了!
回答by Landei
I'd suggest this:
我建议这样做:
def producePairs(words: Array[String]): Map[(String,String), Double] = {
val table = for(w1 <- words; w2 <- words) yield (w1,w2)
val grouppedPairs = table.groupBy(identity)
val size = grouppedPairs.size.toDouble
grouppedPairs.mapValues(_.length / size)
}
The for comprehension is much easier to read, and there is already a predifined function identity, with is a generalized version of your self.
for comprehension 更容易阅读,并且已经有一个预定义的函数identity,它是你的self.
回答by 0__
you are creating a list of pairs of all words against all words by iterating over words twice, where i guess you just want the neighbouring pairs. the easiest is to use a sliding view instead.
您正在通过迭代单词两次来创建所有单词对所有单词的列表,我猜你只想要相邻的单词对。最简单的方法是使用滑动视图。
def producePairs(words: Array[String]): Map[(String, String), Int] = {
val pairs = words.sliding(2, 1).map(arr => arr(0) -> arr(1)).toList
val grouped = pairs.groupBy(t => t)
grouped.mapValues(_.size)
}
another approach would be to fold the list of pairs by summing them up. not sure though that this is more efficient:
另一种方法是通过总结它们来折叠对列表。不确定这是否更有效:
def producePairs(words: Array[String]): Map[(String, String), Int] = {
val pairs = words.sliding(2, 1).map(arr => arr(0) -> arr(1))
pairs.foldLeft(Map.empty[(String, String), Int]) { (m, p) =>
m + (p -> (m.getOrElse(p, 0) + 1))
}
}
i see you are return a relative number (Double). for simplicity i have just counted the occurances, so you need to do the final division. i think you want to divide by the number of total pairs (words.size - 1) and not by the number of unique pairs (grouped.size)..., so the relative frequencies sum up to 1.0
我看到你正在返回一个相对数字(双)。为简单起见,我只计算了出现的次数,因此您需要进行最后的除法。我想你想除以总对数(words.size - 1)而不是除以唯一对数(grouped.size)......,所以相对频率总和为1.0
回答by Debilski
Alternative approach which is not of order O(num_words * num_words)but of order O(num_unique_words * num_unique_words)(or something like that):
不是有序O(num_words * num_words)而是有序的替代方法O(num_unique_words * num_unique_words)(或类似的东西):
def producePairs[T <% Traversable[String]](words: T): Map[(String,String), Double] = {
val counts = words.groupBy(identity).map{case (w, ws) => (w -> ws.size)}
val size = (counts.size * counts.size).toDouble
for(w1 <- counts; w2 <- counts) yield {
((w1._1, w2._1) -> ((w1._2 * w2._2) / size))
}
}

