词频统计 Java 8
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/29122394/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Word frequency count Java 8
提问by Mouna
How to count the frequency of words of List in Java 8?
如何统计Java 8中List的词频?
List <String> wordsList = Lists.newArrayList("hello", "bye", "ciao", "bye", "ciao");
The result must be:
结果必须是:
{ciao=2, hello=1, bye=2}
采纳答案by Mouna
I want to share the solution I found because at first I expected to use map-and-reduce methods, but it was a bit different.
我想分享我找到的解决方案,因为起初我希望使用 map-and-reduce 方法,但它有点不同。
Map<String, Long> collect =
wordsList.stream().collect(groupingBy(Function.identity(), counting()));
Or for Integer values:
或者对于整数值:
Map<String, Integer> collect =
wordsList.stream().collect(groupingBy(Function.identity(), summingInt(e -> 1)));
EDIT
编辑
I add how to sort the map by value:
我添加了如何按值对地图进行排序:
LinkedHashMap<String, Long> countByWordSorted = collect.entrySet()
.stream()
.sorted(Map.Entry.comparingByValue(Comparator.reverseOrder()))
.collect(Collectors.toMap(
Map.Entry::getKey,
Map.Entry::getValue,
(v1, v2) -> {
throw new IllegalStateException();
},
LinkedHashMap::new
));
回答by Marco13
(NOTE: See the edits below)
(注意:请参阅下面的编辑)
As an alternative to Mounas answer, here is an approach that does the word count in parallel:
作为Mounas answer的替代方案,这里有一种并行计算字数的方法:
import java.util.Arrays;
import java.util.List;
import java.util.Map;
import java.util.stream.Collectors;
public class ParallelWordCount
{
public static void main(String[] args)
{
List<String> list = Arrays.asList(
"hello", "bye", "ciao", "bye", "ciao");
Map<String, Integer> counts = list.parallelStream().
collect(Collectors.toConcurrentMap(
w -> w, w -> 1, Integer::sum));
System.out.println(counts);
}
}
EDIT In response to the comment, I ran a small test with JMH, comparing the
toConcurrentMap
and thegroupingByConcurrent
approach, with different input list sizes and random words of different lengths. This test suggested that thetoConcurrentMap
approach was faster. When considering how different these approaches are "under the hood", it's hard to predict something like this.As a further extension, based on further comments, I extended the test to cover all four combinations of
toMap
,groupingBy
, serial and parallel.The results are still that the
toMap
approach is faster, but unexpectedly (at least, for me) the "concurrent" versions in both cases are slower than the serial versions...:
编辑 为了回应评论,我用 JMH 进行了一个小测试,比较了
toConcurrentMap
和groupingByConcurrent
方法,具有不同的输入列表大小和不同长度的随机单词。该测试表明该toConcurrentMap
方法更快。当考虑到这些方法在“幕后”有多么不同时,很难预测这样的事情。作为进一步的扩展,基于进一步的评论,我扩展了测试以涵盖
toMap
、groupingBy
、串行和并行的所有四种组合。结果仍然是该
toMap
方法更快,但出乎意料的是(至少对我而言)两种情况下的“并发”版本都比串行版本慢......:
(method) (count) (wordLength) Mode Cnt Score Error Units
toConcurrentMap 1000 2 avgt 50 146,636 ± 0,880 us/op
toConcurrentMap 1000 5 avgt 50 272,762 ± 1,232 us/op
toConcurrentMap 1000 10 avgt 50 271,121 ± 1,125 us/op
toMap 1000 2 avgt 50 44,396 ± 0,541 us/op
toMap 1000 5 avgt 50 46,938 ± 0,872 us/op
toMap 1000 10 avgt 50 46,180 ± 0,557 us/op
groupingBy 1000 2 avgt 50 46,797 ± 1,181 us/op
groupingBy 1000 5 avgt 50 68,992 ± 1,537 us/op
groupingBy 1000 10 avgt 50 68,636 ± 1,349 us/op
groupingByConcurrent 1000 2 avgt 50 231,458 ± 0,658 us/op
groupingByConcurrent 1000 5 avgt 50 438,975 ± 1,591 us/op
groupingByConcurrent 1000 10 avgt 50 437,765 ± 1,139 us/op
toConcurrentMap 10000 2 avgt 50 712,113 ± 6,340 us/op
toConcurrentMap 10000 5 avgt 50 1809,356 ± 9,344 us/op
toConcurrentMap 10000 10 avgt 50 1813,814 ± 16,190 us/op
toMap 10000 2 avgt 50 341,004 ± 16,074 us/op
toMap 10000 5 avgt 50 535,122 ± 24,674 us/op
toMap 10000 10 avgt 50 511,186 ± 3,444 us/op
groupingBy 10000 2 avgt 50 340,984 ± 6,235 us/op
groupingBy 10000 5 avgt 50 708,553 ± 6,369 us/op
groupingBy 10000 10 avgt 50 712,858 ± 10,248 us/op
groupingByConcurrent 10000 2 avgt 50 901,842 ± 8,685 us/op
groupingByConcurrent 10000 5 avgt 50 3762,478 ± 21,408 us/op
groupingByConcurrent 10000 10 avgt 50 3795,530 ± 32,096 us/op
I'm not so experienced with JMH, maybe I did something wrong here - suggestions and corrections are welcome:
我对 JMH 不太有经验,也许我在这里做错了 - 欢迎提出建议和更正:
import java.util.ArrayList;
import java.util.List;
import java.util.Map;
import java.util.Random;
import java.util.concurrent.TimeUnit;
import java.util.function.Function;
import java.util.stream.Collectors;
import org.openjdk.jmh.annotations.Benchmark;
import org.openjdk.jmh.annotations.BenchmarkMode;
import org.openjdk.jmh.annotations.Mode;
import org.openjdk.jmh.annotations.OutputTimeUnit;
import org.openjdk.jmh.annotations.Param;
import org.openjdk.jmh.annotations.Scope;
import org.openjdk.jmh.annotations.Setup;
import org.openjdk.jmh.annotations.State;
import org.openjdk.jmh.infra.Blackhole;
@State(Scope.Thread)
public class ParallelWordCount
{
@Param({"toConcurrentMap", "toMap", "groupingBy", "groupingByConcurrent"})
public String method;
@Param({"2", "5", "10"})
public int wordLength;
@Param({"1000", "10000" })
public int count;
private List<String> list;
@Setup
public void initList()
{
list = createRandomStrings(count, wordLength, new Random(0));
}
@Benchmark
@BenchmarkMode(Mode.AverageTime)
@OutputTimeUnit(TimeUnit.MICROSECONDS)
public void testMethod(Blackhole bh)
{
if (method.equals("toMap"))
{
Map<String, Integer> counts =
list.stream().collect(
Collectors.toMap(
w -> w, w -> 1, Integer::sum));
bh.consume(counts);
}
else if (method.equals("toConcurrentMap"))
{
Map<String, Integer> counts =
list.parallelStream().collect(
Collectors.toConcurrentMap(
w -> w, w -> 1, Integer::sum));
bh.consume(counts);
}
else if (method.equals("groupingBy"))
{
Map<String, Long> counts =
list.stream().collect(
Collectors.groupingBy(
Function.identity(), Collectors.<String>counting()));
bh.consume(counts);
}
else if (method.equals("groupingByConcurrent"))
{
Map<String, Long> counts =
list.parallelStream().collect(
Collectors.groupingByConcurrent(
Function.identity(), Collectors.<String> counting()));
bh.consume(counts);
}
}
private static String createRandomString(int length, Random random)
{
StringBuilder sb = new StringBuilder();
for (int i = 0; i < length; i++)
{
int c = random.nextInt(26);
sb.append((char) (c + 'a'));
}
return sb.toString();
}
private static List<String> createRandomStrings(
int count, int length, Random random)
{
List<String> list = new ArrayList<String>(count);
for (int i = 0; i < count; i++)
{
list.add(createRandomString(length, random));
}
return list;
}
}
The times are only similar for the serial case of a list with 10000 elements, and 2-letter words.
时间仅对于具有 10000 个元素和 2 个字母单词的列表的串行情况相似。
It could be worthwhile to check whether for even larger list sizes, the concurrent versions eventually outperform the serial ones, but currently don't have the time for another detailed benchmark run with all these configurations.
可能值得检查对于更大的列表大小,并发版本是否最终优于串行版本,但目前没有时间使用所有这些配置运行另一个详细的基准测试。
回答by Donald Raab
If you use Eclipse Collections, you can just convert the List
to a Bag
.
如果您使用Eclipse Collections,则只需将 转换List
为Bag
.
Bag<String> words =
Lists.mutable.with("hello", "bye", "ciao", "bye", "ciao").toBag();
Assert.assertEquals(2, words.occurrencesOf("ciao"));
Assert.assertEquals(1, words.occurrencesOf("hello"));
Assert.assertEquals(2, words.occurrencesOf("bye"));
This code will work with Java 5 - 8.
此代码适用于 Java 5 - 8。
Note:I am a committer for Eclipse Collections
注意:我是 Eclipse Collections 的提交者
回答by Eugene
I'll present the solution here which I made (the one with grouping is much better :) ).
我将在此处介绍我制作的解决方案(带有分组的解决方案要好得多:))。
static private void test0(List<String> input) {
Set<String> set = input.stream()
.collect(Collectors.toSet());
set.stream()
.collect(Collectors.toMap(Function.identity(),
str -> Collections.frequency(input, str)));
}
Just my 0.02$
只是我的 0.02$
回答by Sym-Sym
Another 2 cent of mine, given an array:
给定一个数组,我的另外 2 美分:
import static java.util.stream.Collectors.*;
String[] str = {"hello", "bye", "ciao", "bye", "ciao"};
Map<String, Integer> collected
= Arrays.stream(str)
.collect(groupingBy(Function.identity(),
collectingAndThen(counting(), Long::intValue)));
回答by Piyush
Here's a way to create a frequency map using map functions.
这是一种使用映射函数创建频率映射的方法。
List<String> words = Stream.of("hello", "bye", "ciao", "bye", "ciao").collect(toList());
Map<String, Integer> frequencyMap = new HashMap<>();
words.forEach(word ->
frequencyMap.merge(word, 1, (v, newV) -> v + newV)
);
System.out.println(frequencyMap); // {ciao=2, hello=1, bye=2}
Or
或者
words.forEach(word ->
frequencyMap.compute(word, (k, v) -> v != null ? v + 1 : 1)
);
回答by nejckorasa
Find most frequent item in collection, with generics:
使用泛型查找集合中最频繁的项目:
private <V> V findMostFrequentItem(final Collection<V> items)
{
return items.stream()
.filter(Objects::nonNull)
.collect(Collectors.groupingBy(Functions.identity(), Collectors.counting()))
.entrySet()
.stream()
.max(Comparator.comparing(Entry::getValue))
.map(Entry::getKey)
.orElse(null);
}
Compute item frequencies:
计算项目频率:
private <V> Map<V, Long> findFrequencies(final Collection<V> items)
{
return items.stream()
.filter(Objects::nonNull)
.collect(Collectors.groupingBy(Function.identity(), Collectors.counting()));
}
回答by Easycoder
public class Main {
public static void main(String[] args) {
String testString ="qqwweerrttyyaaaaaasdfasafsdfadsfadsewfywqtedywqtdfewyfdweytfdywfdyrewfdyewrefdyewdyfwhxvsahxvfwytfx";
long java8Case2 = testString.codePoints().filter(ch -> ch =='a').count();
System.out.println(java8Case2);
ArrayList<Character> list = new ArrayList<Character>();
for (char c : testString.toCharArray()) {
list.add(c);
}
Map<Object, Integer> counts = list.parallelStream().
collect(Collectors.toConcurrentMap(
w -> w, w -> 1, Integer::sum));
System.out.println(counts);
}
}
回答by Saidu
you can use the Java 8 Streams
您可以使用 Java 8 Streams
Arrays.asList(s).stream()
.collect(Collectors.groupingBy(Function.<String>identity(),
Collectors.<String>counting()));