Java 检查 List<String> 是否包含唯一字符串的最快方法
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/3307549/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Fastest way to check if a List<String> contains a unique String
提问by Ben
Basically I have about 1,000,000 strings, for each request I have to check if a String belongs to the list or not.
基本上我有大约 1,000,000 个字符串,对于每个请求,我必须检查一个字符串是否属于该列表。
I'm worried about the performance, so what's the best method? ArrayList
? Hash?
我担心性能,那么最好的方法是什么?ArrayList
? 哈希?
采纳答案by krock
Your best bet is to use a HashSet
and check if a string exists in the set via the contains()
method. HashSets are built for fast access via the use of Object methods hashCode()
and equals()
. The Javadoc for HashSet
states:
最好的办法是使用 aHashSet
并通过该contains()
方法检查集合中是否存在字符串。HashSets 是为通过使用 Object 方法hashCode()
和equals()
. 声明的 Javadoc HashSet
:
This class offers constant time performance for the basic operations (add, remove, contains and size),
此类为基本操作(添加、删除、包含和大小)提供恒定的时间性能,
HashSet stores objects in hash bucketswhich is to say that the value returned by the hashCode
method will determine which bucket an object is stored in. This way, the amount of equality checks the HashSet
has to perform via the equals()
method is reduced to just the other Objects in the same hash bucket.
HashSet将对象存储在哈希桶中,也就是说,该hashCode
方法返回的值将确定对象存储在哪个桶中。这样,HashSet
必须通过该equals()
方法执行的相等性检查的数量减少到只是其他对象同一个哈希桶。
To use HashSets and HashMaps effectively, you must conform to the equals
and hashCode
contract outlined in the javadoc. In the case of java.lang.String
these methods have already been implemented to do this.
要有效地使用 HashSet 和 HashMap,您必须遵守javadoc 中概述的equals
和hashCode
契约。在这些方法已经被实施的情况下做到这一点。java.lang.String
回答by unbeli
I'd use a Set
, in most cases HashSet
is fine.
我会使用Set
, 在大多数情况下没问题HashSet
。
回答by oopbase
If you are having such a large amount of strings, the best opportunity for you is to use a database. Look for MySQL.
如果您有如此大量的字符串,那么最好的机会就是使用数据库。寻找 MySQL。
回答by mdma
In general, a HashSet will give you better performance, since it does not have to look through each element and compare, like an ArrayList does, but typically compares at most a few elements, where the hashcodes are equal.
一般来说,HashSet 会给你更好的性能,因为它不必像 ArrayList 那样查看每个元素并进行比较,但通常最多比较几个元素,其中哈希码相等。
However, for 1M strings, the performance of hashSet may still not be optimal. A lot of cache misses will slow down searching the set. If all strings are equally likely, then this is unavoidable. However, if some strings are more often requested than others, then you can place the common strings into a small hashSet, and check that first, before checking the larger set. The small hashset should be sized to fit in cache (e.g. a few hundred K at most). Hits to the small hashset will then be very fast, while hits to the larger hashset proceed at speed limited by the memory bandwidth.
但是,对于 1M 的字符串,hashSet 的性能可能仍然不是最佳的。大量缓存未命中会减慢搜索集合的速度。如果所有字符串的可能性相等,那么这是不可避免的。但是,如果某些字符串比其他字符串更频繁地被请求,那么您可以将公共字符串放入一个小的 hashSet 中,并在检查较大的集合之前先检查它。小哈希集的大小应适合缓存(例如,最多几百 K)。对小哈希集的命中将非常快,而对较大哈希集的命中则以受内存带宽限制的速度进行。
回答by nd.
Before going further, please consider this: Why are you worried about performance? How often is this check called?
在进一步讨论之前,请考虑以下问题:您为什么要担心性能?此检查多久调用一次?
As for possible solutions:
至于可能的解决方案:
If the list is already sorted, then you can use
java.util.Collections.binarySearch
which offers the same performance characteristics as ajava.util.TreeSet
.Otherwise you can use a
java.util.HashSet
that as a performance characteristic of O(1). Note that calculating the hash code for a string that doesn't have one calculated yet is an O(m) operation with m=string.length()
. Also keep in mind that hashtables only work well until they reach a given load factor, i.e. hashtables will use more memory than plain lists. The default load factor used by HashSet is .75, meaning that internally a HashSet for 1e6 objects will use an array with 1.3e6 entries.If the HashSet does not work for you (e.g. because there are lots of hash-collisions, because memory is tight or because there are lots of insertions), than consider using a Trie. Lookup in a Trie has a worst-case complexity of O(m) where m=
string.length()
. A Trie has also some extra-benefits that might be useful for you: e.g., it can give you the closest fitfor a search string. But keep in mind that the best code is no code, so only roll your own Trie implementiation if the benefits outweight the costs.Consider using a database if you want more complex queries, e.g. match for a substring or a regular expression.
如果列表已经排序,那么您可以使用
java.util.Collections.binarySearch
which 提供与java.util.TreeSet
.否则,您可以将
java.util.HashSet
其用作 O(1) 的性能特征。请注意,为尚未计算的字符串计算哈希码是 O(m) 操作,其中 m=string.length()
。还要记住,哈希表只有在达到给定的负载因子之前才能很好地工作,即哈希表将使用比普通列表更多的内存。HashSet 使用的默认加载因子是 0.75,这意味着 1e6 对象的 HashSet 在内部将使用具有 1.3e6 条目的数组。如果 HashSet 对您不起作用(例如,因为有很多哈希冲突,因为内存紧张或因为有很多插入),那么请考虑使用Trie。在 Trie 中查找的最坏情况复杂度为 O(m),其中 m=
string.length()
。Trie 还具有一些可能对您有用的额外好处:例如,它可以为您提供最适合搜索字符串的方法。但请记住,最好的代码是没有代码的,因此只有在收益大于成本的情况下才推出您自己的 Trie 实现。如果您想要更复杂的查询,例如匹配子字符串或正则表达式,请考虑使用数据库。
回答by Truong Ha
Not only for String, you can use Setfor any case you need unique items.
不仅对于 String,您还可以将Set用于需要唯一项的任何情况。
If the type of items is primitive or wrapper, you may not care. But if it is a class, you must override two methods:
如果项目的类型是原始的或包装的,您可能不在乎。但是如果是类,则必须重写两个方法:
- hashCode()
- equals()
- 哈希码()
- 等于()
回答by ILMTitan
回答by ghostNet
Sometimes you want to check if an object is in the list/set and at the same time you want the list/set to be ordered. If you are looking to also retrieve objects easily without using an enumeration or iterator, you may consider using both an ArrayList<String>
and HashMap<String, Integer>
. The list is backed by the map.
有时你想检查一个对象是否在列表/集合中,同时你希望列表/集合被排序。如果您还希望在不使用枚举或迭代器的情况下轻松检索对象,您可以考虑同时使用 anArrayList<String>
和HashMap<String, Integer>
。该列表由地图支持。
Example from some work I recently did:
我最近做的一些工作的例子:
public class NodeKey<K> implements Serializable, Cloneable{
private static final long serialVersionUID = -634779076519943311L;
private NodeKey<K> parent;
private List<K> children = new ArrayList<K>();
private Map<K, Integer> childrenToListMap = new HashMap<K, Integer>();
public NodeKey() {}
public NodeKey(Collection<? extends K> c){
List<K> childHierarchy = new ArrayList<K>(c);
K childLevel0 = childHierarchy.remove(0);
if(!childrenToListMap.containsKey(childLevel0)){
children.add(childLevel0);
childrenToListMap.put(childLevel0, children.size()-1);
}
...
In this case, parameter K
would be a String
for you. The map (childrenToMapList
) stores Strings
inserted into the list (children
) as the key, and the map values are the index position in the list.
在这种情况下,参数K
将是String
适合您的。映射(childrenToMapList
)存储Strings
插入列表(children
)作为键,映射值是列表中的索引位置。
The reason for the list and the map is so that you can retrieve indexed values of the list, without having to do an iteration over a HashSet<String>
.
列表和映射的原因是这样您就可以检索列表的索引值,而无需对HashSet<String>
.
回答by awiebe
Having run the exercise here are my results.
运行这里的练习是我的结果。
private static final int TEST_CYCLES = 4000;
private static final long RAND_ELEMENT_COUNT = 1000000l;
private static final int RAND_STR_LEN = 20;
//Mean time
/*
Array list:18.55425
Array list not contains:17.113
Hash set:5.0E-4
Hash set not contains:7.5E-4
*/
I believe the numbers speak for themselves. The lookup time of the hash set is way, wayyyy faster.
我相信数字不言而喻。哈希集的查找时间更快。
回答by simplylizz
Perhaps this isn't required for your case but I think it's useful to know that there is some space-efficient probabilistic algorithms. For example Bloom filter.
也许这不是您的情况所必需的,但我认为知道有一些节省空间的概率算法很有用。例如布隆过滤器。