C# 大型集合的 LINQ 性能
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/682576/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
LINQ Performance for Large Collections
提问by Peter J
I have a large collection of strings (up to 1M) alphabetically sorted. I have experimented with LINQ queries against this collection using HashSet, SortedDictionary, and Dictionary. I am static caching the collection, it's up to 50MB in size, and I'm always calling the LINQ query against the cached collection. My problem is as follows:
我有大量按字母顺序排序的字符串(最多 1M)。我已经使用 HashSet、SortedDictionary 和 Dictionary 对这个集合进行了 LINQ 查询试验。我正在静态缓存集合,它的大小高达 50MB,并且我总是针对缓存的集合调用 LINQ 查询。我的问题如下:
Regardless of collection type, performance is much poorer than SQL (up to 200ms). When doing a similar query against the underlying SQL tables, performance is much quicker ( 5-10ms). I have implemented my LINQ queries as follows:
不管集合类型,性能都比SQL差很多(最多200ms)。对底层 SQL 表执行类似查询时,性能要快得多(5-10 毫秒)。我已经实现了我的 LINQ 查询如下:
public static string ReturnSomething(string query, int limit)
{
StringBuilder sb = new StringBuilder();
foreach (var stringitem in MyCollection.Where(
x => x.StartsWith(query) && x.Length > q.Length).Take(limit))
{
sb.Append(stringitem);
}
return sb.ToString();
}
It is my understanding that the HashSet, Dictionary, etc. implement lookups using binary tree search instead of the standard enumeration. What are my options for high performance LINQ queries into the advanced collection types?
我的理解是 HashSet、Dictionary 等使用二叉树搜索而不是标准枚举来实现查找。对于高级集合类型的高性能 LINQ 查询,我有哪些选择?
采纳答案by Guffa
In your current code you don't make use of any of the special features of the Dictionary
/ SortedDictionary
/ HashSet
collections, you are using them the same way that you would use a List
. That is why you don't see any difference in performance.
在当前的代码,你不利用任何的特殊功能Dictionary
/ SortedDictionary
/HashSet
收藏,您使用的是他们以同样的方式,你会使用List
。这就是为什么您看不到任何性能差异的原因。
If you use a dictionary as index where the first few characters of the string is the key and a list of strings is the value, you can from the search string pick out a small part of the entire collection of strings that has possible matches.
如果您使用字典作为索引,其中字符串的前几个字符是键,字符串列表是值,您可以从搜索字符串中挑选出可能匹配的整个字符串集合的一小部分。
I wrote the class below to test this. If I populate it with a million strings and search with an eight character string it rips through all possible matches in about 3 ms. Searching with a one character string is the worst case, but it finds the first 1000 matches in about 4 ms. Finding all matches for a one character strings takes about 25 ms.
我写了下面的类来测试这个。如果我用一百万个字符串填充它并用一个八个字符的字符串进行搜索,它会在大约 3 毫秒内遍历所有可能的匹配项。使用一个字符串进行搜索是最糟糕的情况,但它会在大约 4 毫秒内找到前 1000 个匹配项。查找一个字符串的所有匹配项大约需要 25 毫秒。
The class creates indexes for 1, 2, 4 and 8 character keys. If you look at your specific data and what you search for, you should be able to select what indexes to create to optimise it for your conditions.
该类为 1、2、4 和 8 个字符的键创建索引。如果您查看特定数据和搜索内容,您应该能够选择要创建的索引以针对您的条件对其进行优化。
public class IndexedList {
private class Index : Dictionary<string, List<string>> {
private int _indexLength;
public Index(int indexLength) {
_indexLength = indexLength;
}
public void Add(string value) {
if (value.Length >= _indexLength) {
string key = value.Substring(0, _indexLength);
List<string> list;
if (!this.TryGetValue(key, out list)) {
Add(key, list = new List<string>());
}
list.Add(value);
}
}
public IEnumerable<string> Find(string query, int limit) {
return
this[query.Substring(0, _indexLength)]
.Where(s => s.Length > query.Length && s.StartsWith(query))
.Take(limit);
}
}
private Index _index1;
private Index _index2;
private Index _index4;
private Index _index8;
public IndexedList(IEnumerable<string> values) {
_index1 = new Index(1);
_index2 = new Index(2);
_index4 = new Index(4);
_index8 = new Index(8);
foreach (string value in values) {
_index1.Add(value);
_index2.Add(value);
_index4.Add(value);
_index8.Add(value);
}
}
public IEnumerable<string> Find(string query, int limit) {
if (query.Length >= 8) return _index8.Find(query, limit);
if (query.Length >= 4) return _index4.Find(query,limit);
if (query.Length >= 2) return _index2.Find(query,limit);
return _index1.Find(query, limit);
}
}
回答by Jon Skeet
If you're doing a "starts with", you only care about ordinal comparisons, and you can have the collection sorted (again in ordinal order) then I would suggest you have the values in a list. You can then binary search to find the first value which starts with the right prefix, then go down the list linearly yielding results until the first value which doesn'tstart with the right prefix.
如果你正在做一个“开始于”,你只关心序数比较,你可以对集合进行排序(再次按序数顺序),那么我建议你将值放在一个列表中。然后,您可以进行二分搜索以找到以正确前缀开头的第一个值,然后在列表中线性下降,直到第一个不以正确前缀开头的值为止。
In fact, you could probably do another binary search for the first value which doesn't start with the prefix, so you'd have a start and an end point. Then you just need to apply the length criterion to that matching portion. (I'd hope that if it's sensible data, the prefix matching is going to get rid of most candidate values.) The way to find the first value which doesn't start with the prefix is to search for the lexicographically-first value which doesn't - e.g. with a prefix of "ABC", search for "ABD".
事实上,您可能可以对不以前缀开头的第一个值进行另一次二分搜索,因此您将有一个起点和一个终点。然后您只需要将长度标准应用于该匹配部分。(我希望如果它是合理的数据,前缀匹配将摆脱大多数候选值。)找到不以前缀开头的第一个值的方法是搜索按字典顺序排列的第一个值不 - 例如,前缀为“ABC”,搜索“ABD”。
None of this uses LINQ, and it's all very specific to your particular case, but it should work. Let me know if any of this doesn't make sense.
这些都不使用 LINQ,而且它都非常适合您的特定情况,但它应该可以工作。如果其中任何一个没有意义,请告诉我。
回答by casperOne
Just looking at your code, I would say that you should reorder the comparison to take advantage of short-circuiting when using boolean operators:
仅查看您的代码,我会说您应该重新排序比较以在使用布尔运算符时利用短路:
foreach (var stringitem in MyCollection.Where(
x => x.Length > q.Length && x.StartsWith(query)).Take(limit))
The comparison of length is always going to be an O(1) operation (as the length is being stored as part of the string, it doesn't count each character every time), whereas the call to StartsWith is going to be an O(N) operation, where N is the length of query (or the length of the string, whichever is smaller).
长度的比较总是一个 O(1) 操作(因为长度作为字符串的一部分存储,它不会每次都计算每个字符),而对 StartsWith 的调用将是一个 O (N) 操作,其中 N 是查询的长度(或字符串的长度,以较小者为准)。
By placing the comparison of length before the call to StartsWith, if that comparison fails, you save yourself some extra cycles which could add up when processing large numbers of items.
通过在调用 StartsWith 之前放置长度比较,如果比较失败,您可以节省一些额外的周期,这些周期在处理大量项目时可能会累加。
I don't think that a lookup table is going to help you here, as lookup tables are good when you are comparing the entire key, not parts of the key, like you are doing with the call to StartsWith.
我不认为查找表会在这里帮助你,因为当你比较整个键而不是键的部分时,查找表是很好的,就像你在调用 StartsWith 时所做的那样。
Rather, you might be better off using a tree structure which is split based on the letters in the words in the list.
相反,您最好使用基于列表中单词中的字母拆分的树结构。
However, at that point, you are really just recreating what SQL Server is doing (in the case of indexes) and that would just be a duplication of effort on your part.
但是,此时,您实际上只是在重新创建 SQL Server 正在执行的操作(在索引的情况下),而这只是您的重复工作。
回答by erikkallen
I bet you have an index on the column so SQL server can do the comparison in O(log(n)) operations rather than O(n). To imitate the SQL server behavior, use a sorted collection and find all strings s such that s >= query and then look at values until you find a value that does not start with s and then do an additional filter on the values. This is what is called a range scan (Oracle) or an index seek (SQL server).
我敢打赌你在列上有一个索引,所以 SQL 服务器可以在 O(log(n)) 操作而不是 O(n) 中进行比较。要模仿 SQL 服务器的行为,请使用排序集合并查找所有字符串 s 使得 s >= 查询,然后查看值,直到找到不以 s 开头的值,然后对这些值进行额外的过滤。这就是所谓的范围扫描 (Oracle) 或索引查找 (SQL Server)。
This is some example code which is very likely to go into infinite loops or have one-off errors because I didn't test it, but you should get the idea.
这是一些示例代码,很可能会进入无限循环或出现一次性错误,因为我没有对其进行测试,但您应该明白这一点。
// Note, list must be sorted before being passed to this function
IEnumerable<string> FindStringsThatStartWith(List<string> list, string query) {
int low = 0, high = list.Count - 1;
while (high > low) {
int mid = (low + high) / 2;
if (list[mid] < query)
low = mid + 1;
else
high = mid - 1;
}
while (low < list.Count && list[low].StartsWith(query) && list[low].Length > query.Length)
yield return list[low];
low++;
}
}
回答by Ben S
If you are trying to optimize looking up a list of strings with a given prefix you might want to take a look at implementing a Trie(not to be mistaken with a regular tree) data structure in C#.
如果您正在尝试优化查找具有给定前缀的字符串列表,您可能需要查看在 C#中实现Trie(不要与常规树混淆)数据结构。
Tries offer very fast prefix lookups and have a very small memory overhead compared to other data structures for this sort of operation.
与此类操作的其他数据结构相比,尝试提供非常快速的前缀查找,并且具有非常小的内存开销。
About LINQ to Objects in general. It's not unusual to have a speed reduction compared to SQL. The net is littered with articlesanalyzing its performance.
回答by MartinStettner
I think the problem is that Linq has no way to use the fact that your sequence is already sorted. Especially it cannot know, that applying the StartsWith
function retains the order.
我认为问题在于 Linq 无法利用您的序列已经排序的事实。尤其是它不能知道,应用StartsWith
函数保留顺序。
I would suggest to use the List.BinarySearch
method together with a IComparer<string>
that does only comparison of the first query chars (this might be tricky, since it's not clear, if the query string will always be the first or the second parameter to ()
).
我建议将该List.BinarySearch
方法与IComparer<string>
只比较第一个查询字符的a 一起使用(这可能很棘手,因为不清楚,如果查询字符串始终是 to 的第一个或第二个参数()
)。
You could even use the standard string comparison, since BinarySearch returns a negative number which you can complement (using ~) in order to get the index of the first element that is larger than your query.
您甚至可以使用标准字符串比较,因为 BinarySearch 返回一个负数,您可以对它进行补充(使用 ~)以获得大于查询的第一个元素的索引。
You have then to start from the returned index (in both directions!) to find all elements matching your query string.
然后,您必须从返回的索引(双向!)开始查找与您的查询字符串匹配的所有元素。