Java 比 O(n) 更好的范围交集算法?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/303591/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
A range intersection algorithm better than O(n)?
提问by Pyrolistical
Range intersection is a simple, but non-trivial problem.
范围交集是一个简单但重要的问题。
Its has been answered twice already:
它已经回答了两次:
The first solutions is O(n) and the second solution is for a database (which is less than O(n) of course).
第一个解决方案是 O(n),第二个解决方案用于数据库(当然小于 O(n))。
I have the same problem, but for a large n and I am not within a database.
我有同样的问题,但是对于一个很大的 n 并且我不在数据库中。
This problem seems to be very similar to Store 2D points for quick retrieval of those inside a rectanglebut I don't see how it maps.
这个问题似乎与Store 2D points非常相似,用于快速检索矩形内的那些点,但我看不出它是如何映射的。
So what data structure would you store the set of ranges in, such that a search on a range costs less than O(n)? (Extra credit for using libraries available for Java)
那么你会用什么数据结构来存储这组范围,使得对范围的搜索成本小于 O(n)?(使用可用于 Java 的库的额外功劳)
EDIT:
编辑:
I want to get a subset of all intersecting ranges, meaning the search range could intersect multiple ranges.
我想获得所有相交范围的子集,这意味着搜索范围可以与多个范围相交。
The method that needs to be less than O(n) in Java is:
Java中需要小于O(n)的方法是:
public class RangeSet {
....
public Set<Range> intersects(Range range);
....
}
Where Range is just a class containing a pair of int start and end.
其中 Range 只是一个包含一对 int start 和 end 的类。
This is not an impossible question, I already have the solution, I just wanted to see if there was a more standard/simpler way of doing it
这不是一个不可能的问题,我已经有了解决方案,我只是想看看是否有更标准/更简单的方法
采纳答案by Rafa? Dowgird
The standard approach is to use an interval tree.
标准方法是使用区间树。
In computer science, an interval tree is a tree data structure to hold intervals. Specifically, it allows one to efficiently find all intervals that overlap with any given interval or point. It is often used for windowing queries, for instance, to find all roads on a computerized map inside a rectangular viewport, or to find all visible elements inside a three-dimensional scene. A similar data structure is the segment tree.
The trivial solution is to visit each interval and test whether it intersects the given point or interval, which requires O(n) time, where n is the number of intervals in the collection. Since a query may return all intervals, for example if the query is a large interval intersecting all intervals in the collection, this is asymptotically optimal; however, we can do better by considering output-sensitive algorithms, where the runtime is expressed in terms of m, the number of intervals produced by the query. Interval trees have a query time of O(log n + m) and an initial creation time of O(n log n), while limiting memory consumption to O(n). After creation, interval trees may be dynamic, allowing efficient insertion and deletion of an interval in O(log n). If the endpoints of intervals are within a small integer range (e.g., in the range [1,...,O(n)]), faster data structures exist[1] with preprocessing time O(n) and query time O(1+m) for reporting m intervals containing a given query point.
在计算机科学中,区间树是用于保存区间的树数据结构。具体来说,它允许人们有效地找到与任何给定区间或点重叠的所有区间。它通常用于窗口查询,例如,在矩形视口内的计算机地图上查找所有道路,或在 3D 场景中查找所有可见元素。类似的数据结构是段树。
简单的解决方案是访问每个区间并测试它是否与给定的点或区间相交,这需要 O(n) 时间,其中 n 是集合中的区间数。由于查询可能返回所有区间,例如,如果查询是与集合中所有区间相交的大区间,则这是渐近最优的;然而,我们可以通过考虑输出敏感算法做得更好,其中运行时间用 m 表示,即查询产生的间隔数。区间树的查询时间为 O(log n + m),初始创建时间为 O(n log n),同时将内存消耗限制为 O(n)。创建后,区间树可能是动态的,允许在 O(log n) 中有效地插入和删除区间。如果区间的端点在一个小的整数范围内(例如,在 [1,...,
回答by Paul Tomblin
Just as a quad tree works for a set of 2d points, a simple binary tree should work for this case. Build a tree with your ranges.
正如四叉树适用于一组 2d 点一样,简单的二叉树也适用于这种情况。用你的范围构建一棵树。
To explain further: Each node in the tree contains two integers, the beginning and end of the range, and the two children if it's not a leaf node. To find the ranges that your input range spans, then starting at the top of the tree
进一步解释:树中的每个节点包含两个整数,范围的开始和结束,如果它不是叶节点,则包含两个子节点。要找到您的输入范围跨越的范围,然后从树的顶部开始
- if the node range intersects the input range:
- if it's a leaf node, then add the range to your result list
- if it's not a leaf node, then traverse down to the child nodes and repeat this process.
It should be O(logN)
它应该是 O(logN)
Further detail: The binary tree would be structured like a 1-d version of a quad tree. Each node would have three integers (sorry I said two above, but now I realize you need three), the lowest representing the lowest value of the lowest range that's below this node, the highest representing the highest value of the highest range that's below this node, and the pivot. The left child would span from this node's lowest to its pivot. The right child would span from this node's pivot to this node's highest. If there is only one range that goes from "lowest" to "highest", you wouldn't have a pivot and this would be a leaf. Ideally you'd pick the pivots for each node to keep the tree balanced.
进一步的细节:二叉树的结构类似于四叉树的一维版本。每个节点将有三个整数(抱歉我上面说了两个,但现在我意识到你需要三个),最低代表低于此节点的最低范围的最低值,最高代表低于此的最高范围的最高值节点和枢轴。左孩子将从这个节点的最低点跨越到它的支点。右孩子将从这个节点的枢轴跨越到这个节点的最高点。如果只有一个范围从“最低”到“最高”,那么您将没有枢轴,这将是一片叶子。理想情况下,您应该为每个节点选择枢轴以保持树平衡。
回答by Bill Michell
Sounds like you need a class that implements the SortedSet interface. TreeSet is the implementation that ships with the core API.
听起来您需要一个实现 SortedSet 接口的类。TreeSet 是核心 API 附带的实现。
Have one set holding the ranges sorted by lowest value, and one sorted by highest value.
一组保存按最低值排序的范围,一组按最高值排序。
You can then implement the equivalent of the database algorithm using the in-memory sets.
然后,您可以使用内存集实现与数据库算法等效的算法。
As for whether this is actually faster than O(n), I couldn't say.
至于这是否真的比 O(n) 快,我不能说。
回答by flolo
This depends on your exact problem, in the linked question, the ranges where distinct, no common part, and the searched ranged could span multiple ranges. If your problem is the same, its is really easy: Take an array of the ranges, sort them by their lowest values (since they do not overlap, this would be also the same order as sorted by their upper values).
这取决于您的确切问题,在链接的问题中,不同的范围、没有共同的部分以及搜索的范围可以跨越多个范围。如果您的问题是相同的,那真的很简单:取一个范围数组,按它们的最低值对它们进行排序(因为它们不重叠,这也与按它们的高值排序的顺序相同)。
Now just make a binsearch for the your target lower value (or smaller if not exact) and one for the target upper value(or bigger if not exact). The resulting indexes are the the ranges which are coverd. You have to check wheter the ranges at the indexes itself are in- or excluded, but that are just 2 checks. Overall complexity O(log n).
现在只需对您的目标下限值(如果不准确则更小)和目标上限值(或如果不准确则更大)进行一次 binsearch。结果索引是覆盖的范围。您必须检查索引本身的范围是在内部还是排除在外,但这只是 2 个检查。总体复杂度 O(log n)。
回答by Lawrence Dol
When I had this problem, I used a sorted array of ranges and a binary search to look for intersections. This is (I believe) O(log n) performance, with a little bit of overhead to deal with overlapping ranges.
当我遇到这个问题时,我使用了一个排序的范围数组和一个二分搜索来寻找交叉点。这是(我相信)O(log n) 性能,有一点开销来处理重叠范围。
The answer to your question is, I think, derivable from the code below, but stopping short of the insertion. I present the entire code to avoid confusion by the differing context - I needed to insert a range of Unicode codepoints into a list of codepoint ranges.
我认为,您的问题的答案可以从下面的代码中推导出来,但没有插入。我展示了整个代码以避免被不同的上下文混淆 - 我需要将一系列 Unicode 代码点插入到代码点范围列表中。
-- EDIT --
- 编辑 -
Adapting the code below to determine intersections of multiple ranges involves a trivial forward search from the insertion point until a range is found which no longer intersects.
修改下面的代码以确定多个范围的交集涉及从插入点开始的简单前向搜索,直到找到不再相交的范围。
-- END EDIT --
-- 结束编辑 --
The Range class contains:
Range 类包含:
final int lower; // lower end of range
final int upper; // upper end of range
public int compareTo(Object obj) {
if(obj==null) { return -1; }
Range oth=(Range)obj;
if(lower<oth.lower) { return -1; }
if(lower>oth.lower) { return 1; }
if(upper<oth.upper) { return -1; }
if(upper>oth.upper) { return 1; }
return 0;
}
Range Insertion:
范围插入:
public Builder addRange(int fir, int las) {
if(fir!=-1) { fir&=0x001FFFFF; }
if(las!=-1) { las&=0x001FFFFF; }
if(codepoints==null || codepoints.length==0) {
codepoints=new Range[]{new Range(fir,las)};
}
else {
int idx=Range.findChar(codepoints,fir);
int ins=(idx<0 ? -(idx+1) : idx);
if(idx<0) {
if (ins>0 && fir==(codepoints[ins-1].upper+1)) { idx=(ins-1); } // new range adjoins the following range (can't overlap or idx would be >=0)
else if(ins<codepoints.length && las>=(codepoints[ins ].lower-1)) { idx=ins; } // new range overlaps or adjoins the following range
}
if(idx<0) {
codepoints=(Range[])Util.arrayInsert(codepoints,ins,new Range(fir,las));
}
else {
boolean rmv=false;
for(int xa=(idx+1); xa<codepoints.length && codepoints[xa].lower<=las; xa++) {
if(las<codepoints[xa].upper) { las=codepoints[xa].upper; }
codepoints[xa]=null;
rmv=true;
}
if(codepoints[idx].lower>fir || codepoints[idx].upper<las) {
codepoints[idx]=new Range((codepoints[idx].lower < fir ? codepoints[idx].lower : fir),(codepoints[idx].upper>las ? codepoints[idx].upper : las));
}
if(rmv) { codepoints=Range.removeNulls(codepoints); }
}
}
return this;
}
Binary Search:
二分查找:
static int findChar(Range[] arr, int val) {
if(arr.length==1) {
if (val< arr[0].lower) { return -1; } // value too low
else if(val<=arr[0].upper) { return 0; } // value found
else { return -2; } // value too high
}
else {
int lowidx=0; // low index
int hghidx=(arr.length-1); // high index
int mididx; // middle index
Range midval; // middle value
while(lowidx<=hghidx) {
mididx=((lowidx+hghidx)>>>1);
midval=arr[mididx];
if (val< midval.lower) { hghidx=(mididx-1); } // value too low
else if(val<=midval.upper) { return mididx; } // value found
else { lowidx=(mididx+1); } // value too high
}
return -(lowidx+1); // value not found.
}
}
回答by Adam Tegen
Non Overlapping Ranges:
非重叠范围:
Prep O(n log n):
准备 O(n log n):
- Make a array / vector of the ranges.
- Sort the vector by the end of the range (break ties by sorting by the start of the range)
- 制作范围的数组/向量。
- 按范围的结尾对向量进行排序(通过按范围的开头排序来打破平局)
Search:
搜索:
- Use binary search to find the first range with an End value of >= TestRange.Start
Iterator starting at the binary search until you find an Start > TestRange.End:
2a. If the range if the current range is within the TestRange, add it to your result.
- 使用二分查找查找 End 值为 >= TestRange.Start 的第一个范围
迭代器从二分查找开始,直到找到 Start > TestRange.End:
2a. 如果当前范围在 TestRange 内,则将其添加到您的结果中。
回答by Adam Tegen
Edit:It sounds like this solution is more or less an Interval Tree. A more complete implementation of an Interval Tree can be found here.
编辑:听起来这个解决方案或多或少是一个 Interval Tree。可以在此处找到间隔树的更完整实现。
class TreeNode
{
public:
long pivot;
List<Range> leaves; //Any ranges that intersect the pivot
TreeNode left; //Tree nodes that fall to the left of the pivot
TreeNode right; //Tree nodes that fall to the right of the pivot
};
Prep O(n log n):
准备 O(n log n):
- Create the list of ranges
- Pick the pivot points (possibly by using a sorted list of the end dates.) ??
- Build your tree.
- 创建范围列表
- 选择轴心点(可能通过使用结束日期的排序列表。) ??
- 建造你的树。
Search:
搜索:
- Use binary search to find the first pivot that is >= TestRange.End
Traverse the tree until the pivot > TestRange.Start
2a. Add the leaves to your result.
- 使用二分查找找到 >= TestRange.End 的第一个数据透视表
遍历树直到枢轴> TestRange.Start
2a. 将叶子添加到您的结果中。
Example:
例子:
Ranges:
范围:
- 0 - 2
- 1 - 2
- 2 - 3
- 1 - 4
- 2 - 4
- 0 - 5
- 4 - 5
- 2 - 6
- 3 - 7
- 0 - 2
- 1 - 2
- 2 - 3
- 1 - 4
- 2 - 4
- 0 - 5
- 4 - 5
- 2 - 6
- 3 - 7
Tree:
树:
4
--------------+------------------
3 | 7
| 1-4 |
| 2-4 |
| 0-5 |
| 4-5 |
---------+------ --------+--------
2 | null 6 | null
-----+---- 2-3 ----+---- 3-7
null | null null | null
0-2 2-6
1-2
回答by Adam Tegen
Overlapping Ranges:
重叠范围:
Prep O(n log n):
准备 O(n log n):
- Make a array / vector of the ranges.
- Sort the vector by the end of the range (break ties by sorting by the start of the range)
Make a second vector of ints. This represents the point at which you can stop searching.
int stop[size]; stop[size-1] = Ranges[size - 1].start; for (int i = size - 2; i >= 0; i--) { stop[i] = min(Ranges[i].start, stop[i+1]); }
- 制作范围的数组/向量。
- 按范围的结尾对向量进行排序(通过按范围的开头排序来打破平局)
制作第二个整数向量。这表示您可以停止搜索的点。
int stop[size]; stop[size-1] = Ranges[size - 1].start; for (int i = size - 2; i >= 0; i--) { stop[i] = min(Ranges[i].start, stop[i+1]); }
Search:
搜索:
- Use binary search to find the first range with an End value of >= TestRange.Start
Iterator starting at the binary search until stop[i] > TestRange.End:
2a. If the range if the current range is within the TestRange, add it to your result.
- 使用二分查找查找 End 值为 >= TestRange.Start 的第一个范围
迭代器从二分查找开始,直到 stop[i] > TestRange.End:
2a. 如果当前范围在 TestRange 内,则将其添加到您的结果中。
回答by TextGeek
If ranges overlap, and one wants to retrieve allthe ranges that overlap (or contain) a given target range, most of the solutions above don't appear to work.
如果范围重叠,并且想要检索与给定目标范围重叠(或包含)的所有范围,则上述大多数解决方案似乎都不起作用。
As some have pointed out, if (worst-case) allthe ranges happen to intersect the target range (for example, if the target range is {0..MAXINT} or similar) then of course it takes O(n) to return the n ranges.
正如一些人指出的那样,如果(最坏的情况)所有范围恰好与目标范围相交(例如,如果目标范围是 {0..MAXINT} 或类似的),那么当然需要 O(n) 来返回n 个范围。
But isn't the interesting and typical/average case, where only a very small % of the n total ranges do intersect the target range? Call the number that dointersect "m" -- in that case, you might conceivably be able to do as well as O(m). And if n=10^9 and m=10, that's a make-or-break difference.
但是,在 n 个总范围中只有很小 % 的范围与目标范围相交的有趣且典型/平均情况不是吗?拨打该号码做交叉“米” -在这种情况下,你可以设想能够做的,以及O(M)。如果 n=10^9 和 m=10,那就是成败之分。
Consider the simple case of a text document that has various regions marked up for their "type" -- perhaps you want to find all the marked-up units that contain or intersect a given contiguous range of text (for example, a paragraph). In HTML, XML, or similar those can only be ancestors of the text-node(s) containing at least some characters of the target range. In typical representations with parent pointers in each node, that's O(m) -- way better than O(n), especially because m is (for short or synchronous target ranges) merely the tree nesting depth, which tends to be even lower than ln(n) because big XML documents in practice get bushier not deeper.
考虑一个文本文档的简单情况,该文档具有针对其“类型”标记的多个区域——也许您想要查找包含或与给定的连续文本范围(例如,一个段落)相交的所有标记单元。在 HTML、XML 或类似内容中,它们只能是至少包含目标范围某些字符的文本节点的祖先。在每个节点中带有父指针的典型表示中,这是 O(m) —— 比 O(n) 好得多,特别是因为 m(对于短或同步目标范围)仅仅是树嵌套深度,它往往低于ln(n) 因为在实践中大型 XML 文档变得更复杂而不是更深。
The interesting case is harder: what if your "elements" do not form a tree as in XML, but can overlap as in MECS, CLIX, LMNL, and some other systems? You still want to find all the regions/"elements" that overlap your target, but they aren't so easily organized.
有趣的情况更难:如果您的“元素”不像在 XML 中那样形成树,但可以像在 MECS、CLIX、LMNL 和其他一些系统中那样重叠怎么办?您仍然希望找到与目标重叠的所有区域/“元素”,但它们并不那么容易组织。
On the other hand, you should be able to do very well because marked-up ranges in many applications are most frequently small -- there are far more words, sentences, and paragraphs in a book, than there are chapters. So even though there may be a huge number of ranges that start before the target and a huge number that end after it, the intersection will be very small on average.
另一方面,您应该能够做得很好,因为许多应用程序中的标记范围通常很小——一本书中的单词、句子和段落比章节多得多。因此,即使可能有大量范围在目标之前开始并在目标之后结束,但平均而言交叉点将非常小。
I think that's what the original questioner was getting at, and I'm afraid i didn't see an answer that addresses that problem. If it's not what the original question was about, then I'd like to pose it as a new question.
我认为这就是原始提问者的意思,恐怕我没有看到解决该问题的答案。如果这不是原始问题的内容,那么我想将其作为一个新问题提出。