IntervalTree-class {IRanges}R Documentation

Interval Search Trees

Description

Efficiently perform overlap queries with an interval tree.

Details

A common type of query that arises when working with intervals is finding which intervals in one set overlap those in another. An efficient family of algorithms for answering such queries is known as the Interval Tree. This implementation makes use of the augmented tree algorithm from the reference below, but heavily adapts it for the use case of large, sorted query sets.

The simplest approach is to call the findOverlaps function on a Ranges or other object with range information, as described in the following section.

An IntervalTree object is a derivative of Ranges and stores its ranges as a tree that is optimized for overlap queries. Thus, for repeated queries against the same subject, it is more efficient to create an IntervalTree once for the subject using the constructor described below and then perform the queries against the IntervalTree instance.

Finding Overlaps

This main purpose of the interval tree is to optimize the search for ranges overlapping those in a query set. The interface for this operation is the findOverlaps function.

findOverlaps(query, subject = query, maxgap = 0L, minoverlap = 1L, type = c("any", "start", "end", "within", "equal"), select = c("all", "first", "last", "arbitrary"), drop = FALSE, ignoreSelf = FALSE, ignoreRedundant = FALSE):

Find the intervals in query, a Ranges, RangesList, RangedData, or integer vector (to be converted to length-one ranges), that overlap with the intervals subject, a Ranges, RangesList, or RangedData. If query is unsorted, it is sorted first, so it is usually better to sort up-front, to avoid a sort with each findOverlaps call.

If subject is omitted, query is queried against itself. In this case, and only this case, the ignoreSelf and ignoreRedundant arguments are allowed. By default, the result will contain hits for each range against itself, and if there is a hit from A to B, there is also a hit for B to A. If ignoreSelf is TRUE, all self matches are dropped. If ignoreRedundant is TRUE, only one of A->B and B->A is returned.

Intervals with a separation of maxgap or less and a minimum of minoverlap overlapping positions, allowing for maxgap, are considered to be overlapping. maxgap should be a scalar, non-negative, integer. minoverlap should be a scalar, positive integer.

When select is "all", the results are returned as a RangesMatching object. When select is "first", "last", or "arbitrary" the results are returned as an integer vector of length query containing the first, last, or arbitrary overlapping interval in subject, with NA indicating intervals that did not overlap any intervals in subject.

If query is a RangesList or RangedData, subject must be a RangesList or RangedData. If both lists have names, each element from the subject is paired with the element from the query with the matching name, if any. Otherwise, elements are paired by position. The overlap is then computed between the pairs as described above. If select is "all", a RangesMatchingList is returned. For all other select the return value depends on the drop argument. When select != "all" && !drop, an IntegerList is returned, where each element of the result corresponds to a space in query. Whenselect != "all" && drop, an integer vector is returned containing indices that are offset to align with the unlisted query.

By default, any overlap is accepted. By specifying the type parameter, one can select for specific types of overlap. The types correspond to operations in Allen's Interval Algebra (see references). If type is start or end, the intervals are required to have matching starts or ends, respectively. While this operation seems trivial, the naive implementation using outer would be much less efficient. Specifying equal as the type returns the intersection of the start and end matches. If type is within, the query interval must be wholly contained within the subject interval. Note that all matches must additionally satisfy the minoverlap constraint described above.

The maxgap parameter has special meaning with the special overlap types. For start, end, and equal, it specifies the maximum difference in the starts, ends or both, respectively. For within, it is the maximum amount by which the query may be wider than the subject.

countOverlaps(query, subject, maxgap = 0L, minoverlap = 1L, type = c("any", "start", "end", "within", "equal")): Returns the overlap hit count for each range in query using the specified findOverlaps parameters. Both query and subject should be Ranges, RangesList or RangedData objects.
subsetByOverlaps(query, subject, maxgap = 0L, minoverlap = 1L, type = c("any", "start", "end", "within", "equal")): Returns the subset of query that has an overlap hit with a range in subject using the specified findOverlaps parameters. Both query and subject should be Ranges, RangesList or RangedData objects.
x %in% table: Shortcut for finding the ranges in x that overlap any of the ranges in table. Both x and table should be Ranges, RangesList or RangedData objects. For Ranges objects, the result is a logical vector of length equal to the number of ranges in x. For RangesList and RangedData objects, the result is a LogicalList object, where each element of the result corresponds to a space in x.
match(x, table, nomatch = NA_integer_, incomparables = NULL): Returns an integer vector of length length(x), containing the index of the first overlapping range in table for each range in x. If a range in x does not overlap any ranges in table, its value is nomatch. The x and table arguments should either be both Ranges objects or both RangesList objects, in which case the indices are into the unlisted table. The incomparables argument is currently ignored.

Constructor

IntervalTree(ranges): Creates an IntervalTree from the ranges in ranges, an object coercible to IntervalTree, such as an IRanges object.

Coercion

as(from, "IRanges"): Imports the ranges in from, an IntervalTree, to an IRanges.
as(from, "IntervalTree"): Constructs an IntervalTree representing from, a Ranges object that is coercible to IRanges.

Accessors

length(x): Gets the number of ranges stored in the tree. This is a fast operation that does not bring the ranges into R.
start(x): Get the starts of the ranges.
end(x): Get the ends of the ranges.

Notes on Time Complexity

The cost of constructing an instance of the interval tree is a O(n*lg(n)), which makes it about as fast as other types of overlap query algorithms based on sorting. The good news is that the tree need only be built once per subject; this is useful in situations of frequent querying. Also, in this implementation the data is stored outside of R, avoiding needless copying. Of course, external storage is not always convenient, so it is possible to coerce the tree to an instance of IRanges (see the Coercion section).

For the query operation, the running time is based on the query size m and the average number of hits per query k. The output size is then max(mk,m), but we abbreviate this as mk. Note that when the multiple parameter is set to FALSE, k is fixed to 1 and drops out of this analysis. We also assume here that the query is sorted by start position (the findOverlaps function sorts the query if it is unsorted).

An upper bound for finding overlaps is O(min(mk*lg(n),n+mk)). The fastest interval tree algorithm known is bounded by O(min(m*lg(n),n)+mk) but is a lot more complicated and involves two auxillary trees. The lower bound is Omega(lg(n)+mk), which is almost the same as for returning the answer, Omega(mk). The average is of course somewhere in between.

This analysis informs the choice of which set of ranges to process into a tree, i.e. assigning one to be the subject and the other to be the query. Note that if m > n, then the running time is O(m), and the total operation of complexity O(n*lg(n) + m) is better than if m and n were exchanged. Thus, for once-off operations, it is often most efficient to choose the smaller set to become the tree (but k also affects this). This is reinforced by the realization that if mk is about the same in either direction, the running time depends only on n, which should be minimized. Even in cases where a tree has already been constructed for one of the sets, it can be more efficient to build a new tree when the existing tree of size n is much larger than the query set of size m, roughly when n > m*lg(n).

Author(s)

Michael Lawrence

References

Interval tree algorithm from: Cormen, Thomas H.; Leiserson, Charles E.; Rivest, Ronald L.; Stein, Clifford. Introduction to Algorithms, second edition, MIT Press and McGraw-Hill. ISBN 0-262-53196-8

Allen's Interval Algebra: James F. Allen: Maintaining knowledge about temporal intervals. In: Communications of the ACM. 26/11/1983. ACM Press. S. 832-843, ISSN 0001-0782

See Also

Ranges, the parent of this class, RangesMatching, the result of an overlap query.

Examples

  query <- IRanges(c(1, 4, 9), c(5, 7, 10))
  subject <- IRanges(c(2, 2, 10), c(2, 3, 12))
  tree <- IntervalTree(subject)

  ## at most one hit per query
  findOverlaps(query, tree, select = "first")
  findOverlaps(query, tree, select = "last")
  findOverlaps(query, tree, select = "arbitrary")

  ## allow multiple hits
  findOverlaps(query, tree)

  ## overlap as long as distance <= 1
  findOverlaps(query, tree, maxgap = 1L)

  ## shortcut
  findOverlaps(query, subject)

  ## query and subject are easily interchangeable
  query <- IRanges(c(1, 4, 9), c(5, 7, 10))
  subject <- IRanges(c(2, 2), c(5, 4))
  tree <- IntervalTree(subject)
  t(findOverlaps(query, tree))
  # the same as:
  findOverlaps(subject, query)

  ## one Ranges with itself
  findOverlaps(query)

  ## single points as query
  subject <- IRanges(c(1, 6, 13), c(4, 9, 14))
  findOverlaps(c(3L, 7L, 10L), subject, select = "first")

  ## alternative overlap types
  query <- IRanges(c(1, 5, 3, 4), width=c(2, 2, 4, 6))
  subject <- IRanges(c(1, 3, 5, 6), width=c(4, 4, 5, 4))

  findOverlaps(query, tree, type = "start")
  findOverlaps(query, tree, type = "start", maxgap = 1L)
  findOverlaps(query, tree, type = "end", select = "first")
  findOverlaps(query, tree, type = "within", maxgap = 1L)

[Package IRanges version 1.6.16 Index]