IntervalTree-class {IRanges} | R Documentation |
Efficiently perform overlap queries with an interval tree.
A common type of query that arises when working with intervals is finding which intervals in one set overlap those in another. An efficient family of algorithms for answering such queries is known as the Interval Tree. This implementation makes use of the augmented tree algorithm from the reference below, but heavily adapts it for the use case of large, sorted query sets.
The simplest approach is to call the findOverlaps
function on a
Ranges
or other object with range information, as described in
the following section.
An IntervalTree
object is a derivative of
Ranges
and stores its ranges as a tree that is
optimized for overlap queries. Thus, for repeated queries against the
same subject, it is more efficient to create an IntervalTree
once for the subject using the constructor described below and then
perform the queries against the IntervalTree
instance.
This main purpose of the interval tree is to optimize the search for
ranges overlapping those in a query set. The interface for this
operation is the findOverlaps
function.
findOverlaps(query, subject = query, maxgap = 0L, minoverlap = 1L,
type = c("any", "start", "end", "within", "equal"),
select = c("all", "first", "last", "arbitrary"),
drop = FALSE, ignoreSelf = FALSE, ignoreRedundant = FALSE)
:
Find the intervals in query
, a Ranges
,
RangesList
, RangedData
, or integer
vector (to
be converted to length-one ranges), that overlap with the
intervals subject
, a Ranges
, RangesList
, or
RangedData
. If query
is unsorted, it is sorted
first, so it is usually better to sort up-front, to avoid a sort
with each findOverlaps
call.
If subject
is omitted, query
is queried against
itself. In this case, and only this case, the ignoreSelf
and ignoreRedundant
arguments are allowed. By default,
the result will contain hits for each range against itself, and if
there is a hit from A to B, there is also a hit for B to A. If
ignoreSelf
is TRUE
, all self matches are dropped. If
ignoreRedundant
is TRUE
, only one of A->B and B->A
is returned.
Intervals with a separation of maxgap
or less and a minimum
of minoverlap
overlapping positions, allowing for
maxgap
, are considered to be overlapping. maxgap
should be a scalar, non-negative, integer. minoverlap
should be a scalar, positive integer.
When select
is "all", the results are returned as a
RangesMatching
object. When select
is "first"
, "last"
, or "arbitrary"
the results
are returned as an integer vector of length query
containing
the first, last, or arbitrary overlapping interval in subject
,
with NA
indicating intervals that did not overlap any intervals
in subject
.
If query
is a RangesList
or
RangedData
, subject
must be a
RangesList
or RangedData
. If both lists have names,
each element from the subject is paired with the element from the
query with the matching name, if any. Otherwise, elements are
paired by position. The overlap is then computed between the pairs
as described above. If select
is "all"
, a
RangesMatchingList
is returned. For all other
select
the return value depends on the drop
argument.
When select != "all" && !drop
, an IntegerList
is returned, where each element of the result corresponds to a space in
query
. Whenselect != "all" && drop
, an integer vector is
returned containing indices that are offset to align with the unlisted
query
.
By default, any overlap is accepted. By specifying the type
parameter, one can select for specific types of overlap. The types
correspond to operations in Allen's Interval Algebra (see
references). If type
is start
or end
, the
intervals are required to have matching starts or ends,
respectively. While this operation seems trivial, the naive
implementation using outer
would be much less
efficient. Specifying equal
as the type returns the
intersection of the start
and end
matches. If
type
is within
, the query interval must be wholly
contained within the subject interval. Note that all matches must
additionally satisfy the minoverlap
constraint described above.
The maxgap
parameter has special meaning with the special
overlap types. For start
, end
, and equal
,
it specifies the maximum difference in the starts, ends or both,
respectively. For within
, it is the maximum amount by which
the query may be wider than the subject.
countOverlaps(query, subject, maxgap = 0L, minoverlap = 1L,
type = c("any", "start", "end", "within", "equal"))
:
Returns the overlap hit count for each range in query
using the
specified findOverlaps
parameters. Both query
and
subject
should be Ranges
, RangesList
or
RangedData
objects.
subsetByOverlaps(query, subject, maxgap = 0L, minoverlap = 1L,
type = c("any", "start", "end", "within", "equal"))
:
Returns the subset of query
that has an overlap hit with a range
in subject
using the specified findOverlaps
parameters.
Both query
and subject
should be Ranges
,
RangesList
or RangedData
objects.
x %in% table
:
Shortcut for finding the ranges in x
that overlap any of
the ranges in table
. Both x
and table
should
be Ranges
, RangesList
or RangedData
objects. For Ranges
objects, the result is a logical
vector of length equal to the number of ranges in x
. For
RangesList
and RangedData
objects, the result is a
LogicalList
object, where each element of the
result corresponds to a space in x
.
match(x, table, nomatch = NA_integer_, incomparables =
NULL)
:
Returns an integer vector of length length(x)
, containing
the index of the first overlapping range in table
for each
range in x
. If a range in x
does not overlap any
ranges in table
, its value is nomatch
. The x
and table
arguments should either be both Ranges
objects or both RangesList
objects, in which case the
indices are into the unlisted table
. The
incomparables
argument is currently ignored.
IntervalTree
from the
ranges in ranges
, an object coercible to
IntervalTree
, such as an IRanges
object.
as(from, "IRanges")
: Imports the ranges in
from
, an IntervalTree
, to an
IRanges
.as(from, "IntervalTree")
: Constructs an
IntervalTree
representing from
, a Ranges
object that is coercible to IRanges
.
length(x)
: Gets the number of ranges stored in the
tree. This is a fast operation that does not bring the ranges into
R.start(x)
: Get the starts of the ranges.end(x)
: Get the ends of the ranges.
The cost of constructing an instance of the interval tree is a
O(n*lg(n))
, which makes it about as fast as other types of
overlap query algorithms based on sorting. The good news is that the
tree need only be built once per subject; this is useful in situations
of frequent querying. Also, in this implementation the data is stored
outside of R, avoiding needless copying. Of course, external storage
is not always convenient, so it is possible to coerce the tree to an
instance of IRanges
(see the Coercion section).
For the query operation, the running time is based on the query size
m
and the average number of hits per query k
. The output
size is then max(mk,m)
, but we abbreviate this as
mk
. Note that when the multiple
parameter is set to
FALSE
, k
is fixed to 1 and drops out of this
analysis. We also assume here that the query is sorted by start
position (the findOverlaps
function sorts the query if it is unsorted).
An upper bound for finding overlaps is
O(min(mk*lg(n),n+mk))
. The fastest interval tree algorithm
known is bounded by O(min(m*lg(n),n)+mk)
but is a lot more
complicated and involves two auxillary trees. The lower bound is
Omega(lg(n)+mk)
, which is almost the same as for returning
the answer, Omega(mk)
. The average is of course somewhere in
between.
This analysis informs the choice of which set of ranges to process
into a tree, i.e. assigning one to be the subject and the other to be
the query. Note that if m > n
, then the running time is
O(m)
, and the total operation of complexity O(n*lg(n) +
m)
is better than if m
and n
were exchanged. Thus, for
once-off operations, it is often most efficient to choose the smaller
set to become the tree (but k
also affects this). This is
reinforced by the realization that if mk
is about the same in
either direction, the running time depends only on n
, which
should be minimized. Even in cases where a tree has already been
constructed for one of the sets, it can be more efficient to build a
new tree when the existing tree of size n
is much larger than
the query set of size m
, roughly when n > m*lg(n)
.
Michael Lawrence
Interval tree algorithm from: Cormen, Thomas H.; Leiserson, Charles E.; Rivest, Ronald L.; Stein, Clifford. Introduction to Algorithms, second edition, MIT Press and McGraw-Hill. ISBN 0-262-53196-8
Allen's Interval Algebra: James F. Allen: Maintaining knowledge about temporal intervals. In: Communications of the ACM. 26/11/1983. ACM Press. S. 832-843, ISSN 0001-0782
Ranges
, the parent of this class,
RangesMatching
, the result of an overlap query.
query <- IRanges(c(1, 4, 9), c(5, 7, 10)) subject <- IRanges(c(2, 2, 10), c(2, 3, 12)) tree <- IntervalTree(subject) ## at most one hit per query findOverlaps(query, tree, select = "first") findOverlaps(query, tree, select = "last") findOverlaps(query, tree, select = "arbitrary") ## allow multiple hits findOverlaps(query, tree) ## overlap as long as distance <= 1 findOverlaps(query, tree, maxgap = 1L) ## shortcut findOverlaps(query, subject) ## query and subject are easily interchangeable query <- IRanges(c(1, 4, 9), c(5, 7, 10)) subject <- IRanges(c(2, 2), c(5, 4)) tree <- IntervalTree(subject) t(findOverlaps(query, tree)) # the same as: findOverlaps(subject, query) ## one Ranges with itself findOverlaps(query) ## single points as query subject <- IRanges(c(1, 6, 13), c(4, 9, 14)) findOverlaps(c(3L, 7L, 10L), subject, select = "first") ## alternative overlap types query <- IRanges(c(1, 5, 3, 4), width=c(2, 2, 4, 6)) subject <- IRanges(c(1, 3, 5, 6), width=c(4, 4, 5, 4)) findOverlaps(query, tree, type = "start") findOverlaps(query, tree, type = "start", maxgap = 1L) findOverlaps(query, tree, type = "end", select = "first") findOverlaps(query, tree, type = "within", maxgap = 1L)