GPos-class {GenomicRanges} | R Documentation |
The GPos class is a container for storing a set of genomic positions, that is, genomic ranges of width 1. Even though a GRanges object can be used for that, using a GPos object can be much more memory-efficient, especially when the object contains long runs of adjacent positions.
GPos(pos_runs) # constructor function
pos_runs |
A GRanges object (or any other GenomicRanges derivative)
where each range is interpreted as a run of adjacent genomic positions.
If |
A GPos object.
GPos objects support the same set of getters as GRanges
objects (i.e. seqnames()
, start()
, end()
,
ranges()
, strand()
, mcols()
, seqinfo()
,
etc...), plus the pos()
getter which is equivalent to
start()
or end()
. See ?GRanges
for the list of
getters supported by GRanges objects.
Note that a GPos object cannot hold names i.e. names()
always returns NULL
on it.
Like GRanges objects, GPos objects support the following setters:
The mcols()
and metadata()
setters.
The family of setters that operate on the seqinfo component of
an object:
seqlevels()
,
seqlevelsStyle()
,
seqlengths()
,
isCircular()
,
genome()
,
and seqinfo()
.
These setters are defined and documented in the GenomeInfoDb
package.
However, there is no seqnames()
, pos()
, or strand()
setter for GPos objects at the moment (although they might be
added in the future).
From GenomicRanges to GPos:
A GenomicRanges object x
in which all the ranges have a width
of 1 can be coerced to a GPos object with as(x, "GPos")
.
The names on x
are not propagated (a warning is issued if x
has names on it).
From GPos to GRanges:
A GPos object x
can be coerced to a GRanges object
with as(x, "GRanges")
. However be aware that the resulting object
can use thousands times (or more) memory than x
!
See "MEMORY USAGE" in the Examples section below.
From GPos to ordinary R objects:
Like with a GRanges object, as.character()
, as.factor()
,
and as.data.frame()
work with a GPos object x
.
Note that as.data.frame(x)
returns a data frame with a pos
column (containing pos(x)
) instead of the start
, end
,
and width
columns that one gets when x
is a GRanges
object.
A GPos object can be subsetted exactly like a GRanges object.
GPos objects can be combined (a.k.a. appended) with c()
or
append()
.
Like with a GRanges object, split()
and relist()
work
with a GPos object x
. Note that they return a
GenomicRangesList object instead of a GRangesList object.
Like for any Vector derivative, the length of a
GPos object cannot exceed .Machine$integer.max
(i.e. 2^31 on
most platforms). GPos()
will return an error if pos_runs
contains too many genomic positions.
Hervé Pagès; based on ideas borrowed from Georg Stricker georg.stricker@in.tum.de and Julien Gagneur gagneur@in.tum.de
GRanges objects.
The seqinfo
accessor and family in
the GenomeInfoDb package for accessing/modifying the seqinfo
component of an object.
GenomicRanges-comparison for comparing and ordering genomic positions.
findOverlaps-methods for finding overlapping genomic ranges and/or positions.
nearest-methods for finding the nearest genomic range/position neighbor.
The snpsBySeqname
,
snpsByOverlaps
, and
snpsById
methods for
SNPlocs objects defined in the BSgenome
package for extractors that return a GPos object.
SummarizedExperiment objects in the SummarizedExperiment package.
## --------------------------------------------------------------------- ## BASIC EXAMPLES ## --------------------------------------------------------------------- ## Example 1: gpos1 <- GPos(c("chr1:44-53", "chr1:5-10", "chr2:2-5")) gpos1 length(gpos1) seqnames(gpos1) pos(gpos1) # same as 'start(gpos1)' and 'end(gpos1)' strand(gpos1) as.character(gpos1) as.data.frame(gpos1) as(gpos1, "GRanges") as.data.frame(as(gpos1, "GRanges")) gpos1[9:17] ## Example 2: pos_runs <- GRanges("chrI", IRanges(c(1, 6, 12, 17), c(5, 10, 16, 20)), strand=c("+", "-", "-", "+")) gpos2 <- GPos(pos_runs) gpos2 strand(gpos2) ## Example 3: gpos3A <- gpos3B <- GPos(c("chrI:1-1000", "chrI:1005-2000")) npos <- length(gpos3A) mcols(gpos3A)$sample <- Rle("sA") sA_counts <- sample(10, npos, replace=TRUE) mcols(gpos3A)$counts <- sA_counts mcols(gpos3B)$sample <- Rle("sB") sB_counts <- sample(10, npos, replace=TRUE) mcols(gpos3B)$counts <- sB_counts gpos3 <- c(gpos3A, gpos3B) gpos3 ## Example 4: library(BSgenome.Scerevisiae.UCSC.sacCer2) genome <- BSgenome.Scerevisiae.UCSC.sacCer2 gpos4 <- GPos(seqinfo(genome)) gpos4 # all the positions along the genome are represented mcols(gpos4)$dna <- do.call("c", unname(as.list(genome))) gpos4 ## Note however that, like for any Vector derivative, the length of a ## GPos object cannot exceed '.Machine$integer.max' (i.e. 2^31 on most ## platforms) so the above only works with a "small" genome. ## For example it doesn't work with the Human genome: library(TxDb.Hsapiens.UCSC.hg19.knownGene) ## Not run: GPos(seqinfo(TxDb.Hsapiens.UCSC.hg19.knownGene)) # error! ## End(Not run) ## You can use isSmallGenome() to check upfront whether the genome is ## "small" or not. isSmallGenome(genome) isSmallGenome(TxDb.Hsapiens.UCSC.hg19.knownGene) ## --------------------------------------------------------------------- ## MEMORY USAGE ## --------------------------------------------------------------------- ## Coercion to GRanges works... gr4 <- as(gpos4, "GRanges") gr4 ## ... but is generally not a good idea: object.size(gpos4) object.size(gr4) # 6951 times bigger than the GPos object! ## Shuffling the order of the positions impacts memory usage: gpos4r <- rev(gpos4) object.size(gpos4r) # significantly gpos4s <- sample(gpos4) object.size(gpos4s) # even worse! ## AN IMPORTANT NOTE: In the worst situations, GPos still performs as ## good as a GRanges object. object.size(as(gpos4r, "GRanges")) # same size as 'gpos4r' object.size(as(gpos4s, "GRanges")) # same size as 'gpos4s' ## Best case scenario is when the object is strictly sorted (i.e. ## positions are in strict ascending order). ## This can be checked with: is.unsorted(gpos4, strict=TRUE) # 'gpos4' is strictly sorted ## --------------------------------------------------------------------- ## USING MEMORY-EFFICIENT METADATA COLUMNS ## --------------------------------------------------------------------- ## In order to keep memory usage as low as possible, it is recommended ## to use a memory-efficient representation of the metadata columns that ## we want to set on the object. Rle's are particularly well suited for ## this, especially if the metadata columns contain long runs of ## identical values. This is the case for example if we want to use a ## GPos object to represent the coverage of sequencing reads along a ## genome. ## Example 5: library(pasillaBamSubset) library(Rsamtools) # for the BamFile() constructor function bamfile1 <- BamFile(untreated1_chr4()) bamfile2 <- BamFile(untreated3_chr4()) gpos5 <- GPos(seqinfo(bamfile1)) library(GenomicAlignments) # for "coverage" method for BamFile objects cov1 <- unlist(coverage(bamfile1), use.names=FALSE) cov2 <- unlist(coverage(bamfile2), use.names=FALSE) mcols(gpos5) <- DataFrame(cov1, cov2) gpos5 object.size(gpos5) # lightweight ## Keep only the positions where coverage is at least 10 in one of the ## 2 samples: gpos5[mcols(gpos5)$cov1 >= 10 | mcols(gpos5)$cov2 >= 10] ## --------------------------------------------------------------------- ## USING A GPos OBJECT IN A SummarizedExperiment OBJECT ## --------------------------------------------------------------------- ## Because the GPos class extends the GenomicRanges virtual class, a GPos ## object can be used as the rowRanges component of a SummarizedExperiment ## object. ## As a 1st example, we show how the counts for samples sA and sB in ## 'gpos3' can be stored in a SummarizedExperiment object where the rows ## correspond to unique genomic positions and the columns to samples: library(SummarizedExperiment) counts <- cbind(sA=sA_counts, sB=sB_counts) mcols(gpos3A) <- NULL rse3 <- SummarizedExperiment(list(counts=counts), rowRanges=gpos3A) rse3 rowRanges(rse3) head(assay(rse3)) ## Finally we show how the coverage data from Example 5 can be easily ## stored in a lightweight SummarizedExperiment object: cov <- mcols(gpos5) mcols(gpos5) <- NULL rse5 <- SummarizedExperiment(list(cov=cov), rowRanges=gpos5) rse5 rowRanges(rse5) assay(rse5) ## Keep only the positions where coverage is at least 10 in one of the ## 2 samples: rse5[assay(rse5)$cov1 >= 10 | assay(rse5)$cov2 >= 10]