XStringSet-io {Biostrings}R Documentation

Read/write an XStringSet object from/to a file

Description

Functions to read/write an XStringSet object from/to a file.

Usage

## Read FASTA (or FASTQ) files in an XStringSet object:
read.BStringSet(filepath, format="fasta",
                nrec=-1L, skip=0L, use.names=TRUE)
read.DNAStringSet(filepath, format="fasta",
                  nrec=-1L, skip=0L, use.names=TRUE)
read.RNAStringSet(filepath, format="fasta",
                  nrec=-1L, skip=0L, use.names=TRUE)
read.AAStringSet(filepath, format="fasta",
                 nrec=-1L, skip=0L, use.names=TRUE)

## Extract basic information about FASTA (or FASTQ) files
## without loading them:
fasta.info(filepath, nrec=-1L, skip=0L, use.names=TRUE)
fastq.geometry(filepath, nrec=-1L, skip=0L)

## Write an XStringSet object to a FASTA (or FASTQ) file:
write.XStringSet(x, filepath, append=FALSE, format="fasta", ...)

## Serialize an XStringSet object:
save.XStringSet(x, objname, dirpath=".", save.dups=FALSE, verbose=TRUE)

Arguments

filepath A character vector (of arbitrary length when reading, of length 1 when writing) containing the path(s) to the file(s) to read or write. Note that special values like "" or "|cmd" (typically supported by other I/O functions in R) are not supported here. Also filepath cannot be a connection.
format Either "fasta" (the default) or "fastq". Note that write.XStringSet only supports "fasta" for now.
nrec Single integer. The maximum of number of records to read in. Negative values are ignored.
skip Single non-negative integer. The number of records of the data file(s) to skip before beginning to read in records.
use.names Should the returned vector be named? For FASTA the names are taken from the record description lines. For FASTQ they are taken from the record sequence ids. Dropping the names can help reducing memory footprint e.g. for a FASTQ file containing millions of reads.
x For write.XStringSet, the object to write to file.

For save.XStringSet, the object to serialize.

append TRUE or FALSE. If TRUE output will be appended to file; otherwise, it will overwrite the contents of file. See ?cat for the details.
... Further format-specific arguments. If format="fasta", the width argument (single integer) can be used to specify the maximum number of letters per line of sequence. If format="fastq", the qualities argument (BStringSet object) must be used to specify the qualities.
objname The name of the serialized object.
dirpath The path to the directory where to save the serialized object.
save.dups TRUE or FALSE. If TRUE then the Dups object describing how duplicated elements in x are related to each other is saved too. For advanced users only.
verbose TRUE or FALSE.

Details

Only FASTA and FASTQ files are supported for now. The qualities stored in the FASTQ records are ignored.

Reading functions read.BStringSet, read.DNAStringSet, read.RNAStringSet and read.AAStringSet load sequences from an input file (or set of input files) into an XStringSet object. When multiple input files are specified, they are read in the corresponding order and their data are stored in the returned object in that order. Note that when multiple input FASTQ files are specified, all must have the same "width" (i.e. all their sequences must have the same length).

The fasta.info utility returns an integer vector with one element per FASTA record in the input files. Each element is the length of the sequence found in the corresponding record.

The fastq.geometry utility returns an integer vector describing the "geometry" of the FASTQ files i.e. a vector of length 2 where the first element is the total number of FASTQ records in the files and the second element the common "width" of these files (this width is NA if the files contain no FASTQ records or records with different widths).

write.XStringSet writes an XStringSet object to a file.

Serializing an XStringSet object with save.XStringSet is equivalent to using the standard save mechanism. But it will try to reduce the size of x in memory first before calling save. Most of the times this leads to a much reduced size on disk.

See Also

readFASTA, writeFASTA, XStringSet-class, BString-class, DNAString-class, RNAString-class, AAString-class

Examples

  ## ---------------------------------------------------------------------
  ## A. READ/WRITE FASTA FILES
  ## ---------------------------------------------------------------------
  filepath <- system.file("extdata", "someORF.fa", package="Biostrings")
  fasta.info(filepath)
  x <- read.DNAStringSet(filepath)
  x
  out <- tempfile()
  write.XStringSet(x, out)

  ## ---------------------------------------------------------------------
  ## B. READ/WRITE FASTQ FILES
  ## ---------------------------------------------------------------------
  filepath <- system.file("extdata", "s_1_sequence.txt",
                          package="Biostrings")
  fastq.geometry(filepath)
  read.DNAStringSet(filepath, format="fastq")

  library(BSgenome.Celegans.UCSC.ce2)
  ## Create a "sliding window" on chr I:
  sw_start <- seq.int(1, length(Celegans$chrI)-50, by=50)
  sw <- Views(Celegans$chrI, start=sw_start, width=10)
  my_fake_shortreads <- as(sw, "XStringSet")
  my_fake_ids <- sprintf("ID%06d",  seq_len(length(my_fake_shortreads)))
  names(my_fake_shortreads) <- my_fake_ids
  my_fake_shortreads
  my_fake_quals <- rep.int(BStringSet("]]]]]]]]]]"),
                           length(my_fake_shortreads))
  my_fake_quals
  out <- tempfile()
  write.XStringSet(my_fake_shortreads, out, format="fastq",
                   qualities=my_fake_quals)

  ## ---------------------------------------------------------------------
  ## C. SERIALIZATION
  ## ---------------------------------------------------------------------
  save.XStringSet(my_fake_shortreads, "my_fake_shortreads", dirpath=tempdir())

[Package Biostrings version 2.18.2 Index]