Load the library.
library(refseqR)
This vignette shows a tutorial of how I have been using refseqR
to automate some common processes of my research. The package refseqR
is built on top of rentrez
, the excellent library written by David Winter to query the NCBI’s API and fetch the resulting data.
In short, refseqR
provides summary information at three different levels:
Given a gene symbol/name obtained from the ‘Gene’ database, the refseqR
library enables us to retrieve the associated mRNA/transcript and protein accessions.
<- c("LOC101512347")
GeneID <- refseq_fromGene(GeneID, sequence = "transcript")
transcript #> NCBI servers are busy. Please try again a bit later.
<- refseq_fromGene(GeneID, sequence = "protein")
protein #> NCBI servers are busy. Please try again a bit later.
The mRNA transcript identifier (id) for LOC101512347 = .
The protein identifier (id) for LOC101512347 = .
Similarly, the function is effective when utilizing gene symbols that encode for multiple transcripts.
<- c("LOC105852298")
GeneID <- refseq_fromGene(GeneID, sequence = "transcript")
transcript #> NCBI servers are busy. Please try again a bit later.
<- refseq_fromGene(GeneID, sequence = "protein")
protein #> NCBI servers are busy. Please try again a bit later.
The mRNA transcript ids. for LOC105852298 = .
The protein ids. for LOC105852298 = .
The function refseq_description
returns the sequence description corresponding to a given accession. The identifier (id) can be either a transcript, protein, or Gene identifier.
<- c("LOC101512347")
id refseq_description(id)
It is important to note that gene symbols (e.x. “LOC105852298”) are not unique, and a single gene symbol search may map to multiple sequences. To avoid inconsistencies in function, it is highly recommended to use the actual GeneID (e.x. “105852298) as the first argument.
Using the rentrez
package, we can fetch data from NCBI. Here, the first 30 lines for accession “XM_004487701” :
<- rentrez::entrez_fetch(db= 'nuccore', id = "XM_004487701", rettype = 'gp')
mrna_gb strsplit(mrna_gb, "\n")[[1]][1:30]
The refseq_mRNAfeat
function serves as a wrapper built on top of entrez_summary
from the rentrez
package, designed to extract specific features from the obtained data. Typically, my focus lies on key features like id, accession, title, update, or sequence length (bp). However, you have the flexibility to tailor the function to extract additional features of interest from the esummary_list
object.
= c("XM_004487701", "XM_004488493", "XM_004501904")
transcript = c("caption", "moltype", "sourcedb", "slen", "title")
feat refseq_mRNAfeat(transcript, feat)
Another interesting function is refseq_RNA2protein
, which retrieves the protein accession associated with the provided mRNA.
<- "XM_004487701"
transcript refseq_RNA2protein(transcript)
The CDS coordinates come in handy when we want to get the fasta sequence. We sometimes do not want the 5’UTR or 3’UTR contained in the mRNA sequence and are interested just in the CDS.
The function refseq_CDScoords
creates an IRanges
object with the CDS coordinates from an mRNA accession. The output object is the basis for refseq_CDSseq
, which fetches the NCBI data, uses that coordinates and returns a DNAString
object with the CDS nucleotide sequence.
refseq_CDScoords(transcript)
#> IRanges object with 0 ranges and 0 metadata columns:
#> start end width
#> <integer> <integer> <integer>
refseq_CDSseq(transcript)
#> DNAStringSet object of length 0
Here, the first 500 nucleotides of the mRNA ‘XM_004487701’:
<- "XM_004487701"
transcript = rentrez::entrez_fetch(db="nuccore", id=transcript, rettype="fasta")
mrna_fasta # take a look at the first 500 chars.
cat(strwrap(substr(mrna_fasta, 1, 500)), sep="\n")
#> >XM_004487701.3 PREDICTED: Cicer arietinum aldehyde dehydrogenase 22A1
#> (LOC101512347), mRNA
#> GTTACCATGTCAACAAAAACTCTCAAGTCACTTTCTATTTGAAGCCGAGAAACCTATTATCTTTATGTCA
#> TGACAATTCCAAAATACATAACCCACATCTTTGCATGAATAGCATCACAATTCCCTAATTTTTTTATAAT
#> ACCCCTTAATCCATTTGTGGTCTACATATCGAAGTAAACCACTACACCCCCACTTTCTCTATAGATCTGT
#> GAGCTCGATCGCAATTTAGTTTGATTGTTACTTTATTTATTTATTAATCTCATTTTATATGTTTTCATTT
#> TCTTCTTGGAACCGATAAAGTCGTAGTTTATTCCTTTCTCAATTTGATGAAAAGTGCAAACTTGGAAAAG
#> AAAACAGGTTCACCTTTGAACTCAAATAAACAAGTACTACAATATCAAAACCC
Here, the first 60 nucleotides of the CDS from the mRNA ‘XM_004487701’:
substr(toString(refseq_CDSseq(transcript)), 1, 60)
#> [1] "ATGGCGTTTTGGTGGTCTTTGCTCGTTCTAGCATTCGCTTTCGCTATCTGCAAGTTCCTT"
As previously said, the function refseq_description
returns the sequence description corresponding to a given accession. The identifier (id) can be either a transcript, protein, or Gene identifier.
<- "XM_004487701"
id refseq_description(id)
Similarly to nucleotide sequences, refseq_protein2RNA
, retrieves the mRNA associated with the provided protein accession.
<- "XP_020244413"
protein refseq_protein2RNA(protein)
Two specific functions prove useful for managing protein accessions: refseq_AAlen
offers the amino acid length of the sequence, while refseq_mol.wt
provides the molecular weight in Daltons.
refseq_AAlen(protein)
refseq_AAmol_wt(protein)
The refseq_AAseq
function, fetches the NCBI data, and returns a DNAString
object with the amino acid sequence.
refseq_AAseq(protein)
#> AAStringSet object of length 0
As previously mentioned, the refseq_description
function ultimately provides the sequence description associated with a given accession. The identifier (id) can take the form of either a transcript, protein, or Gene identifier.
<- "XP_020244413"
id refseq_description(id)
The package refseqR
contains a number of functions to programmatically automatize some common operations.
Functions to apply on GeneID accessions
refseq_description
refseq_fromGene
Functions to apply on transcript id. accessions
refseq_GeneID
refseq_CDScoords
refseq_CDSseq
refseq_RNA2protein
Functions to apply on protein id. accessions
refseq_GeneID
refseq_AAseq
refseq_AAmol_wt
refseq_AAlen
refseq_protein2RNA
I’d really appreciate your feedback. The whole code used in this tutorial is available from my Github repository. You can contact me by email or visit my website.
Córdoba, (Spain), 2024-10-30.