---
title: "Working With Sequences"
author: "Eric Archer"
date: "`r Sys.Date()`"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Working with sequences}
  %\VignetteEngine{knitr::rmarkdown}
  \usepackage[utf8]{inputenc}
---
```{r echo = FALSE, message = FALSE}
options(digits = 2)
library(strataG)
```
There are several functions for working with sequence data in strataG. They will either take a haploid _gtypes_ object that contains sequences or some format that can be converted to a _DNAbin_ (in the _ape_ package), or _multidna_ (in the _apex_ package) object.

Leading and trailing N's can be removed from all sequences like this:
```{r}
library(ape)
data(dolph.seqs)

i <- sample(1:10, 1)
j <- sample(1:10, 1)
x <- c(rep("n", i), dolph.seqs[[1]], rep("n", j))
x
x.trimmed <- trimNs(as.DNAbin(x))
as.character(as.list(x.trimmed))
```

Base frequencies for a sequence are calculated with the _baseFreqs_ function:
```{r}
bf <- baseFreqs(dolph.seqs)
bf$site.freqs[, 1:8]
bf$base.freqs
```

One can also identify which sites are fixed and which are variable:
```{r}
fs <- fixedSites(dolph.seqs)
fs[1:20]

vs <- variableSites(dolph.seqs)
vs
```

Both functions take an optional set of bases to consider when evaluating whether a site is fixed or variable. For _fixedSites_, the function will only count those sites that are fixed in the listed _bases_ argument. For _variableSites_ the site is considered variable if it has those bases and is not fixed for them:
```{r}
fs <- fixedSites(dolph.seqs, bases = c("c", "t"))
fs[1:20]

vs <- variableSites(dolph.seqs, bases = c("c", "t"))
vs
```

There are also functions to compare bases against IUPAC ambiguity codes. One can calculate the appropriate IUPAC code for a vector of nucleotides:
```{r}
iupacCode(c("c", "t", "t", "c", "c"))
iupacCode(c("c", "t", "a", "c", "c"))
iupacCode(c("g", "t", "a", "c", "c"))
```

One can also calculate all IUPAC codes that apply to a vector of nucleotides:
```{r}
validIupacCodes(c("c", "t", "t", "c", "c"))
validIupacCodes(c("c", "t", "a", "c", "c"))
validIupacCodes(c("g", "t", "a", "c", "c"))
```

A consensus sequence can also be easily generated:
```{r}
createConsensus(dolph.seqs)
```

Nucleotide diversity for each site is calculaed with:
```{r}
nd <- nucleotideDiversity(dolph.seqs)
head(nd)
```

For a stratified _gtypes_ object, one can calculate net nucleotide divergence (Nei's dA), and distributions of between- and within-strata divergence:
```{r}
# create gtypes
data(dolph.seqs)
data(dolph.strata)
rownames(dolph.strata) <- dolph.strata$id
dloop <- df2gtypes(dolph.strata[, c("id", "fine", "id")], ploidy = 1,
             schemes = dolph.strata[, c("fine", "broad")], sequences = dolph.seqs)
dloop <- labelHaplotypes(dloop, "Hap.")

# calculate divergence
nucleotideDivergence(dloop)
```

For stratified _gtypes_, one can also identify fixed differences between strata:
```{r}
fixedDifferences(dloop)
```

Two functions have been provided to select a subset of representative sequences. The first selects the most distant sequences in order to capture the full distribution of variation. For example:
```{r}
x <- as.DNAbin(dolph.seqs)
mostDistantSequences(x, num.seqs = 5)
```

The other function selects the most representative sequences by first clustering the sequences and selecting the sequences closest to the center of each cluster:
```{r}
mostRepresentativeSequences(x, num.seqs = 5)
```