--- title: "Genotype and sequence summaries" author: "Eric Archer" date: "`r Sys.Date()`" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Genotype and sequence summaries} %\VignetteEngine{knitr::rmarkdown} \usepackage[utf8]{inputenc} --- ```{r echo = FALSE, message = FALSE} options(digits = 2) library(strataG) ``` There are several by-locus summary functions available for _gtypes_ objects. Given some sample microsatellite data: ```{r} data(msats.g) msats <- stratify(msats.g, "broad") msats <- msats[, getLociNames(msats)[1:4], ] ``` One can calculate the following summaries: The number of alleles at each locus: ```{r} numAlleles(msats) ``` The number of samples with missing data at each locus: ```{r} numMissing(msats) ``` which can also be expressed as a proportion of samples with missing data: ```{r} numMissing(msats, prop = TRUE) ``` The allelic richness, or the average number of alleles per sample: ```{r} allelicRichness(msats) ``` The observed and expected heterozygosity: ```{r} # observed heterozygosity(msats, type = "observed") # expected heterozygosity(msats, type = "expected") ``` The proportion of alleles that are unique (present in only one sample): ```{r} propUniqueAlleles(msats) ``` The value of theta based on heterozygosity: ```{r} theta(msats) ``` These measures are all calculated in the _summarizeLoci_ function and returned as a matrix. This function also allows you to calculate the measures for each stratum separately, which returns a list for each stratum: ```{r} summarizeLoci(msats) summarizeLoci(msats, by.strata = TRUE) ``` One can also obtain the allelic frequencies for each locus overall and by-strata by: ```{r} alleleFreqs(msats) alleleFreqs(msats, by.strata = TRUE) ``` The _dupGenotypes_ function identifies samples that have the same or nearly the same genotypes. The number (or percent) of loci that must be shared in order for it to be considered a duplicate can be set by the _num.shared_ argument. The return data.frame provides which loci the two samples show mismatches at so they can be reviewed. ```{r} # Find samples that share alleles at 2/3rds of the loci dupGenotypes(msats, num.shared = 0.66) ``` The start and end positions and number of N's and indels can be generated with the _summarizeSeqs_ function: ```{r} library(ape) data(dolph.seqs) seq.smry <- summarizeSeqs(as.DNAbin(dolph.seqs)) head(seq.smry) ``` Base frequencies can be generated with _baseFreqs_: ```{r} bf <- baseFreqs(as.DNAbin(dolph.seqs)) # nucleotide frequencies by site bf$site.freq[, 1:15] # overall nucleotide frequencies bf$base.freqs ``` Sequences can be scanned for low-frequency substitutions with _lowFreqSubs_: ```{r} lowFreqSubs(as.DNAbin(dolph.seqs), min.freq = 2) ``` Unusual sequences can be identified by plotting likelihoods based on pairwise distances: ```{r} data(dolph.haps) sequenceLikelihoods(as.DNAbin(dolph.haps)) ``` All of the above functions can be conducted at once with the _qaqc_ function. Only those functions appropriate to the data type contained (haploid or diploid) will be run. Files are written for each output that are labelled either by the _\@description_ slot of the gtypes object or the optional _label_ argument of the function.