---
title: "Package MKomics"
author: "Matthias Kohl"
date: "`r Sys.Date()`"
output: 
  rmarkdown::html_vignette:
    toc: true
bibliography: MKomics.bib
vignette: >
  %\VignetteIndexEntry{MKomics}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{utf8}
---


## Introduction  
Package MKomics includes a collection of functions that I found and find useful 
for the analysis of omics data. It includes:

- similarity plots based on correlation and median absolute deviation (MAD) 

- adjusting colors for heatmaps 

- aggregate technical replicates 

- calculate pairwise fold-changes and log fold-changes 

- compute one- and two-way ANOVA 

- simplified interface to package 'limma' [@limma] for moderated t-test and 
one-way ANOVA 

- Hamming and the Levenshtein (edit) distance of strings as well as optimal 
alignment scores for global (Needleman-Wunsch) and local (Smith-Waterman) 
alignments with constant gap penalties [@Merkl2009].


The package requires Bioconductor package limma, which can be installed via
```{r, eval = FALSE}
## Install package BiocManager
install.packages("BiocManager")
## Use BiocManager to install limma
BiocManager::install("limma")
```

We first load the package.
```{r}
library(MKomics)
```


## Moderated Tests Based on Package limma
The functions to compute the moderated t-test and one-way ANOVA were motivated 
by the fact that my students had problems to understand and correctly adapt the 
code of the limma package [@limma].

### Moderated t-Test
We generate some artificial data and compute moderated t-tests.
```{r}
## One-sample test
X <- matrix(rnorm(10*20, mean = 1), nrow = 10, ncol = 20)
mod.t.test(X)

## Two-sample test
set.seed(123)
X <- rbind(matrix(rnorm(5*20), nrow = 5, ncol = 20),
           matrix(rnorm(5*20, mean = 1), nrow = 5, ncol = 20))
g2 <- factor(c(rep("group 1", 10), rep("group 2", 10)))
mod.t.test(X, group = g2)

## Paired two-sample test
subjID <- factor(rep(1:10, 2))
mod.t.test(X, group = g2, paired = TRUE, subject = subjID)
```


### Moderated 1-Way ANOVA
We generate some artifical data and compute moderated one-way ANOVAs.
```{r}
set.seed(123)
X <- rbind(matrix(rnorm(5*20), nrow = 5, ncol = 20),
           matrix(rnorm(5*20, mean = 1), nrow = 5, ncol = 20))
gr <- factor(c(rep("A1", 5), rep("B2", 5), rep("C3", 5), rep("D4", 5)))
mod.oneway.test(X, gr)
```

We can also perform a moderated one-way ANOVA with repeated measures.
```{r}
X <- rbind(matrix(rnorm(6*18), nrow = 6, ncol = 18),
           matrix(rnorm(6*18, mean = 1), nrow = 6, ncol = 18))
gr <- factor(c(rep("T1", 6), rep("T2", 6), rep("T3", 6)))
subjectID <- factor(c(rep(1:6, 3)))
mod.oneway.test(X, gr, repeated = TRUE, subject = subjectID)
```


### Pairwise moderated t-tests
As a optional post-hoc analysis after mod.oneway.test one can use pairwise
moderated t-tests (only unpaired). One should carefully think about the 
adjustment of p values in this context.
```{r}
pairwise.mod.t.test(X, gr)
```


## Pairwise Comparisons
Often we are in a situation that we want to compare more than two groups 
pairwise. 

### FC and logFC
In the analysis of omics data, the FC or logFC are important 
measures of effect size and are often used in combination with (adjusted) 
p values [@Shi2006].
```{r}
x <- rnorm(100) ## assumed as log-data
g <- factor(sample(1:4, 100, replace = TRUE))
levels(g) <- c("a", "b", "c", "d")
## modified FC
pairwise.fc(x, g)
## "true" FC
pairwise.fc(x, g, mod.fc = FALSE)
## without any transformation
pairwise.logfc(x, g)
```

The function returns a modified FC. That is, if the FC is smaller than 1 it is
transformed to -1/FC. One can also use other functions than the mean for the
aggregation of the data.
```{r}
pairwise.logfc(x, g, ave = median)
```


## Aggregating Technical Replicates
In case of omics experiments it is often the case that technical replicates 
are measured and hence it is part of the preprocessing of the raw data to 
aggregate these technical replicates. This is the purpose of function repMeans.
```{r}
M <- matrix(rnorm(100), ncol = 5)
FL <- matrix(rpois(100, lambda = 10), ncol = 5) # only for this example
repMeans(x = M, flags = FL, use.flags = "max", ndups = 5, spacing = 4)
```


## 1- and 2-Way ANOVA
Functions oneWayAnova and twoWayAnova return a function that can be used
to perform a 1- or 2-way ANOVA, respectively.
```{r}
af <- oneWayAnova(c(rep(1,5),rep(2,5)))
## p value
af(rnorm(10))
x <- matrix(rnorm(12*10), nrow = 10)
## 2-way ANOVA with interaction
af1 <- twoWayAnova(c(rep(1,6),rep(2,6)), rep(c(rep(1,3), rep(2,3)), 2))
## p values
apply(x, 1, af1)
## 2-way ANOVA without interaction
af2 <- twoWayAnova(c(rep(1,6),rep(2,6)), rep(c(rep(1,3), rep(2,3)), 2), 
                   interaction = FALSE)
## p values
apply(x, 1, af2)
```


## Correlation Distance Matrix
In the analysis of omics data correlation and absolute correlation distance
matrices are often used during quality control. Function corDist can compute
the Pearson, Spearman, Kendall or Cosine sample correlation and absolute correlation
as well as the minimum covariance determinant or the orthogonalized 
Gnanadesikan-Kettenring correlation and absolute correlation implemented in
package robustbase [@robustbase].
```{r}
M <- matrix(rcauchy(1000), nrow = 5)
## Pearson
corDist(M)
## Spearman
corDist(M, method = "spearman")
## Minimum Covariance Determinant
corDist(M, method = "mcd")
```


## MAD Matrix
In case of outliers the MAD (median absolute deviation = median of the absolute 
deviations from the median) may be useful as measure of scale.
```{r}
madMatrix(t(M))
```


## Similarity Matrices
First, we can plot a similarity matrix based on correlation.
```{r, fig.width=8, fig.height=7}
M <- matrix(rnorm(1000), ncol = 20)
colnames(M) <- paste("Sample", 1:20)
M.cor <- cor(M)
corPlot(M.cor, minCor = min(M.cor), labels = colnames(M))
```

There is a second implementation based on package \pkg{ComplexHeatmap} [@ComplexHeatmap].
```{r, fig.width=8, fig.height=7}
corPlot2(M.cor, minCor = min(M.cor), labels = colnames(M))
```

Next, we can use the MAD.
```{r, fig.width=8, fig.height=7}
## random data
x <- matrix(rnorm(1000), ncol = 10)
## outliers
x[1:20,5] <- x[1:20,5] + 10
madPlot(x, new = TRUE, maxMAD = 2.5, labels = TRUE,
        title = "MAD: Outlier visible")
madPlot2(x, new = TRUE, maxMAD = 2.5, labels = TRUE,
        title = "MAD: Outlier visible")
## in contrast
corPlot2(x, new = TRUE, minCor = -0.5, labels = TRUE,
        title = "Correlation: Outlier masked")
```


## Colors for Heatmaps
Nowadays there are better solutions e.g. provided by Bioconductor package 
ComplexHeatmap [@ComplexHeatmap]. This is my solution to get a better 
coloring of heatmaps.
```{r, fig.width=7, fig.height=9}
## generate some random data
data.plot <- matrix(rnorm(100*50, sd = 1), ncol = 50)
colnames(data.plot) <- paste("patient", 1:50)
rownames(data.plot) <- paste("gene", 1:100)
data.plot[1:70, 1:30] <- data.plot[1:70, 1:30] + 3
data.plot[71:100, 31:50] <- data.plot[71:100, 31:50] - 1.4
data.plot[1:70, 31:50] <- rnorm(1400, sd = 1.2)
data.plot[71:100, 1:30] <- rnorm(900, sd = 1.2)
nrcol <- 128
## Load required packages
library(RColorBrewer)
myCol <- rev(colorRampPalette(brewer.pal(10, "RdBu"))(nrcol))
heatmap(data.plot, col =  myCol, main = "standard colors")
myCol2 <- heatmapCol(data = data.plot, col = myCol, 
                     lim = min(abs(range(data.plot)))-1)
heatmap(data.plot, col = myCol2, main = "heatmapCol colors")
```


## String Alignment
### String Distances
In Bioinformatics the (pairwise and multiple) alignment of strings is an 
important topic. For this one can use the distance or similarity between 
strings. The Hamming and the the Levenshtein (edit) distance are 
implemented in function stringDist [@Merkl2009].
```{r}
x <- "GACGGATTATG"
y <- "GATCGGAATAG"
## Hamming distance
stringDist(x, y)
## Levenshtein distance
d <- stringDist(x, y)
d
```

In case of the Levenshtein (edit) distance, the respective scoring and traceback 
matrices are attached as attributes to the result.
```{r}
attr(d, "ScoringMatrix")
attr(d, "TraceBackMatrix")
```
The characters in the trace-back matrix reflect insertion of a gap in string y 
(d: deletion), match (m), mismatch (mm), and insertion of a gap in string x (i). 

### String Similarities
The function stringSim computes the optimal alignment scores for global 
(Needleman-Wunsch) and local (Smith-Waterman) alignments with constant gap 
penalties [@Merkl2009]. Scoring and trace-back matrix are again attached as 
attributes to the results. 
```{r}
## optimal global alignment score
d <- stringSim(x, y)
d
attr(d, "ScoringMatrix")
attr(d, "TraceBackMatrix")

## optimal local alignment score
d <- stringSim(x, y, global = FALSE)
d
attr(d, "ScoringMatrix")
attr(d, "TraceBackMatrix")
```

The entry stop indicates that the minimum similarity score has been reached.


### Optimal Alignment
Finally, the function traceBack computes an optimal global or local alignment 
based on a trace back matrix as provided by function stringDist or 
stringSim [@Merkl2009].
```{r}
x <- "GACGGATTATG"
y <- "GATCGGAATAG"
## Levenshtein distance
d <- stringDist(x, y)
## optimal global alignment
traceBack(d)

## Optimal global alignment score
d <- stringSim(x, y)
## optimal global alignment
traceBack(d)

## Optimal local alignment score
d <- stringSim(x, y, global = FALSE)
## optimal local alignment
traceBack(d, global = FALSE)
```


## sessionInfo
```{r}
sessionInfo()
```


## References