--- title: "Package MKomics" author: "Matthias Kohl" date: "`r Sys.Date()`" output: rmarkdown::html_vignette: toc: true bibliography: MKomics.bib vignette: > %\VignetteIndexEntry{MKomics} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{utf8} --- ## Introduction Package MKomics includes a collection of functions that I found and find useful for the analysis of omics data. It includes: - similarity plots based on correlation and median absolute deviation (MAD) - adjusting colors for heatmaps - aggregate technical replicates - calculate pairwise fold-changes and log fold-changes - compute one- and two-way ANOVA - simplified interface to package 'limma' [@limma] for moderated t-test and one-way ANOVA - Hamming and the Levenshtein (edit) distance of strings as well as optimal alignment scores for global (Needleman-Wunsch) and local (Smith-Waterman) alignments with constant gap penalties [@Merkl2009]. The package requires Bioconductor package limma, which can be installed via ```{r, eval = FALSE} ## Install package BiocManager install.packages("BiocManager") ## Use BiocManager to install limma BiocManager::install("limma") ``` We first load the package. ```{r} library(MKomics) ``` ## Moderated Tests Based on Package limma The functions to compute the moderated t-test and one-way ANOVA were motivated by the fact that my students had problems to understand and correctly adapt the code of the limma package [@limma]. ### Moderated t-Test We generate some artificial data and compute moderated t-tests. ```{r} ## One-sample test X <- matrix(rnorm(10*20, mean = 1), nrow = 10, ncol = 20) mod.t.test(X) ## Two-sample test set.seed(123) X <- rbind(matrix(rnorm(5*20), nrow = 5, ncol = 20), matrix(rnorm(5*20, mean = 1), nrow = 5, ncol = 20)) g2 <- factor(c(rep("group 1", 10), rep("group 2", 10))) mod.t.test(X, group = g2) ## Paired two-sample test subjID <- factor(rep(1:10, 2)) mod.t.test(X, group = g2, paired = TRUE, subject = subjID) ``` ### Moderated 1-Way ANOVA We generate some artifical data and compute moderated one-way ANOVAs. ```{r} set.seed(123) X <- rbind(matrix(rnorm(5*20), nrow = 5, ncol = 20), matrix(rnorm(5*20, mean = 1), nrow = 5, ncol = 20)) gr <- factor(c(rep("A1", 5), rep("B2", 5), rep("C3", 5), rep("D4", 5))) mod.oneway.test(X, gr) ``` We can also perform a moderated one-way ANOVA with repeated measures. ```{r} X <- rbind(matrix(rnorm(6*18), nrow = 6, ncol = 18), matrix(rnorm(6*18, mean = 1), nrow = 6, ncol = 18)) gr <- factor(c(rep("T1", 6), rep("T2", 6), rep("T3", 6))) subjectID <- factor(c(rep(1:6, 3))) mod.oneway.test(X, gr, repeated = TRUE, subject = subjectID) ``` ### Pairwise moderated t-tests As a optional post-hoc analysis after mod.oneway.test one can use pairwise moderated t-tests (only unpaired). One should carefully think about the adjustment of p values in this context. ```{r} pairwise.mod.t.test(X, gr) ``` ## Pairwise Comparisons Often we are in a situation that we want to compare more than two groups pairwise. ### FC and logFC In the analysis of omics data, the FC or logFC are important measures of effect size and are often used in combination with (adjusted) p values [@Shi2006]. ```{r} x <- rnorm(100) ## assumed as log-data g <- factor(sample(1:4, 100, replace = TRUE)) levels(g) <- c("a", "b", "c", "d") ## modified FC pairwise.fc(x, g) ## "true" FC pairwise.fc(x, g, mod.fc = FALSE) ## without any transformation pairwise.logfc(x, g) ``` The function returns a modified FC. That is, if the FC is smaller than 1 it is transformed to -1/FC. One can also use other functions than the mean for the aggregation of the data. ```{r} pairwise.logfc(x, g, ave = median) ``` ## Aggregating Technical Replicates In case of omics experiments it is often the case that technical replicates are measured and hence it is part of the preprocessing of the raw data to aggregate these technical replicates. This is the purpose of function repMeans. ```{r} M <- matrix(rnorm(100), ncol = 5) FL <- matrix(rpois(100, lambda = 10), ncol = 5) # only for this example repMeans(x = M, flags = FL, use.flags = "max", ndups = 5, spacing = 4) ``` ## 1- and 2-Way ANOVA Functions oneWayAnova and twoWayAnova return a function that can be used to perform a 1- or 2-way ANOVA, respectively. ```{r} af <- oneWayAnova(c(rep(1,5),rep(2,5))) ## p value af(rnorm(10)) x <- matrix(rnorm(12*10), nrow = 10) ## 2-way ANOVA with interaction af1 <- twoWayAnova(c(rep(1,6),rep(2,6)), rep(c(rep(1,3), rep(2,3)), 2)) ## p values apply(x, 1, af1) ## 2-way ANOVA without interaction af2 <- twoWayAnova(c(rep(1,6),rep(2,6)), rep(c(rep(1,3), rep(2,3)), 2), interaction = FALSE) ## p values apply(x, 1, af2) ``` ## Correlation Distance Matrix In the analysis of omics data correlation and absolute correlation distance matrices are often used during quality control. Function corDist can compute the Pearson, Spearman, Kendall or Cosine sample correlation and absolute correlation as well as the minimum covariance determinant or the orthogonalized Gnanadesikan-Kettenring correlation and absolute correlation implemented in package robustbase [@robustbase]. ```{r} M <- matrix(rcauchy(1000), nrow = 5) ## Pearson corDist(M) ## Spearman corDist(M, method = "spearman") ## Minimum Covariance Determinant corDist(M, method = "mcd") ``` ## MAD Matrix In case of outliers the MAD (median absolute deviation = median of the absolute deviations from the median) may be useful as measure of scale. ```{r} madMatrix(t(M)) ``` ## Similarity Matrices First, we can plot a similarity matrix based on correlation. ```{r, fig.width=8, fig.height=7} M <- matrix(rnorm(1000), ncol = 20) colnames(M) <- paste("Sample", 1:20) M.cor <- cor(M) corPlot(M.cor, minCor = min(M.cor), labels = colnames(M)) ``` There is a second implementation based on package \pkg{ComplexHeatmap} [@ComplexHeatmap]. ```{r, fig.width=8, fig.height=7} corPlot2(M.cor, minCor = min(M.cor), labels = colnames(M)) ``` Next, we can use the MAD. ```{r, fig.width=8, fig.height=7} ## random data x <- matrix(rnorm(1000), ncol = 10) ## outliers x[1:20,5] <- x[1:20,5] + 10 madPlot(x, new = TRUE, maxMAD = 2.5, labels = TRUE, title = "MAD: Outlier visible") madPlot2(x, new = TRUE, maxMAD = 2.5, labels = TRUE, title = "MAD: Outlier visible") ## in contrast corPlot2(x, new = TRUE, minCor = -0.5, labels = TRUE, title = "Correlation: Outlier masked") ``` ## Colors for Heatmaps Nowadays there are better solutions e.g. provided by Bioconductor package ComplexHeatmap [@ComplexHeatmap]. This is my solution to get a better coloring of heatmaps. ```{r, fig.width=7, fig.height=9} ## generate some random data data.plot <- matrix(rnorm(100*50, sd = 1), ncol = 50) colnames(data.plot) <- paste("patient", 1:50) rownames(data.plot) <- paste("gene", 1:100) data.plot[1:70, 1:30] <- data.plot[1:70, 1:30] + 3 data.plot[71:100, 31:50] <- data.plot[71:100, 31:50] - 1.4 data.plot[1:70, 31:50] <- rnorm(1400, sd = 1.2) data.plot[71:100, 1:30] <- rnorm(900, sd = 1.2) nrcol <- 128 ## Load required packages library(RColorBrewer) myCol <- rev(colorRampPalette(brewer.pal(10, "RdBu"))(nrcol)) heatmap(data.plot, col = myCol, main = "standard colors") myCol2 <- heatmapCol(data = data.plot, col = myCol, lim = min(abs(range(data.plot)))-1) heatmap(data.plot, col = myCol2, main = "heatmapCol colors") ``` ## String Alignment ### String Distances In Bioinformatics the (pairwise and multiple) alignment of strings is an important topic. For this one can use the distance or similarity between strings. The Hamming and the the Levenshtein (edit) distance are implemented in function stringDist [@Merkl2009]. ```{r} x <- "GACGGATTATG" y <- "GATCGGAATAG" ## Hamming distance stringDist(x, y) ## Levenshtein distance d <- stringDist(x, y) d ``` In case of the Levenshtein (edit) distance, the respective scoring and traceback matrices are attached as attributes to the result. ```{r} attr(d, "ScoringMatrix") attr(d, "TraceBackMatrix") ``` The characters in the trace-back matrix reflect insertion of a gap in string y (d: deletion), match (m), mismatch (mm), and insertion of a gap in string x (i). ### String Similarities The function stringSim computes the optimal alignment scores for global (Needleman-Wunsch) and local (Smith-Waterman) alignments with constant gap penalties [@Merkl2009]. Scoring and trace-back matrix are again attached as attributes to the results. ```{r} ## optimal global alignment score d <- stringSim(x, y) d attr(d, "ScoringMatrix") attr(d, "TraceBackMatrix") ## optimal local alignment score d <- stringSim(x, y, global = FALSE) d attr(d, "ScoringMatrix") attr(d, "TraceBackMatrix") ``` The entry stop indicates that the minimum similarity score has been reached. ### Optimal Alignment Finally, the function traceBack computes an optimal global or local alignment based on a trace back matrix as provided by function stringDist or stringSim [@Merkl2009]. ```{r} x <- "GACGGATTATG" y <- "GATCGGAATAG" ## Levenshtein distance d <- stringDist(x, y) ## optimal global alignment traceBack(d) ## Optimal global alignment score d <- stringSim(x, y) ## optimal global alignment traceBack(d) ## Optimal local alignment score d <- stringSim(x, y, global = FALSE) ## optimal local alignment traceBack(d, global = FALSE) ``` ## sessionInfo ```{r} sessionInfo() ``` ## References