Title: | Miscellaneous Functions from M. Kohl |
---|---|
Description: | Contains several functions for statistical data analysis; e.g. for sample size and power calculations, computation of confidence intervals and tests, and generation of similarity matrices. |
Authors: | Matthias Kohl [aut, cre] (0000-0001-9514-8910) |
Maintainer: | Matthias Kohl <[email protected]> |
License: | LGPL-3 |
Version: | 1.9 |
Built: | 2024-11-09 03:53:01 UTC |
Source: | https://github.com/stamats/mkmisc |
Contains several functions for statistical data analysis; e.g. for sample size and power calculations, computation of confidence intervals, and generation of similarity matrices.
Package: | MKmisc |
Type: | Package |
Version: | 1.9 |
Date: | 2022-11-19 |
Depends: | R(>= 3.5.0) |
Imports: | stats, utils, graphics, grDevices, RColorBrewer, robustbase, ggplot2, scales, limma |
Suggests: | gplots, Amelia, knitr, rmarkdown, exactRankTests, foreach, parallel, doParallel |
License: | LGPL-3 |
URL: | https://github.com/stamats/MKmisc |
library(MKmisc)
Matthias Kohl https://www.stamats.de
Maintainer: Matthias Kohl [email protected]
The function computes AUC.
AUC(x, y, group, switchAUC = TRUE)
AUC(x, y, group, switchAUC = TRUE)
x |
numeric vector. |
y |
numeric vector. If missing, |
group |
grouping vector or factor. |
switchAUC |
logical value. Switch AUC; see Details section. |
The function computes the area under the receiver operating characteristic curve (AUC under ROC curve).
If AUC < 0.5
, a warning is printed and 1-AUC
is returned. This
behaviour can be suppressed by using switchAUC = FALSE
The implementation uses the connection of AUC to the Wilcoxon rank sum test; see Hanley and McNeil (1982).
AUC value.
Matthias Kohl [email protected]
J. A. Hanley and B. J. McNeil (1982). The meaning and use of the area under a receiver operating characteristic (ROC) curve. Radiology, 143, 29-36.
set.seed(13) x <- rnorm(100) ## assumed as log2-data g <- sample(1:2, 100, replace = TRUE) AUC(x, group = g) ## avoid switching AUC AUC(x, group = g, switchAUC = FALSE)
set.seed(13) x <- rnorm(100) ## assumed as log2-data g <- sample(1:2, 100, replace = TRUE) AUC(x, group = g) ## avoid switching AUC AUC(x, group = g, switchAUC = FALSE)
Performs tests for one and two AUCs.
AUC.test(pred1, lab1, pred2, lab2, conf.level = 0.95, paired = FALSE)
AUC.test(pred1, lab1, pred2, lab2, conf.level = 0.95, paired = FALSE)
pred1 |
numeric vector. |
lab1 |
grouping vector or factor for |
pred2 |
numeric vector. |
lab2 |
grouping vector or factor for |
conf.level |
confidence level of the interval. |
paired |
not yet implemented. |
If pred2
and lab2
are missing, the AUC for pred1
and lab1
is tested using the Wilcoxon signed rank test;
see wilcox.test
.
If pred1
and lab1
as well as pred2
and lab2
are specified, the Hanley and McNeil test (cf. Hanley and McNeil (1982))
is computed.
A list with AUC, SE and confidence interval as well as the corresponding test result.
Matthias Kohl [email protected]
J. A. Hanley and B. J. McNeil (1982). The meaning and use of the area under a receiver operating characteristic (ROC) curve. Radiology, 143, 29-36.
set.seed(13) x <- rnorm(100) ## assumed as log2-data g <- sample(1:2, 100, replace = TRUE) AUC.test(x, g) y <- rnorm(100) ## assumed as log2-data h <- sample(1:2, 100, replace = TRUE) AUC.test(x, g, y, h)
set.seed(13) x <- rnorm(100) ## assumed as log2-data g <- sample(1:2, 100, replace = TRUE) AUC.test(x, g) y <- rnorm(100) ## assumed as log2-data h <- sample(1:2, 100, replace = TRUE) AUC.test(x, g, y, h)
This function can be used to compute confidence intervals for binomial proportions.
binomCI(x, n, conf.level = 0.95, method = "wilson", rand = 123)
binomCI(x, n, conf.level = 0.95, method = "wilson", rand = 123)
x |
number of successes |
n |
number of trials |
conf.level |
confidence level |
method |
character string specifing which method to use; see details. |
rand |
seed for random number generator; see details. |
The Wald interval is obtained by inverting the acceptance region of the Wald large-sample normal test.
The Wilson interval, which is the default, was introduced by Wilson (1927) and is the inversion of the CLT approximation to the family of equal tail tests of p = p0. The Wilson interval is recommended by Agresti and Coull (1998) as well as by Brown et al (2001).
The Agresti-Coull interval was proposed by Agresti and Coull (1998) and is a slight modification of the Wilson interval. The Agresti-Coull intervals are never shorter than the Wilson intervals; cf. Brown et al (2001).
The Jeffreys interval is an implementation of the equal-tailed Jeffreys prior interval as given in Brown et al (2001).
The modified Wilson interval is a modification of the Wilson interval for x close to 0 or n as proposed by Brown et al (2001).
The modified Jeffreys interval is a modification of the Jeffreys interval for
x == 0 | x == 1
and x == n-1 | x == n
as proposed by
Brown et al (2001).
The Clopper-Pearson interval is based on quantiles of corresponding beta distributions. This is sometimes also called exact interval.
The arcsine interval is based on the variance stabilizing distribution for the binomial distribution.
The logit interval is obtained by inverting the Wald type interval for the log odds.
The Witting interval (cf. Beispiel 2.106 in Witting (1985)) uses randomization to obtain uniformly optimal lower and upper confidence bounds (cf. Satz 2.105 in Witting (1985)) for binomial proportions.
For more details we refer to Brown et al (2001) as well as Witting (1985).
A list with class "confint"
containing the following components:
estimate |
the estimated probability of success. |
conf.int |
a confidence interval for the probability of success. |
A first version of this function appeared in R package SLmisc.
Matthias Kohl [email protected]
A. Agresti and B.A. Coull (1998). Approximate is better than "exact" for interval estimation of binomial proportions. American Statistician, 52, 119-126.
L.D. Brown, T.T. Cai and A. Dasgupta (2001). Interval estimation for a binomial proportion. Statistical Science, 16(2), 101-133.
H. Witting (1985). Mathematische Statistik I. Stuttgart: Teubner.
binomCI(x = 42, n = 43, method = "wald") binomCI(x = 42, n = 43, method = "wilson") binomCI(x = 42, n = 43, method = "agresti-coull") binomCI(x = 42, n = 43, method = "jeffreys") binomCI(x = 42, n = 43, method = "modified wilson") binomCI(x = 42, n = 43, method = "modified jeffreys") binomCI(x = 42, n = 43, method = "clopper-pearson") binomCI(x = 42, n = 43, method = "arcsine") binomCI(x = 42, n = 43, method = "logit") binomCI(x = 42, n = 43, method = "witting") ## the confidence interval computed by binom.test ## corresponds to the Clopper-Pearson interval binomCI(x = 42, n = 43, method = "clopper-pearson")$conf.int binom.test(x = 42, n = 43)$conf.int
binomCI(x = 42, n = 43, method = "wald") binomCI(x = 42, n = 43, method = "wilson") binomCI(x = 42, n = 43, method = "agresti-coull") binomCI(x = 42, n = 43, method = "jeffreys") binomCI(x = 42, n = 43, method = "modified wilson") binomCI(x = 42, n = 43, method = "modified jeffreys") binomCI(x = 42, n = 43, method = "clopper-pearson") binomCI(x = 42, n = 43, method = "arcsine") binomCI(x = 42, n = 43, method = "logit") binomCI(x = 42, n = 43, method = "witting") ## the confidence interval computed by binom.test ## corresponds to the Clopper-Pearson interval binomCI(x = 42, n = 43, method = "clopper-pearson")$conf.int binom.test(x = 42, n = 43)$conf.int
The function computes and returns the correlation and absolute correlation distance matrix computed by using the specified distance measure to compute the distances between the rows of a data matrix.
corDist(x, method = "pearson", diag = FALSE, upper = FALSE, abs = FALSE, use = "pairwise.complete.obs", ...)
corDist(x, method = "pearson", diag = FALSE, upper = FALSE, abs = FALSE, use = "pairwise.complete.obs", ...)
x |
a numeric matrix or data frame |
method |
the correlation distance measure to be used. This must be one of
|
diag |
logical value indicating whether the diagonal of the distance matrix should be printed by 'print.dist'. |
upper |
logical value indicating whether the upper triangle of the distance matrix should be printed by 'print.dist'. |
abs |
logical, compute absolute correlation distances |
use |
character, correponds to argument |
... |
further arguments to functions |
The function computes the Pearson, Spearman, Kendall or Cosine sample correlation
and absolute correlation; confer Section 12.2.2 of Gentleman et al (2005). For more
details about the arguments we refer to functions dist
and
cor
.
Moreover, the function computes the minimum covariance determinant or the
orthogonalized Gnanadesikan-Kettenring estimator. For more details we refer to
functions covMcd
and covOGK
,
respectively.
corDist
returns an object of class "dist"
; cf. dist
.
A first version of this function appeared in package SLmisc.
Matthias Kohl [email protected]
Gentleman R. Ding B., Dudoit S. and Ibrahim J. (2005). Distance Measures in DNA Microarray Data Analysis. In: Gentleman R., Carey V.J., Huber W., Irizarry R.A. and Dudoit S. (editors) Bioinformatics and Computational Biology Solutions Using R and Bioconductor. Springer.
P. J. Rousseeuw and A. M. Leroy (1987). Robust Regression and Outlier Detection. Wiley.
P. J. Rousseeuw and K. van Driessen (1999) A fast algorithm for the minimum covariance determinant estimator. Technometrics 41, 212-223.
Pison, G., Van Aelst, S., and Willems, G. (2002), Small Sample Corrections for LTS and MCD, Metrika, 55, 111-123.
Maronna, R.A. and Zamar, R.H. (2002). Robust estimates of location and dispersion of high-dimensional datasets; Technometrics 44(4), 307-317.
Gnanadesikan, R. and John R. Kettenring (1972). Robust estimates, residuals, and outlier detection with multiresponse data. Biometrics 28, 81-124.
## only a dummy example M <- matrix(rnorm(1000), ncol = 20) D <- corDist(M)
## only a dummy example M <- matrix(rnorm(1000), ncol = 20) D <- corDist(M)
Plot of similarity matrix. This function is a slight modification of function
plot.cor
of the archived package "sma"
.
corPlot(x, new = FALSE, col, minCor, labels = FALSE, lab.both.axes = FALSE, labcols = "black", title = "", cex.title = 1.2, protocol = FALSE, cex.axis = 0.8, cex.axis.bar = 1, signifBar = 2, ...)
corPlot(x, new = FALSE, col, minCor, labels = FALSE, lab.both.axes = FALSE, labcols = "black", title = "", cex.title = 1.2, protocol = FALSE, cex.axis = 0.8, cex.axis.bar = 1, signifBar = 2, ...)
x |
data or correlation matrix, respectively |
new |
If |
col |
colors palette for image. If missing, the |
minCor |
numeric value in [-1,1], used to adjust |
labels |
vector of character strings to be placed at the tickpoints,
labels for the columns of |
lab.both.axes |
logical, display labels on both axes |
labcols |
colors to be used for the labels of the columns of |
title |
character string, overall title for the plot. |
cex.title |
A numerical value giving the amount by which plotting text
and symbols should be magnified relative to the default;
cf. |
protocol |
logical, display color bar without numbers |
cex.axis |
The magnification to be used for axis annotation relative to the
current setting of 'cex'; cf. |
cex.axis.bar |
The magnification to be used for axis annotation of the color
bar relative to the current setting of 'cex'; cf.
|
signifBar |
integer indicating the precision to be used for the bar. |
... |
graphical parameters may also be supplied as arguments to the
function (see |
This functions generates the so called similarity matrix (based on correlation) for a microarray experiment.
If min(x)
, respectively min(cor(x))
is smaller than minCor
,
the colors in col
are adjusted such that the minimum correlation value
which is color coded is equal to minCor
.
invisible()
A first version of this function appeared in package SLmisc.
Matthias Kohl [email protected]
Sandrine Dudoit, Yee Hwa (Jean) Yang, Benjamin Milo Bolstad and with
contributions from Natalie Thorne, Ingrid Loennstedt and Jessica Mar.
sma: Statistical Microarray Analysis.
http://www.stat.berkeley.edu/users/terry/zarray/Software/smacode.html
## only a dummy example M <- matrix(rnorm(1000), ncol = 20) colnames(M) <- paste("Sample", 1:20) M.cor <- cor(M) corPlot(M.cor, minCor = min(M.cor)) corPlot(M.cor, minCor = min(M.cor), lab.both.axes = TRUE) corPlot(M.cor, minCor = min(M.cor), protocol = TRUE) corPlot(M.cor, minCor = min(M.cor), signifBar = 1)
## only a dummy example M <- matrix(rnorm(1000), ncol = 20) colnames(M) <- paste("Sample", 1:20) M.cor <- cor(M) corPlot(M.cor, minCor = min(M.cor)) corPlot(M.cor, minCor = min(M.cor), lab.both.axes = TRUE) corPlot(M.cor, minCor = min(M.cor), protocol = TRUE) corPlot(M.cor, minCor = min(M.cor), signifBar = 1)
The functions compute CV as well as two robust versions of the CV.
CV(x, na.rm = FALSE)
CV(x, na.rm = FALSE)
x |
numeric vector. |
na.rm |
logical. Should missing values be removed? |
The functions compute the (classical) coefficient of variation as well as two robust variants.
medCV
uses the (standardized) MAD instead of SD and median instead of mean.
iqrCV
uses the (standardized) IQR instead of SD and median instead of mean.
CV value.
Matthias Kohl [email protected]
C.N.P.G. Arachchige, L.A. Prendergast and R.G. Staudte. Robust analogues to the Coefficient of Variation. https://arxiv.org/abs/1907.01110.
## 5% outliers out <- rbinom(100, prob = 0.05, size = 1) sum(out) x <- (1-out)*rnorm(100, mean = 10, sd = 2) + out*25 CV(x) medCV(x) iqrCV(x)
## 5% outliers out <- rbinom(100, prob = 0.05, size = 1) sum(out) x <- (1-out)*rnorm(100, mean = 10, sd = 2) + out*25 CV(x) medCV(x) iqrCV(x)
This function can be used to compute confidence intervals for the (classical) coefficient of variation.
cvCI(x, conf.level = 0.95, method = "miller", na.rm = FALSE)
cvCI(x, conf.level = 0.95, method = "miller", na.rm = FALSE)
x |
numeric vector. |
conf.level |
confidence level |
method |
character string specifing which method to use; see details. |
na.rm |
logical. Should missing values be removed? |
For details about the confidence intervals we refer to Gulhar et al (2012) and Arachchige et al (2019).
A list with class "confint"
containing the following components:
estimate |
the estimated coefficient of variation. |
conf.int |
a confidence interval for the coefficient of variation. |
Matthias Kohl [email protected]
C.N.P.G. Arachchige, L.A. Prendergast and R.G. Staudte (2019). Robust analogues to the Coefficient of Variation. https://arxiv.org/abs/1907.01110.
M. Gulhar, G. Kibria, A. Albatineh, N.U. Ahmed (2012). A comparison of some confidence intervals for estimating the population coefficient of variation: a simulation study. Sort, 36(1), 45-69.
x <- rnorm(100, mean = 10, sd = 2) # CV = 0.2 cvCI(x, method = "miller") cvCI(x, method = "sharma") cvCI(x, method = "curto") cvCI(x, method = "mckay") cvCI(x, method = "vangel") cvCI(x, method = "panichkitkosolkul") cvCI(x, method = "medmiller") cvCI(x, method = "medmckay") cvCI(x, method = "medvangel") cvCI(x, method = "medcurto") cvCI(x, method = "gulhar")
x <- rnorm(100, mean = 10, sd = 2) # CV = 0.2 cvCI(x, method = "miller") cvCI(x, method = "sharma") cvCI(x, method = "curto") cvCI(x, method = "mckay") cvCI(x, method = "vangel") cvCI(x, method = "panichkitkosolkul") cvCI(x, method = "medmiller") cvCI(x, method = "medmckay") cvCI(x, method = "medvangel") cvCI(x, method = "medcurto") cvCI(x, method = "gulhar")
Function to compute five-number summaries (minimum, 1st quartile, median, 3rd quartile, maximum)
fiveNS(x, na.rm = TRUE, type = 7)
fiveNS(x, na.rm = TRUE, type = 7)
x |
numeric vector |
na.rm |
logical; remove |
type |
an integer between 1 and 9 selecting one of nine quantile
algorithms; for more details see |
In contrast to fivenum
the functions computes the
first and third quartile using function quantile
.
A numeric vector of length 5 containing the summary information.
Matthias Kohl [email protected]
x <- rnorm(100) fiveNS(x) fiveNS(x, type = 2) fivenum(x)
x <- rnorm(100) fiveNS(x) fiveNS(x, type = 2) fivenum(x)
The functions compute the generalized logarithm, which is more or less identical to the area hyperbolic sine, and their inverse; see details.
glog(x, base = exp(1)) glog10(x) glog2(x) inv.glog(x, base = exp(1)) inv.glog10(x) inv.glog2(x)
glog(x, base = exp(1)) glog10(x) glog2(x) inv.glog(x, base = exp(1)) inv.glog10(x) inv.glog2(x)
x |
a numeric or complex vector. |
base |
a positive or a positive or complex number: the base with respect to which logarithms are computed. Defaults to e=exp(1). |
The function computes
where the first part corresponds to the area hyperbolic sine. Subtracting log(2) makes the function asymptotically identical to the logarithm.
A vector of the same length as x containing the transformed values.
Matthias Kohl [email protected]
curve(log, from = -3, to = 5) curve(glog, from = -3, to = 5, add = TRUE, col = "orange") legend("topleft", fill = c("black", "orange"), legend = c("log", "glog")) curve(log10(x), from = -3, to = 5) curve(glog10(x), from = -3, to = 5, add = TRUE, col = "orange") legend("topleft", fill = c("black", "orange"), legend = c("log10", "glog10")) inv.glog(glog(10)) inv.glog(glog(10, base = 3), base = 3) inv.glog10(glog10(10)) inv.glog2(glog2(10))
curve(log, from = -3, to = 5) curve(glog, from = -3, to = 5, add = TRUE, col = "orange") legend("topleft", fill = c("black", "orange"), legend = c("log", "glog")) curve(log10(x), from = -3, to = 5) curve(glog10(x), from = -3, to = 5, add = TRUE, col = "orange") legend("topleft", fill = c("black", "orange"), legend = c("log10", "glog10")) inv.glog(glog(10)) inv.glog(glog(10, base = 3), base = 3) inv.glog10(glog10(10)) inv.glog2(glog2(10))
This function modifies a given color vector as used for heatmaps.
heatmapCol(data, col, lim, na.rm = TRUE)
heatmapCol(data, col, lim, na.rm = TRUE)
data |
matrix or data.frame; data which shall be displayed in a heatmap; ranging from negative to positive numbers. |
col |
vector of colors used for heatmap. |
lim |
constant colors are used for data below |
na.rm |
logical; remove |
Colors below and above a specified value are kept constant. In addition, the colors are symmetrizised.
vector of colors
A first version of this function appeared in package SLmisc.
Matthias Kohl [email protected]
data.plot <- matrix(rnorm(100*50, sd = 1), ncol = 50) colnames(data.plot) <- paste("patient", 1:50) rownames(data.plot) <- paste("gene", 1:100) data.plot[1:70, 1:30] <- data.plot[1:70, 1:30] + 3 data.plot[71:100, 31:50] <- data.plot[71:100, 31:50] - 1.4 data.plot[1:70, 31:50] <- rnorm(1400, sd = 1.2) data.plot[71:100, 1:30] <- rnorm(900, sd = 1.2) nrcol <- 128 require(gplots) require(RColorBrewer) myCol <- rev(colorRampPalette(brewer.pal(10, "RdBu"))(nrcol)) heatmap.2(data.plot, col = myCol, trace = "none", tracecol = "black") farbe <- heatmapCol(data = data.plot, col = myCol, lim = min(abs(range(data.plot)))-1) heatmap.2(data.plot, col = farbe, trace = "none", tracecol = "black")
data.plot <- matrix(rnorm(100*50, sd = 1), ncol = 50) colnames(data.plot) <- paste("patient", 1:50) rownames(data.plot) <- paste("gene", 1:100) data.plot[1:70, 1:30] <- data.plot[1:70, 1:30] + 3 data.plot[71:100, 31:50] <- data.plot[71:100, 31:50] - 1.4 data.plot[1:70, 31:50] <- rnorm(1400, sd = 1.2) data.plot[71:100, 1:30] <- rnorm(900, sd = 1.2) nrcol <- 128 require(gplots) require(RColorBrewer) myCol <- rev(colorRampPalette(brewer.pal(10, "RdBu"))(nrcol)) heatmap.2(data.plot, col = myCol, trace = "none", tracecol = "black") farbe <- heatmapCol(data = data.plot, col = myCol, lim = min(abs(range(data.plot)))-1) heatmap.2(data.plot, col = farbe, trace = "none", tracecol = "black")
The function computes Hosmer-Lemeshow goodness of fit tests for C and H statistic as well as the le Cessie-van Houwelingen-Copas-Hosmer unweighted sum of squares test for global goodness of fit.
HLgof.test(fit, obs, ngr = 10, X, verbose = FALSE)
HLgof.test(fit, obs, ngr = 10, X, verbose = FALSE)
fit |
numeric vector with fitted probabilities. |
obs |
numeric vector with observed values. |
ngr |
number of groups for C and H statistic. |
X |
covariate(s) for le Cessie-van Houwelingen-Copas-Hosmer global goodness of fit test. |
verbose |
logical, print intermediate results. |
Hosmer-Lemeshow goodness of fit tests are computed; see Lemeshow and Hosmer (1982).
If X
is specified, the le Cessie-van Houwelingen-Copas-Hosmer
unweighted sum of squares test for global goodness of fit is additionally
determined; see Hosmer et al. (1997).
A more general version of this test is implemented in function
residuals.lrm
in package rms.
A list of test results.
Matthias Kohl [email protected]
S. Lemeshow and D.W. Hosmer (1982). A review of goodness of fit statistics for use in the development of logistic regression models. American Journal of Epidemiology, 115(1), 92-106.
D.W. Hosmer, T. Hosmer, S. le Cessie, S. Lemeshow (1997). A comparison of goodness-of-fit tests for the logistic regression model. Statistics in Medicine, 16, 965-980.
set.seed(111) x1 <- factor(sample(1:3, 50, replace = TRUE)) x2 <- rnorm(50) obs <- sample(c(0,1), 50, replace = TRUE) fit <- glm(obs ~ x1+x2, family = binomial) HLgof.test(fit = fitted(fit), obs = obs) HLgof.test(fit = fitted(fit), obs = obs, X = model.matrix(obs ~ x1+x2))
set.seed(111) x1 <- factor(sample(1:3, 50, replace = TRUE)) x2 <- rnorm(50) obs <- sample(c(0,1), 50, replace = TRUE) fit <- glm(obs ~ x1+x2, family = binomial) HLgof.test(fit = fitted(fit), obs = obs) HLgof.test(fit = fitted(fit), obs = obs, X = model.matrix(obs ~ x1+x2))
Performs Hsu two sample t-tests on vectors of data.
hsu.t.test(x, ...) ## Default S3 method: hsu.t.test(x, y, alternative = c("two.sided", "less", "greater"), mu = 0, conf.level = 0.95, ...) ## S3 method for class 'formula' hsu.t.test(formula, data, subset, na.action, ...)
hsu.t.test(x, ...) ## Default S3 method: hsu.t.test(x, y, alternative = c("two.sided", "less", "greater"), mu = 0, conf.level = 0.95, ...) ## S3 method for class 'formula' hsu.t.test(formula, data, subset, na.action, ...)
x |
a (non-empty) numeric vector of data values. |
y |
a (non-empty) numeric vector of data values. |
alternative |
a character string specifying the alternative
hypothesis, must be one of |
mu |
a number indicating the true value of the mean (or difference in means if you are performing a two sample test). |
conf.level |
confidence level of the interval. |
formula |
a formula of the form |
data |
an optional matrix or data frame (or similar: see
|
subset |
an optional vector specifying a subset of observations to be used. |
na.action |
a function which indicates what should happen when
the data contain |
... |
further arguments to be passed to or from methods. |
The function and its documentation was adapted from t.test
.
alternative = "greater"
is the alternative that x
has a
larger mean than y
.
If the input data are effectively constant (compared to the larger of the two means) an error is generated.
One should at least have six observations per group to apply the test; see Section 6.8.3 of Hedderich and Sachs (2016).
A list with class "htest"
containing the following components:
statistic |
the value of the t-statistic. |
parameter |
the degrees of freedom for the t-statistic. |
p.value |
the p-value for the test. |
conf.int |
a confidence interval for the mean appropriate to the specified alternative hypothesis. |
estimate |
the estimated means and standard deviations. |
null.value |
the specified hypothesized value of the mean or mean difference depending on whether it was a one-sample test or a two-sample test. |
stderr |
the standard error of the difference in means, used as denominator in the t-statistic formula. |
alternative |
a character string describing the alternative hypothesis. |
method |
a character string indicating what type of t-test was performed. |
data.name |
a character string giving the name(s) of the data. |
J. Hedderich, L. Sachs. Angewandte Statistik: Methodensammlung mit R. Springer 2016.
## Examples taken and adapted from function t.test t.test(1:10, y = c(7:20)) # P = .00001855 t.test(1:10, y = c(7:20, 200)) # P = .1245 -- NOT significant anymore hsu.t.test(1:10, y = c(7:20)) hsu.t.test(1:10, y = c(7:20, 200)) ## Traditional interface with(sleep, t.test(extra[group == 1], extra[group == 2])) with(sleep, hsu.t.test(extra[group == 1], extra[group == 2])) ## Formula interface t.test(extra ~ group, data = sleep) hsu.t.test(extra ~ group, data = sleep)
## Examples taken and adapted from function t.test t.test(1:10, y = c(7:20)) # P = .00001855 t.test(1:10, y = c(7:20, 200)) # P = .1245 -- NOT significant anymore hsu.t.test(1:10, y = c(7:20)) hsu.t.test(1:10, y = c(7:20, 200)) ## Traditional interface with(sleep, t.test(extra[group == 1], extra[group == 2])) with(sleep, hsu.t.test(extra[group == 1], extra[group == 2])) ## Formula interface t.test(extra ~ group, data = sleep) hsu.t.test(extra ~ group, data = sleep)
The function imputes standard deviations for changes from baseline adopting the approach describe in the Cochrane handbook, Section 16.1.3.2.
imputeSD(SD1, SD2, SDchange)
imputeSD(SD1, SD2, SDchange)
SD1 |
numeric vector, baseline SD. |
SD2 |
numeric vector, follow-up SD. |
SDchange |
numeric vector, SD for changes from baseline. |
The function imputes standard deviations for changes from baseline adopting the approach describe in the Cochrane handbook, Section 16.1.3.2.
1) Missing SD1
are replaced by correspondig values of SD2
and
vice versa.
2) Correlations for complete data (rows) are computed.
3) Minimum, mean and maximum correlation (over rows) are computed.
4) Missing values of SDchange are computed by the formula provided in the handbook. The minimum, mean and maximum correlation are used leading to maximal, mean and minimal SD values that may be used for imputation as well as a sensitivity analysis.
data.frame
with possibly imputed SD1 and SD2 values as well as the
given SDchange values are returen. Moreover, the computed correlations as
well as possible values for the imputation of SDchange are returned.
Matthias Kohl [email protected]
Higgins JPT, Green S (editors). Cochrane Handbook for Systematic Reviews of Interventions Version 5.1.0 [updated March 2011]. The Cochrane Collaboration, 2011. Available from www.handbook.cochrane.org.
SD1 <- c(0.149, 0.022, 0.036, 0.085, 0.125, NA, 0.139, 0.124, 0.038) SD2 <- c(NA, 0.039, 0.038, 0.087, 0.125, NA, 0.135, 0.126, 0.038) SDchange <- c(NA, NA, NA, 0.026, 0.058, NA, NA, NA, NA) imputeSD(SD1, SD2, SDchange)
SD1 <- c(0.149, 0.022, 0.036, 0.085, 0.125, NA, 0.139, 0.124, 0.038) SD2 <- c(NA, 0.039, 0.038, 0.087, 0.125, NA, 0.135, 0.126, 0.038) SDchange <- c(NA, NA, NA, 0.026, 0.058, NA, NA, NA, NA) imputeSD(SD1, SD2, SDchange)
Computes (standardized) interquartile range of the x
values.
IQrange(x, na.rm = FALSE, type = 7) sIQR(x, na.rm = FALSE, type = 7, constant = 2*qnorm(0.75))
IQrange(x, na.rm = FALSE, type = 7) sIQR(x, na.rm = FALSE, type = 7, constant = 2*qnorm(0.75))
x |
a numeric vector. |
na.rm |
logical. Should missing values be removed? |
type |
an integer between 1 and 9 selecting one of nine quantile
algorithms; for more details see |
constant |
standardizing contant; see details below. |
This function IQrange
computes quartiles as
IQR(x) = quantile(x,3/4) - quantile(x,1/4)
.
The function is identical to function IQR
. It was added
before the type
argument was introduced to function IQR
in 2010 (r53643, r53644).
For normally distributed
, the expected value of
IQR(X)
is 2*qnorm(3/4) = 1.3490
, i.e., for a normal-consistent
estimate of the standard deviation, use IQR(x) / 1.349
. This is implemented
in function sIQR
(standardized IQR).
Matthias Kohl [email protected]
Tukey, J. W. (1977). Exploratory Data Analysis. Reading: Addison-Wesley.
IQrange(rivers) ## identical to IQR(rivers) ## other quantile algorithms IQrange(rivers, type = 4) IQrange(rivers, type = 5) ## standardized IQR sIQR(rivers) ## right-skewed data distribution sd(rivers) mad(rivers) ## for normal data x <- rnorm(100) sd(x) sIQR(x) mad(x)
IQrange(rivers) ## identical to IQR(rivers) ## other quantile algorithms IQrange(rivers, type = 4) IQrange(rivers, type = 5) ## standardized IQR sIQR(rivers) ## right-skewed data distribution sd(rivers) mad(rivers) ## for normal data x <- rnorm(100) sd(x) sIQR(x) mad(x)
Compute MAD between colums of a matrix or data.frame. Can be used to create a similarity matrix for a microarray experiment.
madMatrix(x)
madMatrix(x)
x |
matrix or data.frame |
This functions computes the so called similarity matrix (based on MAD) for a microarray experiment; cf. Buness et. al. (2004).
matrix of MAD values between colums of x
A first version of this function appeared in package SLmisc.
Matthias Kohl [email protected]
Andreas Buness, Wolfgang Huber, Klaus Steiner, Holger Sueltmann, and Annemarie Poustka. arrayMagic: two-colour cDNA microarray quality control and preprocessing. Bioinformatics Advance Access published on September 28, 2004. doi:10.1093/bioinformatics/bti052
plotMAD
## only a dummy example madMatrix(matrix(rnorm(1000), ncol = 10))
## only a dummy example madMatrix(matrix(rnorm(1000), ncol = 10))
Plot of similarity matrix based on MAD between microarrays.
madPlot(x, new = FALSE, col, maxMAD = 3, labels = FALSE, labcols = "black", title = "", protocol = FALSE, ...)
madPlot(x, new = FALSE, col, maxMAD = 3, labels = FALSE, labcols = "black", title = "", protocol = FALSE, ...)
x |
data or correlation matrix, respectively |
new |
If |
col |
colors palette for image. If missing, the |
maxMAD |
maximum MAD value displayed |
labels |
vector of character strings to be placed at the tickpoints,
labels for the columns of |
labcols |
colors to be used for the labels of the columns of |
title |
character string, overall title for the plot. |
protocol |
logical, display color bar without numbers |
... |
graphical parameters may also be supplied as arguments to the
function (see |
This functions generates the so called similarity matrix (based on MAD) for
a microarray experiment; cf. Buness et. al. (2004). The function is similar
to corPlot
.
A first version of this function appeared in package SLmisc.
Matthias Kohl [email protected]
Sandrine Dudoit, Yee Hwa (Jean) Yang, Benjamin Milo Bolstad and with
contributions from Natalie Thorne, Ingrid Loennstedt and Jessica Mar.
sma: Statistical Microarray Analysis.
http://www.stat.berkeley.edu/users/terry/zarray/Software/smacode.html
Andreas Buness, Wolfgang Huber, Klaus Steiner, Holger Sueltmann, and Annemarie Poustka. arrayMagic: two-colour cDNA microarray quality control and preprocessing. Bioinformatics Advance Access published on September 28, 2004. doi:10.1093/bioinformatics/bti052
corPlot
## only a dummy example set.seed(13) x <- matrix(rnorm(1000), ncol = 10) x[1:20,5] <- x[1:20,5] + 10 madPlot(x, new = TRUE, maxMAD = 2.5) ## in contrast corPlot(x, new = TRUE, minCor = -0.5)
## only a dummy example set.seed(13) x <- matrix(rnorm(1000), ncol = 10) x[1:20,5] <- x[1:20,5] + 10 madPlot(x, new = TRUE, maxMAD = 2.5) ## in contrast corPlot(x, new = TRUE, minCor = -0.5)
Computes (standardized) mean absolute deviation.
meanAD(x, na.rm = FALSE, constant = sqrt(pi/2))
meanAD(x, na.rm = FALSE, constant = sqrt(pi/2))
x |
a numeric vector. |
na.rm |
logical. Should missing values be removed? |
constant |
standardizing contant; see details below. |
The mean absolute deviation is a consistent estimator of
for the standard deviation of
a normal distribution. Under minor deviations of the normal distributions
its asymptotic variance is smaller than that of the sample standard
deviation (Tukey (1960)).
It works well under the assumption of symmetric, where mean and median
coincide. Under the normal distribution it's about 18% more efficient
(asymptotic relative efficiency) than the median absolute deviation
((1/qnorm(0.75))/sqrt(pi/2)
) and about 12% less efficient than the
sample standard deviation (Tukey (1960)).
Matthias Kohl [email protected]
Tukey, J. W. (1960). A survey of sampling from contaminated distribution. In Olink, I., editor, Contributions to Probablity and Statistics. Essays in Honor of H. Hotelling., pages 448-485. Stanford University Press.
## right skewed data ## mean absolute deviation meanAD(rivers) ## standardized IQR sIQR(rivers) ## median absolute deviation mad(rivers) ## sample standard deviation sd(rivers) ## for normal data x <- rnorm(100) sd(x) sIQR(x) mad(x) meanAD(x) ## Asymptotic relative efficiency for Tukey's symmetric gross-error model ## (1-eps)*Norm(mean, sd = sigma) + eps*Norm(mean, sd = 3*sigma) eps <- seq(from = 0, to = 1, by = 0.001) ARE <- function(eps){ 0.25*((3*(1+80*eps))/((1+8*eps)^2)-1)/(pi*(1+8*eps)/(2*(1+2*eps)^2)-1) } plot(eps, ARE(eps), type = "l", xlab = "Proportion of gross-errors", ylab = "Asymptotic relative efficiency", main = "ARE of mean absolute deviation w.r.t. sample standard deviation") abline(h = 1.0, col = "red") text(x = 0.5, y = 1.5, "Mean absolute deviation is better", col = "red", cex = 1, font = 1) ## lower bound of interval uniroot(function(x){ ARE(x)-1 }, interval = c(0, 0.002)) ## upper bound of interval uniroot(function(x){ ARE(x)-1 }, interval = c(0.5, 0.55)) ## worst case optimize(ARE, interval = c(0,1), maximum = TRUE)
## right skewed data ## mean absolute deviation meanAD(rivers) ## standardized IQR sIQR(rivers) ## median absolute deviation mad(rivers) ## sample standard deviation sd(rivers) ## for normal data x <- rnorm(100) sd(x) sIQR(x) mad(x) meanAD(x) ## Asymptotic relative efficiency for Tukey's symmetric gross-error model ## (1-eps)*Norm(mean, sd = sigma) + eps*Norm(mean, sd = 3*sigma) eps <- seq(from = 0, to = 1, by = 0.001) ARE <- function(eps){ 0.25*((3*(1+80*eps))/((1+8*eps)^2)-1)/(pi*(1+8*eps)/(2*(1+2*eps)^2)-1) } plot(eps, ARE(eps), type = "l", xlab = "Proportion of gross-errors", ylab = "Asymptotic relative efficiency", main = "ARE of mean absolute deviation w.r.t. sample standard deviation") abline(h = 1.0, col = "red") text(x = 0.5, y = 1.5, "Mean absolute deviation is better", col = "red", cex = 1, font = 1) ## lower bound of interval uniroot(function(x){ ARE(x)-1 }, interval = c(0, 0.002)) ## upper bound of interval uniroot(function(x){ ARE(x)-1 }, interval = c(0.5, 0.55)) ## worst case optimize(ARE, interval = c(0,1), maximum = TRUE)
The function transforms a given data.frame form wide to long form.
melt.long(data, select, group)
melt.long(data, select, group)
data |
data.frame that shall be transformed. |
select |
optional integer vector to select a subset of the columns of |
group |
optional vector to include an additional grouping in the output; for more details see examples below. |
The function transforms a given data.frame form wide to long form. This is for example useful for plotting with ggplot2.
data.frame in long form.
Matthias Kohl [email protected]
library(ggplot2) ## some random data test <- data.frame(x = rnorm(10), y = rnorm(10), z = rnorm(10)) test.long <- melt.long(test) test.long ggplot(test.long, aes(x = variable, y = value)) + geom_boxplot(aes(fill = variable)) ## introducing an additional grouping variable group <- factor(rep(c("a","b"), each = 5)) test.long.gr <- melt.long(test, select = 1:2, group = group) test.long.gr ggplot(test.long.gr, aes(x = variable, y = value, fill = group)) + geom_boxplot()
library(ggplot2) ## some random data test <- data.frame(x = rnorm(10), y = rnorm(10), z = rnorm(10)) test.long <- melt.long(test) test.long ggplot(test.long, aes(x = variable, y = value)) + geom_boxplot(aes(fill = variable)) ## introducing an additional grouping variable group <- factor(rep(c("a","b"), each = 5)) test.long.gr <- melt.long(test, select = 1:2, group = group) test.long.gr ggplot(test.long.gr, aes(x = variable, y = value, fill = group)) + geom_boxplot()
Performs one and two sample t-tests on multiple imputed datasets.
mi.t.test(miData, ...) ## Default S3 method: mi.t.test(miData, x, y = NULL, alternative = c("two.sided", "less", "greater"), mu = 0, paired = FALSE, var.equal = FALSE, conf.level = 0.95, subset = NULL, ...)
mi.t.test(miData, ...) ## Default S3 method: mi.t.test(miData, x, y = NULL, alternative = c("two.sided", "less", "greater"), mu = 0, paired = FALSE, var.equal = FALSE, conf.level = 0.95, subset = NULL, ...)
miData |
list of multiple imputed datasets. |
x |
name of a variable that shall be tested. |
y |
an optional name of a variable that shall be tested (paired test) or a variable that shall be used to split into groups (unpaired test). |
alternative |
a character string specifying the alternative
hypothesis, must be one of |
mu |
a number indicating the true value of the mean (or difference in means if you are performing a two sample test). |
paired |
a logical indicating whether you want a paired t-test. |
var.equal |
a logical variable indicating whether to treat the
two variances as being equal. If |
conf.level |
confidence level of the interval. |
subset |
an optional vector specifying a subset of observations to be used. |
... |
further arguments to be passed to or from methods. |
alternative = "greater"
is the alternative that x
has a
larger mean than y
.
If paired
is TRUE
then both x
and y
must
be specified and they must be the same length. Missing values are
not allowed as they should have been imputed. If
var.equal
is TRUE
then the pooled estimate of the
variance is used. By default, if var.equal
is FALSE
then the variance is estimated separately for both groups and the
Welch modification to the degrees of freedom is used.
We use the approach of Rubin (1987) in combination with the adjustment of Barnard and Rubin (1999).
A list with class "htest"
containing the following components:
statistic |
the value of the t-statistic. |
parameter |
the degrees of freedom for the t-statistic. |
p.value |
the p-value for the test. |
conf.int |
a confidence interval for the mean appropriate to the specified alternative hypothesis. |
estimate |
the estimated mean (one-sample test), difference in means (paired test), or estimated means (two-sample test) as well as the respective standard deviations. |
null.value |
the specified hypothesized value of the mean or mean difference depending on whether it was a one-sample test or a two-sample test. |
alternative |
a character string describing the alternative hypothesis. |
method |
a character string indicating what type of t-test was performed. |
data.name |
a character string giving the name(s) of the data. |
Matthias Kohl [email protected]
Rubin, D. (1987). Multiple Imputation for Nonresponse in Surveys. John Wiley & Sons, New York.
Barnard, J. and Rubin, D. (1999). Small-Sample Degrees of Freedom with Multiple Imputation. Biometrika, 86(4), 948-955.
## Generate some data set.seed(123) x <- rnorm(25, mean = 1) x[sample(1:25, 5)] <- NA y <- rnorm(20, mean = -1) y[sample(1:20, 4)] <- NA pair <- c(rnorm(25, mean = 1), rnorm(20, mean = -1)) g <- factor(c(rep("yes", 25), rep("no", 20))) D <- data.frame(ID = 1:45, variable = c(x, y), pair = pair, group = g) ## Use Amelia to impute missing values library(Amelia) res <- amelia(D, m = 10, p2s = 0, idvars = "ID", noms = "group") ## Per protocol analysis (Welch two-sample t-test) t.test(variable ~ group, data = D) ## Intention to treat analysis (Multiple Imputation Welch two-sample t-test) mi.t.test(res$imputations, x = "variable", y = "group") ## Per protocol analysis (Two-sample t-test) t.test(variable ~ group, data = D, var.equal = TRUE) ## Intention to treat analysis (Multiple Imputation two-sample t-test) mi.t.test(res$imputations, x = "variable", y = "group", var.equal = TRUE) ## Specifying alternatives mi.t.test(res$imputations, x = "variable", y = "group", alternative = "less") mi.t.test(res$imputations, x = "variable", y = "group", alternative = "greater") ## One sample test t.test(D$variable[D$group == "yes"]) mi.t.test(res$imputations, x = "variable", subset = D$group == "yes") mi.t.test(res$imputations, x = "variable", mu = -1, subset = D$group == "yes", alternative = "less") mi.t.test(res$imputations, x = "variable", mu = -1, subset = D$group == "yes", alternative = "greater") ## paired test t.test(D$variable, D$pair, paired = TRUE) mi.t.test(res$imputations, x = "variable", y = "pair", paired = TRUE)
## Generate some data set.seed(123) x <- rnorm(25, mean = 1) x[sample(1:25, 5)] <- NA y <- rnorm(20, mean = -1) y[sample(1:20, 4)] <- NA pair <- c(rnorm(25, mean = 1), rnorm(20, mean = -1)) g <- factor(c(rep("yes", 25), rep("no", 20))) D <- data.frame(ID = 1:45, variable = c(x, y), pair = pair, group = g) ## Use Amelia to impute missing values library(Amelia) res <- amelia(D, m = 10, p2s = 0, idvars = "ID", noms = "group") ## Per protocol analysis (Welch two-sample t-test) t.test(variable ~ group, data = D) ## Intention to treat analysis (Multiple Imputation Welch two-sample t-test) mi.t.test(res$imputations, x = "variable", y = "group") ## Per protocol analysis (Two-sample t-test) t.test(variable ~ group, data = D, var.equal = TRUE) ## Intention to treat analysis (Multiple Imputation two-sample t-test) mi.t.test(res$imputations, x = "variable", y = "group", var.equal = TRUE) ## Specifying alternatives mi.t.test(res$imputations, x = "variable", y = "group", alternative = "less") mi.t.test(res$imputations, x = "variable", y = "group", alternative = "greater") ## One sample test t.test(D$variable[D$group == "yes"]) mi.t.test(res$imputations, x = "variable", subset = D$group == "yes") mi.t.test(res$imputations, x = "variable", mu = -1, subset = D$group == "yes", alternative = "less") mi.t.test(res$imputations, x = "variable", mu = -1, subset = D$group == "yes", alternative = "greater") ## paired test t.test(D$variable, D$pair, paired = TRUE) mi.t.test(res$imputations, x = "variable", y = "pair", paired = TRUE)
Performs moderated 1-Way ANOVAs based on Bioconductor package limma.
mod.oneway.test(x, group, adjust.method = "BH", sort.by = "none")
mod.oneway.test(x, group, adjust.method = "BH", sort.by = "none")
x |
a (non-empty) numeric matrix of data values. |
group |
an optional factor representing the groups. |
adjust.method |
see |
sort.by |
see |
, where "logFC"
corresponds to difference in means.
The function uses Bioconductor package limma to compute moderated 1-way ANOVAs.
For more details we refer to ebayes
.
A data.frame with the results.
B. Phipson, S. Lee, I.J. Majewski, W.S. Alexander, G.H. Smyth (2016). Robust hyperparameter estimation protects against hypervariable genes and improves power to detect differential expression. Annals of Applied Statistics 10(2), 946-963.
oneway.test
, mod.t.test
set.seed(123) X <- rbind(matrix(rnorm(5*20), nrow = 5, ncol = 20), matrix(rnorm(5*20, mean = 1), nrow = 5, ncol = 20)) gr <- factor(c(rep("A1", 5), rep("B2", 5), rep("C3", 5), rep("D4", 5))) mod.oneway.test(X, gr) ## Welch 1-Way ANOVA (not moderated) ow.test <- function(x, g){ res <- oneway.test(x ~ g) c(res$statistic, res$p.value) } ow.res <- t(apply(X, 1, ow.test, g = gr)) colnames(ow.res) <- c("F", "p.value") ow.res
set.seed(123) X <- rbind(matrix(rnorm(5*20), nrow = 5, ncol = 20), matrix(rnorm(5*20, mean = 1), nrow = 5, ncol = 20)) gr <- factor(c(rep("A1", 5), rep("B2", 5), rep("C3", 5), rep("D4", 5))) mod.oneway.test(X, gr) ## Welch 1-Way ANOVA (not moderated) ow.test <- function(x, g){ res <- oneway.test(x ~ g) c(res$statistic, res$p.value) } ow.res <- t(apply(X, 1, ow.test, g = gr)) colnames(ow.res) <- c("F", "p.value") ow.res
Performs moderated t-tests based on Bioconductor package limma.
mod.t.test(x, group = NULL, paired = FALSE, adjust.method = "BH", sort.by = "none")
mod.t.test(x, group = NULL, paired = FALSE, adjust.method = "BH", sort.by = "none")
x |
a (non-empty) numeric matrix of data values. |
group |
an optional factor representing the groups. |
paired |
a logical indicating whether you want a paired test. |
adjust.method |
see |
sort.by |
see |
, where "logFC"
corresponds to difference in means.
The function uses Bioconductor package limma to compute moderated t-tests.
For more details we refer to ebayes
.
A data.frame with the results.
B. Phipson, S. Lee, I.J. Majewski, W.S. Alexander, G.H. Smyth (2016). Robust hyperparameter estimation protects against hypervariable genes and improves power to detect differential expression. Annals of Applied Statistics 10(2), 946-963.
## One-sample test X <- matrix(rnorm(10*20, mean = 1), nrow = 10, ncol = 20) mod.t.test(X) ## corresponds to library(limma) design <- matrix(1, nrow = ncol(X), ncol = 1) colnames(design) <- "A" fit1 <- lmFit(X, design) fit2 <- eBayes(fit1) topTable(fit2, coef = 1, number = Inf, confint = TRUE, sort.by = "none")[,-4] ## Two-sample test set.seed(123) X <- rbind(matrix(rnorm(5*20), nrow = 5, ncol = 20), matrix(rnorm(5*20, mean = 1), nrow = 5, ncol = 20)) g2 <- factor(c(rep("group 1", 10), rep("group 2", 10))) mod.t.test(X, group = g2) ## corresponds to design <- model.matrix(~ 0 + g2) colnames(design) <- c("group1", "group2") fit1 <- lmFit(X, design) cont.matrix <- makeContrasts(group1vsgroup2="group1-group2", levels=design) fit2 <- contrasts.fit(fit1, cont.matrix) fit3 <- eBayes(fit2) topTable(fit3, coef = 1, number = Inf, confint = TRUE, sort.by = "none")[,-4] ## Paired two-sample test mod.t.test(X, group = g2, paired = TRUE)
## One-sample test X <- matrix(rnorm(10*20, mean = 1), nrow = 10, ncol = 20) mod.t.test(X) ## corresponds to library(limma) design <- matrix(1, nrow = ncol(X), ncol = 1) colnames(design) <- "A" fit1 <- lmFit(X, design) fit2 <- eBayes(fit1) topTable(fit2, coef = 1, number = Inf, confint = TRUE, sort.by = "none")[,-4] ## Two-sample test set.seed(123) X <- rbind(matrix(rnorm(5*20), nrow = 5, ncol = 20), matrix(rnorm(5*20, mean = 1), nrow = 5, ncol = 20)) g2 <- factor(c(rep("group 1", 10), rep("group 2", 10))) mod.t.test(X, group = g2) ## corresponds to design <- model.matrix(~ 0 + g2) colnames(design) <- c("group1", "group2") fit1 <- lmFit(X, design) cont.matrix <- makeContrasts(group1vsgroup2="group1-group2", levels=design) fit2 <- contrasts.fit(fit1, cont.matrix) fit3 <- eBayes(fit2) topTable(fit3, coef = 1, number = Inf, confint = TRUE, sort.by = "none")[,-4] ## Paired two-sample test mod.t.test(X, group = g2, paired = TRUE)
This function can be used to compute confidence intervals for mean and standard deviation of a normal distribution.
normCI(x, mean = NULL, sd = NULL, conf.level = 0.95, na.rm = TRUE)
normCI(x, mean = NULL, sd = NULL, conf.level = 0.95, na.rm = TRUE)
x |
vector of observations. |
mean |
mean if known otherwise |
sd |
standard deviation if known otherwise |
conf.level |
confidence level. |
na.rm |
a logical value indicating whether NA values should be stripped before the computation proceeds. |
The standard confidence intervals for mean and standard deviation are computed that can be found in many textbooks, e.g. Chapter 4 in Altman et al. (2000).
A list with class "confint"
containing the following components:
estimate |
the estimated mean and sd. |
conf.int |
confidence interval(s) for mean and/or sd. |
Infos |
additional information. |
Matthias Kohl [email protected]
D. Altman, D. Machin, T. Bryant, M. Gardner (eds). Statistics with Confidence: Confidence Intervals and Statistical Guidelines, 2nd edition 2000.
x <- rnorm(50) ## mean and sd unknown normCI(x) ## sd known normCI(x, sd = 1) ## mean known normCI(x, mean = 0)
x <- rnorm(50) ## mean and sd unknown normCI(x) ## sd known normCI(x, sd = 1) ## mean known normCI(x, mean = 0)
This function can be used to compute confidence intervals for difference of means assuming normal distributions.
normDiffCI(x, y, conf.level = 0.95, paired = FALSE, method = "welch", na.rm = TRUE)
normDiffCI(x, y, conf.level = 0.95, paired = FALSE, method = "welch", na.rm = TRUE)
x |
numeric vector of data values of group 1. |
y |
numeric vector of data values of group 2. |
conf.level |
confidence level. |
paired |
a logical value indicating whether the two groups are paired. |
method |
a character string specifing which method to use in the unpaired case; see details. |
na.rm |
a logical value indicating whether |
The standard confidence intervals for the difference of means are computed that can be found in many textbooks, e.g. Chapter 4 in Altman et al. (2000).
The method "classical"
assumes equal variances whereas methods
"welch"
and "hsu"
allow for unequal variances. The latter two
methods use different formulas for computing the degrees of freedom of the
respective t-distribution providing the quantiles in the confidence interval.
Instead of the Welch-Satterhwaite equation the method of Hsu uses the minimum
of the group sample sizes minus 1; see Section 6.8.3 of Hedderich and Sachs (2016).
A list with class "confint"
containing the following components:
estimate |
point estimate (mean of differences or difference in means). |
conf.int |
confidence interval. |
Infos |
additional information. |
Matthias Kohl [email protected]
D. Altman, D. Machin, T. Bryant, M. Gardner (eds). Statistics with Confidence: Confidence Intervals and Statistical Guidelines, 2nd edition. John Wiley and Sons 2000.
J. Hedderich, L. Sachs. Angewandte Statistik: Methodensammlung mit R. Springer 2016.
x <- rnorm(20) y <- rnorm(20, sd = 2) ## paired normDiffCI(x, y, paired = TRUE) ## compare normCI(x-y) ## unpaired y <- rnorm(10, mean = 1, sd = 2) ## classical normDiffCI(x, y, method = "classical") ## Welch (default is in case of function t.test) normDiffCI(x, y, method = "welch") ## Hsu normDiffCI(x, y, method = "hsu") ## Monte-Carlo simulation: coverage probability M <- 10000 CIhsu <- CIwelch <- CIclass <- matrix(NA, nrow = M, ncol = 2) for(i in 1:M){ x <- rnorm(10) y <- rnorm(30, sd = 0.1) CIclass[i,] <- normDiffCI(x, y, method = "classical")$conf.int CIwelch[i,] <- normDiffCI(x, y, method = "welch")$conf.int CIhsu[i,] <- normDiffCI(x, y, method = "hsu")$conf.int } ## coverage probabilies ## classical sum(CIclass[,1] < 0 & 0 < CIclass[,2])/M ## Welch sum(CIwelch[,1] < 0 & 0 < CIwelch[,2])/M ## Hsu sum(CIhsu[,1] < 0 & 0 < CIhsu[,2])/M
x <- rnorm(20) y <- rnorm(20, sd = 2) ## paired normDiffCI(x, y, paired = TRUE) ## compare normCI(x-y) ## unpaired y <- rnorm(10, mean = 1, sd = 2) ## classical normDiffCI(x, y, method = "classical") ## Welch (default is in case of function t.test) normDiffCI(x, y, method = "welch") ## Hsu normDiffCI(x, y, method = "hsu") ## Monte-Carlo simulation: coverage probability M <- 10000 CIhsu <- CIwelch <- CIclass <- matrix(NA, nrow = M, ncol = 2) for(i in 1:M){ x <- rnorm(10) y <- rnorm(30, sd = 0.1) CIclass[i,] <- normDiffCI(x, y, method = "classical")$conf.int CIwelch[i,] <- normDiffCI(x, y, method = "welch")$conf.int CIhsu[i,] <- normDiffCI(x, y, method = "hsu")$conf.int } ## coverage probabilies ## classical sum(CIclass[,1] < 0 & 0 < CIclass[,2])/M ## Welch sum(CIwelch[,1] < 0 & 0 < CIwelch[,2])/M ## Hsu sum(CIhsu[,1] < 0 & 0 < CIhsu[,2])/M
This function is a slight modification of function Anova
of
package "genefilter"
.
oneWayAnova(cov, na.rm = TRUE, var.equal = FALSE)
oneWayAnova(cov, na.rm = TRUE, var.equal = FALSE)
cov |
The covariate. It must have length equal to the number of
columns of the array that the result of |
na.rm |
a logical value indicating whether |
var.equal |
a logical variable indicating whether to treat the variances
in the samples as equal. If |
The function returned by oneWayAnova
uses oneway.test
to perform a one-way ANOVA, where x
is the set of gene expressions.
The F statistic for an overall effect is computed and the corresponding
p-value is returned.
The function Anova
instead compares the computed
p-value to a prespecified p-value and returns TRUE
, if the computed p-value
is smaller than the prespecified one.
oneWayAnova
returns a function with bindings for cov
that will
perform a one-way ANOVA.
The covariate can be continuous, in which case the test is for a linear effect for the covariate.
A first version of this function appeared in package SLmisc.
Matthias Kohl [email protected]
R. Gentleman, V. Carey, W. Huber and F. Hahne (2006). genefilter: methods for filtering genes from microarray experiments. R package version 1.13.7.
set.seed(123) af <- oneWayAnova(c(rep(1,5),rep(2,5))) af(rnorm(10))
set.seed(123) af <- oneWayAnova(c(rep(1,5),rep(2,5))) af(rnorm(10))
The function computes the optimal cutoff for various performance weasures for binary classification.
optCutoff(pred, truth, namePos, perfMeasure = "Youden's J statistic", max = TRUE, parallel = FALSE, ncores, delta = 0.01)
optCutoff(pred, truth, namePos, perfMeasure = "Youden's J statistic", max = TRUE, parallel = FALSE, ncores, delta = 0.01)
pred |
numeric values that shall be used for classification; e.g. probabilities to belong to the positive group. |
truth |
true grouping vector or factor. |
namePos |
value representing the positive group. |
perfMeasure |
a performance measure computed by function |
max |
logical value. Whether to maximize or minimize the performacne measure. |
parallel |
logical value. If |
ncores |
integer value, number of cores that shall be used to parallelize the computations. |
delta |
numeric value for setting up grid for optimization; start is
minimum of |
The function is ablte to compute the optimal cutoff for various performance
measures, all performance measures that are implemented in function
perfMeasures
.
Optimal cutoff and value of the optimized performance measure based on a simple grid search.
Matthias Kohl [email protected]
## example from dataset infert fit <- glm(case ~ spontaneous+induced, data = infert, family = binomial()) pred <- predict(fit, type = "response") optCutoff(pred, truth = infert$case, namePos = 1)
## example from dataset infert fit <- glm(case ~ spontaneous+induced, data = infert, family = binomial()) pred <- predict(fit, type = "response") optCutoff(pred, truth = infert$case, namePos = 1)
The function transforms a given odds-ratio (OR) to the respective relative risk (RR).
or2rr(or, p0, p1)
or2rr(or, p0, p1)
or |
numeric vector: OR (odds-ratio). |
p0 |
numeric vector of length 1: incidence of the outcome of interest in the nonexposed group. |
p1 |
numeric vector of length 1: incidence of the outcome of interest in the exposed group. |
The function transforms a given odds-ratio (OR) to the respective relative risk (RR). It can also be used to transform the limits of confidence intervals.
The formulas can be derived by combining the formulas for RR and OR; see also Zhang and Yu (1998).
relative risk.
Matthias Kohl [email protected]
Zhang, J. and Yu, K. F. (1998). What's the relative risk? A method of correcting the odds ratio in cohort studies of common outcomes. JAMA, 280(19):1690-1691.
## We use data from Zhang and Yu (1998) ## OR to RR using OR and p0 or2rr(14.1, 0.05) ## compute p1 or2rr(14.1, 0.05)*0.05 ## OR to RR using OR and p1 or2rr(14.1, p1 = 0.426) ## OR and 95% confidence interval or2rr(c(14.1, 7.8, 27.5), 0.05) ## Logistic OR and 95% confidence interval logisticOR <- rbind(c(14.1, 7.8, 27.5), c(8.7, 5.5, 14.3), c(27.4, 17.2, 45.8), c(4.5, 2.7, 7.8), c(0.25, 0.17, 0.37), c(0.09, 0.05, 0.14)) colnames(logisticOR) <- c("OR", "2.5%", "97.5%") rownames(logisticOR) <- c("7.4", "4.2", "3.0", "2.0", "0.37", "0.14") logisticOR ## p0 p0 <- c(0.05, 0.12, 0.32, 0.27, 0.40, 0.40) ## Compute corrected RR ## helper function or2rr.mat <- function(or, p0){ res <- matrix(NA, nrow = nrow(or), ncol = ncol(or)) for(i in seq_len(nrow(or))) res[i,] <- or2rr(or[i,], p0[i]) dimnames(res) <- dimnames(or) res } RR <- or2rr.mat(logisticOR, p0) round(RR, 2) ## Results are not completely identical to Zhang and Yu (1998) ## what probably is caused by the fact that the logistic OR values ## provided in the table are rounded and are not exact values.
## We use data from Zhang and Yu (1998) ## OR to RR using OR and p0 or2rr(14.1, 0.05) ## compute p1 or2rr(14.1, 0.05)*0.05 ## OR to RR using OR and p1 or2rr(14.1, p1 = 0.426) ## OR and 95% confidence interval or2rr(c(14.1, 7.8, 27.5), 0.05) ## Logistic OR and 95% confidence interval logisticOR <- rbind(c(14.1, 7.8, 27.5), c(8.7, 5.5, 14.3), c(27.4, 17.2, 45.8), c(4.5, 2.7, 7.8), c(0.25, 0.17, 0.37), c(0.09, 0.05, 0.14)) colnames(logisticOR) <- c("OR", "2.5%", "97.5%") rownames(logisticOR) <- c("7.4", "4.2", "3.0", "2.0", "0.37", "0.14") logisticOR ## p0 p0 <- c(0.05, 0.12, 0.32, 0.27, 0.40, 0.40) ## Compute corrected RR ## helper function or2rr.mat <- function(or, p0){ res <- matrix(NA, nrow = nrow(or), ncol = ncol(or)) for(i in seq_len(nrow(or))) res[i,] <- or2rr(or[i,], p0[i]) dimnames(res) <- dimnames(or) res } RR <- or2rr.mat(logisticOR, p0) round(RR, 2) ## Results are not completely identical to Zhang and Yu (1998) ## what probably is caused by the fact that the logistic OR values ## provided in the table are rounded and are not exact values.
The function computes pairwise AUCs.
pairwise.auc(x, g)
pairwise.auc(x, g)
x |
numeric vector. |
g |
grouping vector or factor |
The function computes pairwise areas under the receiver operating
characteristic curves (AUC under ROC curves) using function AUC
.
The implementation is in certain aspects analogously to
pairwise.t.test
.
Vector with pairwise AUCs.
Matthias Kohl [email protected]
set.seed(13) x <- rnorm(100) g <- factor(sample(1:4, 100, replace = TRUE)) levels(g) <- c("a", "b", "c", "d") pairwise.auc(x, g)
set.seed(13) x <- rnorm(100) g <- factor(sample(1:4, 100, replace = TRUE)) levels(g) <- c("a", "b", "c", "d") pairwise.auc(x, g)
This function computes pairwise fold changes. It also works for logarithmic data.
pairwise.fc(x, g, ave = mean, log = TRUE, base = 2, mod.fc = TRUE, ...)
pairwise.fc(x, g, ave = mean, log = TRUE, base = 2, mod.fc = TRUE, ...)
x |
numeric vector. |
g |
grouping vector or factor |
ave |
function to compute the group averages. |
log |
logical. Is the data logarithmic? |
base |
If |
mod.fc |
logical. Return modified fold changes? (see details) |
... |
optional arguments to |
The function computes pairwise fold changes between groups, where
the group values are aggregated using the function which is
given by the argument ave
.
The fold changes are returned in a slightly modified form if mod.fc = TRUE.
Fold changes FC
which are smaller than 1
are reported as
to -1/FC
.
The implementation is in certain aspects analogously to pairwise.t.test
.
Vector with pairwise fold changes.
Matthias Kohl [email protected]
set.seed(13) x <- rnorm(100) ## assumed as log2-data g <- factor(sample(1:4, 100, replace = TRUE)) levels(g) <- c("a", "b", "c", "d") pairwise.fc(x, g) ## some small checks res <- by(x, list(g), mean) 2^(res[[1]] - res[[2]]) # a vs. b -1/2^(res[[1]] - res[[3]]) # a vs. c 2^(res[[1]] - res[[4]]) # a vs. d -1/2^(res[[2]] - res[[3]]) # b vs. c -1/2^(res[[2]] - res[[4]]) # b vs. d 2^(res[[3]] - res[[4]]) # c vs. d
set.seed(13) x <- rnorm(100) ## assumed as log2-data g <- factor(sample(1:4, 100, replace = TRUE)) levels(g) <- c("a", "b", "c", "d") pairwise.fc(x, g) ## some small checks res <- by(x, list(g), mean) 2^(res[[1]] - res[[2]]) # a vs. b -1/2^(res[[1]] - res[[3]]) # a vs. c 2^(res[[1]] - res[[4]]) # a vs. d -1/2^(res[[2]] - res[[3]]) # b vs. c -1/2^(res[[2]] - res[[4]]) # b vs. d 2^(res[[3]] - res[[4]]) # c vs. d
The function computes pairwise values for a given function.
pairwise.fun(x, g, fun, ...)
pairwise.fun(x, g, fun, ...)
x |
numeric vector. |
g |
grouping vector or factor |
fun |
some function where the first two arguments have to be numeric vectors for which the function computes some quantity; see example section below. |
... |
additional arguments to fun. |
The function computes pairwise values for a given function.
The implementation is in certain aspects analogously to
pairwise.t.test
.
Vector with pairwise function values.
Matthias Kohl [email protected]
pairwise.t.test
, pairwise.fc
,
pairwise.logfc
, pairwise.auc
set.seed(13) x <- rnorm(100) g <- factor(sample(1:4, 100, replace = TRUE)) levels(g) <- c("a", "b", "c", "d") pairwise.fun(x, g, fun = function(x, y) t.test(x,y)$p.value) ## in contrast to pairwise.t.test(x, g, p.adjust.method = "none", pool.sd = FALSE)
set.seed(13) x <- rnorm(100) g <- factor(sample(1:4, 100, replace = TRUE)) levels(g) <- c("a", "b", "c", "d") pairwise.fun(x, g, fun = function(x, y) t.test(x,y)$p.value) ## in contrast to pairwise.t.test(x, g, p.adjust.method = "none", pool.sd = FALSE)
The function computes pairwise log-fold changes.
pairwise.logfc(x, g, ave = mean, log = TRUE, base = 2, ...)
pairwise.logfc(x, g, ave = mean, log = TRUE, base = 2, ...)
x |
numeric vector. |
g |
grouping vector or factor |
ave |
function to compute the group averages. |
log |
logical. Is the data logarithmic? |
base |
If |
... |
optional arguments to |
The function computes pairwise log-fold changes between groups, where
the group values are aggregated using the function which is
given by the argument ave
.
The implementation is in certain aspects analogously to pairwise.t.test
.
Vector with pairwise log-fold changes.
Matthias Kohl [email protected]
set.seed(13) x <- rnorm(100) ## assumed as log2-data g <- factor(sample(1:4, 100, replace = TRUE)) levels(g) <- c("a", "b", "c", "d") pairwise.logfc(x, g) ## some small checks res <- by(x, list(g), mean) res[[1]] - res[[2]] # a vs. b res[[1]] - res[[3]] # a vs. c res[[1]] - res[[4]] # a vs. d res[[2]] - res[[3]] # b vs. c res[[2]] - res[[4]] # b vs. d res[[3]] - res[[4]] # c vs. d
set.seed(13) x <- rnorm(100) ## assumed as log2-data g <- factor(sample(1:4, 100, replace = TRUE)) levels(g) <- c("a", "b", "c", "d") pairwise.logfc(x, g) ## some small checks res <- by(x, list(g), mean) res[[1]] - res[[2]] # a vs. b res[[1]] - res[[3]] # a vs. c res[[1]] - res[[4]] # a vs. d res[[2]] - res[[3]] # b vs. c res[[2]] - res[[4]] # b vs. d res[[3]] - res[[4]] # c vs. d
Performs pairwise moderated t-tests based on Bioconductor package limma.
pairwise.mod.t.test(x, group, adjust.method = "BH", sort.by = "none")
pairwise.mod.t.test(x, group, adjust.method = "BH", sort.by = "none")
x |
a (non-empty) numeric matrix of data values. |
group |
an optional factor representing the groups. |
adjust.method |
see |
sort.by |
see |
, where "logFC"
corresponds to difference in means.
The function uses Bioconductor package limma to compute pairwise moderated
t-tests. For more details we refer to ebayes
.
A data.frame with the results.
B. Phipson, S. Lee, I.J. Majewski, W.S. Alexander, G.H. Smyth (2016). Robust hyperparameter estimation protects against hypervariable genes and improves power to detect differential expression. Annals of Applied Statistics 10(2), 946-963.
oneway.test
, mod.t.test
set.seed(123) X <- rbind(matrix(rnorm(5*20), nrow = 5, ncol = 20), matrix(rnorm(5*20, mean = 1), nrow = 5, ncol = 20)) gr <- factor(c(rep("A1", 5), rep("B2", 5), rep("C3", 5), rep("D4", 5))) mod.oneway.test(X, gr) pairwise.mod.t.test(X, gr)
set.seed(123) X <- rbind(matrix(rnorm(5*20), nrow = 5, ncol = 20), matrix(rnorm(5*20, mean = 1), nrow = 5, ncol = 20)) gr <- factor(c(rep("A1", 5), rep("B2", 5), rep("C3", 5), rep("D4", 5))) mod.oneway.test(X, gr) pairwise.mod.t.test(X, gr)
The function computes various performance weasures and scores for binary classification.
perfMeasures(pred, pred.group, truth, namePos, cutoff = 0.5, weight = 0.5, wACC = weight, wPV = weight) perfScores(pred, truth, namePos, weight = 0.5, wBS = weight)
perfMeasures(pred, pred.group, truth, namePos, cutoff = 0.5, weight = 0.5, wACC = weight, wPV = weight) perfScores(pred, truth, namePos, weight = 0.5, wBS = weight)
pred |
numeric values that shall be used for classification; e.g. probabilities to belong to the positive group. |
pred.group |
vector or factor including the predicted group. If missing,
|
truth |
true grouping vector or factor. |
namePos |
value representing the positive group. |
cutoff |
cutoff value used for classification. |
weight |
weight used for computing weighted values. Must be in [0,1]. |
wACC |
weight used for computing the weighted accuracy. Must be in [0,1]. |
wPV |
weight used for computing the weighted predictive value. Must be in [0,1]. |
wBS |
weight used for computing the weighted Brier score. Must be in [0,1]. |
The function perfMeasures
computes various performance measures.
The measures are:
accuracy (ACC), probabiliy of correct classification (PCC), probability of
missclassification (PMC), error rate, sensitivity, specificity, prevalence,
no information rate, weighted accuracy (wACC), balanced accuracy (BACC),
informedness, Youden's J statistic, positive likelihood ratio (PLR),
negative likelihood ratio (NLR), positive predictive value (PPV),
negative predictive value (NPV), markedness, weighted predictive value,
balanced predictive value, F1 score, Matthews' correlation
coefficient (MCC), proportion of positive predictions, expected accuracy,
Cohen's kappa coefficient, and detection rate.
These performance measures have in common that they require a dichotomization (discretization) of a computed continuous classification function.
The function perfScores
computes various performance Scores.
The scores are:
area under the ROC curve (AUC), Gini index, Brier score, positive Brier score,
negative Brier score, weighted Brier score, and balanced Brier score.
If the predictions (pred
) are not in the interval [0,1] the standard
logistic function is applied to transform the values of pred - cutoff
to [0,1].
data.frame
with names of the performance measures, respectivey scores
and their respective values.
Matthias Kohl [email protected]
G.W. Brier (1950). Verification of forecasts expressed in terms of probability. Mon. Wea. Rev. 78, 1-3.
K.H. Brodersen, C.S. Ong, K.E. Stephan, J.M. Buhmann (2010). The balanced accuracy and its posterior distribution. In Pattern Recognition (ICPR), 20th International Conference on, 3121-3124 (IEEE, 2010).
J.A. Cohen (1960). A coefficient of agreement for nominal scales. Educational and Psychological Measurement 20, 3746.
T. Fawcett (2006). An introduction to ROC analysis. Pattern Recognition Letters 27, 861-874.
T.A. Gerds, T. Cai, M. Schumacher (2008). The performance of risk prediction models. Biom J 50, 457-479.
D. Hand, R. Till (2001). A simple generalisation of the area under the ROC curve for multiple class classification problems. Machine Learning 45, 171-186.
J. Hernandez-Orallo, P.A. Flach, C. Ferri (2011). Brier curves: a new cost- based visualisation of classifier performance. In L. Getoor and T. Scheffer (eds.) Proceedings of the 28th International Conference on Machine Learning (ICML-11), 585???592 (ACM, New York, NY, USA).
J. Hernandez-Orallo, P.A. Flach, C. Ferri (2012). A unified view of performance metrics: Translating threshold choice into expected classification loss. J. Mach. Learn. Res. 13, 2813-2869.
B.W. Matthews (1975). Comparison of the predicted and observed secondary structure of t4 phage lysozyme. Biochimica et Biophysica Acta (BBA) - Protein Structure 405, 442-451.
D.M. Powers (2011). Evaluation: From Precision, Recall and F-Factor to ROC, Informedness, Markedness and Correlation. Journal of Machine Learning Technologies 1, 37-63.
N.A. Smits (2010). A note on Youden's J and its cost ratio. BMC Medical Research Methodology 10, 89.
B. Wallace, I. Dahabreh (2012). Class probability estimates are unreliable for imbalanced data (and how to fix them). In Data Mining (ICDM), IEEE 12th International Conference on, 695-04.
J.W. Youden (1950). Index for rating diagnostic tests. Cancer 3, 32-35.
## example from dataset infert fit <- glm(case ~ spontaneous+induced, data = infert, family = binomial()) pred <- predict(fit, type = "response") ## with group numbers perfMeasures(pred, truth = infert$case, namePos = 1) perfScores(pred, truth = infert$case, namePos = 1) ## with group names my.case <- factor(infert$case, labels = c("control", "case")) perfMeasures(pred, truth = my.case, namePos = "case") perfScores(pred, truth = my.case, namePos = "case") ## on the scale of the linear predictors pred2 <- predict(fit) perfMeasures(pred2, truth = infert$case, namePos = 1, cutoff = 0) perfScores(pred2, truth = infert$case, namePos = 1) ## using weights perfMeasures(pred, truth = infert$case, namePos = 1, weight = 0.3) perfScores(pred, truth = infert$case, namePos = 1, weight = 0.3)
## example from dataset infert fit <- glm(case ~ spontaneous+induced, data = infert, family = binomial()) pred <- predict(fit, type = "response") ## with group numbers perfMeasures(pred, truth = infert$case, namePos = 1) perfScores(pred, truth = infert$case, namePos = 1) ## with group names my.case <- factor(infert$case, labels = c("control", "case")) perfMeasures(pred, truth = my.case, namePos = "case") perfScores(pred, truth = my.case, namePos = "case") ## on the scale of the linear predictors pred2 <- predict(fit) perfMeasures(pred2, truth = infert$case, namePos = 1, cutoff = 0) perfScores(pred2, truth = infert$case, namePos = 1) ## using weights perfMeasures(pred, truth = infert$case, namePos = 1, weight = 0.3) perfScores(pred, truth = infert$case, namePos = 1, weight = 0.3)
Compute sample size, power, delta, or significance level of a diagnostic test for an expected sensititivy or specificity.
power.diagnostic.test(sens = NULL, spec = NULL, n = NULL, delta = NULL, sig.level = 0.05, power = NULL, prev = NULL, method = c("exact", "asymptotic"), NMAX = 1e4)
power.diagnostic.test(sens = NULL, spec = NULL, n = NULL, delta = NULL, sig.level = 0.05, power = NULL, prev = NULL, method = c("exact", "asymptotic"), NMAX = 1e4)
sens |
Expected sensitivity; either |
spec |
Expected specificity; either |
n |
Number of cases if |
delta |
|
sig.level |
Significance level (Type I error probability) |
power |
Power of test (1 minus Type II error probability) |
prev |
Expected prevalence, if |
method |
exact or asymptotic formula; default |
NMAX |
Maximum sample size considered in case |
Either sens
or spec
has to be specified which leads to
computations for either cases or controls.
Exactly one of the parameters n
, delta
, sig.level
,
and power
must be passed as NULL
, and that parameter is determined
from the others. Notice that sig.level
has a non-NULL
default
so NULL
must be explicitly passed if you want to compute it.
The computations are based on the formulas given in the Appendix of Flahault et al. (2005). Please be careful, in Equation (A1) the numerator should be squared, in equation (A2) and (A3) the second exponent should be n-i and not i.
As noted in Chu and Cole (2007) power is not a monotonically increasing
function in n but rather saw toothed (see also Chernick and Liu (2002)).
Hence, in our calculations we use the more conservative approach II);
i.e., the minimum sample size n
such that the actual power is
larger or equal power
andsuch that for any sample size larger
than n
it also holds that the actual power is larger or equal
power
.
Object of class "power.htest"
, a list of the arguments
(including the computed one) augmented with method
and
note
elements.
uniroot
is used to solve power equation for unknowns, so
you may see errors from it, notably about inability to bracket the
root when invalid arguments are given.
Matthias Kohl [email protected]
A. Flahault, M. Cadilhac, and G. Thomas (2005). Sample size calculation should be performed for design accuracy in diagnostic test studies. Journal of Clinical Epidemiology, 58(8):859-862.
H. Chu and S.R. Cole (2007). Sample size calculation using exact methods in diagnostic test studies. Journal of Clinical Epidemiology, 60(11):1201-1202.
M.R. Chernick amd C.Y. Liu (2002). The saw-toothed behavior of power versus sample size and software solutions: single binomial proportion using exact methods. Am Stat, 56:149-155.
## see n2 on page 1202 of Chu and Cole (2007) power.diagnostic.test(sens = 0.99, delta = 0.14, power = 0.95) # 40 power.diagnostic.test(sens = 0.99, delta = 0.13, power = 0.95) # 43 power.diagnostic.test(sens = 0.99, delta = 0.12, power = 0.95) # 47 power.diagnostic.test(sens = 0.98, delta = 0.13, power = 0.95) # 50 power.diagnostic.test(sens = 0.98, delta = 0.11, power = 0.95) # 58 ## see page 1201 of Chu and Cole (2007) power.diagnostic.test(sens = 0.95, delta = 0.1, n = 93) ## 0.957 power.diagnostic.test(sens = 0.95, delta = 0.1, n = 93, power = 0.95, sig.level = NULL) ## 0.0496 power.diagnostic.test(sens = 0.95, delta = 0.1, n = 102) ## 0.968 power.diagnostic.test(sens = 0.95, delta = 0.1, n = 102, power = 0.95, sig.level = NULL) ## 0.0471 ## yields 102 not 93! power.diagnostic.test(sens = 0.95, delta = 0.1, power = 0.95)
## see n2 on page 1202 of Chu and Cole (2007) power.diagnostic.test(sens = 0.99, delta = 0.14, power = 0.95) # 40 power.diagnostic.test(sens = 0.99, delta = 0.13, power = 0.95) # 43 power.diagnostic.test(sens = 0.99, delta = 0.12, power = 0.95) # 47 power.diagnostic.test(sens = 0.98, delta = 0.13, power = 0.95) # 50 power.diagnostic.test(sens = 0.98, delta = 0.11, power = 0.95) # 58 ## see page 1201 of Chu and Cole (2007) power.diagnostic.test(sens = 0.95, delta = 0.1, n = 93) ## 0.957 power.diagnostic.test(sens = 0.95, delta = 0.1, n = 93, power = 0.95, sig.level = NULL) ## 0.0496 power.diagnostic.test(sens = 0.95, delta = 0.1, n = 102) ## 0.968 power.diagnostic.test(sens = 0.95, delta = 0.1, n = 102, power = 0.95, sig.level = NULL) ## 0.0471 ## yields 102 not 93! power.diagnostic.test(sens = 0.95, delta = 0.1, power = 0.95)
Compute the power of the two-sample Hsu t test, or determine parameters to obtain a target power; see Section 7.4.4 in Hedderich and Sachs (2016),
power.hsu.t.test(n = NULL, delta = NULL, sd1 = 1, sd2 = 1, sig.level = 0.05, power = NULL, alternative = c("two.sided", "one.sided"), strict = FALSE, tol = .Machine$double.eps^0.25)
power.hsu.t.test(n = NULL, delta = NULL, sd1 = 1, sd2 = 1, sig.level = 0.05, power = NULL, alternative = c("two.sided", "one.sided"), strict = FALSE, tol = .Machine$double.eps^0.25)
n |
number of observations (per group) |
delta |
(expected) true difference in means |
sd1 |
(expected) standard deviation of group 1 |
sd2 |
(expected) standard deviation of group 2 |
sig.level |
significance level (Type I error probability) |
power |
power of test (1 minus Type II error probability) |
alternative |
one- or two-sided test. Can be abbreviated. |
strict |
use strict interpretation in two-sided case |
tol |
numerical tolerance used in root finding, the default providing (at least) four significant digits. |
Exactly one of the parameters n
, delta
, power
,
sd1
, sd2
and sig.level
must be passed as NULL
,
and that parameter is determined from the others. Notice that the last three
have non-NULL defaults, so NULL must be explicitly passed if you want to
compute them.
If strict = TRUE
is used, the power will include the probability of
rejection in the opposite direction of the true effect, in the two-sided
case. Without this the power will be half the significance level if the
true difference is zero.
Object of class "power.htest"
, a list of the arguments
(including the computed one) augmented with method
and
note
elements.
The function and its documentation was adapted from power.t.test
implemented by Peter Dalgaard and based on previous work by Claus Ekstroem.
uniroot
is used to solve the power equation for unknowns, so
you may see errors from it, notably about inability to bracket the
root when invalid arguments are given.
Matthias Kohl [email protected]
J. Hedderich, L. Sachs. Angewandte Statistik: Methodensammlung mit R. Springer 2016.
power.welch.t.test
, power.t.test
,
t.test
, uniroot
## more conservative than classical or Welch t-test power.hsu.t.test(n = 20, delta = 1) power.hsu.t.test(power = .90, delta = 1) power.hsu.t.test(power = .90, delta = 1, alternative = "one.sided") ## sd1 = 0.5, sd2 = 1 power.welch.t.test(delta = 0.5, sd1 = 0.5, sd2 = 1, power = 0.9) power.hsu.t.test(delta = 0.5, sd1 = 0.5, sd2 = 1, power = 0.9) ## empirical check M <- 10000 ps <- numeric(M) for(i in seq_len(M)){ x <- rnorm(55, mean = 0, sd = 0.5) y <- rnorm(55, mean = 0.5, sd = 1.0) ps[i] <- hsu.t.test(x, y)$p.value } ## empirical power sum(ps < 0.05)/M
## more conservative than classical or Welch t-test power.hsu.t.test(n = 20, delta = 1) power.hsu.t.test(power = .90, delta = 1) power.hsu.t.test(power = .90, delta = 1, alternative = "one.sided") ## sd1 = 0.5, sd2 = 1 power.welch.t.test(delta = 0.5, sd1 = 0.5, sd2 = 1, power = 0.9) power.hsu.t.test(delta = 0.5, sd1 = 0.5, sd2 = 1, power = 0.9) ## empirical check M <- 10000 ps <- numeric(M) for(i in seq_len(M)){ x <- rnorm(55, mean = 0, sd = 0.5) y <- rnorm(55, mean = 0.5, sd = 1.0) ps[i] <- hsu.t.test(x, y)$p.value } ## empirical power sum(ps < 0.05)/M
Compute sample size or power for comparing two negative binomial rates.
power.nb.test(n = NULL, mu0, mu1, RR, duration = 1, theta, ssize.ratio = 1, sig.level = 0.05, power = NULL, alternative = c("two.sided", "one.sided"), approach = 3)
power.nb.test(n = NULL, mu0, mu1, RR, duration = 1, theta, ssize.ratio = 1, sig.level = 0.05, power = NULL, alternative = c("two.sided", "one.sided"), approach = 3)
n |
Sample size for group 0 (control group). |
mu0 |
expected rate of events per time unit for group 0 |
mu1 |
expected rate of events per time unit for group 1 |
RR |
ratio of expected event rates: mu1/mu0 |
duration |
(average) treatment duration |
theta |
theta parameter of negative binomial distribution; see |
ssize.ratio |
ratio of sample sizes: n/n1 where n1 is sample size of group 1 |
sig.level |
Significance level (Type I error probability) |
power |
Power of test (1 minus Type II error probability) |
alternative |
one- or two-sided test |
approach |
1, 2, or 3; see Zhu and Lakkis (2014). |
Exactly one of the parameters n
and power
must be passed as
NULL
, and that parameter is determined from the other.
The computations are based on the formulas given in Zhu and Lakkis (2014).
Please be careful, as we are using a slightly different parametrization
(theta
= 1/k).
Zhu and Lakkis (2014) based on their simulation studies recommend to use their approach 2 or 3.
Object of class "power.htest"
, a list of the arguments
(including the computed one) augmented with a note
element.
Matthias Kohl [email protected]
H. Zhu and H. Lakkis (2014). Sample size calculation for comparing two negative binomial rates. Statistics in Medicine, 33:376-387.
## examples from Table I in Zhu and Lakkis (2014) ## theta = 1/k, RR = rr, mu0 = r0, duration = mu_t power.nb.test(mu0 = 0.8, RR = 0.85, theta = 1/0.4, duration = 0.75, power = 0.8, approach = 1) power.nb.test(mu0 = 0.8, RR = 0.85, theta = 1/0.4, duration = 0.75, power = 0.8, approach = 2) power.nb.test(mu0 = 0.8, RR = 0.85, theta = 1/0.4, duration = 0.75, power = 0.8, approach = 3) power.nb.test(mu0 = 1.4, RR = 1.15, theta = 1/1.5, duration = 0.75, power = 0.8, approach = 1) power.nb.test(mu0 = 1.4, RR = 1.15, theta = 1/1.5, duration = 0.75, power = 0.8, approach = 2) power.nb.test(mu0 = 1.4, RR = 1.15, theta = 1/1.5, duration = 0.75, power = 0.8, approach = 3) ## examples from Table II in Zhu and Lakkis (2014) - seem to be total sample sizes ## can reproduce the results with mu_t = 1.0 (not 0.7!) power.nb.test(mu0 = 2.0, RR = 0.5, theta = 1, duration = 1.0, ssize.ratio = 1, power = 0.8, approach = 1) power.nb.test(mu0 = 2.0, RR = 0.5, theta = 1, duration = 1.0, ssize.ratio = 1, power = 0.8, approach = 2) power.nb.test(mu0 = 2.0, RR = 0.5, theta = 1, duration = 1.0, ssize.ratio = 1, power = 0.8, approach = 3) power.nb.test(mu0 = 10.0, RR = 1.5, theta = 1/5, duration = 1.0, ssize.ratio = 3/2, power = 0.8, approach = 1) power.nb.test(mu0 = 10.0, RR = 1.5, theta = 1/5, duration = 1.0, ssize.ratio = 3/2, power = 0.8, approach = 2) power.nb.test(mu0 = 10.0, RR = 1.5, theta = 1/5, duration = 1.0, ssize.ratio = 3/2, power = 0.8, approach = 3) ## examples from Table III in Zhu and Lakkis (2014) power.nb.test(mu0 = 5.0, RR = 2.0, theta = 1/0.5, duration = 1, power = 0.8, approach = 1) power.nb.test(mu0 = 5.0, RR = 2.0, theta = 1/0.5, duration = 1, power = 0.8, approach = 2) power.nb.test(mu0 = 5.0, RR = 2.0, theta = 1/0.5, duration = 1, power = 0.8, approach = 3) ## examples from Table IV in Zhu and Lakkis (2014) power.nb.test(mu0 = 5.9/3, RR = 0.4, theta = 0.49, duration = 3, power = 0.9, approach = 1) power.nb.test(mu0 = 5.9/3, RR = 0.4, theta = 0.49, duration = 3, power = 0.9, approach = 2) power.nb.test(mu0 = 5.9/3, RR = 0.4, theta = 0.49, duration = 3, power = 0.9, approach = 3) power.nb.test(mu0 = 13/6, RR = 0.2, theta = 0.52, duration = 6, power = 0.9, approach = 1) power.nb.test(mu0 = 13/6, RR = 0.2, theta = 0.52, duration = 6, power = 0.9, approach = 2) power.nb.test(mu0 = 13/6, RR = 0.2, theta = 0.52, duration = 6, power = 0.9, approach = 3) ## see Section 5 of Zhu and Lakkis (2014) power.nb.test(mu0 = 0.66, RR = 0.8, theta = 1/0.8, duration = 0.9, power = 0.9)
## examples from Table I in Zhu and Lakkis (2014) ## theta = 1/k, RR = rr, mu0 = r0, duration = mu_t power.nb.test(mu0 = 0.8, RR = 0.85, theta = 1/0.4, duration = 0.75, power = 0.8, approach = 1) power.nb.test(mu0 = 0.8, RR = 0.85, theta = 1/0.4, duration = 0.75, power = 0.8, approach = 2) power.nb.test(mu0 = 0.8, RR = 0.85, theta = 1/0.4, duration = 0.75, power = 0.8, approach = 3) power.nb.test(mu0 = 1.4, RR = 1.15, theta = 1/1.5, duration = 0.75, power = 0.8, approach = 1) power.nb.test(mu0 = 1.4, RR = 1.15, theta = 1/1.5, duration = 0.75, power = 0.8, approach = 2) power.nb.test(mu0 = 1.4, RR = 1.15, theta = 1/1.5, duration = 0.75, power = 0.8, approach = 3) ## examples from Table II in Zhu and Lakkis (2014) - seem to be total sample sizes ## can reproduce the results with mu_t = 1.0 (not 0.7!) power.nb.test(mu0 = 2.0, RR = 0.5, theta = 1, duration = 1.0, ssize.ratio = 1, power = 0.8, approach = 1) power.nb.test(mu0 = 2.0, RR = 0.5, theta = 1, duration = 1.0, ssize.ratio = 1, power = 0.8, approach = 2) power.nb.test(mu0 = 2.0, RR = 0.5, theta = 1, duration = 1.0, ssize.ratio = 1, power = 0.8, approach = 3) power.nb.test(mu0 = 10.0, RR = 1.5, theta = 1/5, duration = 1.0, ssize.ratio = 3/2, power = 0.8, approach = 1) power.nb.test(mu0 = 10.0, RR = 1.5, theta = 1/5, duration = 1.0, ssize.ratio = 3/2, power = 0.8, approach = 2) power.nb.test(mu0 = 10.0, RR = 1.5, theta = 1/5, duration = 1.0, ssize.ratio = 3/2, power = 0.8, approach = 3) ## examples from Table III in Zhu and Lakkis (2014) power.nb.test(mu0 = 5.0, RR = 2.0, theta = 1/0.5, duration = 1, power = 0.8, approach = 1) power.nb.test(mu0 = 5.0, RR = 2.0, theta = 1/0.5, duration = 1, power = 0.8, approach = 2) power.nb.test(mu0 = 5.0, RR = 2.0, theta = 1/0.5, duration = 1, power = 0.8, approach = 3) ## examples from Table IV in Zhu and Lakkis (2014) power.nb.test(mu0 = 5.9/3, RR = 0.4, theta = 0.49, duration = 3, power = 0.9, approach = 1) power.nb.test(mu0 = 5.9/3, RR = 0.4, theta = 0.49, duration = 3, power = 0.9, approach = 2) power.nb.test(mu0 = 5.9/3, RR = 0.4, theta = 0.49, duration = 3, power = 0.9, approach = 3) power.nb.test(mu0 = 13/6, RR = 0.2, theta = 0.52, duration = 6, power = 0.9, approach = 1) power.nb.test(mu0 = 13/6, RR = 0.2, theta = 0.52, duration = 6, power = 0.9, approach = 2) power.nb.test(mu0 = 13/6, RR = 0.2, theta = 0.52, duration = 6, power = 0.9, approach = 3) ## see Section 5 of Zhu and Lakkis (2014) power.nb.test(mu0 = 0.66, RR = 0.8, theta = 1/0.8, duration = 0.9, power = 0.9)
Compute the power of the two-sample Welch t test, or determine parameters to obtain a target power.
power.welch.t.test(n = NULL, delta = NULL, sd1 = 1, sd2 = 1, sig.level = 0.05, power = NULL, alternative = c("two.sided", "one.sided"), strict = FALSE, tol = .Machine$double.eps^0.25)
power.welch.t.test(n = NULL, delta = NULL, sd1 = 1, sd2 = 1, sig.level = 0.05, power = NULL, alternative = c("two.sided", "one.sided"), strict = FALSE, tol = .Machine$double.eps^0.25)
n |
number of observations (per group) |
delta |
(expected) true difference in means |
sd1 |
(expected) standard deviation of group 1 |
sd2 |
(expected) standard deviation of group 2 |
sig.level |
significance level (Type I error probability) |
power |
power of test (1 minus Type II error probability) |
alternative |
one- or two-sided test. Can be abbreviated. |
strict |
use strict interpretation in two-sided case |
tol |
numerical tolerance used in root finding, the default providing (at least) four significant digits. |
Exactly one of the parameters n
, delta
, power
,
sd1
, sd2
and sig.level
must be passed as NULL
,
and that parameter is determined from the others. Notice that the last three
have non-NULL defaults, so NULL must be explicitly passed if you want to
compute them.
If strict = TRUE
is used, the power will include the probability of
rejection in the opposite direction of the true effect, in the two-sided
case. Without this the power will be half the significance level if the
true difference is zero.
Object of class "power.htest"
, a list of the arguments
(including the computed one) augmented with method
and
note
elements.
The function and its documentation was adapted from power.t.test
implemented by Peter Dalgaard and based on previous work by Claus Ekstroem.
uniroot
is used to solve the power equation for unknowns, so
you may see errors from it, notably about inability to bracket the
root when invalid arguments are given.
Matthias Kohl [email protected]
S.L. Jan and G. Shieh (2011). Optimal sample sizes for Welch's test under various allocation and cost considerations. Behav Res Methods, 43, 4:1014-22.
## identical results as power.t.test, since sd = sd1 = sd2 = 1 power.welch.t.test(n = 20, delta = 1) power.welch.t.test(power = .90, delta = 1) power.welch.t.test(power = .90, delta = 1, alternative = "one.sided") ## sd1 = 0.5, sd2 = 1 power.welch.t.test(delta = 1, sd1 = 0.5, sd2 = 1, power = 0.9) ## empirical check M <- 10000 ps <- numeric(M) for(i in seq_len(M)){ x <- rnorm(15, mean = 0, sd = 0.5) y <- rnorm(15, mean = 1, sd = 1.0) ps[i] <- t.test(x, y)$p.value } ## empirical power sum(ps < 0.05)/M
## identical results as power.t.test, since sd = sd1 = sd2 = 1 power.welch.t.test(n = 20, delta = 1) power.welch.t.test(power = .90, delta = 1) power.welch.t.test(power = .90, delta = 1, alternative = "one.sided") ## sd1 = 0.5, sd2 = 1 power.welch.t.test(delta = 1, sd1 = 0.5, sd2 = 1, power = 0.9) ## empirical check M <- 10000 ps <- numeric(M) for(i in seq_len(M)){ x <- rnorm(15, mean = 0, sd = 0.5) y <- rnorm(15, mean = 1, sd = 1.0) ps[i] <- t.test(x, y)$p.value } ## empirical power sum(ps < 0.05)/M
The function computes the positive (PPV) and negative predictive value (NPV) given sensitivity, specificity and prevalence (pre-test probability).
predValues(sens, spec, prev)
predValues(sens, spec, prev)
sens |
numeric vector: sensitivities. |
spec |
numeric vector: specificities. |
prev |
numeric vector: prevalence. |
The function computes the positive (PPV) and negative predictive value (NPV) given sensitivity, specificity and prevalence (pre-test probability).
It's a simple application of the Bayes formula.
One can also specify vectors of length larger than 1 for sensitivity and specificity.
Vector or matrix with PPV and NPV.
Matthias Kohl [email protected]
## Example: HIV test ## 1. ELISA screening test (4th generation) predValues(sens = 0.999, spec = 0.998, prev = 0.001) ## 2. Western-Plot confirmation test predValues(sens = 0.998, spec = 0.999996, prev = 1/3) ## Example: connection between sensitivity, specificity and PPV sens <- seq(0.6, 0.99, by = 0.01) spec <- seq(0.6, 0.99, by = 0.01) ppv <- function(sens, spec, pre) predValues(sens, spec, pre)[,1] res <- outer(sens, spec, ppv, pre = 0.1) image(sens, spec, res, col = terrain.colors(256), main = "PPV for prevalence = 10%", xlim = c(0.59, 1), ylim = c(0.59, 1)) contour(sens, spec, res, add = TRUE)
## Example: HIV test ## 1. ELISA screening test (4th generation) predValues(sens = 0.999, spec = 0.998, prev = 0.001) ## 2. Western-Plot confirmation test predValues(sens = 0.998, spec = 0.999996, prev = 1/3) ## Example: connection between sensitivity, specificity and PPV sens <- seq(0.6, 0.99, by = 0.01) spec <- seq(0.6, 0.99, by = 0.01) ppv <- function(sens, spec, pre) predValues(sens, spec, pre)[,1] res <- outer(sens, spec, ppv, pre = 0.1) image(sens, spec, res, col = terrain.colors(256), main = "PPV for prevalence = 10%", xlim = c(0.59, 1), ylim = c(0.59, 1)) contour(sens, spec, res, add = TRUE)
Printing objects of class "confint"
by a simple print
method.
## S3 method for class 'confint' print(x, digits = getOption("digits"), prefix = "\t", ...)
## S3 method for class 'confint' print(x, digits = getOption("digits"), prefix = "\t", ...)
x |
object of class |
digits |
number of significant digits to be used. |
prefix |
string, passed to |
... |
further arguments to be passed to or from methods. |
A confint
object is just a named list of confidence intervals
and respective (point) estimates.
the argument x
, invisibly, as for all print
methods.
x <- rnorm(20) (CI <- normCI(x)) print(CI, digits = 3)
x <- rnorm(20) (CI <- normCI(x)) print(CI, digits = 3)
Produce box-and-whisker plot(s) of the given (grouped) values. In contrast
to boxplot
quartiles are used instead of hinges
(which are not necessarily quartiles) the rest of the implementation is
identical to boxplot
.
qboxplot(x, ...) ## S3 method for class 'formula' qboxplot(formula, data = NULL, ..., subset, na.action = NULL, type = 7) ## Default S3 method: qboxplot(x, ..., range = 1.5, width = NULL, varwidth = FALSE, notch = FALSE, outline = TRUE, names, plot = TRUE, border = par("fg"), col = NULL, log = "", pars = list(boxwex = 0.8, staplewex = 0.5, outwex = 0.5), horizontal = FALSE, add = FALSE, at = NULL, type = 7)
qboxplot(x, ...) ## S3 method for class 'formula' qboxplot(formula, data = NULL, ..., subset, na.action = NULL, type = 7) ## Default S3 method: qboxplot(x, ..., range = 1.5, width = NULL, varwidth = FALSE, notch = FALSE, outline = TRUE, names, plot = TRUE, border = par("fg"), col = NULL, log = "", pars = list(boxwex = 0.8, staplewex = 0.5, outwex = 0.5), horizontal = FALSE, add = FALSE, at = NULL, type = 7)
formula |
a formula, such as |
data |
a data.frame (or list) from which the variables in
|
subset |
an optional vector specifying a subset of observations to be used for plotting. |
na.action |
a function which indicates what should happen
when the data contain |
x |
for specifying data from which the boxplots are to be
produced. Either a numeric vector, or a single list containing such
vectors. Additional unnamed arguments specify further data
as separate vectors (each corresponding to a component boxplot).
|
... |
For the For the default method, unnamed arguments are additional data
vectors (unless |
range |
this determines how far the plot whiskers extend out
from the box. If |
width |
a vector giving the relative widths of the boxes making up the plot. |
varwidth |
if |
notch |
if |
outline |
if |
names |
group labels which will be printed under each boxplot. Can be a character vector or an expression (see plotmath). |
boxwex |
a scale factor to be applied to all boxes. When there are only a few groups, the appearance of the plot can be improved by making the boxes narrower. |
staplewex |
staple line width expansion, proportional to box width. |
outwex |
outlier line width expansion, proportional to box width. |
plot |
if |
border |
an optional vector of colors for the outlines of the
boxplots. The values in |
col |
if |
log |
character indicating if x or y or both coordinates should be plotted in log scale. |
pars |
a list of (potentially many) more graphical parameters,
e.g., |
horizontal |
logical indicating if the boxplots should be
horizontal; default |
add |
logical, if true add boxplot to current plot. |
at |
numeric vector giving the locations where the boxplots should
be drawn, particularly when |
type |
an integer between 1 and 9 selecting one of nine quantile
algorithms; for more details see |
The generic function qboxplot
currently has a default method
(qboxplot.default
) and a formula interface (qboxplot.formula
).
If multiple groups are supplied either as multiple arguments or via a
formula, parallel boxplots will be plotted, in the order of the
arguments or the order of the levels of the factor (see
factor
).
Missing values are ignored when forming boxplots.
List with the following components:
stats |
a matrix, each column contains the extreme of the lower whisker, the lower hinge, the median, the upper hinge and the extreme of the upper whisker for one group/plot. If all the inputs have the same class attribute, so will this component. |
n |
a vector with the number of observations in each group. |
conf |
a matrix where each column contains the lower and upper extremes of the notch. |
out |
the values of any data points which lie beyond the extremes of the whiskers. |
group |
a vector of the same length as |
names |
a vector of names for the groups. |
Matthias Kohl [email protected]
Becker, R. A., Chambers, J. M. and Wilks, A. R. (1988) The New S Language. Wadsworth & Brooks/Cole.
Chambers, J. M., Cleveland, W. S., Kleiner, B. and Tukey, P. A. (1983) Graphical Methods for Data Analysis. Wadsworth & Brooks/Cole.
Murrell, P. (2005) R Graphics. Chapman & Hall/CRC Press.
See also boxplot.stats
.
qbxp.stats
which does the computation,
bxp
for the plotting and more examples;
and stripchart
for an alternative (with small data
sets).
## adapted examples from boxplot ## qboxplot on a formula: qboxplot(count ~ spray, data = InsectSprays, col = "lightgray") # *add* notches (somewhat funny here): qboxplot(count ~ spray, data = InsectSprays, notch = TRUE, add = TRUE, col = "blue") qboxplot(decrease ~ treatment, data = OrchardSprays, log = "y", col = "bisque") rb <- qboxplot(decrease ~ treatment, data = OrchardSprays, col="bisque") title("Comparing boxplot()s and non-robust mean +/- SD") mn.t <- tapply(OrchardSprays$decrease, OrchardSprays$treatment, mean) sd.t <- tapply(OrchardSprays$decrease, OrchardSprays$treatment, sd) xi <- 0.3 + seq(rb$n) points(xi, mn.t, col = "orange", pch = 18) arrows(xi, mn.t - sd.t, xi, mn.t + sd.t, code = 3, col = "pink", angle = 75, length = .1) ## boxplot on a matrix: mat <- cbind(Uni05 = (1:100)/21, Norm = rnorm(100), `5T` = rt(100, df = 5), Gam2 = rgamma(100, shape = 2)) qboxplot(as.data.frame(mat), main = "qboxplot(as.data.frame(mat), main = ...)") par(las=1)# all axis labels horizontal qboxplot(as.data.frame(mat), main = "boxplot(*, horizontal = TRUE)", horizontal = TRUE) ## Using 'at = ' and adding boxplots -- example idea by Roger Bivand : qboxplot(len ~ dose, data = ToothGrowth, boxwex = 0.25, at = 1:3 - 0.2, subset = supp == "VC", col = "yellow", main = "Guinea Pigs' Tooth Growth", xlab = "Vitamin C dose mg", ylab = "tooth length", xlim = c(0.5, 3.5), ylim = c(0, 35), yaxs = "i") qboxplot(len ~ dose, data = ToothGrowth, add = TRUE, boxwex = 0.25, at = 1:3 + 0.2, subset = supp == "OJ", col = "orange") legend(2, 9, c("Ascorbic acid", "Orange juice"), fill = c("yellow", "orange"))
## adapted examples from boxplot ## qboxplot on a formula: qboxplot(count ~ spray, data = InsectSprays, col = "lightgray") # *add* notches (somewhat funny here): qboxplot(count ~ spray, data = InsectSprays, notch = TRUE, add = TRUE, col = "blue") qboxplot(decrease ~ treatment, data = OrchardSprays, log = "y", col = "bisque") rb <- qboxplot(decrease ~ treatment, data = OrchardSprays, col="bisque") title("Comparing boxplot()s and non-robust mean +/- SD") mn.t <- tapply(OrchardSprays$decrease, OrchardSprays$treatment, mean) sd.t <- tapply(OrchardSprays$decrease, OrchardSprays$treatment, sd) xi <- 0.3 + seq(rb$n) points(xi, mn.t, col = "orange", pch = 18) arrows(xi, mn.t - sd.t, xi, mn.t + sd.t, code = 3, col = "pink", angle = 75, length = .1) ## boxplot on a matrix: mat <- cbind(Uni05 = (1:100)/21, Norm = rnorm(100), `5T` = rt(100, df = 5), Gam2 = rgamma(100, shape = 2)) qboxplot(as.data.frame(mat), main = "qboxplot(as.data.frame(mat), main = ...)") par(las=1)# all axis labels horizontal qboxplot(as.data.frame(mat), main = "boxplot(*, horizontal = TRUE)", horizontal = TRUE) ## Using 'at = ' and adding boxplots -- example idea by Roger Bivand : qboxplot(len ~ dose, data = ToothGrowth, boxwex = 0.25, at = 1:3 - 0.2, subset = supp == "VC", col = "yellow", main = "Guinea Pigs' Tooth Growth", xlab = "Vitamin C dose mg", ylab = "tooth length", xlim = c(0.5, 3.5), ylim = c(0, 35), yaxs = "i") qboxplot(len ~ dose, data = ToothGrowth, add = TRUE, boxwex = 0.25, at = 1:3 + 0.2, subset = supp == "OJ", col = "orange") legend(2, 9, c("Ascorbic acid", "Orange juice"), fill = c("yellow", "orange"))
This functions works identical to boxplot.stats
.
It is typically called by another function to gather the statistics
necessary for producing box plots, but may be invoked separately.
qbxp.stats(x, coef = 1.5, do.conf = TRUE, do.out = TRUE, type = 7)
qbxp.stats(x, coef = 1.5, do.conf = TRUE, do.out = TRUE, type = 7)
x |
a numeric vector for which the boxplot will be constructed
( |
coef |
it determines how far the plot ‘whiskers’ extend out
from the box. If |
do.conf |
logical; if |
do.out |
logical; if |
type |
an integer between 1 and 9 selecting one of nine quantile
algorithms; for more details see |
The notches (if requested) extend to +/-1.58 IQR/sqrt(n)
.
This seems to be based on the same calculations as the formula with 1.57 in
Chambers et al. (1983, p. 62), given in McGill et al.
(1978, p. 16). They are based on asymptotic normality of the median
and roughly equal sample sizes for the two medians being compared, and
are said to be rather insensitive to the underlying distributions of
the samples. The idea appears to be to give roughly a 95% confidence
interval for the difference in two medians.
List with named components as follows:
stats |
a vector of length 5, containing the extreme of the lower whisker, the first quartile, the median, the third quartile and the extreme of the upper whisker. |
n |
the number of non- |
conf |
the lower and upper extremes of the ‘notch’
( |
out |
the values of any data points which lie beyond the
extremes of the whiskers ( |
Note that $stats
and $conf
are sorted in increasing
order, unlike S, and that $n
and $out
include any
+- Inf
values.
Matthias Kohl [email protected]
Tukey, J. W. (1977) Exploratory Data Analysis. Section 2C.
McGill, R., Tukey, J. W. and Larsen, W. A. (1978) Variations of box plots. The American Statistician 32, 12–16.
Velleman, P. F. and Hoaglin, D. C. (1981) Applications, Basics and Computing of Exploratory Data Analysis. Duxbury Press.
Emerson, J. D and Strenio, J. (1983). Boxplots and batch comparison. Chapter 3 of Understanding Robust and Exploratory Data Analysis, eds. D. C. Hoaglin, F. Mosteller and J. W. Tukey. Wiley.
Chambers, J. M., Cleveland, W. S., Kleiner, B. and Tukey, P. A. (1983) Graphical Methods for Data Analysis. Wadsworth & Brooks/Cole.
## adapted example from boxplot.stats x <- c(1:100, 1000) (b1 <- qbxp.stats(x)) (b2 <- qbxp.stats(x, do.conf=FALSE, do.out=FALSE)) stopifnot(b1$stats == b2$stats) # do.out=F is still robust qbxp.stats(x, coef = 3, do.conf=FALSE) ## no outlier treatment: qbxp.stats(x, coef = 0) qbxp.stats(c(x, NA)) # slight change : n is 101 (r <- qbxp.stats(c(x, -1:1/0))) stopifnot(r$out == c(1000, -Inf, Inf))
## adapted example from boxplot.stats x <- c(1:100, 1000) (b1 <- qbxp.stats(x)) (b2 <- qbxp.stats(x, do.conf=FALSE, do.out=FALSE)) stopifnot(b1$stats == b2$stats) # do.out=F is still robust qbxp.stats(x, coef = 3, do.conf=FALSE) ## no outlier treatment: qbxp.stats(x, coef = 0) qbxp.stats(c(x, NA)) # slight change : n is 101 (r <- qbxp.stats(c(x, -1:1/0))) stopifnot(r$out == c(1000, -Inf, Inf))
These functions can be used to compute confidence intervals for quantiles (including median).
quantileCI(x, prob = 0.5, conf.level = 0.95, method = "exact", minLength = FALSE, na.rm = FALSE) medianCI(x, conf.level = 0.95, method = "exact", minLength = FALSE, na.rm = FALSE) madCI(x, conf.level = 0.95, method = "exact", minLength = FALSE, na.rm = FALSE, constant = 1.4826)
quantileCI(x, prob = 0.5, conf.level = 0.95, method = "exact", minLength = FALSE, na.rm = FALSE) medianCI(x, conf.level = 0.95, method = "exact", minLength = FALSE, na.rm = FALSE) madCI(x, conf.level = 0.95, method = "exact", minLength = FALSE, na.rm = FALSE, constant = 1.4826)
x |
numeric data vector |
prob |
quantile |
conf.level |
confidence level |
method |
character string specifing which method to use; see details. |
minLength |
logical, see details |
na.rm |
logical, remove |
constant |
scale factor (see |
The exact confidence interval (method = "exact"
) is computed using binomial
probabilities; see Section 6.8.1 in Sachs and Hedderich (2009). If the result is not
unique, i.e. there is more than one interval with coverage proability closest to
conf.level
, then a matrix of confidence intervals is returned.
If minLength = TRUE
, an exact confidence interval with minimum length is
returned.
The asymptotic confidence interval (method = "asymptotic"
) is based on the
normal approximation of the binomial distribution; see Section 6.8.1 in Sachs and Hedderich (2009).
A list with components
estimate |
the sample quantile. |
CI |
a confidence interval for the sample quantile. |
Matthias Kohl [email protected]
L. Sachs and J. Hedderich (2009). Angewandte Statistik. Springer.
## To get a non-trivial exact confidence interval for the median ## one needs at least 6 observations set.seed(123) x <- rnorm(8) ## exact confidence interval not unique medianCI(x) madCI(x) ## minimum length exact confidence interval medianCI(x, minLength = TRUE) madCI(x, minLength = TRUE) ## asymptotic confidence interval medianCI(x, method = "asymptotic") madCI(x, method = "asymptotic") ## confidence interval for quantiles quantileCI(x, prob = 0.4) quantileCI(x, prob = 0.6)
## To get a non-trivial exact confidence interval for the median ## one needs at least 6 observations set.seed(123) x <- rnorm(8) ## exact confidence interval not unique medianCI(x) madCI(x) ## minimum length exact confidence interval medianCI(x, minLength = TRUE) madCI(x, minLength = TRUE) ## asymptotic confidence interval medianCI(x, method = "asymptotic") madCI(x, method = "asymptotic") ## confidence interval for quantiles quantileCI(x, prob = 0.4) quantileCI(x, prob = 0.6)
Compute mean of replicated spots where additionally spot flags may incorporated.
repMeans(x, flags, use.flags = NULL, ndups, spacing, method, ...)
repMeans(x, flags, use.flags = NULL, ndups, spacing, method, ...)
x |
matrix or data.frame of expression values |
flags |
matrix or data.frame of spot flags; must have same dimension as |
use.flags |
should flags be included and in which way; cf. section details |
ndups |
integer, number of replicates on chip. The number of rows of
|
spacing |
the spacing between the rows of 'x' corresponding to
replicated spots, |
method |
function to aggregate the replicated spots. If missing, the mean is used. |
... |
optional arguments to |
The incorporation of spot flags is controlled via argument use.flags
.
NULL
: flags are not used; minimum flag value of replicated
spots is returned
"max"
: only spots with flag value equal to the maximum flag value of
replicated spots are used
"median"
: only spots with flag values larger or equal to median of
replicated spots are used
"mean"
: only spots with flag values larger or equal to mean of replicated
spots are used
LIST with components
exprs |
mean of expression values |
flags |
flags for mean expression values |
A first version of this function appeared in package SLmisc.
Matthias Kohl [email protected]
## only a dummy example M <- matrix(rnorm(1000), ncol = 10) FL <- matrix(rpois(1000, lambda = 10), ncol = 10) # only for this example res <- repMeans(x = M, flags = FL, use.flags = "max", ndups = 5, spacing = 20)
## only a dummy example M <- matrix(rnorm(1000), ncol = 10) FL <- matrix(rpois(1000, lambda = 10), ncol = 10) # only for this example res <- repMeans(x = M, flags = FL, use.flags = "max", ndups = 5, spacing = 20)
The function computes relative risk (RR), odds ration (OR), and several other risk measures; see details.
risks(p0, p1)
risks(p0, p1)
p0 |
numeric vector of length 1: incidence of the outcome of interest in the nonexposed group. |
p1 |
numeric vector of length 1: incidence of the outcome of interest in the exposed group. |
The function computes relative risk (RR), odds-ratio (OR), relative risk reduction (RRR) resp. relative risk increase (RRI), absolute risk reduction (ARR) resp. absolute risk increase (ARI), number needed to treat (NNT) resp. number needed to harm (NNH).
Vector including several risk measures.
Matthias Kohl [email protected]
See for instance: Relative risk. (2016, November 4). In Wikipedia, The Free Encyclopedia. Retrieved 19:58, November 4, 2016, from https://en.wikipedia.org/w/index.php?title=Relative_risk&oldid=747857409
## See worked example in Wikipedia risks(p0 = 0.4, p1 = 0.1) risks(p0 = 0.4, p1 = 0.5)
## See worked example in Wikipedia risks(p0 = 0.4, p1 = 0.1) risks(p0 = 0.4, p1 = 0.5)
The function computes an approximate confidence interval for the relative risk (RR).
rrCI(a, b, c, d, conf.level = 0.95)
rrCI(a, b, c, d, conf.level = 0.95)
a |
integer: events in exposed group. |
b |
integer: non-events in exposed group. |
c |
integer: events in non-exposed group. |
d |
integer: non-events in non-exposed group. |
conf.level |
numeric: confidence level |
The function computes an approximate confidence interval for the relative risk (RR) based on the normal approximation; see Jewell (2004).
A list with class "confint"
containing the following components:
estimate |
the estimated relative risk. |
conf.int |
a confidence interval for the relative risk. |
Matthias Kohl [email protected]
Jewell, Nicholas P. (2004). Statistics for epidemiology. Chapman & Hall/CRC.
Relative risk. (2016, November 4). In Wikipedia, The Free Encyclopedia. Retrieved 19:58, November 4, 2016, from https://en.wikipedia.org/w/index.php?title=Relative_risk&oldid=747857409
## See worked example in Wikipedia rrCI(a = 15, b = 135, c = 100, d = 150) rrCI(a = 75, b = 75, c = 100, d = 150)
## See worked example in Wikipedia rrCI(a = 15, b = 135, c = 100, d = 150) rrCI(a = 75, b = 75, c = 100, d = 150)
The function simulates a pair of correlated variables.
simCorVars(n, r, plot = TRUE)
simCorVars(n, r, plot = TRUE)
n |
integer: sample size. |
r |
numeric: correlation. |
plot |
logical: generate scatter plot of the variables. |
The function is mainly for teaching purposes and simulates n
observations
from a pair of normal distributed variables with correlation r
.
By specifying plot = TRUE
a scatter plot of the data is generated.
data.frame with entries Var1
and Var2
Matthias Kohl [email protected]
res <- simCorVars(n = 100, r = 0.8) cor(res$Var1, res$Var2)
res <- simCorVars(n = 100, r = 0.8) cor(res$Var1, res$Var2)
Plot of similarity matrix.
simPlot(x, col, minVal, labels = FALSE, lab.both.axes = FALSE, labcols = "black", title = "", cex.title = 1.2, protocol = FALSE, cex.axis = 0.8, cex.axis.bar = 1, signifBar = 2, ...)
simPlot(x, col, minVal, labels = FALSE, lab.both.axes = FALSE, labcols = "black", title = "", cex.title = 1.2, protocol = FALSE, cex.axis = 0.8, cex.axis.bar = 1, signifBar = 2, ...)
x |
quadratic data matrix. |
col |
colors palette for image. If missing, the |
minVal |
numeric, minimum value which is display by a color; used to adjust |
labels |
vector of character strings to be placed at the tickpoints,
labels for the columns of |
lab.both.axes |
logical, display labels on both axes |
labcols |
colors to be used for the labels of the columns of |
title |
character string, overall title for the plot. |
cex.title |
A numerical value giving the amount by which plotting text
and symbols should be magnified relative to the default;
cf. |
protocol |
logical, display color bar without numbers |
cex.axis |
The magnification to be used for axis annotation relative to the
current setting of 'cex'; cf. |
cex.axis.bar |
The magnification to be used for axis annotation of the color
bar relative to the current setting of 'cex'; cf.
|
signifBar |
integer indicating the precision to be used for the bar. |
... |
graphical parameters may also be supplied as arguments to the
function (see |
This functions generates a so called similarity matrix.
If min(x)
is smaller than minVal
, the colors in col
are
adjusted such that the minimum value which is color coded is equal to minVal
.
invisible()
The function is a slight modification of function corPlot
of package MKmisc.
Matthias Kohl [email protected]
Sandrine Dudoit, Yee Hwa (Jean) Yang, Benjamin Milo Bolstad and with
contributions from Natalie Thorne, Ingrid Loennstedt and Jessica Mar.
sma: Statistical Microarray Analysis.
http://www.stat.berkeley.edu/users/terry/zarray/Software/smacode.html
## only a dummy example M <- matrix(rnorm(1000), ncol = 20) colnames(M) <- paste("Sample", 1:20) M.cor <- cor(M) simPlot(M.cor, minVal = min(M.cor)) simPlot(M.cor, minVal = min(M.cor), lab.both.axes = TRUE) simPlot(M.cor, minVal = min(M.cor), protocol = TRUE) simPlot(M.cor, minVal = min(M.cor), signifBar = 1)
## only a dummy example M <- matrix(rnorm(1000), ncol = 20) colnames(M) <- paste("Sample", 1:20) M.cor <- cor(M) simPlot(M.cor, minVal = min(M.cor)) simPlot(M.cor, minVal = min(M.cor), lab.both.axes = TRUE) simPlot(M.cor, minVal = min(M.cor), protocol = TRUE) simPlot(M.cor, minVal = min(M.cor), signifBar = 1)
The functions compute SNR as well as two robust versions of the SNR.
SNR(x, na.rm = FALSE)
SNR(x, na.rm = FALSE)
x |
numeric vector. |
na.rm |
logical. Should missing values be removed? |
The functions compute the (classical) coefficient of variation as well as two robust variants.
medSNR
uses the (standardized) MAD instead of SD and median instead of mean.
iqrSNR
uses the (standardized) IQR instead of SD and median instead of mean.
SNR value.
Matthias Kohl [email protected]
C.N.P.G. Arachchige, L.A. Prendergast and R.G. Staudte. Robust analogues to the Coefficient of Variation. https://arxiv.org/abs/1907.01110.
## 5% outliers out <- rbinom(100, prob = 0.05, size = 1) sum(out) x <- (1-out)*rnorm(100, mean = 10, sd = 2) + out*25 SNR(x) medSNR(x) iqrSNR(x)
## 5% outliers out <- rbinom(100, prob = 0.05, size = 1) sum(out) x <- (1-out)*rnorm(100, mean = 10, sd = 2) + out*25 SNR(x) medSNR(x) iqrSNR(x)
Calculate sample size for training set in developing classifiers using high dimensional data. The calculation is based on the probability of correct classification (PCC).
ssize.pcc(gamma, stdFC, prev = 0.5, nrFeatures, sigFeatures = 20, verbose = FALSE)
ssize.pcc(gamma, stdFC, prev = 0.5, nrFeatures, sigFeatures = 20, verbose = FALSE)
gamma |
tolerance between PCC(infty) and PCC(n). |
stdFC |
expected standardized fold-change; that is, expected fold-change devided by within class standard deviation. |
prev |
expected prevalence. |
nrFeatures |
number of features (variables) considered. |
sigFeatures |
number of significatn features; default (20) should be sufficient for most if not all cases. |
verbose |
print intermediate results. |
The computations are based the algorithm provided in Section~4.2 of Dobbin and Simon (2007). Prevalence is incorporated by the simple rough approach given in Section~4.4 (ibid.).
The results for prevalence equal to $50%$ are identical to the numbers computed by https://brb.nci.nih.gov/brb/samplesize/samplesize4GE.html. For other prevalences the numbers differ and are larger for our implementation.
Object of class "power.htest"
, a list of the arguments
(including the computed one) augmented with method
and
note
elements.
optimize
is used to solve equation (4.3) of Dobbin and Simon (2007),
so you may see errors from it.
Matthias Kohl [email protected]
K. Dobbin and R. Simon (2007). Sample size planning for developing classifiers using high-dimensional DNA microarray data. Biostatistics, 8(1):101-117.
K. Dobbin, Y. Zhao, R. Simon (2008). How Large a Training Set is Needed to Develop a Classifier for Microarray Data? Clin Cancer Res., 14(1):108-114.
## see Table 2 of Dobbin et al. (2008) g <- 0.1 fc <- 1.6 ssize.pcc(gamma = g, stdFC = fc, nrFeatures = 22000) ## see Table 3 of Dobbin et al. (2008) g <- 0.05 fc <- 1.1 ssize.pcc(gamma = g, stdFC = fc, nrFeatures = 22000)
## see Table 2 of Dobbin et al. (2008) g <- 0.1 fc <- 1.6 ssize.pcc(gamma = g, stdFC = fc, nrFeatures = 22000) ## see Table 3 of Dobbin et al. (2008) g <- 0.05 fc <- 1.1 ssize.pcc(gamma = g, stdFC = fc, nrFeatures = 22000)
The function can be used to compute distances between strings.
stringDist(x, y, method = "levenshtein", mismatch = 1, gap = 1)
stringDist(x, y, method = "levenshtein", mismatch = 1, gap = 1)
x |
character vector, first string |
y |
character vector, second string |
method |
character, name of the distance method. This must be
|
mismatch |
numeric, distance value for a mismatch between symbols |
gap |
numeric, distance value for inserting a gap |
The function computes the Hamming and the Levenshtein (edit) distance of two given strings (sequences).
In case of the Hamming distance the two strings must have the same length.
In case of the Levenshtein (edit) distance a scoring and a trace-back matrix are computed
and are saved as attributes "ScoringMatrix"
and "TraceBackMatrix"
.
The characters in the trace-back matrix reflect insertion of a gap in string y
(d
: deletion), match (m
), mismatch (mm
),
and insertion of a gap in string x
(i
).
stringDist
returns an object of S3 class "stringDist"
inherited
from class "dist"
; cf. dist
.
The function is mainly for teaching purposes.
For distances between strings and string alignments see also Bioconductor package Biostrings.
Matthias Kohl [email protected]
R. Merkl and S. Waack (2009). Bioinformatik Interaktiv. Wiley.
x <- "GACGGATTATG" y <- "GATCGGAATAG" ## Levenshtein distance d <- stringDist(x, y) d attr(d, "ScoringMatrix") attr(d, "TraceBackMatrix") ## Hamming distance stringDist(x, y)
x <- "GACGGATTATG" y <- "GATCGGAATAG" ## Levenshtein distance d <- stringDist(x, y) d attr(d, "ScoringMatrix") attr(d, "TraceBackMatrix") ## Hamming distance stringDist(x, y)
The function can be used to compute similarity scores between strings.
stringSim(x, y, global = TRUE, match = 1, mismatch = -1, gap = -1, minSim = 0)
stringSim(x, y, global = TRUE, match = 1, mismatch = -1, gap = -1, minSim = 0)
x |
character vector, first string |
y |
character vector, second string |
global |
logical; global or local alignment |
match |
numeric, score for a match between symbols |
mismatch |
numeric, score for a mismatch between symbols |
gap |
numeric, penalty for inserting a gap |
minSim |
numeric, used as required minimum score in case of local alignments |
The function computes optimal alignment scores for global (Needleman-Wunsch) and local (Smith-Waterman) alignments with constant gap penalties.
Scoring and trace-back matrix are computed and saved in form of attributes
"ScoringMatrix"
and "TraceBackMatrix"
.
The characters in the trace-back matrix reflect insertion of a gap in
string y
(d
: deletion), match (m
), mismatch (mm
),
and insertion of a gap in string x
(i
). In addition stop
indicates that the minimum similarity score has been reached.
stringSim
returns an object of S3 class "stringSim"
inherited
from class "dist"
; cf. dist
.
The function is mainly for teaching purposes.
For distances between strings and string alignments see also Bioconductor package Biostrings.
Matthias Kohl [email protected]
R. Merkl and S. Waack (2009). Bioinformatik Interaktiv. Wiley.
x <- "GACGGATTATG" y <- "GATCGGAATAG" ## optimal global alignment score d <- stringSim(x, y) d attr(d, "ScoringMatrix") attr(d, "TraceBackMatrix") ## optimal local alignment score d <- stringSim(x, y, global = FALSE) d attr(d, "ScoringMatrix") attr(d, "TraceBackMatrix")
x <- "GACGGATTATG" y <- "GATCGGAATAG" ## optimal global alignment score d <- stringSim(x, y) d attr(d, "ScoringMatrix") attr(d, "TraceBackMatrix") ## optimal local alignment score d <- stringSim(x, y, global = FALSE) d attr(d, "ScoringMatrix") attr(d, "TraceBackMatrix")
The function computes and plots TSH, fT3 and fT4 values with respect to the provided reference range.
thyroid(TSH, fT3, fT4, TSHref, fT3ref, fT4ref)
thyroid(TSH, fT3, fT4, TSHref, fT3ref, fT4ref)
TSH |
numeric vector of length 1: measured TSH concentration. |
fT3 |
numeric vector of length 1: measured fT3 concentration. |
fT4 |
numeric vector of length 1: measured fT4 concentration. |
TSHref |
numeric vector of length 2: reference range TSH. |
fT3ref |
numeric vector of length 2: reference range fT3. |
fT4ref |
numeric vector of length 2: reference range fT4. |
A simple function that computes the relative values of the measured values with respect to the provided reference range and visualizes the values using a barplot. Relative values between 40% and 60% are marked as O.K..
Invisible data.frame
with the relative values.
Matthias Kohl [email protected]
thyroid(TSH = 1.5, fT3 = 2.5, fT4 = 14, TSHref = c(0.2, 3.0), fT3ref = c(1.7, 4.2), fT4ref = c(7.6, 15.0))
thyroid(TSH = 1.5, fT3 = 2.5, fT4 = 14, TSHref = c(0.2, 3.0), fT3ref = c(1.7, 4.2), fT4ref = c(7.6, 15.0))
Function computes an optimal global or local alignment based on a trace back
matrix as provided by function stringDist
or stringSim
.
traceBack(D, global = TRUE)
traceBack(D, global = TRUE)
D |
object of class |
global |
logical, global or local alignment |
Computes one possible optimal global or local alignment based on the trace back
matrix saved in an object of class "stringDist"
or "stringSim"
.
matrix: pairwise global/local alignment
The function is mainly for teaching purposes.
For distances between strings and string alignments see Bioconductor package Biostrings.
Matthias Kohl [email protected]
R. Merkl and S. Waack (2009). Bioinformatik Interaktiv. Wiley.
x <- "GACGGATTATG" y <- "GATCGGAATAG" ## Levenshtein distance d <- stringDist(x, y) ## optimal global alignment traceBack(d) ## Optimal global alignment score d <- stringSim(x, y) ## optimal global alignment traceBack(d) ## Optimal local alignment score d <- stringSim(x, y, global = FALSE) ## optimal local alignment traceBack(d, global = FALSE)
x <- "GACGGATTATG" y <- "GATCGGAATAG" ## Levenshtein distance d <- stringDist(x, y) ## optimal global alignment traceBack(d) ## Optimal global alignment score d <- stringSim(x, y) ## optimal global alignment traceBack(d) ## Optimal local alignment score d <- stringSim(x, y, global = FALSE) ## optimal local alignment traceBack(d, global = FALSE)
The functions generate new transformations for the generalized logarithm and the negative logarithm that can be used for transforming the axes in ggplot2 plots.
glog_trans(base = exp(1)) glog10_trans() glog2_trans() scale_y_glog(...) scale_x_glog(...) scale_y_glog10(...) scale_x_glog10(...) scale_y_glog2(...) scale_x_glog2(...) neglog_breaks(n = 5, base = 10) neglog_trans(base = exp(1)) neglog10_trans() neglog2_trans() scale_y_neglog(...) scale_x_neglog(...) scale_y_neglog10(...) scale_x_neglog10(...) scale_y_neglog2(...) scale_x_neglog2(...)
glog_trans(base = exp(1)) glog10_trans() glog2_trans() scale_y_glog(...) scale_x_glog(...) scale_y_glog10(...) scale_x_glog10(...) scale_y_glog2(...) scale_x_glog2(...) neglog_breaks(n = 5, base = 10) neglog_trans(base = exp(1)) neglog10_trans() neglog2_trans() scale_y_neglog(...) scale_x_neglog(...) scale_y_neglog10(...) scale_x_neglog10(...) scale_y_neglog2(...) scale_x_neglog2(...)
base |
a positive or a positive or complex number: the base with respect to which generalized and negative logarithms are computed. Defaults to e=exp(1). |
... |
Arguments passed on to scale_(x|y)_continuous. |
n |
desired number of breaks. |
The functions can be used to transform axes in ggplot2 plots. The implementation
is analogous to e.g. scale_y_log10
.
The negative logarithm is for instance of use in case of p values (e.g. volcano plots),
The functions were adapted from packages scales and ggplot2.
A transformation.
Matthias Kohl [email protected]
H. Wickham. ggplot2: Elegant Graphics for Data Analysis. Springer-Verlag New York, 2016.
library(ggplot2) data(mpg) p1 <- ggplot(mpg, aes(displ, hwy)) + geom_point() p1 p1 + scale_x_log10() p1 + scale_x_glog10() p1 + scale_y_log10() p1 + scale_y_glog10() ## A volcano plot x <- matrix(rnorm(1000, mean = 10), nrow = 10) g1 <- rep("control", 10) y1 <- matrix(rnorm(500, mean = 11.25), nrow = 10) y2 <- matrix(rnorm(500, mean = 9.75), nrow = 10) g2 <- rep("treatment", 10) group <- factor(c(g1, g2)) Data <- rbind(x, cbind(y1, y2)) pvals <- apply(Data, 2, function(x, group) t.test(x ~ group)$p.value, group = group) ## compute log-fold change logfc <- function(x, group){ res <- tapply(x, group, mean) log2(res[1]/res[2]) } lfcs <- apply(Data, 2, logfc, group = group) ps <- data.frame(pvals = pvals, logfc = lfcs) ggplot(ps, aes(x = logfc, y = pvals)) + geom_point() + geom_hline(yintercept = 0.05) + scale_y_neglog10() + geom_vline(xintercept = c(-0.1, 0.1)) + xlab("log-fold change") + ylab("-log10(p value)") + ggtitle("A Volcano Plot")
library(ggplot2) data(mpg) p1 <- ggplot(mpg, aes(displ, hwy)) + geom_point() p1 p1 + scale_x_log10() p1 + scale_x_glog10() p1 + scale_y_log10() p1 + scale_y_glog10() ## A volcano plot x <- matrix(rnorm(1000, mean = 10), nrow = 10) g1 <- rep("control", 10) y1 <- matrix(rnorm(500, mean = 11.25), nrow = 10) y2 <- matrix(rnorm(500, mean = 9.75), nrow = 10) g2 <- rep("treatment", 10) group <- factor(c(g1, g2)) Data <- rbind(x, cbind(y1, y2)) pvals <- apply(Data, 2, function(x, group) t.test(x ~ group)$p.value, group = group) ## compute log-fold change logfc <- function(x, group){ res <- tapply(x, group, mean) log2(res[1]/res[2]) } lfcs <- apply(Data, 2, logfc, group = group) ps <- data.frame(pvals = pvals, logfc = lfcs) ggplot(ps, aes(x = logfc, y = pvals)) + geom_point() + geom_hline(yintercept = 0.05) + scale_y_neglog10() + geom_vline(xintercept = c(-0.1, 0.1)) + xlab("log-fold change") + ylab("-log10(p value)") + ggtitle("A Volcano Plot")
This function is a slight modification of function Anova
of
package "genefilter"
.
twoWayAnova(cov1, cov2, interaction, na.rm = TRUE)
twoWayAnova(cov1, cov2, interaction, na.rm = TRUE)
cov1 |
The first covariate. It must have length equal to the number of
columns of the array that the result of |
cov2 |
The second covariate. It must have length equal to the number of
columns of the array that the result of |
interaction |
logical, should interaction be considered |
na.rm |
a logical value indicating whether 'NA' values should be stripped before the computation proceeds. |
The function returned by twoWayAnova
uses lm
to fit
a linear model of the form lm(x ~ cov1*cov2)
, where x
is the set
of gene expressions. The F statistics for the main effects and the interaction are
computed and the corresponding p-values are returned.
twoWayAnova
returns a function with bindings for cov1
and
cov2
that will perform a two-way ANOVA.
A first version of this function appeared in package SLmisc.
Matthias Kohl [email protected]
R. Gentleman, V. Carey, W. Huber and F. Hahne (2006). genefilter: methods for filtering genes from microarray experiments. R package version 1.13.7.
set.seed(123) af1 <- twoWayAnova(c(rep(1,6),rep(2,6)), rep(c(rep(1,3), rep(2,3)), 2)) af2 <- twoWayAnova(c(rep(1,6),rep(2,6)), rep(c(rep(1,3), rep(2,3)), 2), interaction = FALSE) x <- matrix(rnorm(12*10), nrow = 10) apply(x, 1, af1) apply(x, 1, af2)
set.seed(123) af1 <- twoWayAnova(c(rep(1,6),rep(2,6)), rep(c(rep(1,3), rep(2,3)), 2)) af2 <- twoWayAnova(c(rep(1,6),rep(2,6)), rep(c(rep(1,3), rep(2,3)), 2), interaction = FALSE) x <- matrix(rnorm(12*10), nrow = 10) apply(x, 1, af1) apply(x, 1, af2)