Title: | Descriptive Statistics |
---|---|
Description: | Computation of standardized interquartile range (IQR), Huber-type skipped mean (Hampel (1985), <doi:10.2307/1268758>), robust coefficient of variation (CV) (Arachchige et al. (2019), <arXiv:1907.01110>), robust signal to noise ratio (SNR), z-score, standardized mean difference (SMD), as well as functions that support graphical visualization such as boxplots based on quartiles (not hinges), negative logarithms and generalized logarithms for 'ggplot2' (Wickham (2016), ISBN:978-3-319-24277-4). |
Authors: | Matthias Kohl [aut, cre] (0000-0001-9514-8910) |
Maintainer: | Matthias Kohl <[email protected]> |
License: | LGPL-3 |
Version: | 0.9 |
Built: | 2024-11-20 03:49:56 UTC |
Source: | https://github.com/stamats/mkdescr |
Computation of standardized interquartile range (IQR), Huber-type skipped mean (Hampel (1985), <doi:10.2307/1268758>), robust coefficient of variation (CV) (Arachchige et al. (2019), <arXiv:1907.01110>), robust signal to noise ratio (SNR), z-score, standardized mean difference (SMD), as well as functions that support graphical visualization such as boxplots based on quartiles (not hinges), negative logarithms and generalized logarithms for 'ggplot2' (Wickham (2016), ISBN:978-3-319-24277-4).
library(MKdescr)
Matthias Kohl https://www.stamats.de
Maintainer: Matthias Kohl [email protected]
The functions compute coefficient of variation (CV) as well as two robust versions of the CV.
CV(x, na.rm = FALSE) medCV(x, na.rm = FALSE, constant = 1/qnorm(0.75)) iqrCV(x, na.rm = FALSE, type = 7, constant = 2*qnorm(0.75))
CV(x, na.rm = FALSE) medCV(x, na.rm = FALSE, constant = 1/qnorm(0.75)) iqrCV(x, na.rm = FALSE, type = 7, constant = 2*qnorm(0.75))
x |
numeric vector with positive numbers. |
na.rm |
logical. Should missing values be removed? |
type |
an integer between 1 and 9 selecting one of nine quantile
algorithms; for more details see |
constant |
The functions compute the (classical) CV as well as two robust variants.
medCV
uses the (standardized) MAD instead of SD and median instead of mean.
iqrCV
uses the (standardized) IQR instead of SD and median instead of mean.
CV value.
Matthias Kohl [email protected]
C.N.P.G. Arachchige, L.A. Prendergast and R.G. Staudte. Robust analogues to the Coefficient of Variation. https://arxiv.org/abs/1907.01110.
## 5% outliers out <- rbinom(100, prob = 0.05, size = 1) sum(out) x <- (1-out)*rnorm(100, mean = 10, sd = 2) + out*25 CV(x) medCV(x) iqrCV(x)
## 5% outliers out <- rbinom(100, prob = 0.05, size = 1) sum(out) x <- (1-out)*rnorm(100, mean = 10, sd = 2) + out*25 CV(x) medCV(x) iqrCV(x)
Function to compute five-number summaries (minimum, 1st quartile, median, 3rd quartile, maximum)
fiveNS(x, na.rm = TRUE, type = 7)
fiveNS(x, na.rm = TRUE, type = 7)
x |
numeric vector |
na.rm |
logical; remove |
type |
an integer between 1 and 9 selecting one of nine quantile
algorithms; for more details see |
In contrast to fivenum
the functions computes the
first and third quartile using function quantile
.
A numeric vector of length 5 containing the summary information.
Matthias Kohl [email protected]
x <- rnorm(100) fiveNS(x) fiveNS(x, type = 2) fivenum(x)
x <- rnorm(100) fiveNS(x) fiveNS(x, type = 2) fivenum(x)
The functions compute the generalized logarithm, which is more or less identical to the area hyperbolic sine, and their inverse; see details.
glog(x, base = exp(1)) glog10(x) glog2(x) inv.glog(x, base = exp(1)) inv.glog10(x) inv.glog2(x)
glog(x, base = exp(1)) glog10(x) glog2(x) inv.glog(x, base = exp(1)) inv.glog10(x) inv.glog2(x)
x |
a numeric or complex vector. |
base |
a positive or a positive or complex number: the base with respect to which logarithms are computed. Defaults to e=exp(1). |
The function computes
where the first part corresponds to the area hyperbolic sine. Subtracting log(2) makes the function asymptotically identical to the logarithm.
A vector of the same length as x containing the transformed values.
Matthias Kohl [email protected]
Johnson N. L. (1949). Systems of Frequency Curves Generated by Methods of Translation. Biometrika, 36(1/2):149-176. doi:10.2307/2332539.
curve(log, from = -3, to = 5) curve(glog, from = -3, to = 5, add = TRUE, col = "orange") legend("topleft", fill = c("black", "orange"), legend = c("log", "glog")) curve(log10(x), from = -3, to = 5) curve(glog10(x), from = -3, to = 5, add = TRUE, col = "orange") legend("topleft", fill = c("black", "orange"), legend = c("log10", "glog10")) inv.glog(glog(10)) inv.glog(glog(10, base = 3), base = 3) inv.glog10(glog10(10)) inv.glog2(glog2(10))
curve(log, from = -3, to = 5) curve(glog, from = -3, to = 5, add = TRUE, col = "orange") legend("topleft", fill = c("black", "orange"), legend = c("log", "glog")) curve(log10(x), from = -3, to = 5) curve(glog10(x), from = -3, to = 5, add = TRUE, col = "orange") legend("topleft", fill = c("black", "orange"), legend = c("log10", "glog10")) inv.glog(glog(10)) inv.glog(glog(10, base = 3), base = 3) inv.glog10(glog10(10)) inv.glog2(glog2(10))
Function to illustrate the computation of box-and-whisker plots.
illustrate.boxplot(x)
illustrate.boxplot(x)
x |
numeric vector |
The function visualizes the computation of box-and-whisker plots.
An invisible object of class ggplot
.
Matthias Kohl [email protected]
set.seed(123) illustrate.boxplot(rt(50, df = 5)) illustrate.boxplot(rnorm(50, mean = 3, sd = 2))
set.seed(123) illustrate.boxplot(rt(50, df = 5)) illustrate.boxplot(rnorm(50, mean = 3, sd = 2))
Function to illustrate the computation of quantiles.
illustrate.quantile(x, alpha, type)
illustrate.quantile(x, alpha, type)
x |
numeric vector |
alpha |
numeric value in the interval (0,1). |
type |
integer values between 1 and 9 selecting one or several of
nine quantile algorithms; for more details see
|
The function visualizes the computation of alpha-quantiles.
An invisible object of class ggplot
.
Matthias Kohl [email protected]
x <- 1:10 illustrate.quantile(x, alpha = 0.15) illustrate.quantile(x, alpha = 0.5) illustrate.quantile(x, alpha = 0.8, type = 2) illustrate.quantile(x, alpha = 0.8, type = c(2, 7)) illustrate.quantile(x = rnorm(20), alpha = 0.95) illustrate.quantile(x = rnorm(21), alpha = 0.95)
x <- 1:10 illustrate.quantile(x, alpha = 0.15) illustrate.quantile(x, alpha = 0.5) illustrate.quantile(x, alpha = 0.8, type = 2) illustrate.quantile(x, alpha = 0.8, type = c(2, 7)) illustrate.quantile(x = rnorm(20), alpha = 0.95) illustrate.quantile(x = rnorm(21), alpha = 0.95)
Computes (standardized) interquartile range of the x
values.
IQrange(x, na.rm = FALSE, type = 7) sIQR(x, na.rm = FALSE, type = 7, constant = 2*qnorm(0.75))
IQrange(x, na.rm = FALSE, type = 7) sIQR(x, na.rm = FALSE, type = 7, constant = 2*qnorm(0.75))
x |
a numeric vector. |
na.rm |
logical. Should missing values be removed? |
type |
an integer between 1 and 9 selecting one of nine quantile
algorithms; for more details see |
constant |
standardizing contant; see details below. |
This function IQrange
computes quartiles as
IQR(x) = quantile(x,3/4) - quantile(x,1/4)
.
The function is identical to function IQR
. It was added
before the type
argument was introduced to function IQR
in 2010 (r53643, r53644).
For normally distributed
, the expected value of
IQR(X)
is 2*qnorm(3/4) = 1.3490
, i.e., for a normal-consistent
estimate of the standard deviation, use IQR(x) / 1.349
. This is implemented
in function sIQR
(standardized IQR).
Matthias Kohl [email protected]
Tukey, J. W. (1977). Exploratory Data Analysis. Reading: Addison-Wesley.
IQrange(rivers) ## identical to IQR(rivers) ## other quantile algorithms IQrange(rivers, type = 4) IQrange(rivers, type = 5) ## standardized IQR sIQR(rivers) ## right-skewed data distribution sd(rivers) mad(rivers) ## for normal data x <- rnorm(100) sd(x) sIQR(x) mad(x)
IQrange(rivers) ## identical to IQR(rivers) ## other quantile algorithms IQrange(rivers, type = 4) IQrange(rivers, type = 5) ## standardized IQR sIQR(rivers) ## right-skewed data distribution sd(rivers) mad(rivers) ## for normal data x <- rnorm(100) sd(x) sIQR(x) mad(x)
Computes (standardized) mean absolute deviation.
meanAD(x, na.rm = FALSE, constant = sqrt(pi/2))
meanAD(x, na.rm = FALSE, constant = sqrt(pi/2))
x |
a numeric vector. |
na.rm |
logical. Should missing values be removed? |
constant |
standardizing contant; see details below. |
The mean absolute deviation is a consistent estimator of
for the standard deviation of
a normal distribution. Under minor deviations of the normal distributions
its asymptotic variance is smaller than that of the sample standard
deviation (Tukey (1960)).
It works well under the assumption of symmetric, where mean and median
coincide. Under the normal distribution it's about 18% more efficient
(asymptotic relative efficiency) than the median absolute deviation
((1/qnorm(0.75))/sqrt(pi/2)
) and about 12% less efficient than the
sample standard deviation (Tukey (1960)).
Matthias Kohl [email protected]
Tukey, J. W. (1960). A survey of sampling from contaminated distribution. In Olkin, I., editor, Contributions to Probability and Statistics. Essays in Honor of H. Hotelling., pages 448-485. Stanford University Press.
## right skewed data ## mean absolute deviation meanAD(rivers) ## standardized IQR sIQR(rivers) ## median absolute deviation mad(rivers) ## sample standard deviation sd(rivers) ## for normal data x <- rnorm(100) sd(x) sIQR(x) mad(x) meanAD(x) ## Asymptotic relative efficiency for Tukey's symmetric gross-error model ## (1-eps)*Norm(mean, sd = sigma) + eps*Norm(mean, sd = 3*sigma) eps <- seq(from = 0, to = 1, by = 0.001) ARE <- function(eps){ 0.25*((3*(1+80*eps))/((1+8*eps)^2)-1)/(pi*(1+8*eps)/(2*(1+2*eps)^2)-1) } plot(eps, ARE(eps), type = "l", xlab = "Proportion of gross-errors", ylab = "Asymptotic relative efficiency", main = "ARE of mean absolute deviation w.r.t. sample standard deviation") abline(h = 1.0, col = "red") text(x = 0.5, y = 1.5, "Mean absolute deviation is better", col = "red", cex = 1, font = 1) ## lower bound of interval uniroot(function(x){ ARE(x)-1 }, interval = c(0, 0.002)) ## upper bound of interval uniroot(function(x){ ARE(x)-1 }, interval = c(0.5, 0.55)) ## worst case optimize(ARE, interval = c(0,1), maximum = TRUE)
## right skewed data ## mean absolute deviation meanAD(rivers) ## standardized IQR sIQR(rivers) ## median absolute deviation mad(rivers) ## sample standard deviation sd(rivers) ## for normal data x <- rnorm(100) sd(x) sIQR(x) mad(x) meanAD(x) ## Asymptotic relative efficiency for Tukey's symmetric gross-error model ## (1-eps)*Norm(mean, sd = sigma) + eps*Norm(mean, sd = 3*sigma) eps <- seq(from = 0, to = 1, by = 0.001) ARE <- function(eps){ 0.25*((3*(1+80*eps))/((1+8*eps)^2)-1)/(pi*(1+8*eps)/(2*(1+2*eps)^2)-1) } plot(eps, ARE(eps), type = "l", xlab = "Proportion of gross-errors", ylab = "Asymptotic relative efficiency", main = "ARE of mean absolute deviation w.r.t. sample standard deviation") abline(h = 1.0, col = "red") text(x = 0.5, y = 1.5, "Mean absolute deviation is better", col = "red", cex = 1, font = 1) ## lower bound of interval uniroot(function(x){ ARE(x)-1 }, interval = c(0, 0.002)) ## upper bound of interval uniroot(function(x){ ARE(x)-1 }, interval = c(0.5, 0.55)) ## worst case optimize(ARE, interval = c(0,1), maximum = TRUE)
The function transforms a given data.frame form wide to long form.
melt.long(data, select, group)
melt.long(data, select, group)
data |
data.frame that shall be transformed. |
select |
optional integer vector to select a subset of the columns of |
group |
optional vector to include an additional grouping in the output; for more details see examples below. |
The function transforms a given data.frame form wide to long form. This is for example useful for plotting with ggplot2.
data.frame in long form.
Matthias Kohl [email protected]
library(ggplot2) ## some random data test <- data.frame(x = rnorm(10), y = rnorm(10), z = rnorm(10)) test.long <- melt.long(test) test.long ggplot(test.long, aes(x = variable, y = value)) + geom_boxplot(aes(fill = variable)) ## introducing an additional grouping variable group <- factor(rep(c("a","b"), each = 5)) test.long.gr <- melt.long(test, select = 1:2, group = group) test.long.gr ggplot(test.long.gr, aes(x = variable, y = value, fill = group)) + geom_boxplot()
library(ggplot2) ## some random data test <- data.frame(x = rnorm(10), y = rnorm(10), z = rnorm(10)) test.long <- melt.long(test) test.long ggplot(test.long, aes(x = variable, y = value)) + geom_boxplot(aes(fill = variable)) ## introducing an additional grouping variable group <- factor(rep(c("a","b"), each = 5)) test.long.gr <- melt.long(test, select = 1:2, group = group) test.long.gr ggplot(test.long.gr, aes(x = variable, y = value, fill = group)) + geom_boxplot()
Produce box-and-whisker plot(s) of the given (grouped) values. In contrast
to boxplot
quartiles are used instead of hinges
(which are not necessarily quartiles) the rest of the implementation is
identical to boxplot
.
qboxplot(x, ...) ## S3 method for class 'formula' qboxplot(formula, data = NULL, ..., subset, na.action = NULL, type = 7) ## Default S3 method: qboxplot(x, ..., range = 1.5, width = NULL, varwidth = FALSE, notch = FALSE, outline = TRUE, names, plot = TRUE, border = par("fg"), col = NULL, log = "", pars = list(boxwex = 0.8, staplewex = 0.5, outwex = 0.5), horizontal = FALSE, add = FALSE, at = NULL, type = 7)
qboxplot(x, ...) ## S3 method for class 'formula' qboxplot(formula, data = NULL, ..., subset, na.action = NULL, type = 7) ## Default S3 method: qboxplot(x, ..., range = 1.5, width = NULL, varwidth = FALSE, notch = FALSE, outline = TRUE, names, plot = TRUE, border = par("fg"), col = NULL, log = "", pars = list(boxwex = 0.8, staplewex = 0.5, outwex = 0.5), horizontal = FALSE, add = FALSE, at = NULL, type = 7)
formula |
a formula, such as |
data |
a data.frame (or list) from which the variables in
|
subset |
an optional vector specifying a subset of observations to be used for plotting. |
na.action |
a function which indicates what should happen
when the data contain |
x |
for specifying data from which the boxplots are to be
produced. Either a numeric vector, or a single list containing such
vectors. Additional unnamed arguments specify further data
as separate vectors (each corresponding to a component boxplot).
|
... |
For the For the default method, unnamed arguments are additional data
vectors (unless |
range |
this determines how far the plot whiskers extend out
from the box. If |
width |
a vector giving the relative widths of the boxes making up the plot. |
varwidth |
if |
notch |
if |
outline |
if |
names |
group labels which will be printed under each boxplot. Can be a character vector or an expression (see plotmath). |
boxwex |
a scale factor to be applied to all boxes. When there are only a few groups, the appearance of the plot can be improved by making the boxes narrower. |
staplewex |
staple line width expansion, proportional to box width. |
outwex |
outlier line width expansion, proportional to box width. |
plot |
if |
border |
an optional vector of colors for the outlines of the
boxplots. The values in |
col |
if |
log |
character indicating if x or y or both coordinates should be plotted in log scale. |
pars |
a list of (potentially many) more graphical parameters,
e.g., |
horizontal |
logical indicating if the boxplots should be
horizontal; default |
add |
logical, if true add boxplot to current plot. |
at |
numeric vector giving the locations where the boxplots should
be drawn, particularly when |
type |
an integer between 1 and 9 selecting one of nine quantile
algorithms; for more details see |
The generic function qboxplot
currently has a default method
(qboxplot.default
) and a formula interface (qboxplot.formula
).
If multiple groups are supplied either as multiple arguments or via a
formula, parallel boxplots will be plotted, in the order of the
arguments or the order of the levels of the factor (see
factor
).
Missing values are ignored when forming boxplots.
List with the following components:
stats |
a matrix, each column contains the extreme of the lower whisker, the lower hinge, the median, the upper hinge and the extreme of the upper whisker for one group/plot. If all the inputs have the same class attribute, so will this component. |
n |
a vector with the number of observations in each group. |
conf |
a matrix where each column contains the lower and upper extremes of the notch. |
out |
the values of any data points which lie beyond the extremes of the whiskers. |
group |
a vector of the same length as |
names |
a vector of names for the groups. |
Matthias Kohl [email protected]
Becker, R. A., Chambers, J. M. and Wilks, A. R. (1988) The New S Language. Wadsworth & Brooks/Cole.
Chambers, J. M., Cleveland, W. S., Kleiner, B. and Tukey, P. A. (1983) Graphical Methods for Data Analysis. Wadsworth & Brooks/Cole.
Murrell, P. (2005) R Graphics. Chapman & Hall/CRC Press.
See also boxplot.stats
.
qbxp.stats
which does the computation,
bxp
for the plotting and more examples;
and stripchart
for an alternative (with small data
sets).
## adapted examples from boxplot op <- par() ## qboxplot on a formula: qboxplot(count ~ spray, data = InsectSprays, col = "lightgray") # *add* notches (somewhat funny here): qboxplot(count ~ spray, data = InsectSprays, notch = TRUE, add = TRUE, col = "blue") qboxplot(decrease ~ treatment, data = OrchardSprays, log = "y", col = "bisque") rb <- qboxplot(decrease ~ treatment, data = OrchardSprays, col="bisque") title("Comparing boxplot()s and non-robust mean +/- SD") mn.t <- tapply(OrchardSprays$decrease, OrchardSprays$treatment, mean) sd.t <- tapply(OrchardSprays$decrease, OrchardSprays$treatment, sd) xi <- 0.3 + seq(rb$n) points(xi, mn.t, col = "orange", pch = 18) arrows(xi, mn.t - sd.t, xi, mn.t + sd.t, code = 3, col = "pink", angle = 75, length = .1) ## boxplot on a matrix: mat <- cbind(Uni05 = (1:100)/21, Norm = rnorm(100), `5T` = rt(100, df = 5), Gam2 = rgamma(100, shape = 2)) qboxplot(as.data.frame(mat), main = "qboxplot(as.data.frame(mat), main = ...)") par(las = 1)# all axis labels horizontal qboxplot(as.data.frame(mat), main = "boxplot(*, horizontal = TRUE)", horizontal = TRUE) ## Using 'at = ' and adding boxplots -- example idea by Roger Bivand : qboxplot(len ~ dose, data = ToothGrowth, boxwex = 0.25, at = 1:3 - 0.2, subset = supp == "VC", col = "yellow", main = "Guinea Pigs' Tooth Growth", xlab = "Vitamin C dose mg", ylab = "tooth length", xlim = c(0.5, 3.5), ylim = c(0, 35), yaxs = "i") qboxplot(len ~ dose, data = ToothGrowth, add = TRUE, boxwex = 0.25, at = 1:3 + 0.2, subset = supp == "OJ", col = "orange") legend(2, 9, c("Ascorbic acid", "Orange juice"), fill = c("yellow", "orange")) par(op)
## adapted examples from boxplot op <- par() ## qboxplot on a formula: qboxplot(count ~ spray, data = InsectSprays, col = "lightgray") # *add* notches (somewhat funny here): qboxplot(count ~ spray, data = InsectSprays, notch = TRUE, add = TRUE, col = "blue") qboxplot(decrease ~ treatment, data = OrchardSprays, log = "y", col = "bisque") rb <- qboxplot(decrease ~ treatment, data = OrchardSprays, col="bisque") title("Comparing boxplot()s and non-robust mean +/- SD") mn.t <- tapply(OrchardSprays$decrease, OrchardSprays$treatment, mean) sd.t <- tapply(OrchardSprays$decrease, OrchardSprays$treatment, sd) xi <- 0.3 + seq(rb$n) points(xi, mn.t, col = "orange", pch = 18) arrows(xi, mn.t - sd.t, xi, mn.t + sd.t, code = 3, col = "pink", angle = 75, length = .1) ## boxplot on a matrix: mat <- cbind(Uni05 = (1:100)/21, Norm = rnorm(100), `5T` = rt(100, df = 5), Gam2 = rgamma(100, shape = 2)) qboxplot(as.data.frame(mat), main = "qboxplot(as.data.frame(mat), main = ...)") par(las = 1)# all axis labels horizontal qboxplot(as.data.frame(mat), main = "boxplot(*, horizontal = TRUE)", horizontal = TRUE) ## Using 'at = ' and adding boxplots -- example idea by Roger Bivand : qboxplot(len ~ dose, data = ToothGrowth, boxwex = 0.25, at = 1:3 - 0.2, subset = supp == "VC", col = "yellow", main = "Guinea Pigs' Tooth Growth", xlab = "Vitamin C dose mg", ylab = "tooth length", xlim = c(0.5, 3.5), ylim = c(0, 35), yaxs = "i") qboxplot(len ~ dose, data = ToothGrowth, add = TRUE, boxwex = 0.25, at = 1:3 + 0.2, subset = supp == "OJ", col = "orange") legend(2, 9, c("Ascorbic acid", "Orange juice"), fill = c("yellow", "orange")) par(op)
This functions works identical to boxplot.stats
.
It is typically called by another function to gather the statistics
necessary for producing box plots, but may be invoked separately.
qbxp.stats(x, coef = 1.5, do.conf = TRUE, do.out = TRUE, type = 7)
qbxp.stats(x, coef = 1.5, do.conf = TRUE, do.out = TRUE, type = 7)
x |
a numeric vector for which the boxplot will be constructed
( |
coef |
it determines how far the plot ‘whiskers’ extend out
from the box. If |
do.conf |
logical; if |
do.out |
logical; if |
type |
an integer between 1 and 9 selecting one of nine quantile
algorithms; for more details see |
The notches (if requested) extend to +/-1.58 IQR/sqrt(n)
.
This seems to be based on the same calculations as the formula with 1.57 in
Chambers et al. (1983, p. 62), given in McGill et al.
(1978, p. 16). They are based on asymptotic normality of the median
and roughly equal sample sizes for the two medians being compared, and
are said to be rather insensitive to the underlying distributions of
the samples. The idea appears to be to give roughly a 95% confidence
interval for the difference in two medians.
List with named components as follows:
stats |
a vector of length 5, containing the extreme of the lower whisker, the first quartile, the median, the third quartile and the extreme of the upper whisker. |
n |
the number of non- |
conf |
the lower and upper extremes of the ‘notch’
( |
out |
the values of any data points which lie beyond the
extremes of the whiskers ( |
Note that $stats
and $conf
are sorted in increasing
order, unlike S, and that $n
and $out
include any
+- Inf
values.
Matthias Kohl [email protected]
Tukey, J. W. (1977) Exploratory Data Analysis. Section 2C.
McGill, R., Tukey, J. W. and Larsen, W. A. (1978) Variations of box plots. The American Statistician 32, 12–16.
Velleman, P. F. and Hoaglin, D. C. (1981) Applications, Basics and Computing of Exploratory Data Analysis. Duxbury Press.
Emerson, J. D and Strenio, J. (1983). Boxplots and batch comparison. Chapter 3 of Understanding Robust and Exploratory Data Analysis, eds. D. C. Hoaglin, F. Mosteller and J. W. Tukey. Wiley.
Chambers, J. M., Cleveland, W. S., Kleiner, B. and Tukey, P. A. (1983) Graphical Methods for Data Analysis. Wadsworth & Brooks/Cole.
## adapted example from boxplot.stats x <- c(1:100, 1000) (b1 <- qbxp.stats(x)) (b2 <- qbxp.stats(x, do.conf=FALSE, do.out=FALSE)) stopifnot(b1$stats == b2$stats) # do.out=F is still robust qbxp.stats(x, coef = 3, do.conf=FALSE) ## no outlier treatment: qbxp.stats(x, coef = 0) qbxp.stats(c(x, NA)) # slight change : n is 101 (r <- qbxp.stats(c(x, -1:1/0))) stopifnot(r$out == c(1000, -Inf, Inf))
## adapted example from boxplot.stats x <- c(1:100, 1000) (b1 <- qbxp.stats(x)) (b2 <- qbxp.stats(x, do.conf=FALSE, do.out=FALSE)) stopifnot(b1$stats == b2$stats) # do.out=F is still robust qbxp.stats(x, coef = 3, do.conf=FALSE) ## no outlier treatment: qbxp.stats(x, coef = 0) qbxp.stats(c(x, NA)) # slight change : n is 101 (r <- qbxp.stats(c(x, -1:1/0))) stopifnot(r$out == c(1000, -Inf, Inf))
Function to compute seven-number summaries (minimum, 1st octile, 1st quartile, median, 3rd quartile, 7th octile, maximum)
sevenNS(x, na.rm = TRUE, type = 7)
sevenNS(x, na.rm = TRUE, type = 7)
x |
numeric vector |
na.rm |
logical; remove |
type |
an integer between 1 and 9 selecting one of nine quantile
algorithms; for more details see |
In contrast to Tukey (1977) who proposes hinges and eighths,
we use function quantile
.
A numeric vector of length 7 containing the summary information.
Matthias Kohl [email protected]
Tukey, J.W. (1977). Exploratory Data Analysis. Section 2G.
x <- rnorm(100) sevenNS(x) sevenNS(x, type = 2)
x <- rnorm(100) sevenNS(x) sevenNS(x, type = 2)
The function simulates a pair of correlated variables.
simCorVars(n, r, mu1 = 0, mu2 = 0, sd1 = 1, sd2 = 1, plot = TRUE)
simCorVars(n, r, mu1 = 0, mu2 = 0, sd1 = 1, sd2 = 1, plot = TRUE)
n |
integer: sample size. |
r |
numeric: correlation. |
mu1 |
numeric: mean of first variable. |
mu2 |
numeric: mean of second variable. |
sd1 |
numeric: SD of first variable. |
sd2 |
numeric: SD of second variable. |
plot |
logical: generate scatter plot of the variables. |
The function is mainly for teaching purposes and simulates n
observations
from a pair of normal distributed variables with correlation r
.
By specifying plot = TRUE
a scatter plot of the data is generated.
data.frame with entries Var1
and Var2
Matthias Kohl [email protected]
set.seed(123) res <- simCorVars(n = 100, r = 0.8) cor(res$Var1, res$Var2) colMeans(res) apply(res, 2, sd) set.seed(123) res <- simCorVars(n = 100, r = 0.8, mu1 = -1, mu2 = 1, sd1 = 2, sd2 = 0.5) cor(res$Var1, res$Var2) colMeans(res) apply(res, 2, sd)
set.seed(123) res <- simCorVars(n = 100, r = 0.8) cor(res$Var1, res$Var2) colMeans(res) apply(res, 2, sd) set.seed(123) res <- simCorVars(n = 100, r = 0.8, mu1 = -1, mu2 = 1, sd1 = 2, sd2 = 0.5) cor(res$Var1, res$Var2) colMeans(res) apply(res, 2, sd)
Computes Huper-type Skipped Mean and SD.
skippedMean(x, na.rm = FALSE, constant = 3.0) skippedSD(x, na.rm = FALSE, constant = 3.0)
skippedMean(x, na.rm = FALSE, constant = 3.0) skippedSD(x, na.rm = FALSE, constant = 3.0)
x |
a numeric vector. |
na.rm |
logical. Should missing values be removed? |
constant |
multiplier for outlier identification; see details below. |
The Huber-type skipped mean and is very close to estimator X42 of Hampel (1985), which uses 3.03 x MAD. Quoting Hampel et al. (1986), p. 69, the X42 estimator is "frequently quite reasonable, according to present preliminary knowledge".
For computing the Huber-type skipped mean, one first computes median and MAD. In the next step, all observations outside the interval [median - constant x MAD, median + constant x MAD] are removed and arithmetic mean and sample standard deviation are computed on the remaining data.
Matthias Kohl [email protected]
Hampel, F.R. (1985). The breakdown points of the mean combined with some rejection rules. Technometrics, 27: 95-107.
Hampel, F.R., Ronchetti, E.M., Rousseeuw, P.J., Stahel, W.A (1986). Robust statistics. The approach based on influence functions. New York: Wiley.
## normal data x <- rnorm(100) mean(x) median(x) skippedMean(x) sd(x) mad(x) skippedSD(x) ## Tukey's gross error model ## (1-eps)*Norm(mean, sd = sigma) + eps*Norm(mean, sd = 3*sigma) ind <- rbinom(100, size = 1, prob = 0.1) x.err <- (1-ind)*x + ind*rnorm(100, sd = 3) mean(x.err) median(x.err) skippedMean(x.err) sd(x.err) mad(x.err) skippedSD(x.err)
## normal data x <- rnorm(100) mean(x) median(x) skippedMean(x) sd(x) mad(x) skippedSD(x) ## Tukey's gross error model ## (1-eps)*Norm(mean, sd = sigma) + eps*Norm(mean, sd = 3*sigma) ind <- rbinom(100, size = 1, prob = 0.1) x.err <- (1-ind)*x + ind*rnorm(100, sd = 3) mean(x.err) median(x.err) skippedMean(x.err) sd(x.err) mad(x.err) skippedSD(x.err)
The function computes the standardized mean difference, where a bias correction can be applied.
SMD(x, y, bias.cor = TRUE, var.equal = FALSE, na.rm = FALSE)
SMD(x, y, bias.cor = TRUE, var.equal = FALSE, na.rm = FALSE)
x |
numeric vector, data of group 1. |
y |
numeric vector, data of group 2. |
bias.cor |
a logical variable indicating whether a bias correction should be performed. |
var.equal |
a logical variable indicating whether to treat the two variances
as being equal. If |
na.rm |
logical. Should missing values be removed? |
The function compute the (bias-corrected) standardized mean difference.
If bias.cor = FALSE
and var.equal = TRUE
, the result corresponds
to Cohen's d (Cohen (1988)).
If bias.cor = TRUE
and var.equal = TRUE
, the result corresponds to
Hedges' g (Hedges (1981)).
If bias.cor = FALSE
and var.equal = FALSE
, the result is closely
related to the test statistic of Welch's t test (Aoki (2020)).
If bias.cor = TRUE
and var.equal = FALSE
, the result corresponds to
Aoki's e (Aoki (2020)) which incorporates a Welch-Satterthwaite approximation
in combination with a bias correction.
SMD value.
Matthias Kohl [email protected]
Aoki, S. (2020). Effect sizes of the differences between means without assuming variance equality and between a mean and a constant. Heliyon, 6(1), e03306.
Cohen, J. (1988). Statistical Power Analysis for the Behavioral Sciences. Routledge. ISBN 978-1-134-74270-7.
Hedges, L. V. (1981). Distribution theory for Glass's estimator of effectsize and related estimators. Journal of Educational Statistics 6, 107-128.
n1 <- 200 x <- rnorm(n1) n2 <- 300 y <- rnorm(n2, mean = 3, sd = 2) ## true value (0-3)/sqrt((1 + n1/n2*2^2)/(n1/n2+1)) ## estimates ## Aoki's e SMD(x, y) ## Hedges' g SMD(x, y, var.equal = TRUE) ## standardized test statistic of Welch's t-test SMD(x, y, bias.cor = FALSE) ## Cohen's d SMD(x, y, bias.cor = FALSE, var.equal = TRUE) ## Example from Aoki (2020) SMD(0:4, c(0, 0, 1, 2, 2)) SMD(0:4, c(0, 0, 1, 2, 2), var.equal = TRUE) SMD(0:4, c(0, 0, 1, 2, 2), bias.cor = FALSE) SMD(0:4, c(0, 0, 1, 2, 2), bias.cor = FALSE, var.equal = TRUE)
n1 <- 200 x <- rnorm(n1) n2 <- 300 y <- rnorm(n2, mean = 3, sd = 2) ## true value (0-3)/sqrt((1 + n1/n2*2^2)/(n1/n2+1)) ## estimates ## Aoki's e SMD(x, y) ## Hedges' g SMD(x, y, var.equal = TRUE) ## standardized test statistic of Welch's t-test SMD(x, y, bias.cor = FALSE) ## Cohen's d SMD(x, y, bias.cor = FALSE, var.equal = TRUE) ## Example from Aoki (2020) SMD(0:4, c(0, 0, 1, 2, 2)) SMD(0:4, c(0, 0, 1, 2, 2), var.equal = TRUE) SMD(0:4, c(0, 0, 1, 2, 2), bias.cor = FALSE) SMD(0:4, c(0, 0, 1, 2, 2), bias.cor = FALSE, var.equal = TRUE)
The functions compute the signal to noise ration (SNR) as well as two robust versions of the SNR.
SNR(x, na.rm = FALSE) medSNR(x, na.rm = FALSE, constant = 1/qnorm(0.75)) iqrSNR(x, na.rm = FALSE, type = 7, constant = 2*qnorm(0.75))
SNR(x, na.rm = FALSE) medSNR(x, na.rm = FALSE, constant = 1/qnorm(0.75)) iqrSNR(x, na.rm = FALSE, type = 7, constant = 2*qnorm(0.75))
x |
numeric vector. |
na.rm |
logical. Should missing values be removed? |
type |
an integer between 1 and 9 selecting one of nine quantile
algorithms; for more details see |
constant |
The functions compute the (classical) SNRas well as two robust variants.
medSNR
uses the (standardized) MAD instead of SD and median instead of mean.
iqrSNR
uses the (standardized) IQR instead of SD and median instead of mean.
SNR value.
Matthias Kohl [email protected]
C.N.P.G. Arachchige, L.A. Prendergast and R.G. Staudte. Robust analogues to the Coefficient of Variation. https://arxiv.org/abs/1907.01110.
## 5% outliers out <- rbinom(100, prob = 0.05, size = 1) sum(out) x <- (1-out)*rnorm(100, mean = 10, sd = 2) + out*25 SNR(x) medSNR(x) iqrSNR(x)
## 5% outliers out <- rbinom(100, prob = 0.05, size = 1) sum(out) x <- (1-out)*rnorm(100, mean = 10, sd = 2) + out*25 SNR(x) medSNR(x) iqrSNR(x)
The function computes and plots TSH, fT3 and fT4 values with respect to the provided reference range.
thyroid(TSH, fT3, fT4, TSHref, fT3ref, fT4ref)
thyroid(TSH, fT3, fT4, TSHref, fT3ref, fT4ref)
TSH |
numeric vector of length 1: measured TSH concentration. |
fT3 |
numeric vector of length 1: measured fT3 concentration. |
fT4 |
numeric vector of length 1: measured fT4 concentration. |
TSHref |
numeric vector of length 2: reference range TSH. |
fT3ref |
numeric vector of length 2: reference range fT3. |
fT4ref |
numeric vector of length 2: reference range fT4. |
A simple function that computes the relative values of the measured values with respect to the provided reference range and visualizes the values using a barplot. Relative values between 40% and 60% are marked as O.K..
Invisible data.frame
with the relative values.
Matthias Kohl [email protected]
thyroid(TSH = 1.5, fT3 = 2.5, fT4 = 14, TSHref = c(0.2, 3.0), fT3ref = c(1.7, 4.2), fT4ref = c(7.6, 15.0))
thyroid(TSH = 1.5, fT3 = 2.5, fT4 = 14, TSHref = c(0.2, 3.0), fT3ref = c(1.7, 4.2), fT4ref = c(7.6, 15.0))
The functions generate new transformations for the generalized logarithm and the negative logarithm that can be used for transforming the axes in ggplot2 plots.
glog_trans(base = exp(1)) glog10_trans() glog2_trans() scale_y_glog(...) scale_x_glog(...) scale_y_glog10(...) scale_x_glog10(...) scale_y_glog2(...) scale_x_glog2(...) scale_y_log(...) scale_x_log(...) scale_y_log2(...) scale_x_log2(...) neglog_breaks(n = 5, base = 10) neglog_trans(base = exp(1)) neglog10_trans() neglog2_trans() scale_y_neglog(...) scale_x_neglog(...) scale_y_neglog10(...) scale_x_neglog10(...) scale_y_neglog2(...) scale_x_neglog2(...)
glog_trans(base = exp(1)) glog10_trans() glog2_trans() scale_y_glog(...) scale_x_glog(...) scale_y_glog10(...) scale_x_glog10(...) scale_y_glog2(...) scale_x_glog2(...) scale_y_log(...) scale_x_log(...) scale_y_log2(...) scale_x_log2(...) neglog_breaks(n = 5, base = 10) neglog_trans(base = exp(1)) neglog10_trans() neglog2_trans() scale_y_neglog(...) scale_x_neglog(...) scale_y_neglog10(...) scale_x_neglog10(...) scale_y_neglog2(...) scale_x_neglog2(...)
base |
a positive or a positive or complex number: the base with respect to which generalized and negative logarithms are computed. Defaults to e=exp(1). |
... |
Arguments passed on to scale_(x|y)_continuous. |
n |
desired number of breaks. |
The functions can be used to transform axes in ggplot2 plots. The implementation
is analogous to e.g. scale_y_log10
.
The negative logarithm is for instance of use in case of p values (e.g. volcano plots),
The functions were adapted from packages scales and ggplot2.
A transformation.
Matthias Kohl [email protected]
H. Wickham. ggplot2: Elegant Graphics for Data Analysis. Springer-Verlag New York, 2016.
library(ggplot2) data(mpg) p1 <- ggplot(mpg, aes(displ, hwy)) + geom_point() p1 p1 + scale_x_log10() p1 + scale_x_glog10() p1 + scale_y_log10() p1 + scale_y_glog10() ## A volcano plot x <- matrix(rnorm(1000, mean = 10), nrow = 10) g1 <- rep("control", 10) y1 <- matrix(rnorm(500, mean = 11.25), nrow = 10) y2 <- matrix(rnorm(500, mean = 9.75), nrow = 10) g2 <- rep("treatment", 10) group <- factor(c(g1, g2)) Data <- rbind(x, cbind(y1, y2)) pvals <- apply(Data, 2, function(x, group) t.test(x ~ group)$p.value, group = group) ## compute log-fold change logfc <- function(x, group){ res <- tapply(x, group, mean) log2(res[1]/res[2]) } lfcs <- apply(Data, 2, logfc, group = group) ps <- data.frame(pvals = pvals, logfc = lfcs) ggplot(ps, aes(x = logfc, y = pvals)) + geom_point() + geom_hline(yintercept = 0.05) + scale_y_neglog10() + geom_vline(xintercept = c(-0.1, 0.1)) + xlab("log-fold change") + ylab("-log10(p value)") + ggtitle("A Volcano Plot")
library(ggplot2) data(mpg) p1 <- ggplot(mpg, aes(displ, hwy)) + geom_point() p1 p1 + scale_x_log10() p1 + scale_x_glog10() p1 + scale_y_log10() p1 + scale_y_glog10() ## A volcano plot x <- matrix(rnorm(1000, mean = 10), nrow = 10) g1 <- rep("control", 10) y1 <- matrix(rnorm(500, mean = 11.25), nrow = 10) y2 <- matrix(rnorm(500, mean = 9.75), nrow = 10) g2 <- rep("treatment", 10) group <- factor(c(g1, g2)) Data <- rbind(x, cbind(y1, y2)) pvals <- apply(Data, 2, function(x, group) t.test(x ~ group)$p.value, group = group) ## compute log-fold change logfc <- function(x, group){ res <- tapply(x, group, mean) log2(res[1]/res[2]) } lfcs <- apply(Data, 2, logfc, group = group) ps <- data.frame(pvals = pvals, logfc = lfcs) ggplot(ps, aes(x = logfc, y = pvals)) + geom_point() + geom_hline(yintercept = 0.05) + scale_y_neglog10() + geom_vline(xintercept = c(-0.1, 0.1)) + xlab("log-fold change") + ylab("-log10(p value)") + ggtitle("A Volcano Plot")
The functions compute the classical z-score as well as two robust versions of z-scores.
zscore(x, na.rm = FALSE) medZscore(x, na.rm = FALSE, constant = 1/qnorm(0.75)) iqrZscore(x, na.rm = FALSE, type = 7, constant = 2*qnorm(0.75))
zscore(x, na.rm = FALSE) medZscore(x, na.rm = FALSE, constant = 1/qnorm(0.75)) iqrZscore(x, na.rm = FALSE, type = 7, constant = 2*qnorm(0.75))
x |
numeric vector with positive numbers. |
na.rm |
logical. Should missing values be removed? |
type |
an integer between 1 and 9 selecting one of nine quantile
algorithms; for more details see |
constant |
The functions compute the (classical) zscore as well as two robust variants.
medZscore
uses the (standardized) MAD instead of SD and median instead of mean.
iqrZscore
uses the (standardized) IQR instead of SD and median instead of mean.
z-score.
Matthias Kohl [email protected]
## 10% outliers out <- rbinom(100, prob = 0.1, size = 1) sum(out) x <- (1-out)*rnorm(100, mean = 10, sd = 2) + out*25 z <- zscore(x) z.med <- medZscore(x) z.iqr <- iqrZscore(x) ## mean without outliers (should by close to 0) mean(z[!out]) mean(z.med[!out]) mean(z.iqr[!out]) ## sd without outliers (should by close to 1) sd(z[!out]) sd(z.med[!out]) sd(z.iqr[!out])
## 10% outliers out <- rbinom(100, prob = 0.1, size = 1) sum(out) x <- (1-out)*rnorm(100, mean = 10, sd = 2) + out*25 z <- zscore(x) z.med <- medZscore(x) z.iqr <- iqrZscore(x) ## mean without outliers (should by close to 0) mean(z[!out]) mean(z.med[!out]) mean(z.iqr[!out]) ## sd without outliers (should by close to 1) sd(z[!out]) sd(z.med[!out]) sd(z.iqr[!out])