STATO (https://stato-ontology.org/) is a general-purpose STATistics Ontology.
Name: differential expression analysis data transformation
Definition: A differential expression analysis data transformation is a data transformation that has objective differential expression analysis and that consists of
Name: error correction data transformation
Definition: An error correction data transformation is a data transformation that has the objective of error correction, where the aim is to remove (correct for) erroneous contributions from the input to the data transformation.Name: multiple testing correction method
Definition: A multiple testing correction method is a hypothesis test performed simultaneously on M > 1 hypotheses. Multiple testing procedures produce a set of rejected hypotheses that is an estimate for the set of false null hypotheses while controlling for a suitably define Type I error rateName: family wise error rate correction method
Definition: A family wise error rate correction method is a multiple testing procedure that controls the probability of at least one false positive.Name: Holm-Bonferroni family-wise error rate correction method
Definition: a data transformation that performs more than one hypothesis test simultaneously, a closed-test procedure, that controls the familywise error rate for all the k hypotheses at level α in the strong sense. Objective: multiple testing correction
Name: false discovery rate correction method
Definition: The false discovery rate is a data transformation used in multiple hypothesis testing to correct for multiple comparisons. It controls the expected proportion of incorrectly rejected null hypotheses (type I errors) in a list of rejected hypotheses. It is a less conservative comparison procedure with greater power than familywise error rate (FWER) control, at a cost of increasing the likelihood of obtaining type I errors. .Name: Benjamini and Hochberg false discovery rate correction method
Definition: A data transformation process in which the Benjamini and Hochberg method sequential p-value procedure is applied with the aim of correcting false discovery rate
Name: Benjamini and Yekutieli false discovery rate correction method
Definition: A data transformation in which the Benjamini and Yekutieli method is applied with the aim of correcting false discovery rate
Name: Holm false discovery rate correction
Definition: A data transformation process in which the Holm p-value procedure is applied with the aim of correcting false discovery rate
Name: Hommel false discovery rate correction
Definition: A data transformation process in which the Hommel p-value procedure is applied with the aim of correcting false discovery rate
Name: Dunn’s multiple comparison test
Definition: Dunn’s Multiple Comparison Test is a post hoc (i.e. it’s run after an ANOVA) non parametric test (a “distribution free” test that doesn’t assume your data comes from a particular distribution). It is one of the least powerful of the multiple comparisons tests and can be a very conservative test–especially for larger numbers of comparisons. The Dunn is an alternative to the Tukey test when you only want to test for differences in a small subset of all possible pairs; For larger numbers of pairwise comparisons, use Tukey’s instead. Use Dunn’s when you choose to test a specific number of comparisons before you run the ANOVA and when you are not comparing to controls. If you are comparing to a control group, use the Dunnett test instead.
Name: Conover-Iman test of multiple comparisons using rank sums
Definition: Conover-Iman test for stochastic dominance is a stastical test for multiple group comparisons and reports the results among multiple pairwise comparisons after a Kruskal-Wallis test for stochastic dominance among k groups (Kruskal and Wallis, 1952). The interpretation of stochastic dominance requires an assumption that the CDF of one group does not cross the CDF of the other. The null hypothesis for each pairwise comparison is that the probability of observing a randomly selected value from the first group that is larger than a randomly selected value from the second group equals one half; this null hypothesis corresponds to that of the Wilcoxon-Mann-Whitney rank-sum test. Like the rank-sum test, if the data can be assumed to be continuous, and the distributions are assumed identical except for a difference in location, Conover-Iman test may be understood as a test for median difference. conover.test accounts for tied ranks. The Conover-Iman test is strictly valid if and only if the corresponding Kruskal-Wallis null hypothesis is rejected.
Name: simultaneous multiple testing correction
Definition: simultaneous multiple testing method is a multiple testing correction method which...
Name: sequential multiple testing correction method
Definition: sequential multiple testing method is a multiple testing correction method which...Name: alpha debt
Definition: a sequential multiple correction procedure which does not maintain a constant false positive rate but allows it to grow controllably.
Name: alpha investing procedure
Definition: a type of sequential multiple testing correction method
Name: alpha spending procedure
Definition: a type of sequential multiple testing correction
Name: statistical hypothesis test
Definition: A statistical hypothesis test data transformation is a data transformation that has objective statistical hypothesis test.Name: Student's t-test
Definition: Studen't t-test is a data transformation with the objective of a statistical hypothesis test in which the test statistic has a Student's t distribution if the null hypothesis is true. It is applied when the population is assumed to be normally distributed but the sample sizes are small enough that the statistic on which inference is based is not normally distributed because it relies on an uncertain estimate of standard deviation rather than on a precisely known value.Name: paired t-test
Definition: paired t-test is a statistical test which is specifically designed to analysis differences between paired observations in the case of studies realizing repeated measures design with only 2 repeated measurements per subject (before and after treatment for example)
Name: one sample t-test
Definition: one sample t-test is a kind of Student's t-test which evaluates if a given sample can be reasonably assumed to be taken from the population. The test compares the sample statistic (m) to the population parameter (M). The one sample t-test is the small sample analog of the z test, which is suitable for large samples.
Name: two sample t-test with equal variance
Definition: two sample t-test is a null hypothesis statistical test which is used to reject or accept the hypothesis of absence of difference between the means over 2 randomly sampled populations. It uses a t-distribution for the test and assumes that the variables in the population are normally distributed and with equal variances.
Name: two sample t-test with unequal variance
Definition: Welch t-test is a two sample t-test used when the variances of the 2 populations/samples are thought to be unequal (homoskedasticity hypothesis not verified). In this version of the two-sample t-test, the denominator used to form the t-statistics, does not rely on a 'pooled variance' estimate.
Name: chi square test
Definition: The chi-square test is a data transformation with the objective of statistical hypothesis testing, in which the sampling distribution of the test statistic is a chi-square distribution when the null hypothesis is true, or any in which this is asymptotically true, meaning that the sampling distribution (if the null hypothesis is true) can be made to approximate a chi-square distribution as closely as desired by making the sample size large enough.Name: Pearson's Chi square test of independence between categorical variables
Definition: Pearson's Chi-Squared test is a statistical null hypothesis test which is used to either evaluate goodness of fit of dataset to a Chi-Squared distribution or used to test independence of 2 categorical variables (ie absence of association between those variables).Name: Yate's corrected Chi-Squared test
Definition: Yate's corrected Chi-Squared test is a statistical test which is used to test the association/linkage/independence of 2 dichotomous variables while introducing a correction for using the continous Chi-squared distribution for the test. To reduce the error in approximation, Frank Yates, an English statistician, suggested a correction for continuity that adjusts the formula for Pearson's chi-squared test by subtracting 0.5 from the difference between each observed value and its expected value in a 2 × 2 contingency table. This reduces the chi-squared value obtained and thus increases its p-value.
Name: Pearson's Chi square test of goodness of fit
Definition: Pearson's Chi-Squared test for goodnes of fit is a statistical null hypothesis test which is used to either evaluate goodness of fit of dataset to a Chi-Squared distribution
Name: Wald test
Definition: the Wald test is statistical test which computes a Wald chi-squared test for 1 or more coefficients, given their variance-covariance matrix. The Wald test (also called the Wald Chi-Squared Test) is a way to find out if explanatory variables in a model are significant. “Significant” means that they add something to the model; variables that add nothing can be deleted without affecting the model in any meaningful way
Name: test of association between categorical variables
Definition: linkage between 2 categorical variable test is a statistical test which evaluates if there is an association between a predictor variable assuming discrete values and a response variable also assuming discrete valuesName: Fisher's exact test
Definition: Fisher's exact test is a statistical test used to determine if there are nonrandom associations between two categorical variables.
Name: Cochran-Mantel-Haenzel test for repeated tests of independence
Definition: Cochran-Mantel-Haenzel test for repeated tests of independence is a statitiscal test which allows the comparison of two groups on a dichotomous/categorical response. It is used when the effect of the explanatory variable on the response variable is influenced by covariates that can be controlled. It is often used in observational studies where random assignment of subjects to different treatments cannot be controlled, but influencing covariates can. The null hypothesis is that the two nominal variables that are tested within each repetition are independent of each other. So there are 3 variables to consider: two categorical variables to be tested for independence of each other, and the third variable identifies the repeats.
Name: Pearson's Chi square test of independence between categorical variables
Definition: Pearson's Chi-Squared test is a statistical null hypothesis test which is used to either evaluate goodness of fit of dataset to a Chi-Squared distribution or used to test independence of 2 categorical variables (ie absence of association between those variables).Name: Yate's corrected Chi-Squared test
Definition: Yate's corrected Chi-Squared test is a statistical test which is used to test the association/linkage/independence of 2 dichotomous variables while introducing a correction for using the continous Chi-squared distribution for the test. To reduce the error in approximation, Frank Yates, an English statistician, suggested a correction for continuity that adjusts the formula for Pearson's chi-squared test by subtracting 0.5 from the difference between each observed value and its expected value in a 2 × 2 contingency table. This reduces the chi-squared value obtained and thus increases its p-value.
Name: Cochran-Armitage test for trend
Definition: The Cochran-Armitage test is a statistical test used in categorical data analysis when the aim is to assess for the presence of an association between a dichotomous variable (variable with two categories) and a polychotomous variable (a variable with k categories). The two-level variable represents the response, and the other represents an explanatory variable with ordered levels. The null hypothesis is the hypothesis of no trend, which means that the binomial proportion is the same for all levels of the explanatory variable For example, doses of a treatment can be ordered as 'low', 'medium', and 'high', and we may suspect that the treatment benefit cannot become smaller as the dose increases. The trend test is often used as a genotype-based test for case-control genetic association studies.
Name: transmission disequilibrium test
Definition: The transmission disequilibrium test is a statistical test for genetic linkage between genetic marker and a trait in families. The test is robust to population structure.
Name: Barnard's test
Definition: Barnard's test is an exact statistical test used to determine if there are non-random associations between two categorical variables. It was developed in 1949 by Barnard and is a test which is, most times, more powerful that the Fisher exact test
Name: Cochran's q test for heterogeneity
Definition: Cochran's Q test is a statistical test used for unreplicated randomized block design experiments with a binary response variable and paired data. In the analysis of two-way randomized block designs where the response variable can take only two possible outcomes (coded as 0 and 1), Cochran's Q test is a non-parametric statistical test to verify whether k treatments have identical effects.
Name: sphericity test
Definition: a sphericity test is a null hypothesis statistical testing procedure which posits a null hypothesis of equality of the variances of the differences between levels of the repeated measures factorName: Mauchly's test for sphericity
Definition: the Mauchly's test for sphericity is a statistical test which evaluates if the variance of the differences between all combinations of the groups are equal, a property known as 'sphericity' in the context of repeated measures. It is used for instance prior to repeated measure ANOVA. The test works by assessing if a Wishart-distributed covariance matrix (or transformation thereof) is proportional to a given matrix.
Name: homoskedasticity test
Definition: an homoskedasticity test is a statistical test aiming at evaluate if the variances from several random samples are similarName: Levene's test
Definition: Levene's test is a null hypothesis statistical test which evaluates the null hypothesis of equality of variance in several populations.
Name: Barlett's test
Definition: Bartlett's test (see Snedecor and Cochran, 1989) is used to test if k samples are from populations with equal variances. Equal variances across samples is called homoscedasticity or homogeneity of variances. Some statistical tests, for example the analysis of variance, assume that variances are equal across groups or samples. The Bartlett test can be used to verify that assumption. Bartlett's test is sensitive to departures from normality. That is, if the samples come from non-normal distributions, then Bartlett's test may simply be testing for non-normality. Levene's test and the Brown–Forsythe test are alternatives to the Bartlett test that are less sensitive to departures from normality.
Name: Brown Forsythe test
Definition: the Brown Forsythe test is a statistical test which evaluates if the variance of different groups are equal. It relies on computing the median rather than the mean, as used in the Levene's test for homoschedacity. This test maybe used to, for instance, ensure that the conditions of applications of ANOVA are met.
Name: Breusch-Pagan test
Definition: Breusch-Pagan test is a statistical test which computes a score test of the hypothesis of constant error variance against the alternative that the error variance changes with the level of the response (fitted values), or with a linear combination of predictors.
Name: goodness of fit statistical test
Definition: a goodness of fit statistical test is a statistical test which aim to evaluate if a sample distribution can be considered equivalent to a theoretical distribution used as inputName: Likelihood-ratio test
Definition: Likelihood-ratio is a data transformation which tests whether there is evidence of the need to move from a simple model to a more complicated one (where the simple model is nested within the complicated one); tests of the goodness-of-fit between two models.
Name: Anderson-Darling test
Definition: The Anderson–Darling test is a statistical test of whether a given sample of data is drawn from a given probability distribution.
Name: Shapiro-Wilk test
Definition: Shapiro-Wilk test is a goodness of fit test which evaluates the null hypothesis that the sample is drawn from a population following a normal distribution
Name: Levene's test
Definition: Levene's test is a null hypothesis statistical test which evaluates the null hypothesis of equality of variance in several populations.
Name: Barlett's test
Definition: Bartlett's test (see Snedecor and Cochran, 1989) is used to test if k samples are from populations with equal variances. Equal variances across samples is called homoscedasticity or homogeneity of variances. Some statistical tests, for example the analysis of variance, assume that variances are equal across groups or samples. The Bartlett test can be used to verify that assumption. Bartlett's test is sensitive to departures from normality. That is, if the samples come from non-normal distributions, then Bartlett's test may simply be testing for non-normality. Levene's test and the Brown–Forsythe test are alternatives to the Bartlett test that are less sensitive to departures from normality.
Name: Kolmogorov-Smirnov test
Definition: Kolmogorov-Smirnov test is a goodness of fit test which evaluates the null hypothesis that a sample is drawn from a population that follows a specific continuous probability distribution.
Name: F-test
Definition: an F-test is a statistical test which evaluates that the computed test statistics follows an F-distribution under the null hypothesis. The F-test is sensitive to departure from normality. F-test arise when decomposing the variability in a data set in terms of sum of squares.
Name: one sample Hotelling T2 test
Definition: The one-sample Hotelling’s T2 is the multivariate extension of the common one-sample or paired Student’s t-test. In a one-sample t-test, the mean response is compared against a specific value. Hotelling’s one-sample T2 is used when the number of response variables is two or more, although it can be used when there is only one response variable. T2 makes the usual assumption that the data are approximately multivariate normal. Randomization tests are provided that do not rely on this assumption. These randomization tests should be used whenever you want exact results that do not rely on several assumptions.
Name: Hardy-Weinberg equilibrium testing
Definition: Hardy-Weinberg equilibrium test is a statistical test which aims to evaluate if a population's proportion of allele is stable or not. It is used as means of quality control to evaluate possibility of genotyping error or population structure.
Name: hypergeometric test
Definition: hypergeometric test is a null hypothesis test which evaluates if a random variable follows a hypergeometric distribution. It is a test of goodness of fit to that distribution. The test is suited for situation aimed at assessing cases of sampling from a finite set without replacements. For instance, testing for enrichment or depletion of elements (e.g GO categories, genes)
Name: exact binomial test
Definition: a binomial test is a statistical hypothesis test which evaluates if the observations made about a Bernoulli experiment , that is an experiment which tests the statistical significance of deviations from a theoretically expected distribution (the binomial distribution) of observations into 2 categories. It is a goodness of fit test.
Name: Pearson's Chi square test of goodness of fit
Definition: Pearson's Chi-Squared test for goodnes of fit is a statistical null hypothesis test which is used to either evaluate goodness of fit of dataset to a Chi-Squared distribution
Name: Hosmer-Lemeshow goodness-of-fit test
Definition: The Hosmer-Lemeshow test is a statistical test used to assess the goodness of fit of a logistic regression model. It evaluates how well the predicted probabilities from the model match the observed outcomes in the data. The test helps determine whether the logistic regression model adequately captures the relationship between the predictor variables and the binary outcome variable. The test statistic follows a chi-square distribution with degrees of freedom equal to the number of groups minus the number of parameters estimated in the logistic regression model.
Name: non-parametric test
Definition: a statistical test which makes no assumption about the underlying data distributionName: Mann-Whitney U-test
Definition: The Mann-Whitney U-test is a null hypothesis statistical testing procedure which allows two groups (or conditions or treatments) to be compared without making the assumption that values are normally distributed. The Mann-Whitney test is the non-parametric equivalent of the t-test for independent samples
Name: Kruskal Wallis test
Definition: The Kruskal–Wallis test is a null hypothesis statistical testing objective which allows multiple (n>=2) groups (or conditions or treatments) to be compared, without making the assumption that values are normally distributed. The Kruskal–Wallis test is the non-parametric equivalent of the independent samples ANOVA. The Kruskal–Wallis test is most commonly used when there is one nominal variable and one measurement variable, and the measurement variable does not meet the normality assumption of an anova.
Name: within subject comparison statistical test
Definition: within subject comparison statistical test is a kind of statistical test which evaluates if a change occurs within one experimental unit over time following a treatment or an eventName: Wilcoxon signed rank test
Definition: The Wilcoxon signed rank test is a statistical test which tests the null hypothesis that the median difference between pairs of observations is zero. This is the non-parametric analogue to the paired t-test, and should be used if the distribution of differences between pairs may be non-normally distributed. The procedure involves a ranking, hence the name. The absolute value of the differences between observations are ranked from smallest to largest, with the smallest difference getting a rank of 1, then next larger difference getting a rank of 2, etc. Ties are given average ranks. The ranks of all differences in one direction are summed, and the ranks of all differences in the other direction are summed. The smaller of these two sums is the test statistic, W (sometimes symbolized Ts). Unlike most test statistics, smaller values of W are less likely under the null hypothesis.
Name: paired t-test
Definition: paired t-test is a statistical test which is specifically designed to analysis differences between paired observations in the case of studies realizing repeated measures design with only 2 repeated measurements per subject (before and after treatment for example)
Name: repeated measure ANOVA
Definition: repeated measure ANOVA is a kind of ANOVA specifically developed for non-independent observations as found when repeated measurements on the sample experimental unit. repeated measure ANOVA is sensitive to departure from normality (evaluation using Bartlett's test), more so in the case of unbalanced groups (i.e. different sizes of sample populations). Departure from sphericity (evaluation using Mauchly'test) used to be an issue which is now handled robustly by modern tools such as R's lme4 or nlme, which accommodate dependence assumptions other than sphericity.
Name: odds ratio homogeneity test
Definition: odds ratio homogeneity test is a statistical test which aims to evaluate that null the hypothesis of consistency odds ratio accross different strata of population is true or notName: Breslow-Day test for homogeneity of odds ratio
Definition: the Breslow-Day test is a statistical test which evaluates if the odds ratios are homogenous across N 2x2 contingency tables, for instance several 2x2 contingency tables associated with different strata of a stratified population when evaluating the relationship between exposure and outcome or associated with the different samples coming from several centres in a multicentric study in clinical trial context.
Name: Tarone's test for homogeneity of odds ratio
Definition: Tarone's test for homogeneity of odds ratio is a statistical test which evaluates the null hypothesis that odds ratio are homogeneous
Name: Woolf's test
Definition: Woolf's test is a statistical test which evaluates the null hypothesis that odds ratio are the same accross all strata of population under investigation
Name: Chi-square test for homogeneity
Definition: a statistical test which test for homogeneity of proportions, which is used when comparing proportions observed across multiple groups. It relies on frequencies calculated in contingency tables. It determines is proportions are consistent.
Name: between group comparison statistical test
Definition: between group comparison statistical test is a statistical test which aims to detect difference between the means computing for each of the study group populationsName: ANOVA
Definition: ANOVA or analysis of variance is a data transformation in which a statistical test of whether the means of several groups are all equal.Name: one-way ANOVA
Definition: one-way anova is an analysis of variance where the different groups being compared are associated with the factor levels of only one independent variable. The null hypothesis is an absence of difference between the means calculated for each of the groups. The test assumes normality and equivariance of the data.
Name: two-way ANOVA
Definition: two-way anova is an analysis of variance where the different groups being compared are associated the factor levels of exatly 2 independent variables. The null hypothesis is an absence of difference between the means calculated for each of the groups. The test assumes normality and equivariance of the data.
Name: multiway ANOVA
Definition: Multi-way anova is an analysis of variance where the difference groups being compared are associated to the factor levels of more than 2 independent variables. The null hypothesis is an absence of difference between the means calculated for each of the groups. The test assumes normality and equivariance of the data.
Name: repeated measure ANOVA
Definition: repeated measure ANOVA is a kind of ANOVA specifically developed for non-independent observations as found when repeated measurements on the sample experimental unit. repeated measure ANOVA is sensitive to departure from normality (evaluation using Bartlett's test), more so in the case of unbalanced groups (i.e. different sizes of sample populations). Departure from sphericity (evaluation using Mauchly'test) used to be an issue which is now handled robustly by modern tools such as R's lme4 or nlme, which accommodate dependence assumptions other than sphericity.
Name: multivariate analysis of variance
Definition: "The multivariate analysis of variance, or MANOVA, is a procedure for comparing multivariate sample means. As a multivariate procedure, it is used when there are two or more dependent variables, and is typically followed by significance tests involving individual dependent variables separately. It helps to answer: 1. Do changes in the independent variable(s) have significant effects on the dependent variables? 2. What are the relationships among the dependent variables? 3. What are the relationships among the independent variables?"
Name: Z-test
Definition: Z-test is a statistical test which evaluate the null hypothesis that the means of 2 populations are equal and returns a p-value.
Name: two sample Hotelling T2 test
Definition: Hotelling's T2 test is a statistical test which is a generalization of Student's T-test to a assess if the means of a set of variables remains unchanged when studying 2 populations. It is a type of multivariate analysis
Name: post-hoc analysis
Definition: A post-hoc analysis is a statistical test carried out following an analysis of variance which ruled out the null hypothesis of absence of difference between group which allows identifying which groups differ.Name: Scheffe test
Definition: the Scheffe test is a data transformation which evaluates all possible contrasts and adjusting the levels significance by accounting for multiple comparison. The test is therefore conservative. Confidence intervals can be constructed for the corresponding linear regression. It was developped by American statistician Henry Scheffe in 1959.
Name: Least significance different test
Definition: the LSD test is a statistical test for multiple comparisons of treatments by means of least significant difference following an ANOVA analysis
Name: Tukey HSD for Post-Hoc Analysis
Definition: Tukey Honestly Significant Difference (HSD) test is a statistical test used following an ANOVA test yielding a statistically significant p-value in order to determine which means are different, to a given level of significance. The Tukey HSD test relies on the q-distribution. The procedure is conservative, meaning that if sample sizes (the sizes of different study groups) are equal, the risk of a Type I error is exactly α, and if sample sizes are unequal it’s less than α.
Name: Newman-Keuls test post-hoc analysis
Definition: The Newman–Keuls or Student–Newman–Keuls (SNK) method is a stepwise multiple comparisons procedure used to identify sample means that are significantly different from each other. It was named after Student (1927), D. Newman, and M. Keuls. This procedure is often used as a post-hoc test whenever a significant difference between three or more sample means has been revealed by an analysis of variance (ANOVA). The Newman–Keuls method is similar to Tukey's range test as both procedures use Studentized range statistics.Compared to Tukey's range test, the Newman–Keuls method is more powerful but less conservative.
Name: ANCOVA
Definition: ANCOVA or analysis of covariance is a data transformation which evaluates if population means of a dependent variable are equal across levels of a categorical independent variables while controlling for the effects of other continuous variable s, known as covariates. Therefore, when performing ANCOVA, we are adjusting the dependent variable means to what they would be if all groups were equal on the covariates. It augments the ANOVA model with one or more additional quantitative variables, called covariates, which are related to the response variable. The covariates are included to reduce the variance in the error terms and provide more precise measurement of the treatment effects. ANCOVA is used to test the main and interaction effects of the factors, while controlling for the effects of the covariate
Name: non-parametric test
Definition: a statistical test which makes no assumption about the underlying data distributionName: Mann-Whitney U-test
Definition: The Mann-Whitney U-test is a null hypothesis statistical testing procedure which allows two groups (or conditions or treatments) to be compared without making the assumption that values are normally distributed. The Mann-Whitney test is the non-parametric equivalent of the t-test for independent samples
Name: Kruskal Wallis test
Definition: The Kruskal–Wallis test is a null hypothesis statistical testing objective which allows multiple (n>=2) groups (or conditions or treatments) to be compared, without making the assumption that values are normally distributed. The Kruskal–Wallis test is the non-parametric equivalent of the independent samples ANOVA. The Kruskal–Wallis test is most commonly used when there is one nominal variable and one measurement variable, and the measurement variable does not meet the normality assumption of an anova.
Name: one sample t-test
Definition: one sample t-test is a kind of Student's t-test which evaluates if a given sample can be reasonably assumed to be taken from the population. The test compares the sample statistic (m) to the population parameter (M). The one sample t-test is the small sample analog of the z test, which is suitable for large samples.
Name: two sample t-test with equal variance
Definition: two sample t-test is a null hypothesis statistical test which is used to reject or accept the hypothesis of absence of difference between the means over 2 randomly sampled populations. It uses a t-distribution for the test and assumes that the variables in the population are normally distributed and with equal variances.
Name: two sample t-test with unequal variance
Definition: Welch t-test is a two sample t-test used when the variances of the 2 populations/samples are thought to be unequal (homoskedasticity hypothesis not verified). In this version of the two-sample t-test, the denominator used to form the t-statistics, does not rely on a 'pooled variance' estimate.
Name: Yuen t-Test with trimmed means
Definition: The Yuen's t-test is a two sample t-test with populations of unequal variance which provides a more robust t-test procedure under normal distribution and long tailed distributions. The test computes a t statistics not using 'arithmetic means' but using 'trimmed means' instead as well as winsorized variances.
Name: Dunn’s multiple comparison test
Definition: Dunn’s Multiple Comparison Test is a post hoc (i.e. it’s run after an ANOVA) non parametric test (a “distribution free” test that doesn’t assume your data comes from a particular distribution). It is one of the least powerful of the multiple comparisons tests and can be a very conservative test–especially for larger numbers of comparisons. The Dunn is an alternative to the Tukey test when you only want to test for differences in a small subset of all possible pairs; For larger numbers of pairwise comparisons, use Tukey’s instead. Use Dunn’s when you choose to test a specific number of comparisons before you run the ANOVA and when you are not comparing to controls. If you are comparing to a control group, use the Dunnett test instead.
Name: Conover-Iman test of multiple comparisons using rank sums
Definition: Conover-Iman test for stochastic dominance is a stastical test for multiple group comparisons and reports the results among multiple pairwise comparisons after a Kruskal-Wallis test for stochastic dominance among k groups (Kruskal and Wallis, 1952). The interpretation of stochastic dominance requires an assumption that the CDF of one group does not cross the CDF of the other. The null hypothesis for each pairwise comparison is that the probability of observing a randomly selected value from the first group that is larger than a randomly selected value from the second group equals one half; this null hypothesis corresponds to that of the Wilcoxon-Mann-Whitney rank-sum test. Like the rank-sum test, if the data can be assumed to be continuous, and the distributions are assumed identical except for a difference in location, Conover-Iman test may be understood as a test for median difference. conover.test accounts for tied ranks. The Conover-Iman test is strictly valid if and only if the corresponding Kruskal-Wallis null hypothesis is rejected.
Name: Friedman test
Definition: The Friedman test is a non-parametric statistical test used to determine whether there are statistically significant differences among multiple related groups. It is an extension of the Wilcoxon signed-rank test for more than two related samples.
Name: one tailed test
Definition: a one-tailed test is a statistical test which, assuming an unskewed probability distribution, allocates all of the significance level to evaluate only one hypothesis to explain a difference. The one-tailed test provides more power to detect an effect in one direction by not testing the effect in the other direction. one-tailed test should be preceded by two-tailed test in order to avoid missing out on detecting alternate effect explaining an observed difference.
Name: two tailed test
Definition: a two tailed test is a statistical test which assess the null hypothesis of absence of difference assuming a symmetric (not skewed) underlying probability distribution by allocating half of the significance level selected to each of the direction of change which could explain a difference (for example, a difference can be an excess or a loss).Name: hypergeometric test
Definition: hypergeometric test is a null hypothesis test which evaluates if a random variable follows a hypergeometric distribution. It is a test of goodness of fit to that distribution. The test is suited for situation aimed at assessing cases of sampling from a finite set without replacements. For instance, testing for enrichment or depletion of elements (e.g GO categories, genes)
Name: McNemar test
Definition: McNemar's test is a statistical test used on paired nominal data. It is applied to 2 × 2 contingency tables with a dichotomous trait, with matched pairs of subjects, to determine whether the row and column marginal frequencies are equal (that is, whether there is "marginal homogeneity"). It is named after Quinn McNemar, who introduced it in 1947. An application of the test in genetics is the transmission disequilibrium test for detecting linkage disequilibrium
Name: generalized extreme studentized deviate test
Definition: The Extreme Studentized Deviate Test is a statistical test used to detect outliers in a univariate data set assumed to come from a normally distributed population. The ESD Test differs from the Grubbs' test and the Tietjen-Moore test in the sense that it contains built-in correction for multiple testing.Name: Dixon Q test
Definition: Dixon test is a statistical test used to detect outliers in a univariate data set assumed to come from a normally distributed population.
Name: Grubbs' test
Definition: Grubbs' test is a statistical test used to detect one outlier in a univariate data set assumed to come from a normally distributed population.
Name: Tietjen-Moore test for outliers
Definition: Tietjen-Moore test for outlier is a statistical test used to detect outliers and corresponds to a generalization of the Grubb's test, thus allowing detection of more than one outlier in a univariate data set assumed to come from a normally distributed population. If testing for a single outlier, the Tietjen-Moore test is equivalent to the Grubbs' test.
Name: log-rank test
Definition: The logrank test is a statistical hypothesis test used to compare the survival distributions of two or more groups. It is commonly employed in survival analysis, where the primary interest lies in comparing the survival experiences of different groups over time.
Name: sign test
Definition: The sign test is a non-parametric hypothesis test used to assess whether the median of a single population is equal to a specified value, typically referred to as the null hypothesis. The sign test is particularly useful when the data are not normally distributed or when the assumptions required for parametric tests are not met. Note that the 'sign test' is related but different from the "Wilcoxon signed rank test'. Sign test: It does not assume any specific distribution for the data. It only requires paired data and makes no assumptions about the shape of the underlying distribution. Wilcoxon signed-rank test: It assumes that the differences between paired observations come from a symmetric distribution. It's also more powerful than the sign test when the distribution is continuous and symmetric.
Name: homogeneity test
Definition: an homogeneity test is a statistical test aiming at evaluate if the statisical measure from several random samples are similarName: Breslow-Day test for homogeneity of odds ratio
Definition: the Breslow-Day test is a statistical test which evaluates if the odds ratios are homogenous across N 2x2 contingency tables, for instance several 2x2 contingency tables associated with different strata of a stratified population when evaluating the relationship between exposure and outcome or associated with the different samples coming from several centres in a multicentric study in clinical trial context.
Name: Tarone's test for homogeneity of odds ratio
Definition: Tarone's test for homogeneity of odds ratio is a statistical test which evaluates the null hypothesis that odds ratio are homogeneous
Name: Woolf's test
Definition: Woolf's test is a statistical test which evaluates the null hypothesis that odds ratio are the same accross all strata of population under investigation
Name: Chi-square test for homogeneity
Definition: a statistical test which test for homogeneity of proportions, which is used when comparing proportions observed across multiple groups. It relies on frequencies calculated in contingency tables. It determines is proportions are consistent.
Name: Likelihood-ratio test
Definition: Likelihood-ratio is a data transformation which tests whether there is evidence of the need to move from a simple model to a more complicated one (where the simple model is nested within the complicated one); tests of the goodness-of-fit between two models.
Name: regression analysis method
Definition: Regression analysis is a descriptive statistics technique that examines the relation of a dependent variable (response variable) to specified independent variables (explanatory variables). Regression analysis can be used as a descriptive method of data analysis (such as curve fitting) without relying on any assumptions about underlying processes generating the data.Name: principal component regression
Definition: The Principal Component Regression method is a regression analysis method that combines the Principal Component Analysis (PCA)spectral decomposition with an Inverse Least Squares (ILS) regression method to create a quantitative model for complex samples. Unlike quantitation methods based directly on Beer's Law which attempt to calculate the absorbtivity coefficients for the constituents of interest from a direct regression of the constituent concentrations onto the spectroscopic responses, the PCR method regresses the concentrations on the PCA scores.
Name: multinomial probit regression for analysis of polychotomous dependent variable
Definition: multinomial logistic regression model is a model which attempts to explain data distribution associated with polychotomous response/dependent variable in terms of values assumed by the independent variable uses a function of predictor/independent variable(s): the function used in this instance of regression modeling is probit function.
Name: probit regression for analysis of polychotomous dependent variable
Definition: probit regression model is a model which attempts to explain data distribution associated with dichotomous response/dependent variable in terms of values assumed by the independent variable uses a function of predictor/independent variable(s): the function used in this instance of regression modeling is the probit function aka the quantile function, i.e., the inverse cumulative distribution function (CDF), associated with the standard normal distribution.
Name: linear regression for analysis of continuous dependent variable
Definition: linear regression model is a model which attempts to explain data distribution associated with response/dependent variable in terms of values assumed by the independent variable uses a linear function or linear combination of the regression parameters and the predictor/independent variable(s). linear regression modeling makes a number of assumptions, which includes homoskedasticity (constance of variance)
Name: multinomial logistic regression for analysis of dichotomous dependent variable
Definition: multinomial logistic regression model is a model which attempts to explain data distribution associated with polychotomous response/dependent variable in terms of values assumed by the independent variable uses a function of predictor/independent variable(s): the function used in this instance of regression modeling is logistic function.
Name: binomial logistic regression for analysis of dichotomous dependent variable
Definition: binomial logistic regression model is a model which attempts to explain data distribution associated with dichotomous response/dependent variable in terms of values assumed by the independent variable uses a function of predictor/independent variable(s): the function used in this instance of regression modeling is logistic function.
Name: ordered probit regression for analysis of ordinal dependent variable
Definition: probit regression model is a model which attempts to explain data distribution associated with ordinal response/dependent variable in terms of values assumed by the independent variable uses a function of predictor/independent variable(s): the function used in this instance of regression modeling is the ordered probit function.
Name: normalization data transformation
Definition: A normalization data transformation is a data transformation that has objective normalization.Name: logarithmic transformation
Definition: A logarithmic transformation is a data transformation consisting in the application of the logarithm function with a given base a (where a>0 and a is not equal to 1) to a (one dimensional) positive real number input. The logarithm function with base a can be defined as the inverse of the exponential function with the same base. See e.g. http://en.wikipedia.org/wiki/Logarithm.
Name: averaging data transformation
Definition: An averaging data transformation is a data transformation that has objective averaging.
Name: partitioning data transformation
Definition: A partitioning data transformation is a data transformation that has objective partitioning.Name: k-nearest neighbors
Definition: A k-nearest neighbors is a data transformation which achieves a class discovery or partitioning objective, in which an input data object with vector y is assigned to a class label based upon the k closest training data set points to y; where k is the largest value that class label is assigned.
Name: k-means clustering
Definition: A k-means clustering is a data transformation which achieves a class discovery or partitioning objective, which takes as input a collection of objects (represented as points in multidimensional space) and which partitions them into a specified number k of clusters. The algorithm attempts to find the centers of natural clusters in the data. The most common form of the algorithm starts by partitioning the input points into k initial sets, either at random or using some heuristic data. It then calculates the mean point, or centroid, of each set. It constructs a new partition by associating each point with the closest centroid. Then the centroids are recalculated for the new clusters, and the algorithm repeated by alternate applications of these two steps until convergence, which is obtained when the points no longer switch clusters (or alternatively centroids are no longer changed).
Name: class discovery data transformation
Definition: A class discovery data transformation (sometimes called unsupervised classification) is a data transformation that has objective class discovery.Name: k-nearest neighbors
Definition: A k-nearest neighbors is a data transformation which achieves a class discovery or partitioning objective, in which an input data object with vector y is assigned to a class label based upon the k closest training data set points to y; where k is the largest value that class label is assigned.
Name: k-means clustering
Definition: A k-means clustering is a data transformation which achieves a class discovery or partitioning objective, which takes as input a collection of objects (represented as points in multidimensional space) and which partitions them into a specified number k of clusters. The algorithm attempts to find the centers of natural clusters in the data. The most common form of the algorithm starts by partitioning the input points into k initial sets, either at random or using some heuristic data. It then calculates the mean point, or centroid, of each set. It constructs a new partition by associating each point with the closest centroid. Then the centroids are recalculated for the new clusters, and the algorithm repeated by alternate applications of these two steps until convergence, which is obtained when the points no longer switch clusters (or alternatively centroids are no longer changed).
Name: hierarchical clustering
Definition: A hierarchical clustering is a data transformation which achieves a class discovery objective, which takes as input data item and builds a hierarchy of clusters. The traditional representation of this hierarchy is a tree (visualized by a dendrogram), with the individual input objects at one end (leaves) and a single cluster containing every object at the other (root).Name: agglomerative hierarchical clustering
Definition: An agglomerative hierarchical clustering is a hierarchical clustering which starts with separate clusters and then successively combines these clusters until there is only one cluster remaining.Name: average linkage hierarchical clustering
Definition: An average linkage hierarchical clustering is an agglomerative hierarchical clustering which generates successive clusters based on a distance measure, where the distance between two clusters is calculated as the average distance between objects from the first cluster and objects from the second cluster.
Name: complete linkage hierarchical clustering
Definition: an agglomerative hierarchical clustering which generates successive clusters based on a distance measure, where the distance between two clusters is calculated as the maximum distance between objects from the first cluster and objects from the second cluster.
Name: single linkage hierarchical clustering
Definition: A single linkage hierarchical clustering is an agglomerative hierarchical clustering which generates successive clusters based on a distance measure, where the distance between two clusters is calculated as the minimum distance between objects from the first cluster and objects from the second cluster.
Name: divisive hierarchical clustering
Definition: A divisive hierarchical clustering is a hierarchical clustering which starts with a single cluster and then successively splits resulting clusters until only clusters of individual objects remain.
Name: dimensionality reduction
Definition: A dimensionality reduction is data partitioning which transforms each input m-dimensional vector (x_1, x_2, ..., x_m) into an output n-dimensional vector (y_1, y_2, ..., y_n), where n is smaller than m.Name: principal components analysis dimensionality reduction
Definition: A principal components analysis dimensionality reduction is a dimensionality reduction achieved by applying principal components analysis and by keeping low-order principal components and excluding higher-order ones.
Name: factor analysis
Definition: Factor analysis is a dimension reduction data transformation that is used to describe variability among observed, correlated variables in terms of a potentially lower number of unobserved variables called factors. Factor analysis is related to principal component analysis (PCA), but the two are not identical. Both PCA and factor analysis aim to reduce the dimensionality of a set of data, but the approaches taken to do so are different for the two techniques. Factor analysis is clearly designed with the objective to identify certain unobservable factors from the observed variables, whereas PCA does not directly address this objective; at best, PCA provides an approximation to the required factors.
Name: principal component regression
Definition: The Principal Component Regression method is a regression analysis method that combines the Principal Component Analysis (PCA)spectral decomposition with an Inverse Least Squares (ILS) regression method to create a quantitative model for complex samples. Unlike quantitation methods based directly on Beer's Law which attempt to calculate the absorbtivity coefficients for the constituents of interest from a direct regression of the constituent concentrations onto the spectroscopic responses, the PCR method regresses the concentrations on the PCA scores.
Name: random forest procedure
Definition: random forest procedure is a type of data transformation used in classification and statistical learning using regression. The random forest procedure is a meta estimator that fits a number of classifying decision trees on various sub-samples of the dataset (it operates by constructing a multitude of decision trees at training time) and use averaging to improve the predictive accuracy and control over-fitting. The sub-sample size is always the same as the original input sample size but the samples are drawn with replacement if bootstrap=True (default). The random forest procedure outputs the class that is the mode of the classes (classification) or mean prediction (regression) of the individual trees.
Name: Partial Least Square regression
Definition: Partial least squares regression (PLS regression) is a data transformation that bears some relation to principal components regression; instead of finding hyperplanes of maximum variance between the response and independent variables, it finds a linear regression model by projecting the predicted variables and the observable variables to a new space. Because both the X and Y data are projected to new spaces, the PLS family of methods are known as bilinear factor models. Partial least squares Discriminant Analysis (PLS-DA) is a variant used when the Y is categorical. PLS is used to find the fundamental relations between two matrices (X and Y), i.e. a latent variable approach to modeling the covariance structures in these two spaces. A PLS model will try to find the multidimensional direction in the X space that explains the maximum multidimensional variance direction in the Y space. PLS regression is particularly suited when the matrix of predictors has more variables than observations, and when there is multicollinearity among X values. By contrast, standard regression will fail in these cases (unless it is regularized). Partial least squares was introduced by the Swedish statistician Herman O. A. Wold, who then developed it with his son, Svante Wold. An alternative term for PLS (and more correct according to Svante Wold[1]) is projection to latent structures, but the term partial least squares is still dominant in many areas. Although the original applications were in the social sciences, PLS regression is today most widely used in chemometrics and related areas. It is also used in bioinformatics, sensometrics, neuroscience and anthropology.Name: PLS1
Definition: a partial least square regression applied when there is only one variable in Y (the matrix of response variables), or it is desirable to model and optimize separately the performance of each of the variables in Y. This case is usually referred to as PLS1 regression (J = 1).
Name: PLS2
Definition: a partial least square regression applied to a multivariate response variable.
Name: Partial Least Square Discriminant Analysis
Definition: a version of PLS used for classification, where the input y-block are group labels (categorical variable) rather than a continuous variable
Name: center calculation data transformation
Definition: A center calculation data transformation is a data transformation that has objective of center calculation.Name: mode calculation
Definition: A mode calculation is a descriptive statistics calculation in which the mode is calculated which is the most common value in a data set. It is most often used as a measure of center for discrete data.
Name: median calculation
Definition: A median calculation is a descriptive statistics calculation in which the midpoint of the data set (the 0.5 quantile) is calculated. First, the observations are sorted in increasing order. For an odd number of observations, the median is the middle value of the sorted data. For an even number of observations, the median is the average of the two middle values.
Name: descriptive statistical calculation data transformation
Definition: A descriptive statistical calculation data transformation is a data transformation that has objective descriptive statistical calculation and which concerns any calculation intended to describe a feature of a data set, for example, its center or its variability.Name: mode calculation
Definition: A mode calculation is a descriptive statistics calculation in which the mode is calculated which is the most common value in a data set. It is most often used as a measure of center for discrete data.
Name: median calculation
Definition: A median calculation is a descriptive statistics calculation in which the midpoint of the data set (the 0.5 quantile) is calculated. First, the observations are sorted in increasing order. For an odd number of observations, the median is the middle value of the sorted data. For an even number of observations, the median is the average of the two middle values.
Name: survival analysis data transformation
Definition: A data transformation which has the objective of performing survival analysis.Name: log-rank test
Definition: The logrank test is a statistical hypothesis test used to compare the survival distributions of two or more groups. It is commonly employed in survival analysis, where the primary interest lies in comparing the survival experiences of different groups over time.
Name: ANOVA
Definition: ANOVA or analysis of variance is a data transformation in which a statistical test of whether the means of several groups are all equal.Name: one-way ANOVA
Definition: one-way anova is an analysis of variance where the different groups being compared are associated with the factor levels of only one independent variable. The null hypothesis is an absence of difference between the means calculated for each of the groups. The test assumes normality and equivariance of the data.
Name: two-way ANOVA
Definition: two-way anova is an analysis of variance where the different groups being compared are associated the factor levels of exatly 2 independent variables. The null hypothesis is an absence of difference between the means calculated for each of the groups. The test assumes normality and equivariance of the data.
Name: multiway ANOVA
Definition: Multi-way anova is an analysis of variance where the difference groups being compared are associated to the factor levels of more than 2 independent variables. The null hypothesis is an absence of difference between the means calculated for each of the groups. The test assumes normality and equivariance of the data.
Name: repeated measure ANOVA
Definition: repeated measure ANOVA is a kind of ANOVA specifically developed for non-independent observations as found when repeated measurements on the sample experimental unit. repeated measure ANOVA is sensitive to departure from normality (evaluation using Bartlett's test), more so in the case of unbalanced groups (i.e. different sizes of sample populations). Departure from sphericity (evaluation using Mauchly'test) used to be an issue which is now handled robustly by modern tools such as R's lme4 or nlme, which accommodate dependence assumptions other than sphericity.
Name: multivariate analysis of variance
Definition: "The multivariate analysis of variance, or MANOVA, is a procedure for comparing multivariate sample means. As a multivariate procedure, it is used when there are two or more dependent variables, and is typically followed by significance tests involving individual dependent variables separately. It helps to answer: 1. Do changes in the independent variable(s) have significant effects on the dependent variables? 2. What are the relationships among the dependent variables? 3. What are the relationships among the independent variables?"
Name: binary classification
Definition: binary classification (or binomial classification) is a data transformation which aims to cast members of a set into 2 disjoint groups depending on whether the element have a given property/feature or not.
Name: Anderson-Darling test
Definition: The Anderson–Darling test is a statistical test of whether a given sample of data is drawn from a given probability distribution.
Name: one-way ANOVA
Definition: one-way anova is an analysis of variance where the different groups being compared are associated with the factor levels of only one independent variable. The null hypothesis is an absence of difference between the means calculated for each of the groups. The test assumes normality and equivariance of the data.
Name: two-way ANOVA
Definition: two-way anova is an analysis of variance where the different groups being compared are associated the factor levels of exatly 2 independent variables. The null hypothesis is an absence of difference between the means calculated for each of the groups. The test assumes normality and equivariance of the data.
Name: multiway ANOVA
Definition: Multi-way anova is an analysis of variance where the difference groups being compared are associated to the factor levels of more than 2 independent variables. The null hypothesis is an absence of difference between the means calculated for each of the groups. The test assumes normality and equivariance of the data.
Name: Z-test
Definition: Z-test is a statistical test which evaluate the null hypothesis that the means of 2 populations are equal and returns a p-value.
Name: Fisher's exact test
Definition: Fisher's exact test is a statistical test used to determine if there are nonrandom associations between two categorical variables.
Name: Cochran-Mantel-Haenzel test for repeated tests of independence
Definition: Cochran-Mantel-Haenzel test for repeated tests of independence is a statitiscal test which allows the comparison of two groups on a dichotomous/categorical response. It is used when the effect of the explanatory variable on the response variable is influenced by covariates that can be controlled. It is often used in observational studies where random assignment of subjects to different treatments cannot be controlled, but influencing covariates can. The null hypothesis is that the two nominal variables that are tested within each repetition are independent of each other. So there are 3 variables to consider: two categorical variables to be tested for independence of each other, and the third variable identifies the repeats.
Name: Mann-Whitney U-test
Definition: The Mann-Whitney U-test is a null hypothesis statistical testing procedure which allows two groups (or conditions or treatments) to be compared without making the assumption that values are normally distributed. The Mann-Whitney test is the non-parametric equivalent of the t-test for independent samples
Name: Shapiro-Wilk test
Definition: Shapiro-Wilk test is a goodness of fit test which evaluates the null hypothesis that the sample is drawn from a population following a normal distribution
Name: Levene's test
Definition: Levene's test is a null hypothesis statistical test which evaluates the null hypothesis of equality of variance in several populations.
Name: Barlett's test
Definition: Bartlett's test (see Snedecor and Cochran, 1989) is used to test if k samples are from populations with equal variances. Equal variances across samples is called homoscedasticity or homogeneity of variances. Some statistical tests, for example the analysis of variance, assume that variances are equal across groups or samples. The Bartlett test can be used to verify that assumption. Bartlett's test is sensitive to departures from normality. That is, if the samples come from non-normal distributions, then Bartlett's test may simply be testing for non-normality. Levene's test and the Brown–Forsythe test are alternatives to the Bartlett test that are less sensitive to departures from normality.
Name: Brown Forsythe test
Definition: the Brown Forsythe test is a statistical test which evaluates if the variance of different groups are equal. It relies on computing the median rather than the mean, as used in the Levene's test for homoschedacity. This test maybe used to, for instance, ensure that the conditions of applications of ANOVA are met.
Name: Kolmogorov-Smirnov test
Definition: Kolmogorov-Smirnov test is a goodness of fit test which evaluates the null hypothesis that a sample is drawn from a population that follows a specific continuous probability distribution.
Name: F-test
Definition: an F-test is a statistical test which evaluates that the computed test statistics follows an F-distribution under the null hypothesis. The F-test is sensitive to departure from normality. F-test arise when decomposing the variability in a data set in terms of sum of squares.
Name: Wilcoxon signed rank test
Definition: The Wilcoxon signed rank test is a statistical test which tests the null hypothesis that the median difference between pairs of observations is zero. This is the non-parametric analogue to the paired t-test, and should be used if the distribution of differences between pairs may be non-normally distributed. The procedure involves a ranking, hence the name. The absolute value of the differences between observations are ranked from smallest to largest, with the smallest difference getting a rank of 1, then next larger difference getting a rank of 2, etc. Ties are given average ranks. The ranks of all differences in one direction are summed, and the ranks of all differences in the other direction are summed. The smaller of these two sums is the test statistic, W (sometimes symbolized Ts). Unlike most test statistics, smaller values of W are less likely under the null hypothesis.
Name: Kruskal Wallis test
Definition: The Kruskal–Wallis test is a null hypothesis statistical testing objective which allows multiple (n>=2) groups (or conditions or treatments) to be compared, without making the assumption that values are normally distributed. The Kruskal–Wallis test is the non-parametric equivalent of the independent samples ANOVA. The Kruskal–Wallis test is most commonly used when there is one nominal variable and one measurement variable, and the measurement variable does not meet the normality assumption of an anova.
Name: paired t-test
Definition: paired t-test is a statistical test which is specifically designed to analysis differences between paired observations in the case of studies realizing repeated measures design with only 2 repeated measurements per subject (before and after treatment for example)
Name: statistical test power analysis
Definition: A stastical test power analysis is a data transformation which aims to determine the size of a statistical sample required to reach a desired significance level given a particular statistical test
Name: two sample Hotelling T2 test
Definition: Hotelling's T2 test is a statistical test which is a generalization of Student's T-test to a assess if the means of a set of variables remains unchanged when studying 2 populations. It is a type of multivariate analysis
Name: ranking
Definition: ranking is a data transformation which turns a non-ordinal variable into a Ordinal variable by sorting the values of the input variable and replacing their value by their position in the sorting result
Name: model parameter estimation
Definition: model parameter estimation is a data transformation that finds parameter values (the model parameter estimates) most compatible with the data as judged by the model.Name: best linear unbiased predictor
Definition: best linear unbiased prediction is a data transformation which predictsunder the assumption that the variable(s) under consideration have a random effect
Name: ordinary least squares estimation
Definition: the ordinary least squares estimation is a model parameter estimation for a linear regression model when the errors are uncorrelated and equal in variance. Is the Best Linear Unbiased Estimation (BLUE) method under these assumptions, Uniformly Minimum-Variance Unbiased Estimator (UMVUE) with addition of a Gaussian assumption.
Name: weighted least squares estimation
Definition: the weighted least squares estimation is a model parameter estimation for a linear regression model with errors that independent but have heterogeneous variance. Difficult to use use in practice, as weights must be set based on the variance which is usually unknown. If true variance is known, it is the Best Linear Unbiased Estimation (BLUE) method under these assumptions, Uniformly Minimum-Variance Unbiased Estimator (UMVUE) with addition of a Gaussian assumption.
Name: generalized least squares estimation
Definition: the generalized least squares estimation is a model parameter estimation for a linear regression model with errors that are dependent and (possibly) have heterogeneous variance. Difficult to use use in practice, as covariance matrix of the errors must known to "whiten" data and model. If true covariance is known, it is the Best Linear Unbiased Estimation (BLUE) method under these assumptions, Uniformly Minimum-Variance Unbiased Estimator (UMVUE) with addition of a Gaussian assumption.Name: feasible generalized least squares estimation
Definition: the feasible generalized least squares estimation is a model parameter estimation which is a practical implementation of Generalised Least Squares, where the covariance of the errors is estimated from the residuals of the regression model, providing the information needed to whiten the data and model. Each successive estimate of the whitening matrix improves the estimation of the regression parameters, which in turn are used to compute residuals and update the whitening matrix.
Name: iteratively reweighted least squares estimation
Definition: the iteratively reweighted least squares estimation is a model parameter estimation which is a practical implementation of Weighted Least Squares, where the heterogeneous variances of the errors are estimated from the residuals of the regression model, providing an estimate for the weights. Each successive estimate of the weights improves the estimation of the regression parameters, which in turn are used to compute residuals and update the weights
Name: maximum likelihood estimation
Definition: maximum likelihood estimation (MLE) is a method of estimating the parameters of a statistical model, given observations. MLE attempts to find the parameter values that maximize the likelihood function, given the observations. The method of maximum likelihood is based on the likelihood function, {\displaystyle {\mathcal {L}}(\theta ,;x)} {\displaystyle {\mathcal {L}}(\theta ,;x)}. We are given a statistical model, i.e. a family of distributions {\displaystyle {f(\cdot ,;\theta )\mid \theta \in \Theta }} {\displaystyle {f(\cdot ,;\theta )\mid \theta \in \Theta }}, where {\displaystyle \theta } \theta denotes the (possibly multi-dimensional) parameter for the model. The method of maximum likelihood finds the values of the model parameter, {\displaystyle \theta } \theta , that maximize the likelihood function, {\displaystyle {\mathcal {L}}(\theta ,;x)} {\displaystyle {\mathcal {L}}(\theta ,;x)}. IName: restricted maximum likelihood estimation
Definition: restricted maximum likelihood estimation is a kind of maximum likelihood estimation data transformation which estimates the variance components of random-effects in univariate and multivariate meta-analysis. in contrast to 'maximum likelihood estimation', reml can produce unbiased estimates of variance and covariance parameters.
Name: ridge regression best linear unbiaised predictor
Definition: RR-BLUP is a data transformation used in the context of estimating breeding value using a Bayesian ridge regression. It can be obtained from Bayes B procedure by setting pi parameter to zero ( ) and assuming that all the markers have the same variance.
Name: genomic best linear unbiased prediction
Definition: a data transformation which calculate predictions of breeding values using an animal model and a relationship matrix calculated from the genomic/genetic markers (G Matrix), in constrast to using Pedigree information as in BLUP, also known as ABLUP
Name: trait-specific relationship matrix best linear unbiaised prediction
Definition: a data transformation which calculate estimates of genomic estimated breeding values (GEBVs) on an animal or plant model utilizing trait-specific marker information.
Name: Bayes A
Definition: Bayes A is a data transformation used in the context of estimating breeding value, which relies on a Bayesian model and treats the prior probability π that a SNP has zero effect as unknown (i.e π=0)
Name: Bayes B
Definition: Bayes B is a data transformation used in the context of estimating breeding value, which relies on a Bayesian model, treats the prior probability π that a SNP has zero effect to a set value (i.e π >0) and uses a mixture distribution.
Name: Bayesian least absolute shrinkage and selection operator
Definition: Bayesian LASSO is a data transformation where the regression parameters have independent Laplace (i.e., double-exponential) priors and are used to interprete Lasso estimate for linear regression parameters as Bayesian posterior mode estimates in accordance to a Bayesian framework.
Name: Bayes C pi
Definition: Bayes C pi is a data transformation used to compute estimated breeding values using a Bayesian model and which assesses the SNP effect using MonteCarlo Markov chain methods. Bayes C pi treats the prior probability π that a SNP has zero effect as unknown. The method was devised to address short comings of Bayes A and Bayes B approaches
Name: reproducing kernel Hilbert space procedure
Definition: A data transformation that produces a reproducing kernel Hilbert space (or RKHS), which is a Hilbert space of functions in which point evaluation is a continuous linear functional.
Name: Bayes R
Definition: Bayes R is a data transformation used in the context of estimating breeding value, which relies on a Bayesian model to compute 'genomic estimated breeding values'. In contrast to Bayes B methods, the new method assumes that the true SNP effects are derived from a series of normal distributions, the first with zero variance, up to one with a variance of approximately 1% of the genetic variance.
Name: best linear unbiased estimator
Definition: best linear unbiased estimator
Name: Breslow-Day test for homogeneity of odds ratio
Definition: the Breslow-Day test is a statistical test which evaluates if the odds ratios are homogenous across N 2x2 contingency tables, for instance several 2x2 contingency tables associated with different strata of a stratified population when evaluating the relationship between exposure and outcome or associated with the different samples coming from several centres in a multicentric study in clinical trial context.
Name: Tarone's test for homogeneity of odds ratio
Definition: Tarone's test for homogeneity of odds ratio is a statistical test which evaluates the null hypothesis that odds ratio are homogeneous
Name: Cochran-Armitage test for trend
Definition: The Cochran-Armitage test is a statistical test used in categorical data analysis when the aim is to assess for the presence of an association between a dichotomous variable (variable with two categories) and a polychotomous variable (a variable with k categories). The two-level variable represents the response, and the other represents an explanatory variable with ordered levels. The null hypothesis is the hypothesis of no trend, which means that the binomial proportion is the same for all levels of the explanatory variable For example, doses of a treatment can be ordered as 'low', 'medium', and 'high', and we may suspect that the treatment benefit cannot become smaller as the dose increases. The trend test is often used as a genotype-based test for case-control genetic association studies.
Name: one sample Hotelling T2 test
Definition: The one-sample Hotelling’s T2 is the multivariate extension of the common one-sample or paired Student’s t-test. In a one-sample t-test, the mean response is compared against a specific value. Hotelling’s one-sample T2 is used when the number of response variables is two or more, although it can be used when there is only one response variable. T2 makes the usual assumption that the data are approximately multivariate normal. Randomization tests are provided that do not rely on this assumption. These randomization tests should be used whenever you want exact results that do not rely on several assumptions.
Name: meta analysis
Definition: meta-analysis is a data transformation which uses the effect size estimates from several independent quantitative scientific studies addressing the same question in order to assess finding consistency.Name: meta analysis by Hartung-Knapp-Sidik-Jonkman method
Definition: a random effect meta analysis procedure defined by Hartung and Knapp and by Sidik and Jonkman which performs better than DerSimonian and Laird approach, especially when there is heterogeneity and the number of studies in the meta-analysis is small.
Name: meta analysis by DerSimonian and Leard method
Definition: a meta analysis which relies on the computation of the DerSimonian and Leard estimator as a measure of heterogeneity over a set of studies.
Name: meta analysis by Hunter-Schmidt method
Definition: a meta analysis which relies on the computation of the Hunter and Schmidt estimator as a measure of heterogeneity over a set of studies by considering the weighted mean of the raw correlation coefficient. Hunter and Schmidt developed what is commonly termed validity generalization procedures (Schmidt and Hunter, 1977). These involve correcting the effect sizes in the meta-analysis for sampling, and measurement error and range restriction.
Name: Scheffe test
Definition: the Scheffe test is a data transformation which evaluates all possible contrasts and adjusting the levels significance by accounting for multiple comparison. The test is therefore conservative. Confidence intervals can be constructed for the corresponding linear regression. It was developped by American statistician Henry Scheffe in 1959.
Name: Least significance different test
Definition: the LSD test is a statistical test for multiple comparisons of treatments by means of least significant difference following an ANOVA analysis
Name: confidence interval calculation
Definition: confidence interval calculation is a data transformation which determines a confidence interval for a given statistical parameter
Name: ANCOVA
Definition: ANCOVA or analysis of covariance is a data transformation which evaluates if population means of a dependent variable are equal across levels of a categorical independent variables while controlling for the effects of other continuous variable s, known as covariates. Therefore, when performing ANCOVA, we are adjusting the dependent variable means to what they would be if all groups were equal on the covariates. It augments the ANOVA model with one or more additional quantitative variables, called covariates, which are related to the response variable. The covariates are included to reduce the variance in the error terms and provide more precise measurement of the treatment effects. ANCOVA is used to test the main and interaction effects of the factors, while controlling for the effects of the covariate
Name: Hardy-Weinberg equilibrium testing
Definition: Hardy-Weinberg equilibrium test is a statistical test which aims to evaluate if a population's proportion of allele is stable or not. It is used as means of quality control to evaluate possibility of genotyping error or population structure.
Name: Tukey HSD for Post-Hoc Analysis
Definition: Tukey Honestly Significant Difference (HSD) test is a statistical test used following an ANOVA test yielding a statistically significant p-value in order to determine which means are different, to a given level of significance. The Tukey HSD test relies on the q-distribution. The procedure is conservative, meaning that if sample sizes (the sizes of different study groups) are equal, the risk of a Type I error is exactly α, and if sample sizes are unequal it’s less than α.
Name: cartesian product
Definition: a cartesian product is a data transformation which operates on a n Sets to produce a set of all possible ordered n-tuples where each element of the tuple comes from a SetName: cartesian product 2 sets
Definition: self explanatory
Name: Mauchly's test for sphericity
Definition: the Mauchly's test for sphericity is a statistical test which evaluates if the variance of the differences between all combinations of the groups are equal, a property known as 'sphericity' in the context of repeated measures. It is used for instance prior to repeated measure ANOVA. The test works by assessing if a Wishart-distributed covariance matrix (or transformation thereof) is proportional to a given matrix.
Name: continuous variable discretization
Definition: discretization as a processing converting a continuous variable into a polychotomous variable by concretizing a set of discretization rules
Name: model fitting
Definition: Model fitting is a data transformation process which evaluates if a model appropriately represents a dataset. A model fitting process tests the goodness of fit of the model to the data
Name: Woolf's test
Definition: Woolf's test is a statistical test which evaluates the null hypothesis that odds ratio are the same accross all strata of population under investigation
Name: repeated measure ANOVA
Definition: repeated measure ANOVA is a kind of ANOVA specifically developed for non-independent observations as found when repeated measurements on the sample experimental unit. repeated measure ANOVA is sensitive to departure from normality (evaluation using Bartlett's test), more so in the case of unbalanced groups (i.e. different sizes of sample populations). Departure from sphericity (evaluation using Mauchly'test) used to be an issue which is now handled robustly by modern tools such as R's lme4 or nlme, which accommodate dependence assumptions other than sphericity.
Name: Newman-Keuls test post-hoc analysis
Definition: The Newman–Keuls or Student–Newman–Keuls (SNK) method is a stepwise multiple comparisons procedure used to identify sample means that are significantly different from each other. It was named after Student (1927), D. Newman, and M. Keuls. This procedure is often used as a post-hoc test whenever a significant difference between three or more sample means has been revealed by an analysis of variance (ANOVA). The Newman–Keuls method is similar to Tukey's range test as both procedures use Studentized range statistics.Compared to Tukey's range test, the Newman–Keuls method is more powerful but less conservative.
Name: permutation numbering
Definition: permutation numbering is a data tranformation allowing to count the number of possible permutations of elements in a set of size n, each element occurring exactly once. This number is factorial n.
Name: transmission disequilibrium test
Definition: The transmission disequilibrium test is a statistical test for genetic linkage between genetic marker and a trait in families. The test is robust to population structure.
Name: Breusch-Pagan test
Definition: Breusch-Pagan test is a statistical test which computes a score test of the hypothesis of constant error variance against the alternative that the error variance changes with the level of the response (fitted values), or with a linear combination of predictors.
Name: hypergeometric test
Definition: hypergeometric test is a null hypothesis test which evaluates if a random variable follows a hypergeometric distribution. It is a test of goodness of fit to that distribution. The test is suited for situation aimed at assessing cases of sampling from a finite set without replacements. For instance, testing for enrichment or depletion of elements (e.g GO categories, genes)
Name: one tailed test
Definition: a one-tailed test is a statistical test which, assuming an unskewed probability distribution, allocates all of the significance level to evaluate only one hypothesis to explain a difference. The one-tailed test provides more power to detect an effect in one direction by not testing the effect in the other direction. one-tailed test should be preceded by two-tailed test in order to avoid missing out on detecting alternate effect explaining an observed difference.
Name: two tailed test
Definition: a two tailed test is a statistical test which assess the null hypothesis of absence of difference assuming a symmetric (not skewed) underlying probability distribution by allocating half of the significance level selected to each of the direction of change which could explain a difference (for example, a difference can be an excess or a loss).Name: hypergeometric test
Definition: hypergeometric test is a null hypothesis test which evaluates if a random variable follows a hypergeometric distribution. It is a test of goodness of fit to that distribution. The test is suited for situation aimed at assessing cases of sampling from a finite set without replacements. For instance, testing for enrichment or depletion of elements (e.g GO categories, genes)
Name: exact binomial test
Definition: a binomial test is a statistical hypothesis test which evaluates if the observations made about a Bernoulli experiment , that is an experiment which tests the statistical significance of deviations from a theoretically expected distribution (the binomial distribution) of observations into 2 categories. It is a goodness of fit test.
Name: one sample t-test
Definition: one sample t-test is a kind of Student's t-test which evaluates if a given sample can be reasonably assumed to be taken from the population. The test compares the sample statistic (m) to the population parameter (M). The one sample t-test is the small sample analog of the z test, which is suitable for large samples.
Name: two sample t-test with equal variance
Definition: two sample t-test is a null hypothesis statistical test which is used to reject or accept the hypothesis of absence of difference between the means over 2 randomly sampled populations. It uses a t-distribution for the test and assumes that the variables in the population are normally distributed and with equal variances.
Name: two sample t-test with unequal variance
Definition: Welch t-test is a two sample t-test used when the variances of the 2 populations/samples are thought to be unequal (homoskedasticity hypothesis not verified). In this version of the two-sample t-test, the denominator used to form the t-statistics, does not rely on a 'pooled variance' estimate.
Name: Barnard's test
Definition: Barnard's test is an exact statistical test used to determine if there are non-random associations between two categorical variables. It was developed in 1949 by Barnard and is a test which is, most times, more powerful that the Fisher exact test
Name: statistical model selection
Definition: A statistical model selection is a data transformation which is based on computing a relative quality value in order to evaluate and select which model best explains data.Name: Bayesian model selection
Definition: A Bayesian model selection is a data transformation which is based on Bayesian statistics to compute Bayes factor in order to evaluate which model best explains data.
Name: best linear unbiased predictor
Definition: best linear unbiased prediction is a data transformation which predictsunder the assumption that the variable(s) under consideration have a random effect
Name: breeding value estimation
Definition: breeding value estimation is a data transformation process aiming at computing breeding value estimates of an organism given a set of genomic (SNP) observations, pedigree information and/or phenotypic observations.Name: best linear unbiased predictor
Definition: best linear unbiased prediction is a data transformation which predictsunder the assumption that the variable(s) under consideration have a random effect
Name: breeding value estimation using genotype data
Definition: breeding value estimation is a data transformation process aiming at computing breeding value estimates of an organism given a set of genomic (SNP) observations.
Name: breeding value estimation using pedigree data
Definition: breeding value estimation is a data transformation process aiming at computing breeding value estimates of an organism given a set of pedigree information.
Name: breeding value estimation using phenotypic data
Definition: breeding value estimation is a data transformation process aiming at computing breeding value estimates of an organism given a set of phenotypic observations.
Name: ridge regression best linear unbiaised predictor
Definition: RR-BLUP is a data transformation used in the context of estimating breeding value using a Bayesian ridge regression. It can be obtained from Bayes B procedure by setting pi parameter to zero ( ) and assuming that all the markers have the same variance.
Name: genomic best linear unbiased prediction
Definition: a data transformation which calculate predictions of breeding values using an animal model and a relationship matrix calculated from the genomic/genetic markers (G Matrix), in constrast to using Pedigree information as in BLUP, also known as ABLUP
Name: trait-specific relationship matrix best linear unbiaised prediction
Definition: a data transformation which calculate estimates of genomic estimated breeding values (GEBVs) on an animal or plant model utilizing trait-specific marker information.
Name: Bayes A
Definition: Bayes A is a data transformation used in the context of estimating breeding value, which relies on a Bayesian model and treats the prior probability π that a SNP has zero effect as unknown (i.e π=0)
Name: Bayes B
Definition: Bayes B is a data transformation used in the context of estimating breeding value, which relies on a Bayesian model, treats the prior probability π that a SNP has zero effect to a set value (i.e π >0) and uses a mixture distribution.
Name: Bayesian least absolute shrinkage and selection operator
Definition: Bayesian LASSO is a data transformation where the regression parameters have independent Laplace (i.e., double-exponential) priors and are used to interprete Lasso estimate for linear regression parameters as Bayesian posterior mode estimates in accordance to a Bayesian framework.
Name: Bayes C pi
Definition: Bayes C pi is a data transformation used to compute estimated breeding values using a Bayesian model and which assesses the SNP effect using MonteCarlo Markov chain methods. Bayes C pi treats the prior probability π that a SNP has zero effect as unknown. The method was devised to address short comings of Bayes A and Bayes B approaches
Name: reproducing kernel Hilbert space procedure
Definition: A data transformation that produces a reproducing kernel Hilbert space (or RKHS), which is a Hilbert space of functions in which point evaluation is a continuous linear functional.
Name: Bayes R
Definition: Bayes R is a data transformation used in the context of estimating breeding value, which relies on a Bayesian model to compute 'genomic estimated breeding values'. In contrast to Bayes B methods, the new method assumes that the true SNP effects are derived from a series of normal distributions, the first with zero variance, up to one with a variance of approximately 1% of the genetic variance.
Name: best linear unbiased estimator
Definition: best linear unbiased estimator
Name: repeated measure analysis
Definition: repeated measure analysis is a kind of data transformation which deals with signals measured in the same experimental units at different times and, possibly, under different conditions over a period of time. Data produced by longitudinal studies qualify for such analysis. Since measurements are made on the same experimental units a number of times, they are likely to be correlated. Repeated measure analysis usually takes into consideration the possibility of correlation with time. It does so by specifying covariance structure in the analysis
Name: contrast estimation
Definition: a data transformation that finds a contrast value (the contrast estimate) by computing the weighted sum of model parameter estimates using a set of contrast weights.
Name: Yuen t-Test with trimmed means
Definition: The Yuen's t-test is a two sample t-test with populations of unequal variance which provides a more robust t-test procedure under normal distribution and long tailed distributions. The test computes a t statistics not using 'arithmetic means' but using 'trimmed means' instead as well as winsorized variances.
Name: maximum likelihood estimation
Definition: maximum likelihood estimation (MLE) is a method of estimating the parameters of a statistical model, given observations. MLE attempts to find the parameter values that maximize the likelihood function, given the observations. The method of maximum likelihood is based on the likelihood function, {\displaystyle {\mathcal {L}}(\theta ,;x)} {\displaystyle {\mathcal {L}}(\theta ,;x)}. We are given a statistical model, i.e. a family of distributions {\displaystyle {f(\cdot ,;\theta )\mid \theta \in \Theta }} {\displaystyle {f(\cdot ,;\theta )\mid \theta \in \Theta }}, where {\displaystyle \theta } \theta denotes the (possibly multi-dimensional) parameter for the model. The method of maximum likelihood finds the values of the model parameter, {\displaystyle \theta } \theta , that maximize the likelihood function, {\displaystyle {\mathcal {L}}(\theta ,;x)} {\displaystyle {\mathcal {L}}(\theta ,;x)}. IName: restricted maximum likelihood estimation
Definition: restricted maximum likelihood estimation is a kind of maximum likelihood estimation data transformation which estimates the variance components of random-effects in univariate and multivariate meta-analysis. in contrast to 'maximum likelihood estimation', reml can produce unbiased estimates of variance and covariance parameters.
Name: McNemar test
Definition: McNemar's test is a statistical test used on paired nominal data. It is applied to 2 × 2 contingency tables with a dichotomous trait, with matched pairs of subjects, to determine whether the row and column marginal frequencies are equal (that is, whether there is "marginal homogeneity"). It is named after Quinn McNemar, who introduced it in 1947. An application of the test in genetics is the transmission disequilibrium test for detecting linkage disequilibrium
Name: Cochran's q test for heterogeneity
Definition: Cochran's Q test is a statistical test used for unreplicated randomized block design experiments with a binary response variable and paired data. In the analysis of two-way randomized block designs where the response variable can take only two possible outcomes (coded as 0 and 1), Cochran's Q test is a non-parametric statistical test to verify whether k treatments have identical effects.
Name: Dixon Q test
Definition: Dixon test is a statistical test used to detect outliers in a univariate data set assumed to come from a normally distributed population.
Name: Grubbs' test
Definition: Grubbs' test is a statistical test used to detect one outlier in a univariate data set assumed to come from a normally distributed population.
Name: Tietjen-Moore test for outliers
Definition: Tietjen-Moore test for outlier is a statistical test used to detect outliers and corresponds to a generalization of the Grubb's test, thus allowing detection of more than one outlier in a univariate data set assumed to come from a normally distributed population. If testing for a single outlier, the Tietjen-Moore test is equivalent to the Grubbs' test.
Name: generalized extreme studentized deviate test
Definition: The Extreme Studentized Deviate Test is a statistical test used to detect outliers in a univariate data set assumed to come from a normally distributed population. The ESD Test differs from the Grubbs' test and the Tietjen-Moore test in the sense that it contains built-in correction for multiple testing.Name: Dixon Q test
Definition: Dixon test is a statistical test used to detect outliers in a univariate data set assumed to come from a normally distributed population.
Name: Grubbs' test
Definition: Grubbs' test is a statistical test used to detect one outlier in a univariate data set assumed to come from a normally distributed population.
Name: Tietjen-Moore test for outliers
Definition: Tietjen-Moore test for outlier is a statistical test used to detect outliers and corresponds to a generalization of the Grubb's test, thus allowing detection of more than one outlier in a univariate data set assumed to come from a normally distributed population. If testing for a single outlier, the Tietjen-Moore test is equivalent to the Grubbs' test.
Name: multivariate analysis of variance
Definition: "The multivariate analysis of variance, or MANOVA, is a procedure for comparing multivariate sample means. As a multivariate procedure, it is used when there are two or more dependent variables, and is typically followed by significance tests involving individual dependent variables separately. It helps to answer: 1. Do changes in the independent variable(s) have significant effects on the dependent variables? 2. What are the relationships among the dependent variables? 3. What are the relationships among the independent variables?"
Name: interim analysis
Definition: interim analysis is a data transformation used to analyzed studies implementing a group-sequential design, to evaluate and interpret the accumulating information during a clinical trial. It means that the analysis of data that is conducted before full data collection has been completed. Clinical trials are unusual in that enrollment of patients is a continual process staggered in time. This means that if a treatment is particularly beneficial or harmful compared to the concurrent placebo group while the study is on-going, the investigators are ethically obliged to assess that difference using the data at hand and to make a deliberate consideration of terminating the study earlier than planned.Name: O'brien-Flemming boundary analysis
Definition: the O'brien-Flemming boundary analysis is a kind of interim-analysis method implemented by O'brien and Flemming to account for the As all frequentist methods of the same type, it focuses on controlling the type I error rate as the repeated hypothesis testing of accumulating data increases the type I error rate of a clinical trial.
Name: Pocock boundary analysis
Definition: The Pocock boundary analysis gives a p-value threshold for each interim analysis which guides the data monitoring committee on whether to stop the trial. The boundary used depends on the number of interim analyses. The Pocock boundary is simple to use in that the p-value threshold is the same at each interim analysis. The disadvantages are that the number of interim analyses must be fixed at the start and it is not possible under this scheme to add analyses after the trial has started. Another disadvantage is that investigators and readers frequently do not understand how the p-values are reported: for example, if there are five interim analyses planned, but the trial is stopped after the third interim analysis because the p-value was 0.01, then the overall p-value for the trial is still reported as <0.05 and not as 0.01. As all frequentist methods of the same type, it focuses on controlling the type I error rate as the repeated hypothesis testing of accumulating data increases the type I error rate of a clinical trial.
Name: Haybittle-Peto boundary analysis
Definition: The Haybittle–Peto boundary analysis is an interim analysis where a rule for deciding when to stop a clinical trial prematurely is defined. It is named for John Haybittle and Richard Peto. The Haybittle–Peto boundary is one such stopping rule, and it states that if an interim analysis shows a probability of equal to, or less than 0.001 that a difference as extreme or more between the treatments is found, given that the null hypothesis is true, then the trial should be stopped early. The final analysis is still evaluated at the normal level of significance (usually 0.05).[3][4] The main advantage of the Haybittle–Peto boundary is that the same threshold is used at every interim analysis, unlike the O'Brien–Fleming boundary, which changes at every analysis. Also, using the Haybittle–Peto boundary means that the final analysis is performed using a 0.05 level of significance as normal, which makes it easier for investigators and readers to understand. The main argument against the Haybittle–Peto boundary is that some investigators believe that the Haybittle–Peto boundary is too conservative and makes it too difficult to stop a trial. As all frequentist methods of the same type, it focuses on controlling the type I error rate as the repeated hypothesis testing of accumulating data increases the type I error rate of a clinical trial.
Name: degree of freedom approximation
Definition: An estimate of the number of degrees of freedom.Name: Kenward-Roger degree of freedom approximation
Definition: The Kenward-Roger method's fundamental idea is to calculate the approximate mean and variance of their statistic and then match moments with an F distribution to obtain the denominator degrees of freedom.
Name: Satterthwaite degree of freedom approximation
Definition: Satterthwaite degree of freedom approximation is a type of degree of freedom approximation which is used to estimate an “effective degrees of freedom” for a probability distribution formed from several independent normal distributions where only estimates of the variance are known. It was originally developed by statistician Franklin E. Satterthwaite.
Name: between-within denominator degrees of freedom approximation
Definition: a data transformation to determine the number of degree of freedom
Name: ridge regression best linear unbiaised predictor
Definition: RR-BLUP is a data transformation used in the context of estimating breeding value using a Bayesian ridge regression. It can be obtained from Bayes B procedure by setting pi parameter to zero ( ) and assuming that all the markers have the same variance.
Name: genomic best linear unbiased prediction
Definition: a data transformation which calculate predictions of breeding values using an animal model and a relationship matrix calculated from the genomic/genetic markers (G Matrix), in constrast to using Pedigree information as in BLUP, also known as ABLUP
Name: trait-specific relationship matrix best linear unbiaised prediction
Definition: a data transformation which calculate estimates of genomic estimated breeding values (GEBVs) on an animal or plant model utilizing trait-specific marker information.
Name: Bayes A
Definition: Bayes A is a data transformation used in the context of estimating breeding value, which relies on a Bayesian model and treats the prior probability π that a SNP has zero effect as unknown (i.e π=0)
Name: Bayes B
Definition: Bayes B is a data transformation used in the context of estimating breeding value, which relies on a Bayesian model, treats the prior probability π that a SNP has zero effect to a set value (i.e π >0) and uses a mixture distribution.
Name: Dunn’s multiple comparison test
Definition: Dunn’s Multiple Comparison Test is a post hoc (i.e. it’s run after an ANOVA) non parametric test (a “distribution free” test that doesn’t assume your data comes from a particular distribution). It is one of the least powerful of the multiple comparisons tests and can be a very conservative test–especially for larger numbers of comparisons. The Dunn is an alternative to the Tukey test when you only want to test for differences in a small subset of all possible pairs; For larger numbers of pairwise comparisons, use Tukey’s instead. Use Dunn’s when you choose to test a specific number of comparisons before you run the ANOVA and when you are not comparing to controls. If you are comparing to a control group, use the Dunnett test instead.
Name: Conover-Iman test of multiple comparisons using rank sums
Definition: Conover-Iman test for stochastic dominance is a stastical test for multiple group comparisons and reports the results among multiple pairwise comparisons after a Kruskal-Wallis test for stochastic dominance among k groups (Kruskal and Wallis, 1952). The interpretation of stochastic dominance requires an assumption that the CDF of one group does not cross the CDF of the other. The null hypothesis for each pairwise comparison is that the probability of observing a randomly selected value from the first group that is larger than a randomly selected value from the second group equals one half; this null hypothesis corresponds to that of the Wilcoxon-Mann-Whitney rank-sum test. Like the rank-sum test, if the data can be assumed to be continuous, and the distributions are assumed identical except for a difference in location, Conover-Iman test may be understood as a test for median difference. conover.test accounts for tied ranks. The Conover-Iman test is strictly valid if and only if the corresponding Kruskal-Wallis null hypothesis is rejected.
Name: Bayesian least absolute shrinkage and selection operator
Definition: Bayesian LASSO is a data transformation where the regression parameters have independent Laplace (i.e., double-exponential) priors and are used to interprete Lasso estimate for linear regression parameters as Bayesian posterior mode estimates in accordance to a Bayesian framework.
Name: data imputation
Definition: Data imputation is a data transformation process whereby missing data is replaced with an estimated value for the missing element. The substituted values are intended to create a data record that does not fail edits. Various methods may be used to produce these substituted values.Name: last observation carried forward data imputation
Definition: last observation carried forward data imputation is a type of data imputation which uses a very simple, self explanatory method for substituted a missing value for an observation. It should be noted that this method gives a biased estimate of the treatment effect and underestimates the variability of the estimated result and should be used cautiously.
Name: regression data imputation
Definition: regression data imputation is a type of data imputation where missing values are replaced with the value of a regression function coefficient.
Name: substitution by the mean data imputation
Definition: substitution by the mean data imputation is a type of data imputation where missing values are replaced with the value the variable mean.
Name: multivariate imputation with chained equations
Definition: multivariate imputation with chained equations (MICE) is a type of data imputation which uses an algorithm devised by Stef van Buuren and Karin Groothuis-Oudshoorn
Name: k-nearest neighbour data imputation
Definition: k-nearest neighbour imputation is a data imputation which uses the k-nearest neighbour algorithm to compute a substitution value for the missing values. For every observation to be imputed, it identifies ‘k’ closest observations based on the euclidean distance and computes the weighted average (weighted based on distance) of these ‘k’ obs.
Name: Bayes C pi
Definition: Bayes C pi is a data transformation used to compute estimated breeding values using a Bayesian model and which assesses the SNP effect using MonteCarlo Markov chain methods. Bayes C pi treats the prior probability π that a SNP has zero effect as unknown. The method was devised to address short comings of Bayes A and Bayes B approaches
Name: sampling from a probability distribution
Definition: sampling from a probability distribution is a data transformation which aims at obtaining a sequence of random samples from a probability distribution for which direct sampling is difficult.Name: Gibbs sampling
Definition: Gibbs sampling or a Gibbs sampler is a Markov chain Monte Carlo (MCMC) algorithm for obtaining a sequence of observations which are approximated from a specified multivariate probability distribution, when direct sampling is difficult.
Name: Metropolis–Hastings sampling
Definition: the Metropolis–Hastings algorithm is a Markov chain Monte Carlo (MCMC) method for obtaining a sequence of random samples from a probability distribution for which direct sampling is difficult.
Name: sampling distribution estimation by bootstrapping
Definition: Bootstrapping is the practice of estimating properties of an estimator (such as its variance) by measuring those properties when sampling from an approximating distribution.
Name: reproducing kernel Hilbert space procedure
Definition: A data transformation that produces a reproducing kernel Hilbert space (or RKHS), which is a Hilbert space of functions in which point evaluation is a continuous linear functional.
Name: Bayes R
Definition: Bayes R is a data transformation used in the context of estimating breeding value, which relies on a Bayesian model to compute 'genomic estimated breeding values'. In contrast to Bayes B methods, the new method assumes that the true SNP effects are derived from a series of normal distributions, the first with zero variance, up to one with a variance of approximately 1% of the genetic variance.
Name: random forest procedure
Definition: random forest procedure is a type of data transformation used in classification and statistical learning using regression. The random forest procedure is a meta estimator that fits a number of classifying decision trees on various sub-samples of the dataset (it operates by constructing a multitude of decision trees at training time) and use averaging to improve the predictive accuracy and control over-fitting. The sub-sample size is always the same as the original input sample size but the samples are drawn with replacement if bootstrap=True (default). The random forest procedure outputs the class that is the mode of the classes (classification) or mean prediction (regression) of the individual trees.
Name: statistical model term testing
Definition: A statistical model term testing is a data transformation that accounts for the evaluation of a component of a statistical model or model term.
Name: Partial Least Square regression
Definition: Partial least squares regression (PLS regression) is a data transformation that bears some relation to principal components regression; instead of finding hyperplanes of maximum variance between the response and independent variables, it finds a linear regression model by projecting the predicted variables and the observable variables to a new space. Because both the X and Y data are projected to new spaces, the PLS family of methods are known as bilinear factor models. Partial least squares Discriminant Analysis (PLS-DA) is a variant used when the Y is categorical. PLS is used to find the fundamental relations between two matrices (X and Y), i.e. a latent variable approach to modeling the covariance structures in these two spaces. A PLS model will try to find the multidimensional direction in the X space that explains the maximum multidimensional variance direction in the Y space. PLS regression is particularly suited when the matrix of predictors has more variables than observations, and when there is multicollinearity among X values. By contrast, standard regression will fail in these cases (unless it is regularized). Partial least squares was introduced by the Swedish statistician Herman O. A. Wold, who then developed it with his son, Svante Wold. An alternative term for PLS (and more correct according to Svante Wold[1]) is projection to latent structures, but the term partial least squares is still dominant in many areas. Although the original applications were in the social sciences, PLS regression is today most widely used in chemometrics and related areas. It is also used in bioinformatics, sensometrics, neuroscience and anthropology.Name: PLS1
Definition: a partial least square regression applied when there is only one variable in Y (the matrix of response variables), or it is desirable to model and optimize separately the performance of each of the variables in Y. This case is usually referred to as PLS1 regression (J = 1).
Name: PLS2
Definition: a partial least square regression applied to a multivariate response variable.
Name: Partial Least Square Discriminant Analysis
Definition: a version of PLS used for classification, where the input y-block are group labels (categorical variable) rather than a continuous variable
Name: non-iterative Partial Least Squares
Definition: a data transformation which finds principal component by applying non-linear iterative partial least squares algorithm
Name: SIMPLS
Definition: A novel algorithm for partial least squares (PLS) regression, SIMPLS, is proposed which calculates the PLS factors directly as linear combinations of the original variables. The PLS factors are determined such as to maximize a covariance criterion, while obeying certain orthogonality and normalization restrictions. This approach follows that of other traditional multivariate methods. The construction of deflated data matrices as in the nonlinear iterative partial least squares (NIPALS)-PLS algorithm is avoided. For univariate y SIMPLS is equivalent to PLS1 and closely related to existing bidiagonalization algorithms. This follows from an analysis of PLS1 regression in terms of Krylov sequences. For multivariate Y there is a slight difference between the SIMPLS approach and NIPALS-PLS2. In practice the SIMPLS algorithm appears to be fast and easy to interpret as it does not involve a breakdown of the data sets. The acronym SIMPLS comes from 'straightforward implementation of a statistically inspired modification of the PLS method'
Name: improved Kernel PLS
Definition: improved kernel PLS is a data transformation which implement a very fast kernel algorithm for updating PLS models in a recursive manner and for exponentially discounting past data.
Name: singular value decomposition
Definition: a data transformation which compute the singular-value decomposition of a rectangular matrix. The singular-value decomposition is very general in the sense that it can be applied to any m × n matrix whereas eigenvalue decomposition can only be applied to certain classes of square matrices.
Name: best linear unbiased estimator
Definition: best linear unbiased estimator
Name: degree of freedom calculation
Definition: degree of freedom calculation is a data transformation which is part of a stastical test and which aims to determine or estimate the number of degrees of freedom in a system.
Name: log-rank test
Definition: The logrank test is a statistical hypothesis test used to compare the survival distributions of two or more groups. It is commonly employed in survival analysis, where the primary interest lies in comparing the survival experiences of different groups over time.
Name: Friedman test
Definition: The Friedman test is a non-parametric statistical test used to determine whether there are statistically significant differences among multiple related groups. It is an extension of the Wilcoxon signed-rank test for more than two related samples.
Name: sign test
Definition: The sign test is a non-parametric hypothesis test used to assess whether the median of a single population is equal to a specified value, typically referred to as the null hypothesis. The sign test is particularly useful when the data are not normally distributed or when the assumptions required for parametric tests are not met. Note that the 'sign test' is related but different from the "Wilcoxon signed rank test'. Sign test: It does not assume any specific distribution for the data. It only requires paired data and makes no assumptions about the shape of the underlying distribution. Wilcoxon signed-rank test: It assumes that the differences between paired observations come from a symmetric distribution. It's also more powerful than the sign test when the distribution is continuous and symmetric.
Name: calibration
Definition: calibration in statistics refers to the process of ensuring that the predicted probabilities or scores from a statistical model accurately reflect the true probabilities or outcomes observed in the data. It is an essential aspect of predictive modeling to ensure the reliability and interpretability of model predictions, where the goal is to estimate the likelihood of certain events or outcomes.
Name: Hosmer-Lemeshow goodness-of-fit test
Definition: The Hosmer-Lemeshow test is a statistical test used to assess the goodness of fit of a logistic regression model. It evaluates how well the predicted probabilities from the model match the observed outcomes in the data. The test helps determine whether the logistic regression model adequately captures the relationship between the predictor variables and the binary outcome variable. The test statistic follows a chi-square distribution with degrees of freedom equal to the number of groups minus the number of parameters estimated in the logistic regression model.