## Definitions

TermDefintion
Treatment GroupIn an experiment this group of people receive some sort of manipulation/treatment.
Control GroupIn an experiment this group is considered the baseline who do not take part in the manipulation, but might take a placebo or fake treatment.
Summary StatisticA number that summarizes the data.
CaseA single observation.
VariableSomething that can be measured, described, or manipulated.
Quantitative VariableAlso known as a numerical variable. It is a variable that can be measued in a numcerical scale. The variable is either continuous or discrete.
Continuous VariableA quantitative variable where the measurements can take on any value.
Discrete VariableA quantitative variable that can be counted with whole numbers.
Qualitative VariableAlso known as a categorical variable. It is a variable where it has finite values and they can fit into particular groups. The variable is either nominal or ordinal.
Nominal VariableA qualitative variable where its categories do not have a certain order to them.
Ordinal VariableA qualitative variable where its categories have a special ordering.
Independent VariableUsually denoted by "x". It is a variable that does not change based on other variables.
Dependent VariableUsually denoted by "y". It is a variable that does change depending on the value of another variable.
PopulationThe entire group of interest.
SampleA subset of the population or part of the group of interest.
BiasTo have an inclination to act or feel a certain way.
Simple Random SampleReferred to as SRS, it is a sampling technique where each case has the same probability to be chosen.
Non-response BiasA type of bias that usually shows up in surveys. It occurs when there is a high amount of nonrespondents.
Convenience SampleA sample that is taken with cases that are easier to reach.
Explanatory VariableAlso referred to as the predictor variable or independent variable. This variable is manipulated by the researcher and tries to explain the response variable.
Response VariableAlso referred to as the dependent variable or outcome variable. This variable is the outcome that is being measured in an experiment.
Observational StudyA study where nothing is being controlled or changed.
ExperimentA study where variables are being controlled or manipulated.
PlaceboA fake treatment or drug.
Confounding VariableAlso referred to as a lurking variable. It is a variable that correlates with the dependent and independent variable.
Prospective StudyAn observational study where data is taken as time goes on.
Retrospective StudyAn observational study where data is taken after events have taken place.
Stratified SamplingA sampling technique where the population is divided into groups called strata and then SRS is used.
StrataDivided groups of a population that are formed wben using stratified sampling.
Cluster SampleA sampling technique where the population is divided into many groups and a certain number of groups are randomly chosen. Then everybody in those groups are sampled.
Multistage SampleSimilar to cluster sampling but instead of sampling everybody in the cluster, we randomly sample a certain amount of people.
BlocksIn experimental design, similar individuals are put into groups called blocks to reduce variability.
BlindAn experiment is called blind when the individuals do not know which treatment they are receiving, but the researchers does.
Double BlindAn experiment is called double blind when neither the individuals nor the researchers know what treatment the individuals are taking.
ScatterplotA way to visualize the relationship between two quantitative variables, where the variables are plotted in an X Y graph.
Dot PlotA way to visualize one quantitative variable on a number line.
MeanAlso referred to as the average or arithmetic mean. It is a way to measure the center of your data.
DistributionThe shape of the data that shows how often values occur.
Weighted MeanA different way to compute the average that gives more importance to certain values.
HistrogramA method to visualize the distribution of your data that uses rectangles to show the frequency of values.
Right SkewedA way to describe a distribution where there are some extreme high values. The mean will be higher than the median. In a histogram, there will be a long right tail.
Left SkewedA way to describe a distribution where there are some exreme low values. The mean will be lower than the median. In a histogram, there will be a long left tail.
SymmetricalA way to describe a distribution where the frequency of values lower than the mean will mirror the frequency of higher values than the mean.
ModeThe values that appears most often in a data set.
UnimodalA distribution with one peak or one unique mode.
BimodalA distribution with two peaks or two unique modes.
MultimodalA distribution with more than two peaks or more than two modes.
DeviationHow far away a value is from the mean.
VarianceA measurement of how spread out a data set in units squared.
Standard DeviationA measurement of the spread of a data set. It is often chosen over the variance because of its mathematical properties and it is reported with the original units of the data.
Box PlotA method to visualize the five number summary of a data set
MedianAlso called the second quartile. The midpoint of a data set where 50% of the data is below this value
Interquartile RangeA measure of the spread of the middle 50% a data set.
First QuartileThe 25th percentile. 25% of the data is below this value.
Third QuartileThe 75th percentile. 75% of the data is below this value.
OutliersAn extreme value relative to the rest of the data.
Robust EstimatesA statistic that changes very little due to high variability or extreme values.
Contingency TableA table that shows the frequency between two categorical variables.
Frequency TableA table that shows the frequency for one categorical variable.
Relative Frequency TableSimilar to a contingency table, but instead shows the percentage or proportion rather than the count.
Bar PlotA method to visualize the frequency of one categorical variable.
Segmented Bar PlotA method to visualize a contigency table.
Mosaic PlotA method to visualize the frequency of either one or two categorical variables.
Pie ChartA method to visualize the frequency of one categorical variable.
Law of Large NumbersThe sample mean will reach the population mean as the number of observations increases.
Mutually ExclusiveAlso referred to as disjoint. Two outcomes are called disjoint when they cannot happen at the same time.
Venn DiagramsA way to visualize what outcomes are different or similar between usually two or three events.
Sample SpaceThe set of all possible outcomes of an event.
ComplementIt is the outcomes that are not in the event of interest.
Marginal ProbabilitiesThe probability of an event based on a single variable.
Joint ProbabilitiesThe probability of an event based two or more variables.
Conditional ProbabilityThe probability of an outcome based on the outcome of another event.
Tree DiagramsA technique to visualize the outcomes of an event and its probabilites.
Random VariableUsually represented as a capital letter, it is a random process with a numerical outcome.
Expected ValueThe expected average of an event.
Probability Density FunctionA PDF represents the outcomes and probability for a continuous variable.
Nornal DistributionA distribution that is symmetric, unimodal, and bell shaped. It is centered around the mean and tapers off on both ends.
Standard Normal DistributionA normal distribution with a mean of 0 and a standard deviation of 1.
ParametersA quantity that describes the population or distribution.
Z-ScoreThe number of standard deviations a value is from the mean.
PercentileThe percentage of data that is below some value.
Bernoulli Random VariableA discrete random variable where it only has two outcomes.
Binomial DistributionDescribes the probability of having exactly k "successess" in n independent Bernoulli Trials.
Negative Binomial DistributionDescribes the probability of observing the kth "success" on the n trial.
Poisson DistributionDescribes the proability of a given number events occurring over a fixed period of time.
Population MeanA parameter that describes the average of the population of interest.
Sample MeanA statistic that describes the average of the sample.
Point EstimateConsidered a "best guess" to estimate an unknown population parameter.
Sampling VariationThe variation in estimates that arises from multiple random samples.
Sampling DistributionThe distribution from the repeated random sampling of a statistic.
Standard ErrorThe standard deviation of the sampling distribution.
Confidence IntervalA plausible range of values for a population parameter.
Margin of ErrorThe error that is permitted in a confidence interval
Null HypothesisThe commonly accepted fact.
Alternative HypothesisThe claim to be tested.
Type 1 ErrorAlso known as a "false positive". We reject the null hypothesis when the null hypthesis is true.
Type 2 ErrorAlso known as a "false negative". We accept the null hypothesis when the alternative hypothesis is true.
Significance LevelThe probability for a Type 1 Error. It is also the value that is compared with the P-value.
P-ValueThe probability of obseraving data as extreme or more, if the null hypothesis is true. A small p-value indicates strong evidence for the alternative hypothesis, which means rejecting the null hypothesis.
Central Limit TheoremAs the sample size increases, the sampling distribution for the mean will become normal.
Test StatisticA summary statistic used to evaluate a hyothesis test and to calculate the p-value.
Statistically SignificantWhen the null hyothesis is rejected, results are deemed statistically signficant, but they might not always be practically significant.
Degrees of FreedomDescribes the shape of the t-distribution.
Pooled Standard DeviationA statistic calculated by two samples to better estimate the standard deviation.
Rejection RegionsA range of values that determine when to reject the null hypothesis, given the test statistic.
PowerThe probability of rejecting the null hypothesis when it is false.
Effect SizeA quantity used to determine what is practically significant.
Analysis of VarianceA test that is used to analyze whether the mean of multiple groups are equal.
Mean Square Between GroupAlso known as Mean Square Treatment. It is the amount of variability between groups.
Mean Square ErrorThe amount of variability within each group.
Bonferroni CorrectionA technique used to compare the mean of multiple groups. It is used to prevent the probability of a type 1 error from increasing.
Pooled ProportionA statistic calculated by two samples to estimate the proportion.
PredictorAlso known as the explanatory or independent variable. It is a variable used to estimate the outcome of another variable.
ResidualsThe difference between the observed value and the estimated value.
CorrelationThe strength of a linear relationship between two variables. Measured between -1 and 1.
ExtrapolationEstimating values that is beyond the range of the observed data.
R-SquaredMeasures how much of the variability can be explained by the model. It is also used to evaluate how good a linear model fits.
Indicator VariableA variable used in a linear model to include a categorical variable as a predictor.
High LeverageA data point that is extreme within the predictor variable.
Influential PointA data point that has a major influenfce on the slope of the fitted line.
CollinearTwo predictor variables that are correlated.
Adjusted R-SquareAn R-Squared measurement that takes into account multiple predictor variables.
Diagnostic PlotsVarious graphs to analyze the assumptions required for regression.
Logistic RegressionA way to predict the response of a categorical variable with two categories.
Logit TransformationA function used to map the odds of a value to a probability.