| Treatment Group | In an experiment this group of people receive some sort of manipulation/treatment. |
| Control Group | In an experiment this group is considered the baseline who do not take part in the manipulation, but might take a placebo or fake treatment. |
| Summary Statistic | A number that summarizes the data. |
| Case | A single observation. |
| Variable | Something that can be measured, described, or manipulated. |
| Quantitative Variable | Also known as a numerical variable. It is a variable that can be measued in a numcerical scale. The variable is either continuous or discrete. |
| Continuous Variable | A quantitative variable where the measurements can take on any value. |
| Discrete Variable | A quantitative variable that can be counted with whole numbers. |
| Qualitative Variable | Also known as a categorical variable. It is a variable where it has finite values and they can fit into particular groups. The variable is either nominal or ordinal. |
| Nominal Variable | A qualitative variable where its categories do not have a certain order to them. |
| Ordinal Variable | A qualitative variable where its categories have a special ordering. |
| Independent Variable | Usually denoted by "x". It is a variable that does not change based on other variables. |
| Dependent Variable | Usually denoted by "y". It is a variable that does change depending on the value of another variable. |
| Population | The entire group of interest. |
| Sample | A subset of the population or part of the group of interest. |
| Bias | To have an inclination to act or feel a certain way. |
| Simple Random Sample | Referred to as SRS, it is a sampling technique where each case has the same probability to be chosen. |
| Non-response Bias | A type of bias that usually shows up in surveys. It occurs when there is a high amount of nonrespondents. |
| Convenience Sample | A sample that is taken with cases that are easier to reach. |
| Explanatory Variable | Also referred to as the predictor variable or independent variable. This variable is manipulated by the researcher and tries to explain the response variable. |
| Response Variable | Also referred to as the dependent variable or outcome variable. This variable is the outcome that is being measured in an experiment. |
| Observational Study | A study where nothing is being controlled or changed. |
| Experiment | A study where variables are being controlled or manipulated. |
| Placebo | A fake treatment or drug. |
| Confounding Variable | Also referred to as a lurking variable. It is a variable that correlates with the dependent and independent variable. |
| Prospective Study | An observational study where data is taken as time goes on. |
| Retrospective Study | An observational study where data is taken after events have taken place. |
| Stratified Sampling | A sampling technique where the population is divided into groups called strata and then SRS is used. |
| Strata | Divided groups of a population that are formed wben using stratified sampling. |
| Cluster Sample | A sampling technique where the population is divided into many groups and a certain number of groups are randomly chosen. Then everybody in those groups are sampled. |
| Multistage Sample | Similar to cluster sampling but instead of sampling everybody in the cluster, we randomly sample a certain amount of people. |
| Blocks | In experimental design, similar individuals are put into groups called blocks to reduce variability. |
| Blind | An experiment is called blind when the individuals do not know which treatment they are receiving, but the researchers does. |
| Double Blind | An experiment is called double blind when neither the individuals nor the researchers know what treatment the individuals are taking. |
| Scatterplot | A way to visualize the relationship between two quantitative variables, where the variables are plotted in an X Y graph. |
| Dot Plot | A way to visualize one quantitative variable on a number line. |
| Mean | Also referred to as the average or arithmetic mean. It is a way to measure the center of your data. |
| Distribution | The shape of the data that shows how often values occur. |
| Weighted Mean | A different way to compute the average that gives more importance to certain values. |
| Histrogram | A method to visualize the distribution of your data that uses rectangles to show the frequency of values. |
| Right Skewed | A way to describe a distribution where there are some extreme high values. The mean will be higher than the median. In a histogram, there will be a long right tail. |
| Left Skewed | A way to describe a distribution where there are some exreme low values. The mean will be lower than the median. In a histogram, there will be a long left tail. |
| Symmetrical | A way to describe a distribution where the frequency of values lower than the mean will mirror the frequency of higher values than the mean. |
| Mode | The values that appears most often in a data set. |
| Unimodal | A distribution with one peak or one unique mode. |
| Bimodal | A distribution with two peaks or two unique modes. |
| Multimodal | A distribution with more than two peaks or more than two modes. |
| Deviation | How far away a value is from the mean. |
| Variance | A measurement of how spread out a data set in units squared. |
| Standard Deviation | A measurement of the spread of a data set. It is often chosen over the variance because of its mathematical properties and it is reported with the original units of the data. |
| Box Plot | A method to visualize the five number summary of a data set |
| Median | Also called the second quartile. The midpoint of a data set where 50% of the data is below this value |
| Interquartile Range | A measure of the spread of the middle 50% a data set. |
| First Quartile | The 25th percentile. 25% of the data is below this value. |
| Third Quartile | The 75th percentile. 75% of the data is below this value. |
| Outliers | An extreme value relative to the rest of the data. |
| Robust Estimates | A statistic that changes very little due to high variability or extreme values. |
| Contingency Table | A table that shows the frequency between two categorical variables. |
| Frequency Table | A table that shows the frequency for one categorical variable. |
| Relative Frequency Table | Similar to a contingency table, but instead shows the percentage or proportion rather than the count. |
| Bar Plot | A method to visualize the frequency of one categorical variable. |
| Segmented Bar Plot | A method to visualize a contigency table. |
| Mosaic Plot | A method to visualize the frequency of either one or two categorical variables. |
| Pie Chart | A method to visualize the frequency of one categorical variable. |
| Law of Large Numbers | The sample mean will reach the population mean as the number of observations increases. |
| Mutually Exclusive | Also referred to as disjoint. Two outcomes are called disjoint when they cannot happen at the same time. |
| Venn Diagrams | A way to visualize what outcomes are different or similar between usually two or three events. |
| Sample Space | The set of all possible outcomes of an event. |
| Complement | It is the outcomes that are not in the event of interest. |
| Marginal Probabilities | The probability of an event based on a single variable. |
| Joint Probabilities | The probability of an event based two or more variables. |
| Conditional Probability | The probability of an outcome based on the outcome of another event. |
| Tree Diagrams | A technique to visualize the outcomes of an event and its probabilites. |
| Random Variable | Usually represented as a capital letter, it is a random process with a numerical outcome. |
| Expected Value | The expected average of an event. |
| Probability Density Function | A PDF represents the outcomes and probability for a continuous variable. |
| Nornal Distribution | A distribution that is symmetric, unimodal, and bell shaped. It is centered around the mean and tapers off on both ends. |
| Standard Normal Distribution | A normal distribution with a mean of 0 and a standard deviation of 1. |
| Parameters | A quantity that describes the population or distribution. |
| Z-Score | The number of standard deviations a value is from the mean. |
| Percentile | The percentage of data that is below some value. |
| Bernoulli Random Variable | A discrete random variable where it only has two outcomes. |
| Binomial Distribution | Describes the probability of having exactly k "successess" in n independent Bernoulli Trials. |
| Negative Binomial Distribution | Describes the probability of observing the kth "success" on the n trial. |
| Poisson Distribution | Describes the proability of a given number events occurring over a fixed period of time. |
| Population Mean | A parameter that describes the average of the population of interest. |
| Sample Mean | A statistic that describes the average of the sample. |
| Point Estimate | Considered a "best guess" to estimate an unknown population parameter. |
| Sampling Variation | The variation in estimates that arises from multiple random samples. |
| Sampling Distribution | The distribution from the repeated random sampling of a statistic. |
| Standard Error | The standard deviation of the sampling distribution. |
| Confidence Interval | A plausible range of values for a population parameter. |
| Margin of Error | The error that is permitted in a confidence interval |
| Null Hypothesis | The commonly accepted fact. |
| Alternative Hypothesis | The claim to be tested. |
| Type 1 Error | Also known as a "false positive". We reject the null hypothesis when the null hypthesis is true. |
| Type 2 Error | Also known as a "false negative". We accept the null hypothesis when the alternative hypothesis is true. |
| Significance Level | The probability for a Type 1 Error. It is also the value that is compared with the P-value. |
| P-Value | The probability of obseraving data as extreme or more, if the null hypothesis is true. A small p-value indicates strong evidence for the alternative hypothesis, which means rejecting the null hypothesis. |
| Central Limit Theorem | As the sample size increases, the sampling distribution for the mean will become normal. |
| Test Statistic | A summary statistic used to evaluate a hyothesis test and to calculate the p-value. |
| Statistically Significant | When the null hyothesis is rejected, results are deemed statistically signficant, but they might not always be practically significant. |
| Degrees of Freedom | Describes the shape of the t-distribution. |
| Pooled Standard Deviation | A statistic calculated by two samples to better estimate the standard deviation. |
| Rejection Regions | A range of values that determine when to reject the null hypothesis, given the test statistic. |
| Power | The probability of rejecting the null hypothesis when it is false. |
| Effect Size | A quantity used to determine what is practically significant. |
| Analysis of Variance | A test that is used to analyze whether the mean of multiple groups are equal. |
| Mean Square Between Group | Also known as Mean Square Treatment. It is the amount of variability between groups. |
| Mean Square Error | The amount of variability within each group. |
| Bonferroni Correction | A technique used to compare the mean of multiple groups. It is used to prevent the probability of a type 1 error from increasing. |
| Pooled Proportion | A statistic calculated by two samples to estimate the proportion. |
| Predictor | Also known as the explanatory or independent variable. It is a variable used to estimate the outcome of another variable. |
| Residuals | The difference between the observed value and the estimated value. |
| Correlation | The strength of a linear relationship between two variables. Measured between -1 and 1. |
| Extrapolation | Estimating values that is beyond the range of the observed data. |
| R-Squared | Measures how much of the variability can be explained by the model. It is also used to evaluate how good a linear model fits. |
| Indicator Variable | A variable used in a linear model to include a categorical variable as a predictor. |
| High Leverage | A data point that is extreme within the predictor variable. |
| Influential Point | A data point that has a major influenfce on the slope of the fitted line. |
| Collinear | Two predictor variables that are correlated. |
| Adjusted R-Square | An R-Squared measurement that takes into account multiple predictor variables. |
| Diagnostic Plots | Various graphs to analyze the assumptions required for regression. |
| Logistic Regression | A way to predict the response of a categorical variable with two categories. |
| Logit Transformation | A function used to map the odds of a value to a probability. |