Treatment Group | In an experiment this group of people receive some sort of manipulation/treatment. |
Control Group | In an experiment this group is considered the baseline who do not take part in the manipulation, but might take a placebo or fake treatment. |
Summary Statistic | A number that summarizes the data. |
Case | A single observation. |
Variable | Something that can be measured, described, or manipulated. |
Quantitative Variable | Also known as a numerical variable. It is a variable that can be measued in a numcerical scale. The variable is either continuous or discrete. |
Continuous Variable | A quantitative variable where the measurements can take on any value. |
Discrete Variable | A quantitative variable that can be counted with whole numbers. |
Qualitative Variable | Also known as a categorical variable. It is a variable where it has finite values and they can fit into particular groups. The variable is either nominal or ordinal. |
Nominal Variable | A qualitative variable where its categories do not have a certain order to them. |
Ordinal Variable | A qualitative variable where its categories have a special ordering. |
Independent Variable | Usually denoted by "x". It is a variable that does not change based on other variables. |
Dependent Variable | Usually denoted by "y". It is a variable that does change depending on the value of another variable. |
Population | The entire group of interest. |
Sample | A subset of the population or part of the group of interest. |
Bias | To have an inclination to act or feel a certain way. |
Simple Random Sample | Referred to as SRS, it is a sampling technique where each case has the same probability to be chosen. |
Non-response Bias | A type of bias that usually shows up in surveys. It occurs when there is a high amount of nonrespondents. |
Convenience Sample | A sample that is taken with cases that are easier to reach. |
Explanatory Variable | Also referred to as the predictor variable or independent variable. This variable is manipulated by the researcher and tries to explain the response variable. |
Response Variable | Also referred to as the dependent variable or outcome variable. This variable is the outcome that is being measured in an experiment. |
Observational Study | A study where nothing is being controlled or changed. |
Experiment | A study where variables are being controlled or manipulated. |
Placebo | A fake treatment or drug. |
Confounding Variable | Also referred to as a lurking variable. It is a variable that correlates with the dependent and independent variable. |
Prospective Study | An observational study where data is taken as time goes on. |
Retrospective Study | An observational study where data is taken after events have taken place. |
Stratified Sampling | A sampling technique where the population is divided into groups called strata and then SRS is used. |
Strata | Divided groups of a population that are formed wben using stratified sampling. |
Cluster Sample | A sampling technique where the population is divided into many groups and a certain number of groups are randomly chosen. Then everybody in those groups are sampled. |
Multistage Sample | Similar to cluster sampling but instead of sampling everybody in the cluster, we randomly sample a certain amount of people. |
Blocks | In experimental design, similar individuals are put into groups called blocks to reduce variability. |
Blind | An experiment is called blind when the individuals do not know which treatment they are receiving, but the researchers does. |
Double Blind | An experiment is called double blind when neither the individuals nor the researchers know what treatment the individuals are taking. |
Scatterplot | A way to visualize the relationship between two quantitative variables, where the variables are plotted in an X Y graph. |
Dot Plot | A way to visualize one quantitative variable on a number line. |
Mean | Also referred to as the average or arithmetic mean. It is a way to measure the center of your data. |
Distribution | The shape of the data that shows how often values occur. |
Weighted Mean | A different way to compute the average that gives more importance to certain values. |
Histrogram | A method to visualize the distribution of your data that uses rectangles to show the frequency of values. |
Right Skewed | A way to describe a distribution where there are some extreme high values. The mean will be higher than the median. In a histogram, there will be a long right tail. |
Left Skewed | A way to describe a distribution where there are some exreme low values. The mean will be lower than the median. In a histogram, there will be a long left tail. |
Symmetrical | A way to describe a distribution where the frequency of values lower than the mean will mirror the frequency of higher values than the mean. |
Mode | The values that appears most often in a data set. |
Unimodal | A distribution with one peak or one unique mode. |
Bimodal | A distribution with two peaks or two unique modes. |
Multimodal | A distribution with more than two peaks or more than two modes. |
Deviation | How far away a value is from the mean. |
Variance | A measurement of how spread out a data set in units squared. |
Standard Deviation | A measurement of the spread of a data set. It is often chosen over the variance because of its mathematical properties and it is reported with the original units of the data. |
Box Plot | A method to visualize the five number summary of a data set |
Median | Also called the second quartile. The midpoint of a data set where 50% of the data is below this value |
Interquartile Range | A measure of the spread of the middle 50% a data set. |
First Quartile | The 25th percentile. 25% of the data is below this value. |
Third Quartile | The 75th percentile. 75% of the data is below this value. |
Outliers | An extreme value relative to the rest of the data. |
Robust Estimates | A statistic that changes very little due to high variability or extreme values. |
Contingency Table | A table that shows the frequency between two categorical variables. |
Frequency Table | A table that shows the frequency for one categorical variable. |
Relative Frequency Table | Similar to a contingency table, but instead shows the percentage or proportion rather than the count. |
Bar Plot | A method to visualize the frequency of one categorical variable. |
Segmented Bar Plot | A method to visualize a contigency table. |
Mosaic Plot | A method to visualize the frequency of either one or two categorical variables. |
Pie Chart | A method to visualize the frequency of one categorical variable. |
Law of Large Numbers | The sample mean will reach the population mean as the number of observations increases. |
Mutually Exclusive | Also referred to as disjoint. Two outcomes are called disjoint when they cannot happen at the same time. |
Venn Diagrams | A way to visualize what outcomes are different or similar between usually two or three events. |
Sample Space | The set of all possible outcomes of an event. |
Complement | It is the outcomes that are not in the event of interest. |
Marginal Probabilities | The probability of an event based on a single variable. |
Joint Probabilities | The probability of an event based two or more variables. |
Conditional Probability | The probability of an outcome based on the outcome of another event. |
Tree Diagrams | A technique to visualize the outcomes of an event and its probabilites. |
Random Variable | Usually represented as a capital letter, it is a random process with a numerical outcome. |
Expected Value | The expected average of an event. |
Probability Density Function | A PDF represents the outcomes and probability for a continuous variable. |
Nornal Distribution | A distribution that is symmetric, unimodal, and bell shaped. It is centered around the mean and tapers off on both ends. |
Standard Normal Distribution | A normal distribution with a mean of 0 and a standard deviation of 1. |
Parameters | A quantity that describes the population or distribution. |
Z-Score | The number of standard deviations a value is from the mean. |
Percentile | The percentage of data that is below some value. |
Bernoulli Random Variable | A discrete random variable where it only has two outcomes. |
Binomial Distribution | Describes the probability of having exactly k "successess" in n independent Bernoulli Trials. |
Negative Binomial Distribution | Describes the probability of observing the kth "success" on the n trial. |
Poisson Distribution | Describes the proability of a given number events occurring over a fixed period of time. |
Population Mean | A parameter that describes the average of the population of interest. |
Sample Mean | A statistic that describes the average of the sample. |
Point Estimate | Considered a "best guess" to estimate an unknown population parameter. |
Sampling Variation | The variation in estimates that arises from multiple random samples. |
Sampling Distribution | The distribution from the repeated random sampling of a statistic. |
Standard Error | The standard deviation of the sampling distribution. |
Confidence Interval | A plausible range of values for a population parameter. |
Margin of Error | The error that is permitted in a confidence interval |
Null Hypothesis | The commonly accepted fact. |
Alternative Hypothesis | The claim to be tested. |
Type 1 Error | Also known as a "false positive". We reject the null hypothesis when the null hypthesis is true. |
Type 2 Error | Also known as a "false negative". We accept the null hypothesis when the alternative hypothesis is true. |
Significance Level | The probability for a Type 1 Error. It is also the value that is compared with the P-value. |
P-Value | The probability of obseraving data as extreme or more, if the null hypothesis is true. A small p-value indicates strong evidence for the alternative hypothesis, which means rejecting the null hypothesis. |
Central Limit Theorem | As the sample size increases, the sampling distribution for the mean will become normal. |
Test Statistic | A summary statistic used to evaluate a hyothesis test and to calculate the p-value. |
Statistically Significant | When the null hyothesis is rejected, results are deemed statistically signficant, but they might not always be practically significant. |
Degrees of Freedom | Describes the shape of the t-distribution. |
Pooled Standard Deviation | A statistic calculated by two samples to better estimate the standard deviation. |
Rejection Regions | A range of values that determine when to reject the null hypothesis, given the test statistic. |
Power | The probability of rejecting the null hypothesis when it is false. |
Effect Size | A quantity used to determine what is practically significant. |
Analysis of Variance | A test that is used to analyze whether the mean of multiple groups are equal. |
Mean Square Between Group | Also known as Mean Square Treatment. It is the amount of variability between groups. |
Mean Square Error | The amount of variability within each group. |
Bonferroni Correction | A technique used to compare the mean of multiple groups. It is used to prevent the probability of a type 1 error from increasing. |
Pooled Proportion | A statistic calculated by two samples to estimate the proportion. |
Predictor | Also known as the explanatory or independent variable. It is a variable used to estimate the outcome of another variable. |
Residuals | The difference between the observed value and the estimated value. |
Correlation | The strength of a linear relationship between two variables. Measured between -1 and 1. |
Extrapolation | Estimating values that is beyond the range of the observed data. |
R-Squared | Measures how much of the variability can be explained by the model. It is also used to evaluate how good a linear model fits. |
Indicator Variable | A variable used in a linear model to include a categorical variable as a predictor. |
High Leverage | A data point that is extreme within the predictor variable. |
Influential Point | A data point that has a major influenfce on the slope of the fitted line. |
Collinear | Two predictor variables that are correlated. |
Adjusted R-Square | An R-Squared measurement that takes into account multiple predictor variables. |
Diagnostic Plots | Various graphs to analyze the assumptions required for regression. |
Logistic Regression | A way to predict the response of a categorical variable with two categories. |
Logit Transformation | A function used to map the odds of a value to a probability. |