Appendix A: Formula Sheet

Important Notations

Measures Population Sample
Sample Size [latex]N[/latex] [latex]n[/latex]
Mean [latex]\mu[/latex] [latex]\bar{\mu}[/latex]
Standard Deviation [latex]\sigma[/latex] [latex]s[/latex]
Proportion [latex]p[/latex] [latex]\hat{p}[/latex]
Slope [latex]\beta_1[/latex] [latex]b_1[/latex]

Descriptive Measures

  • Five-number summary: minimum, [latex]Q_1[/latex], [latex]Q_2[/latex], [latex]Q_3[/latex], and maximum
  • Outliers: [latex]\text{lowerlimit}=Q_1-1.5 \times IQR;[/latex]  [latex]\text{upperlimit}=Q_3+1.5 \times IQR;[/latex] [latex]IQR=Q_3-Q_1[/latex]
  • Sample mean: [latex]\frac{x_1 + x_2 + \dots + x_n}{n} = \frac{\sum x_i}{n}[/latex]
  • Sample standard deviation: [latex]s = \sqrt{ \frac{\sum (x_i - \bar{x})^2 }{n-1} } = \sqrt{ \frac{ \left( \sum x_i^2 \right) - \frac{ \left( \sum x_i \right)^2 }{n} }{n-1} }[/latex]

Probability Concepts

  • Equal-likely outcome model: Probability of event E

    [latex]P(E)= \frac{\text{# of sample points in event E}}{\text{# of sample points in sample space S} } = \frac{\text{# of ways event E can occur}}{\text{# of possible outcomes}}= \frac{f}{N}[/latex]

  • Complement rule: [latex]P(not \: E)=1-P(E)[/latex]
  • Special addition rule: [latex]P(A \: or \: B)=P(A)+P(B)[/latex] if events [latex]A[/latex] and [latex]B[/latex] are mutually exclusive. More generally, if events [latex]A,B,C, \dots[/latex] are mutually exclusive, then [latex]P(A \: or \: B \: or \: C \: or \: \dots)=P(A)+P(B)+P(C) + \dots[/latex]
  • General addition rule: [latex]P(A \:or \: B)=P(A)+P(B)-P(A \: \& \: B)[/latex]
  • Conditional probability of [latex]A[/latex] given [latex]B[/latex]: [latex]P(A|B)=\frac{P(A \: \& \: B)}{P(B)}[/latex] for [latex]P(B) \: \gt \: 0[/latex]
  • General multiplication rule: [latex]P(A \: \& \:B)=P(B)P(A|B)=P(A)P(B|A)[/latex]
  • Special multiplication rule: [latex]P(A \: \& \: B)=P(A)P(B)[/latex] if events [latex]A[/latex] and [latex]B[/latex] are independent. More generally, if events [latex]A,B,C, \dots[/latex] are independent, [latex]P(A \: \& \: B \: \& \: C \: \& \: \dots) =P(A) \times P(B) \times P(C) \times \dots[/latex]
  • Two events [latex]A[/latex] and [latex]B[/latex] are independent if ANY of the following is true:

    [latex]P(A|B)=P(A)[/latex] OR [latex]P(B|A)=P(B)[/latex] OR [latex]P(A \: \& \: B)=P(A) \times P(B)[/latex]

  • Permutation: [latex]nP_r = \frac{n!}{(n-r)!}[/latex]
  • Combination: [latex]nC_r = \frac{n!}{r!(n-r)!}[/latex]

Discrete Random Variables

  • The mean (expected value) of a discrete random variable [latex]X: \mu= \sum xP(X=x)[/latex]
  • Standard deviation of [latex]X: \sigma= \sqrt{\sum (x- \mu)^2 P(X=x)} = \sqrt{\sum x^2 P(X=x)-\mu^2}[/latex]

Binomial Distribution

Among [latex]n[/latex] independent Bernoulli trials with probability of success [latex]p[/latex], let [latex]X[/latex] be the number of successes. The probability of observing [latex]x[/latex] successes is:

[latex]P(X=x)={n \choose x} p^x (1-p)^{n-x}=nC_x p^x(1-p)^{n-x}, x=0,1, \dots , n[/latex]

The mean and standard deviation of a Binomial distribution are [latex]\mu = np[/latex], [latex]\sigma = \sqrt{np(1-p)}[/latex], respectively.

Normal Distribution

  • If random variable [latex]X \sim N(\mu, \sigma)[/latex], then the standardized variable [latex]Z=\frac{X-\mu}{\sigma} \sim N(0,1)[/latex].
  • Given an [latex]x[/latex] value, its z-score is [latex]z=\frac{x-\mu}{\sigma}[/latex]
  • Given the z-score, find the [latex]x[/latex] value: [latex]x=\mu+z \times \sigma[/latex].
  • If sample mean [latex]\bar{X} \sim N(\mu_{\bar{X}}=\mu, \sigma_{\bar{X}}= \frac{\sigma}{\sqrt{n}})[/latex], the standardized variable [latex]Z=\frac{\bar{X}-\mu_{\bar{X}}}{\sigma_{\bar{X}}}= \frac{\bar{X} - \mu}{\sigma / \sqrt{n}} \sim N(0,1)[/latex].

Sampling Distributions

  • Mean and standard deviation of the sample mean [latex]\bar{X}: \mu_{\scriptsize \bar{X}}= \mu, \sigma_{\scriptsize \bar{X}} = \frac{\sigma}{\sqrt{n}}[/latex]
  • Mean and standard deviation of a sample proportion [latex]\hat{p}:  \mu_{\scriptsize \hat{p}} = p; \sigma_{\scriptsize \hat{p}} =\sqrt{ \frac{p(1-p)}{n}}[/latex]
  • Mean and standard deviation of [latex]\bar{X}_1 - \bar{X}_2: \mu_{\scriptsize \bar{X}_1 - \bar{X}_2} = \mu_1 - \mu_2; \sigma_{\scriptsize \bar{X}_1 - \bar{X}_2} = \sqrt{ \frac{\sigma_1^2}{n_1} + \frac{\sigma_2^2}{n_2}}[/latex]
  • Mean and standard deviation of [latex]\hat{p}_1  - \hat{p}_2: \mu_{\scriptsize \hat{p}_1  - \hat{p}_2} = p_1 - p_2; \sigma_{\scriptsize \hat{p}_1  - \hat{p}_2} = \sqrt{ \frac{p_1(1-p_1)}{n_1} + \frac{p_2(1-p_2)}{n_2}}[/latex]

Confidence Intervals and Hypothesis Tests

Parameter Estimate Test Statistic
[latex](1 - \alpha) \times 100 \%[/latex] Confidence Interval
[latex]\mu[/latex] [latex]\bar{x}[/latex] [latex]t_o = \frac{\bar{x} - \mu_0}{\left( \frac{s}{\sqrt{n}} \right)}[/latex]
[latex]\bar{x} \pm t_{\alpha / 2} \frac{s}{\sqrt{n}}[/latex] with [latex]df=n-1[/latex]
[latex]p[/latex] [latex]\hat{p}[/latex] [latex]z_o = \frac{\hat{p} - p_0}{\sqrt{\frac{p_0(1 - p_0)}{n}}}[/latex] [latex]\hat{p} \pm z_{\alpha / 2} \sqrt{\frac{\hat{p}(1 - \hat{p})}{n}} , \hat{p} = \frac{x}{n}[/latex]
[latex]\mu_1 - \mu_2[/latex] [latex]\bar{x}_1 - \bar{x}_2[/latex] [latex]t_o = \frac{(\bar{x}_1 - \bar{x}_2) - \Delta_0}{\sqrt{\frac{s_1^2}{n_1} + \frac{s_2^2}{n_2}}}[/latex] [latex](\bar{x}_1 - \bar{x}_2) \pm t_{\alpha / 2} \sqrt{\frac{s_1^2}{n_1} + \frac{s_2^2}{n_2}}[/latex] with

[latex]df = \frac{\left( \frac{s_1^2}{n_1} + \frac{s_2^2}{n_2} \right)^2}{ \left( \frac{1}{n_1 - 1} \right) \left( \frac{s_1^2}{n_1} \right)^2 + \left( \frac{1}{n_2 - 1} \right) \left( \frac{s_2^2}{n_2} \right)^2}[/latex]
[latex]\mu_1 - \mu_2[/latex] [latex]\bar{x}_1 - \bar{x}_2[/latex] [latex]t_o = \frac{(\bar{x}_1 - \bar{x}_2) - \Delta_0}{ s_p \sqrt{\frac{1}{n_1} + \frac{1}{n_2}}}[/latex] [latex](\bar{x}_1 - \bar{x}_2) \pm t_{\alpha / 2} s_p \sqrt{\frac{1}{n_1} + \frac{1}{n_2}}[/latex] with [latex]df=n_1+n_2-2[/latex]
[latex]\mu_1 - \mu_2[/latex] [latex]\bar{d}[/latex] [latex]t_o = \frac{\bar{d} - \delta_0}{\left( \frac{s_d}{\sqrt{n}} \right)}[/latex]
[latex]\bar{d} \pm t_{\alpha / 2} \frac{s_d}{\sqrt{n}}[/latex] with [latex]df=n-1[/latex], [latex]n =[/latex] # of pairs
[latex]\bar{d} = \frac{\sum d_i}{n}, s_d = \sqrt{\frac{\left( \sum d_i^2 \right) - \frac{\left( \sum d_i \right)^2}{n} }{n-1}}[/latex]
[latex]p_1-p_2[/latex] [latex]\hat{p}_1 - \hat{p}_2[/latex] [latex]z_o = \frac{\hat{p}_1 - \hat{p}_2}{\sqrt{\hat{p}_p (1 - \hat{p}_p)} \sqrt{ \frac{1}{n_1} + \frac{1}{n_2} }}[/latex] [latex](\hat{p}_1 - \hat{p}_2) \pm z_{\alpha / 2} \sqrt{\frac{\hat{p}_1 (1 - \hat{p}_1)}{n_1} + \frac{\hat{p}_2 (1 - \hat{p}_2)}{n_2}}[/latex]
[latex]\hat{p}_p = \frac{x_1 + x_2}{n_1 + n_2}, \hat{p}_1 = \frac{x_1}{n_1}, \hat{p}_2 = \frac{x_2}{n_2}[/latex]
[latex]\beta_1[/latex] [latex]b_1[/latex] [latex]t_o = \frac{b_1}{\left( \frac{s_e}{\sqrt{S_{xx}}} \right)}[/latex]
[latex]b_1 \pm t_{\alpha / 2} \frac{s_e}{\sqrt{S_{xx}}}[/latex] with [latex]df=n-2[/latex]

Margin of Error and Sample Size Calculation

  • Margin of error for the estimate of [latex]\mu: E = z_{\alpha / 2} \frac{\sigma}{\sqrt{n}}[/latex]
  • Sample size calculation for [latex]\mu : = \left( \frac{\sigma \times z_{\alpha / 2}}{E} \right)^2[/latex] round up to the nearest integer
  • Margin of error for the estimate of [latex]p: E = z_{\alpha / 2} \sqrt{ \frac{\hat{p}(1 - \hat{p})}{n} }[/latex]
  • Sample size calculation for [latex]p[/latex] without guessing [latex]\hat{p}: n \leq 0.05 (1 - 0.05) \left( \frac{z_{\alpha / 2}}{E} \right)^2 = 0.25 \left( \frac{z_{\alpha / 2}}{E} \right)^2[/latex]
  • Sample size calculation for [latex]p[/latex] with guessing [latex]\hat{p}: n = p_g (1 - p_g) \left( \frac{z_{\alpha / 2}}{E} \right)^2[/latex] round up

Chi-Square Test

  • Chi-square goodness-of-fit test for one categorical/discrete variable:
    • Expected frequency: [latex]E=np[/latex]
    • Test statistic: [latex]\chi_o^2 = \sum_{\text{all cells}} \frac{(O - E)^2}{E}[/latex] with [latex]df=k-1[/latex], where [latex]k[/latex] is number of possible values of the variable
  • Chi-square independence (or homogeneity) test of two variables:
    • Expected frequency: [latex]E=\frac{\text{(rth row total)} \times \text{(cth column total)}}{n}[/latex]
    • Test statistic: [latex]\chi_o^2 = \sum_{\text{all cells}} \frac{(O - E)^2}{E}[/latex] with [latex]df=(r-1) \times (c-1)[/latex] where [latex]r[/latex] is the number of rows and [latex]c[/latex] is number of columns of the cells.

Regression Analysis

  • Sums of squares

    [latex]S_{xy}= \sum x_i y_i - \frac{\left( \sum x_i \right) \left( \sum y_i \right) }{n}; S_{xx} = x_i^2 - \frac{\left( \sum x_i \right)^2}{n}; S_{yy} = \sum y_i^2 - \frac{\left( \sum y_i \right)^2}{n}[/latex]

  • The least-squares straight line: [latex]\hat{y} = b_0 + b_1 x[/latex] , where [latex]b_1 = \frac{S_{xy}}{S_{xx}}[/latex] and [latex]b_0 = \bar{y} - b_1 \bar{x} = \frac{\sum y_i}{n}  - b_1 \frac{\sum x_i}{n}[/latex]
  • Total sum of squares: [latex]SST = \sum (y_i - \bar{y})^2 = S_{yy}[/latex]
  • Regression sum of squares: [latex]SSR = \sum (\hat{y} - \bar{y})^2 = r^2 S_{yy} = \frac{S_{xy}^2}{S_{xx}}[/latex]
  • Error sum of squares: [latex]SSE = \sum (y_i - \hat{y}_i)^2 = \sum e_i^2 = SST - SSR = S_{yy} - \frac{S_{xy}^2}{S_{xx}}[/latex]
  • Regression identify: [latex]SST=SSE+SSR[/latex]
  • Residual: [latex]e_i  = y_i -\hat{y}_i = y_i - (b_0 + b_1 x_i)[/latex]
  • Correlation coefficient: [latex]r = \frac{S_{xy}}{\sqrt{S_{xx} \times S_{yy}}}[/latex]
  • Coefficient of determination: [latex]R^2=r^2= \frac{S_{xy}^2}{S_{xx} \times S_{yy}}=\frac{SSR}{SST}[/latex]
  • Standard error of the estimate: [latex]s_e = \sqrt{ \frac{\sum (e_i - \bar{e})^2}{n-2} } = \sqrt{ \frac{\sum e_i^2}{n-2}} = \sqrt{\frac{SSE}{n-2}}[/latex]
  • Test statistic for [latex]\beta_1: t_o = \frac{b_1}{\left( \frac{s_e}{\sqrt{S_{xx}}} \right)}[/latex] with [latex]df=n-2[/latex]
  • A [latex](1 - \alpha) \times 100 \%[/latex] confidnece interval for [latex]\beta_1: b_1 \pm t_{\alpha / 2} \frac{s_e}{\sqrt{S_{xx}}}[/latex] with [latex]df = n-2[/latex]
  • A [latex](1 - \alpha) \times 100 \%[/latex] confidence interval for the conditional mean [latex]\mu_p[/latex] is

    [latex]\hat{\mu}_p \pm t_{\alpha / 2} \times SE(\hat{\mu}_p) = (b_0 + b_1 x_p) \pm t_{\alpha / 2} \times s_e \sqrt{\frac{(x_p - \bar{x})^2}{S_{xx}} + \frac{1}{n}}[/latex] with [latex]df=n-2[/latex]

  • A [latex](1 - \alpha) \times 100 \%[/latex] confidence interval for a single response [latex]y_p[/latex] is

    [latex]\hat{y}_p \pm t_{\alpha / 2} \times SE(\hat{y}_p) = (b_0 + b_1 x_p) \pm t_{\alpha / 2} \times s_e \sqrt{\frac{(x_p - \bar{x})^2}{S_{xx}} + \frac{1}{n} + 1}[/latex] with [latex]df=n-2[/latex]

Analysis of Variance (One-Way ANOVA F Test)

Compare [latex]k[/latex] population means: [latex]\mu_1, \mu_2 , \dots , \mu_k[/latex]. Denote sample sizes as [latex]n_1,n_2, \dots ,n_k[/latex], sample means as [latex]\bar{x}_1, \bar{x}_2, \dots , \bar{x}_k[/latex] and sample standard deviations as [latex]s_1,s_2, \dots ,s_k[/latex]. Let [latex]n=n_1 + n_2 + \dots + n_k[/latex] and [latex]\bar{x} = \frac{\sum x_{ij}}{n}[/latex] where [latex]x_{ij}[/latex] is the jth observation of sample [latex]i[/latex].

  • Test statistic: [latex]F_o = \frac{SSTR / (k-1)}{SSE / (n-k)} = \frac{MSTR}{MSE}[/latex] with [latex]df_n=k-1[/latex] and [latex]df_d=n-k[/latex]
  • Total sum of squares: [latex]SST = \sum (x_{ij} - \bar{x})^2 = \sum x_{ij}^2 - \frac{\left( \sum x_{ij} \right)^2}{n}[/latex]
  • Treatment sum of squares: [latex]SSTR = \sum_{i=1}^k n_i (\bar{x}_i - \bar{x})^2[/latex]
  • Error sum of squares: [latex]SSE = \sum (x_{ij} - \bar{x}_i)^2 = \sum_{i=1}^k (n_i -1)s_i^2 = SST - SSTR[/latex]
  • ANOVA identity: [latex]SST=SSE+SSTR[/latex]

License

Icon for the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License

Introduction to Applied Statistics Copyright © 2024 by Wanhua Su is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License, except where otherwise noted.