# Statistics, Correlation and Regression

• ### What is Statistics?

It is a branch of mathematics. It involves collection, analysis, interpretation, presentation, and organization of data. (Dictionary Definition).

Descriptive Statistics summarizes the data with the help of few indices, such as mean (the central tendency) and standard deviation (the dispersion). Inferential Statistics draws conclusions from data that are subject to random variation. Inferential statistics uses the probability theory.

• ### Classification of Data (Numerical)

If the data values are arranged in ascending or descending order, the minimum and maximum values are revealed. The quantity represented by the numerical data is termed as a variate or a variable. Often, data values repeatedly occur in the dataset. The number of times a value occurs in the dataset is termed as the frequency. The representation using frequencies is the frequency distribution.

If the range of the data is wide, instead of mentioning individual values, they are grouped into class intervals. In general, in a class interval of ${(a-b)}$ all values ${\ge a}$ and ${ are included.

Sometimes, a cumulative frequency distribution table is prepared.

• ### Representation of Data

Histogram, Frequency Polygon and Ogive are commonly used to represent the data.

• ### Descriptive Statistics : Central Tendency

As mentioned earlier, in descriptive statistics, the central tendency and the dispersion are studied. The indices of central tendency are as follows :

Arithmetic Mean :

${\mu_x \ or \ \bar x = \frac {x_1+x_2 + \cdots + x_n}{n} =\frac 1 n \sum \limits_{i=1}^{n} x_i}$

If the data are in frequency distribution format,

${\mu_x \ or \ \bar x = \frac {\sum \limits_{i=1}^{n} x_i f_i}{\sum \limits_{i=1}^{n} f_n}}$

If the data are in grouped frequency distribution format,

${\mu_x \ or \ \bar x = A + h \times \frac {\sum \limits_{i=1}^{n} f_i u_i}{\sum \limits_{i=1}^{n} f_n},}$

where ${A}$ is the assumed mean,  generally the middle value in the dataset. ${u}$ is given by ${\frac {x-A}{h}}$. ${h}$ is the width of class interval and ${x}$ is the mid-value of the class interval.

Joint Mean of 2 distributions with ${n_1}$ and ${n_2}$ values is given by

${\frac {\mu_1 n_1 + \mu_2 n_2}{n_1 + n_2}}$

Geometric Mean :

The geometric mean is given by

${\Big( \prod \limits_{i=1}^{n} x_i^{f_i} \Big)^{1/N},}$

where ${N = \sum \limits_{i=1}^{n} f_i}$

Harmonic Mean :

Harmonic mean is the reciprocal of arithmetic mean of the reciprocals of the given values.

Median/ Positional Average :

Median is that value of variate, which divides the dataset into 2 equal parts. Thus, equal number of values exist on either side of the median. If the number of values is odd, the arithmetic mean of the two middle terms is the median.

In a cumulative frequency distribution, the median is that value of variate, whose cumulative frequency is equal to or just greater than $N/2$, where $N$ is the total number of values.

Mode :

It is the most frequently occurring value in the data-set.

• ### Descriptive Statistics : Dispersion

These indices measure the spread of the values. So, 2 datasets can have same mean, but the values may be spread over a wider range for one of them.

Mean Deviation :

It is given by

${\frac {1}{N} \sum \limits_{i=1}^{n} f_i |x_i - \mu|}$

Standard Deviation :

${\sigma = \sqrt {\frac 1 N \sum \limits_{i=1}^{n} (x- \mu)^2}}$

The square of the standard deviation is known as the variance.

Root Mean Square Deviation :

${S = \sqrt {\frac 1 N \sum \limits_{i=1}^{n} (x- A)^2},}$

where ${A}$ is any arbitrary number.

${S}$ and ${\sigma}$ are related by the following relation :

${S^2 = \sigma^2 + (\mu - A)^2}$

In grouped distributions, the standard deviation is calculated as follows :

${\sigma = h \sqrt {\frac 1 N \sum \limits_{i=1}^n f_i u_i^2 - \left (\frac {\sum \limits_{i=1}^{n} f_i u_i}{N} \right)^2}}$

Here, ${h}$ is the width of class interval, ${u = \frac {x-A}{h}}$. ${A}$ is the assumed mean, ${x}$ is the mid-value of the interval. ${N}$ is the sum of all frequencies.

• ### Moments

${r}$th moment of a distribution about the mean ${\mu}$ is denoted by ${\mu_r}$ and is given by

${\mu_r = \frac {1}{N} \sum \limits_{i=1}^{n} f_i \big (x- \mu \big )^r}$

This is similar to moment of a force about a point, where we define the moment as force ${\times}$ the perpendicular distance.

${r}$th moment of a distribution about any arbitrary number ${A}$ is denoted by ${\mu'_r}$ and is given by

${\mu'_r = \frac {1}{N} \sum \limits_{i=1}^{n} f_i \big (x- A \big )^r}$

• ### Relation between ${r}$th moment about the mean (${\mu_r}$) and ${r}$th moment about any number ${A}$, (${\mu'_r}$)

It can be shown that ${\mu_r}$ and ${\mu'_r}$ are related by the following equation :

${\mu_r = \mu'_r - ^rC_1 \mu'_{r-1} \mu'_1 + ^rC_2 \mu'_{r-2} \mu'_2 + \cdots + (-1)^r (\mu'_1)^r}$

• ### Skewness

It tells how skew the frequency distribution curve is from the symmetry.

I) Frequency curve stretches towards right : Mean to the right of mode – Right/Positively skewed

II) Frequency curve stretches towards left : Mean to the left of mode – Left/Negatively skewed

Skewness is measured by

${\frac {\mu - m}{\sigma},}$

where ${m}$ is the median.

The coefficient of skewness is given by

${\beta_1 = \frac {\mu_3^2}{\mu_2^3}}$

• ### Kurtosis

In Greek, ${kurt}$ means ${bulging}$. The coefficient of kurtosis, ${\beta_2}$ is given by

${\beta_2 = \frac {\mu_4}{\mu_2^2}}$

It measures how peaked the frequency distribution curve is.

The curve, which is neither flat nor peaked is ${mesokurtic}$. If $\beta_2 > 3$, the curve is ${leptokurtic}$ or peaked. If ${\beta_2 < 3}$, the curve is ${platykurtic}$.

• ### Bivariate Distributions

When the dataset contains 2 variables, each data point is an ordered pair, say ${(x,y)}$. Such distributions are known as bivariate distributions. We may wish to test the relationship between the 2 variables, if any.

Examples : Amount of time spent by students on social media vs. marks obtained, rainfall in a year and the crop production in the following year.

We may be tempted to come up with a conclusion without having a look at the actual values. The variables in the first example may seem to be correlated, but we cannot be very much sure unless we have the dataset.

In the second example, we can provide a good reasoning for the correlation.

• ### Correlation and Karl Pearson’s Coefficient

When a change in one variable leads to a corresponding change in the other, we say that the variables are correlated. If one increases and the other decreases, it is a negative correlation. If both increase, it is a positive correlation.

If the ratio of values of variables in each pair ${(x,y)}$ is constant, the correlation is said to be linear. According to Karl Pearson, the strength of linear relationship between the variables is given by the correlation coefficient :

${r = \frac {cov (x,y)}{\sigma_x \sigma_y} = \frac {\frac 1 n \sum \limits_{i=1}^n (x_i - \mu_x)(y_i - \mu_y)}{\sigma_x \sigma_y }}$

${cov (x,y)}$ is the covariance. When the covariance is divided by the product of the standard deviations, we get the correlation coefficient.

Note that the value of ${r}$ always lies between ${-1}$ and ${1}$.

• ### Regression

If the variables are known to be correlated, value of one variable can be obtained, if the value of the other variable is known. This is known as regression. In order to do that, a relationship between the variables, say ${x}$ and ${y}$ is developed, in the form of an equation. Further, if it is known that the relationship is linear, the equation will be of the form

${y=mx+c}$

I) Line of Regression of ${y}$ on ${x}$

${y - \mu_y = r \frac {\sigma_y}{\sigma_x} (x- \mu_x)}$

II) Line of Regression of ${x}$ on ${y}$

${x - \mu_x = r \frac {\sigma_x}{\sigma_y} (y- \mu_y)}$

### Regression Coefficients

I) of ${y}$ on ${x}$, ${b_{yx} = \frac {cov (x,y)}{\sigma_x^2}}$

I) of ${x}$ on ${y}$, ${b_{xy} = \frac {cov (x,y)}{\sigma_y^2}}$

### FOR MCQs

Almost all MCQs are formula-based. So, make sure that you know all formulas with the terms forming them.