It is a branch of mathematics. It involves collection, analysis, interpretation, presentation, and organization of data. (Dictionary Definition).
Descriptive Statistics summarizes the data with the help of few indices, such as mean (the central tendency) and standard deviation (the dispersion). Inferential Statistics draws conclusions from data that are subject to random variation. Inferential statistics uses the probability theory.
Classification of Data (Numerical)
If the data values are arranged in ascending or descending order, the minimum and maximum values are revealed. The quantity represented by the numerical data is termed as a variate or a variable. Often, data values repeatedly occur in the dataset. The number of times a value occurs in the dataset is termed as the frequency. The representation using frequencies is the frequency distribution.
If the range of the data is wide, instead of mentioning individual values, they are grouped into class intervals. In general, in a class interval of all values and are included.
Sometimes, a cumulative frequency distribution table is prepared.
Histogram, Frequency Polygon and Ogive are commonly used to represent the data.
Descriptive Statistics : Central Tendency
As mentioned earlier, in descriptive statistics, the central tendency and the dispersion are studied. The indices of central tendency are as follows :
Arithmetic Mean :
If the data are in frequency distribution format,
If the data are in grouped frequency distribution format,
where is the assumed mean, generally the middle value in the dataset. is given by . is the width of class interval and is the mid-value of the class interval.
Joint Mean of 2 distributions with and values is given by
Geometric Mean :
The geometric mean is given by
Harmonic Mean :
Harmonic mean is the reciprocal of arithmetic mean of the reciprocals of the given values.
Median/ Positional Average :
Median is that value of variate, which divides the dataset into 2 equal parts. Thus, equal number of values exist on either side of the median. If the number of values is odd, the arithmetic mean of the two middle terms is the median.
In a cumulative frequency distribution, the median is that value of variate, whose cumulative frequency is equal to or just greater than $N/2$, where $N$ is the total number of values.
It is the most frequently occurring value in the data-set.
Descriptive Statistics : Dispersion
These indices measure the spread of the values. So, 2 datasets can have same mean, but the values may be spread over a wider range for one of them.
Mean Deviation :
It is given by
Standard Deviation :
The square of the standard deviation is known as the variance.
Root Mean Square Deviation :
where is any arbitrary number.
and are related by the following relation :
In grouped distributions, the standard deviation is calculated as follows :
Here, is the width of class interval, . is the assumed mean, is the mid-value of the interval. is the sum of all frequencies.
th moment of a distribution about the mean is denoted by and is given by
This is similar to moment of a force about a point, where we define the moment as force the perpendicular distance.
th moment of a distribution about any arbitrary number is denoted by and is given by
Relation between th moment about the mean () and th moment about any number , ()
It can be shown that and are related by the following equation :
It tells how skew the frequency distribution curve is from the symmetry.
I) Frequency curve stretches towards right : Mean to the right of mode – Right/Positively skewed
II) Frequency curve stretches towards left : Mean to the left of mode – Left/Negatively skewed
Skewness is measured by
where is the median.
The coefficient of skewness is given by
In Greek, means . The coefficient of kurtosis, is given by
It measures how peaked the frequency distribution curve is.
The curve, which is neither flat nor peaked is . If $\beta_2 > 3$, the curve is or peaked. If , the curve is .
When the dataset contains 2 variables, each data point is an ordered pair, say . Such distributions are known as bivariate distributions. We may wish to test the relationship between the 2 variables, if any.
Examples : Amount of time spent by students on social media vs. marks obtained, rainfall in a year and the crop production in the following year.
We may be tempted to come up with a conclusion without having a look at the actual values. The variables in the first example may seem to be correlated, but we cannot be very much sure unless we have the dataset.
In the second example, we can provide a good reasoning for the correlation.
Correlation and Karl Pearson’s Coefficient
When a change in one variable leads to a corresponding change in the other, we say that the variables are correlated. If one increases and the other decreases, it is a negative correlation. If both increase, it is a positive correlation.
If the ratio of values of variables in each pair is constant, the correlation is said to be linear. According to Karl Pearson, the strength of linear relationship between the variables is given by the correlation coefficient :
is the covariance. When the covariance is divided by the product of the standard deviations, we get the correlation coefficient.
Note that the value of always lies between and .
If the variables are known to be correlated, value of one variable can be obtained, if the value of the other variable is known. This is known as regression. In order to do that, a relationship between the variables, say and is developed, in the form of an equation. Further, if it is known that the relationship is linear, the equation will be of the form
I) Line of Regression of on
II) Line of Regression of on
I) of on ,
I) of on ,
Almost all MCQs are formula-based. So, make sure that you know all formulas with the terms forming them.