A Definitive Guide to Descriptive Statistics
Published byat August 20th, 2021 , Revised On February 8, 2023
Descriptive statistics is the summarising and organising of the characteristics of a dataset. Data set is a collection, or a set of responses, hypotheses, or observations from a limited number of samples or an entire population (Mishra et al.2019).
While conducting quantitative research, the first step is the collection of data. Once the data has been collected, the research can proceed to analyse the data.
Data analysis takes place to narrate the characteristics of responses. For instance, an average of one entry or a variable (e.g. age or gender) or the correspondence between two variables ( e.g. age and creativity).
The next step in the process is inferential statistics. These are the tools used to decide whether the data uphold or invalidate your hypothesis and if this data can be generalised to a higher ranged population.
Types of Descriptive Statistics
The three utmost types of descriptive statistics are as follows:
- The distribution which scrutinizes the frequency of each value.
- The central tendency which covers the value average.
- Variability or dispersion which is linked to the level up to which the values are spread out.
These can be used to assess only one variable at a particular time in the univariate analysis. Two or more variables can also be compared in bivariate and multivariate analysis (Kaliyadan et al., 2019).
An example can be considered where a study is conducted about the popularity of different leisure activities by gender. A survey is distributed among the participants who were asked about the frequency of the following actions in the past year.
The results and responses from this survey will lead to a formulation of a dataset. Now the descriptive statistics can be used to figure out the overall frequency of each activity which is referred to as a distribution.
Then the averages of each activity will be figured out, also known as the central tendency. The last step is to measure the spread of responses for each activity referred to as variability.
Dataset comprises of distribution of value or scores. The frequency of every value of a variable can be summarised in numbers or percentages in the form of tables or graphs.
Quantification of Central Tendency
This provides the centre or the average number of a dataset is estimated. There are three different ways of finding an average that is, “mean, median, and mode”.
Following is a demonstration of how to calculate “mean, median, and mode” using the first six responses of the conducted survey.
Measures of Variability
The measures of variability are used to figure out how much the response values are spread. There are three ways to find out the different aspects of spread that are “range, standard deviation, and variance.”
The range is calculated to get an idea about how far the extreme responses are placed. For this, the lowest value is simply subtracted from the highest value.
Range of library visits in the past year
Ordered dataset: 0, 3, 3, 12, 15, 24
Range: 24- 0 = 24
An average of the variability in a dataset is calculated, which is referred to as standard deviation (s) and indicates how far every score is placed from the mean value (Lays et al. .2013). The dataset is more variable if the standard deviation is larger. To find the standard deviation, the following six steps are followed.
- Each score is listed, and its mean is calculated.
- The mean is then subtracted from each score to get the deviation from the mean.
- Each deviation is then squared.
- The sum of all squared deviations is taken.
- The sum should then be divided by N-1.
- The last step is to find the square root of the last found number.
The standard deviation of library visit of last year
Steps 1 to 4:
|Raw data||Deviation from mean||Squared deviation|
|15||15-9.5 = 5.5||30.25|
|3||3- 9.5 = -6.6||42.25|
|12||12 – 9.5 = 2.5||6.25|
|0||0 – 9.5 = – 9.5||90.25|
|24||24 – 9.5 = 14.5||210.25|
|3||3- 9.5 = -6.5||42.25|
|M = 9.5||Sum = 0||Sum of squares 421.5|
Step 5: 421.5 / 5 = 84.3
Step 6: square root of 84.3 will give 9.18
From the answer, it is seen that every response deviates from the mean by 9.18 points.
After the standard deviations are calculated from the mean, an average is calculated, which is called a variance. The more the data is spread in the dataset, the larger the variance will occur in correlation to the mean. The variance is calculated simply by squaring the standard deviation denoted by the symbol s2.
Valiance of the library visit last year
Dataset: 15, 3, 12, 0, 24, 3
s = 9.18
s2 = 84.3
Univariate Descriptive Statistics
This type of descriptive statistic focuses on a single variable at a time. It is crucial to examine the data through the use of multiple measures from every variable separately. Those measures are distribution, central tendency, and spread. Excel or SPSS are the programs that can be used to calculate all of them (Park 2015).
|Visits to a library|
If only mean is considered as a measure to calculate the central tendency, the impression for the middle of the dataset can be slanted by the outliers, not like the median or a mode.
Similarly, the range is also not enough and is sensitive to extreme values. The standard deviation and variance also need to be considered to get satisfactory comparable measures of spread.
Is the Statistics assignment pressure too much to handle?
How about we handle it for you?
Put in the order right now in order to save yourself time, money, and nerves at the last minute.
Bivariate Descriptive Statistics
If the data is collected on more than one variable, bivariate or multivariate descriptive statistics can be used to find out whether the occurrence and type of relationship among them. In bivariate analysis, the frequency and variability of two variables are studied to find out whether they vary or not. The central tendency can also be compared to two variables before carrying on further statistics. The only difference between multivariate and bivariate analysis is that more than two variables are considered in multivariate analysis (Zhang 2016).
In the contingency table, the cell represents the intersection of two variables. In this table, the independent variable, e.g., gender, is placed along the vertical axis, and the dependent variables appear along the horizontal axis, e.g., activities.
It is quite easier to interpret the contingency table when data is considered in a percentage form rather than raw data. The comparison among rows is easy in the case of percentages, as it seems that each group in the data only contains 100 participants or observations. When the percentage-based contingency table is formulated, N is added for every independent variable.
It is more clear from the table that an equal number of men and women went to the library over 17 times in the last year. Moreover, men went to the library most commonly between 5 to 8 times. While women were more frequent between 13 to 16.
It is particularly a chart that depicts a relationship among two or three variables and is a visual representation of the relationship’s strength. In the scatter plot, one variable is plotted along the x-axis, and the other one is placed along the y-axis (Sedlmair et al. .2013). There are the points in the chart which represents the data points.
It is investigated that people who are likely to visit the library regularly tend to watch the movie at the theatre less. The number of times the participants went to watch the movie at the theatre is plotted along the x-axis and their visit to the library by the y-axis.
The scatter plot shows that the frequency of library visits increase as the number of movies seen at theatre decreases. This linear relationship is visually represented, and on this basis, further tests of correlations and regressions can be performed.
- Adamson, K.A., and Prion, S., 2013. Making sense of methods and measurement: measures of central tendency. Clinical Simulation in Nursing, 9(12), pp.e617-e618.
- Kalyan, F., and Kulkarni, V., 2019. Types of variables, descriptive statistics, and sample size. Indian dermatology online journal, 10(1), p.82.
- Leys, C., Ley, C., Klein, O., Bernard, P., and Licata, L., 2013. Detecting outliers: Do not use standard deviation around the mean, use absolute deviation around the median. Journal of Experimental Social Psychology, 49(4), pp.764-766.
- Liu, W., Wang, L., and Yi, M., 2014. Simple-random-sampling-based multiclass text classification algorithm. The Scientific World Journal, 2014.
- Mishra, P., Pandey, C.M., Singh, U., Gupta, A., Sahu, C. and Keshri, A., 2019. Descriptive statistics and normality tests for statistical data. Annals of cardiac anesthesia, 22(1), p.67.
- Park, H.M., 2015. Univariate analysis and normality test using SAS, Stata, and SPSS.
- Sedlmair, M., Munzner, T., and Tory, M., 2013. Empirical guidance on scatterplot and dimension reduction technique choices. IEEE transactions on visualization and computer graphics, 19(12), pp.2634-2643.
- Verma, S.P., Díaz-González, L., Pérez-Garza, J.A. and Rosales-Rivera, M., 2016. Quality control in geochemistry from a comparison of four central tendency and five dispersion estimators and example of a geochemical reference material. Arabian Journal of Geosciences, 9(20), p.740.