5-Number Summary with Python
What is five-number summary? Five-number summary is used to describe the distribution of data without assuming a specific data distribution. For example, the mean and standard deviation are used to summarize a gaussian distribution (normal distribution). The five-number summary can describe a data sample with any distribution.
Nonparametric Data Summarization
Key measurements are used to summarize data. The most common is calculating the mean and standard deviation for a normal distribution. With these two measurements you can understand and re-create the distribution of the sample data.
The issue with using these two parameters, the mean and standard deviation cannot be easily calculated for non-gaussian distributed data. Yes, they can be calculated, but it will not summarize the data distribution and can sometimes be misleading. Five-number summary is used to describe the distribution of data that does NOT have a gaussian distribution.
The statistical quantities are…
- Median: middle value sample or 50th percentile or 2nd quartile
- 1st Quartile: The 25th percentile
- 3rd Quartile: The 75h percentile
- Minimum: Smallest observation in the sample
- Maximum: Largest observation in the sample
A quartile is the value that splits the data into four equally sized parts. The 2nd quartile splits the data into two parts, and the 1st and 3rd split each half into quarters.
Quartiles are often expressed as percentiles. Both of the values are examples of rank statistics which can be calculated on data with any distribution. It’s used to see how much of the data distribution is in front or behind the observed value. Box and whisker plots are graphical methods used to summarize the distribution of data.
Using Python to Calculate the Five-Number Summary
The result shows very similar numbers to the respective quartiles. 0.24 for the 25th percentile, .50 for the 50th percentile and .75 for the 75th percentile.
Pandas .describe( )
You’ll probably use real datasets in which there is an easier way to retrieve the 5-number summary, using Pandas .describe()
In summary,
- The mean and standard deviation are meaningful for a normal or gaussian distribution
- Five-number summary can be used to describe any data distribution
- Pandas makes it easy to find the five-number summary
Thanks for reading this … and shout to to machine learning mastery for providing in-depth statistics and machine learning concepts.