Statistics: Describe your data using Python!

Photo by Pritesh Sudra on Unsplash

This is the first of a series of articles that I will write to give a gentle introduction to statistics. In this article we will introduce some basic statistical concepts and learn how to use basic statistics to help you describe your data. We will cover the following topics in this article:

  • The difference between a population and a sample
  • The difference between Descriptive and Inferential statistics
  • Different types of variables
  • Types of descriptive statistics
  • Normal or Gaussian distribution

The difference between a population and a sample:

  • Population denotes a large group consisting of elements having at least one common feature; it is the complete set of observations
  • Sample is a finite subset of the population; it is a subset of observations from a population. We get a sample from the population in either of the following ways
  • Representative sampling (using simple random sample) — here the sample’s characteristics are similar to the population characteristics
  • Convenience sampling — here we collect sample from section of population that is easily available

The difference between Descriptive and Inferential statistics:

Descriptive statistics — its all about organizing, describing and summarizing data

  • Exploratory data analysis (EDA)
  • measures of location — such as Mean, Median, Mode
  • measures of variability or dispersion — such as Variance, Standard deviation, Range, Inter quartile range (IQR)

Inferential statistics — its all about drawing conclusions about a population from analysis of a random sample drawn from the populaiton

  • Exploratory modelling — how is x related to y?
  • Predictive modelling — if you know x, can you predict y?

Different types of variables:

Quantitative

  • Discrete: a variable whose value is obtained by counting. Example, number of students in a class
  • Continuous: a variable whose value is obtained by measuring. Example, height of all students in a class
  • Interval: this is scale of measurement where continuous data is rank ordered
  • Ratio: this is scale of measurement where continuous data is rank ordered + has meaningful spacing

Qualitative or Categorical

  • Nominal: example gender — female or male
  • Ordinal: example size — small, medium, or large

Types of descriptive statistics:

Measures of location: mainly measures of central tendency

  • Mean: sum of all values divided by the number of values
import seaborn as sns
tips = sns.load_dataset('tips')
tips.mean() # shows mean of all numeric variables
  • Median: middle value in a given sequence of values ordered by rank
tips.median() # shows median of all numeric variables
  • Mode: most frequent value in a set of values
tips.mode() # shows mode of all variables

Measures of variability, spread or dispersion

  • Range: Maximum value — Minimum value
range = tips.total_bill.max() - tips.total_bill.min() # range
  • IQR (Inter quartile range): 75th percentile — 25th percentile
tips.total_bill.quantile(.75) - tips.total_bill.quantile(.25) # IQR
  • Variance: Measure of variability of data around the mean
tips.total_bill.var() # variance of total_bill variable
  • Standard deviation: how spread out the data is, i.e. how much variance there is from the mean
tips.total_bill.std() # standard deviation of total_bill variable
  • Coefficient of variance (C.V.): measure of standard deviation expressed as a percentage of the mean
cv = lambda x: x.std() / x.mean() * 100
cv(tips.total_bill)

Measures of symmetry and peakedness: Skewness measures symmetry and Kurtosis measures peakedness

Normal or Gaussian distribution

  • This is one of the most common statistical distribution. The curve of this distribution is shaped like a bell.
  • The shape of the bell depends on mean and standard deviation of the data
  • Larger the standard deviation, wider the distribution
  • A tip to quickly assess normality is to see if mean and median are nearly equal

Skewness and Kurtosis

Skewness measures tendency of data to be spread out on one side of the mean than the other. Skewness value indicates

  • Negative value indicates the data is left skewed
  • Positive value indicates the data is right skewed
  • Closer to zero for the data to be normally distributed
import scipy.stats as s
s.skew(tips.total_bill, bias=False) #calculate sample skewness

Kurtosis measures tendency of data to be concentrated around the center or tails. Kurtosis value indicates

  • Platykurtic: Negative value indicates lower than normal peakedness
  • Leptokurtic: Positive value indicates higher than normal peakedness
  • Mesokurtic: Closer to zero for the data to be normally distributed
import scipy.stats as s
s.kurtosis(tips.total_bill, bias=False) #calculate sample kurtosis

Comments welcome!

Experienced in synthesizing data to identify trends, deliver insights and recommendations. Focused on customer lifecycle, cross-sell, and employee performance.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store