Outlier Detection in Snowflake

Outlier Detection in Snowflake

Outlier detection is a key step to any analysis. It allows you to quickly spot errors in data collection, or which points you may need to remove before you do any statistical modeling.
There are a number of ways to find outliers, but this notebook goes through the most common ways:

The Data

For this example, we’ll be looking at some data that shows the total number of daily streams for the Top 200 Tracks on Spotify every day since Jan 1 2017:

Method 1: MAD (Median Absolute Deviation)

MAD describes how far away a point is from the median. This method can be preferable to other methods, like the z-score method because you don’t have to assume the data is normally distributed.

Screenshot 2021-05-27 at 2.53.24 pm

“A critical value for the Median Absolute Deviation of 5 has been suggested by workers in Nonparametric statistics, as a Median Absolute Deviation value of just under 1.5 is equivalent to one standard deviation, so MAD = 5 is approximately equal to 3 standard deviations.”

select diff_to_med.*, 
  percentile_cont(0.5) within group (order by difference) over() median_of_diff, 
  abs(difference/percentile_cont(0.5) within group(order by difference) over()) mad, 
  case 
    when abs(difference/percentile_cont(0.5) within group(order by difference)over()) > <THRESHOLD> then 'outlier' 
    else 'not outlier' 
  end label
from (select <COLS>, med.median ,abs(<OUTLIER_COLUMN> - med.median) difference
from <DATA>
cross join (select distinct percentile_cont(.5) within group (order by <COL>) median from DATA) MED) DIFF_TO_MED
order by mad desc

where:

  • <DATA> is the table where you’re looking for outliers
  • <COLS> are the other columns you want to return from your table
  • <OUTLIER_COLUMN> is the column you want to ID the outliers from
  • <THRESHOLD> is the outlier threshold (suggested: 5)

Example:

Method 2: Standard Deviations

The most common way outliers are detected is by finding points outside a certain number of sigmas (or standard deviations) from the mean.
If the data is normally distributed, then the following rules apply:

Image from anomoly.io
So, identifying points 2 or 3 sigmas away from the mean should lead us to our outliers.

select 
  data.*, 
  case 
    when <OUTLIER_COL> between mean - <THRESHOLD> * stdev and mean + <THRESHOLD> * stdev then 'not outlier' 
    else 'outlier' 
  end label,
  mean - <THRESHOLD> * stdev lower_bound,
  mean + <THRESHOLD>  * stdev upper_bound
from <DATA>
cross join (select 
  avg(<OUTLIER_COL>) mean, 
  stddev_samp(<OUTLIER_COL>) stdev
from <DATA>) MEAN_SD
order by label desc

where:

  • <DATA> is the table where you’re looking for outliers
  • <COLS> are the other columns you want to return from your table
  • <OUTLIER_COLUMN> is the column you want to ID the outliers from
  • <THRESHOLD> is the outlier threshold (suggested: 3)

Example:

Method 3: Z-Score

Similar to Method 2, the z-score method tells us the probability of a point occurring in a normal distribution.

Image from simplypsychology.org
To use z-scores to identify outliers, we can set an alpha (significance level) that will determine how ‘likely’ each point will be in order for it to be classified as an outlier.
This method also assumes the data is normally distributed.

select find_z.*,
case when abs(z_score) >= <THRESHOLD> then 'outlier' else ' ' end label
from (select 
  <COLS>,
  <OUTLIER_COLUMN>, 
  (<OUTLIER_COLUMN> - avg(<OUTLIER_COLUMN>) over ())
     / (stddev(<OUTLIER_COLUMN>) over ()) z_score
from <DATA>)FIND_Z
order by abs(z_score) desc

where:

  • <DATA> is the table where you’re looking for outliers
  • <COLS> are the other columns you want to return from your table
  • <OUTLIER_COLUMN> is the column you want to ID the outliers from
  • <THRESHOLD> is the outlier threshold (suggested: see below)

For a two-tailed z-test:
alpha of 0.1 = ± 1.645,
alpha of 0.05 = ± 1.960,
and alpha of 0.01 = ± 2.576

Example:

1 Like