Why do we use 1.5 IQR rule to find outliers?

Why 1.5 times IQR? Why not 1 or 2 or any other number?

1 Like

This 1.5 IQR rule comes in use for the situation when you have to calculate outliers
IQR = Q3 - Q1
To detect the outliers using this method, we define a new range, let’s call it decision range, and any data point lying outside this range is considered as outlier and is accordingly dealt with. The range is as given below:

Lower Bound: **(Q1 - 1.5 * IQR)**Upper Bound: (Q3 + 1.5 * IQR)

Any data point less than the Lower Bound or more than the Upper Bound is considered as an outlier.

But the question is: Why only 1.5 times the IQR? Why not any other number like 1 or 2?

Well, as you might have guessed, the number (here 1.5, hereinafter scale ) clearly controls the sensitivity of the range and hence the decision rule. A bigger scale (i.e using 2) would make the outlier(s) to be considered as data point(s) while a smaller one (i.e. using 1) would make some of the data point(s) to be perceived as outlier(s). And we’re quite sure, none of these cases is desirable.

Now to understand why 1.5 is optimum and it ensures the above the logic remains we will go through the following discussion

If we assume the data to have gaussian or normal distribution (even if it is not then we can use central limit theorem to propose that multiple sampling of data is done to get gaussian distribution) then following things holds true

  • About 68.26% of the whole data lies within one standard deviation () of the mean (μ), taking both sides into account, the pink region in the figure.
  • About 95.44% of the whole data lies within two standard deviations () of the mean (μ), taking both sides into account, the pink+blue region in the figure.
  • About 99.72% of the whole data lies within three standard deviations (<3σ) of the mean (μ), taking both sides into account, the pink+blue+green region in the figure.
  • And the rest 0.28% of the whole data lies outside three standard deviations (>3σ) of the mean (μ), taking both sides into account, the little red region in the figure. And this part of the data is considered as outliers.
  • The first and the third quartiles, Q1 and Q3, lies at -0.675σ and +0.675σ from the mean, respectively. (this is an important piece of information which will use later on so please remember it - also you can use these 2 links to check the validity of this pointer

Now we will calculate the lower and upper bound for outlier detection in terms of σ using different scales values - 1,2 and 1.5

Scale = 1

Lower Bound:
= Q1 - 1 * IQR
= Q1 - 1 * (Q3 - Q1)
= -0.675σ - 1 * (0.675 - [-0.675])σ
= -0.675σ - 1 * 1.35σ
= -2.025σ
Upper Bound:
= Q3 + 1 * IQR
= Q3 + 1 * (Q3 - Q1)
= 0.675σ + 1 * (0.675 - [-0.675])σ
= 0.675σ + 1 * 1.35σ
= 2.025σ

So, when scale is taken as 1, then according to IQR Method any data which lies beyond 2.025σ from the mean (μ), on either side, shall be considered as outlier. But as we know, upto , on either side of the μ ,the data is useful. So we cannot take scale = 1 , because this makes the decision range too exclusive, means this results in too much outliers. In other words, the decision range gets so small (compared to ) that it considers some data points as outliers, which is not desirable.

Scale = 2

Lower Bound:
= Q1 - 2 * IQR
= Q1 - 2 * (Q3 - Q1)
= -0.675σ - 2 * (0.675 - [-0.675])σ
= -0.675σ - 2 * 1.35σ
= -3.375σ
Upper Bound:
= Q3 + 2 * IQR
= Q3 + 2 * (Q3 - Q1)
= 0.675σ + 2 * (0.675 - [-0.675])σ
= 0.675σ + 2 * 1.35σ
= 3.375σ

So, when scale is taken as 2, then according to IQR Method any data which lies beyond 3.375σ from the mean (μ), on either side, shall be considered as outlier. But as we know, upto , on either side of the μ ,the data is useful. So we cannot take scale = 2, because this makes the decision range too inclusive, means this results in too few outliers. In other words, the decision range gets so big (compared to ) that it considers some outliers as data points, which is not desirable either.

Scale = 1.5

Lower Bound:
= Q1 - 1.5 * IQR
= Q1 - 1.5 * (Q3 - Q1)
= -0.675σ - 1.5 * (0.675 - [-0.675])σ
= -0.675σ - 1.5 * 1.35σ
= -2.7σ
Upper Bound:
= Q3 + 1.5 * IQR
= Q3 + 1.5 * (Q3 - Q1)
= 0.675σ + 1.5 * (0.675 - [-0.675])σ
= 0.675σ + 1.5 * 1.35σ
= 2.7σ

When scale is taken as 1.5, then according to IQR Method any data which lies beyond 2.7σ from the mean (μ), on either side, shall be considered as outlier. And this decision range is the closest to what Gaussian Distribution tells us, i.e., . In other words, this makes the decision rule closest to what Gaussian Distribution considers for outlier detection, and this is exactly what we wanted.

If we would consider the scale to be 1.7 then we would get exactly the outlier at distance but 1.5 is chosen because it is convenient number and it shows some kind of a balance between the extreme values of scales between 1 and 2.

1 Like