Comparison (green is better)
Median |
Mean |
|
---|---|---|
Sensitivity to outliers |
The median is much more robust to outliers than mean (see [1],[2]) |
|
Artificiality |
The median always yields a "realistic" value (that is present at least once in the dataset) |
The mean can be completely artificial (for instance, with two groups of values, the mean may be in the middle, or with integer values, the mean is often a real) |
Representativeness |
The median often gives a better intuition of the underlying data, esp. for exponential and long-tail distributions. It's the best choice when one has no idea of the underlying distribution of the data. |
The mean is mostly appropriate for symetric distributions such as the normal (gaussian) distribution. |
Online computation |
Requires storing all data |
Can be computed "on-line" in a cheap manner, with only the sum and the number of observed values |
Precision |
For certain distributions, the mean has a lower variation, i.e. a better statistical "efficiency" (see [2],[3]) |
|
Partial Information |
In certain cases, the theoretical median can be computed by knowing only half of the whole distribution. The median is n such that P(k<n) is very close to 0.5. It is exactly computable, even if the right most tail is infinite, or even completely irregular. |
Those characteristics have an an impact on the algorithms working upon media or mean. For instance, with respect to clustering, K-medoids [using median] is less susceptible to local minima than standard k-means. [4]
[1] Chapter "Mean versus Median", in "Statistics for the Life Sciences" (Myra L. Samuels, Jeffrey A. Witmer)
[2] Comp Basic: Mean vs. Median (Eliza Polly)
[3] Example of efficiency for mean vs. median (John D. Cook)
[4] A survey of outlier detection methodologies (Hodge, V., & Austin, J.), Artificial Intelligence Review, 2004.