On the on-line data mean/variance computation
April 08, 2021
Sometimes it is not possible to compute the mean and variance on the whole dataset; in machine learning you often need to normalize data that cannot fit in RAM. To compute mean and standard deviation on chunked data, we need some adapted formulas.
The problem
Our dataset in composed on chunks \(c_k,\ k=1,\dots\); each chunk is made up of elements \(a_i^{(k)},\ i=1,\dots,n^{(k)}\) (note we do not assume all chunks to have the same length). Our goal is to compute the mean and variance of all the elements \(a_i^{(k)},\ i=1,\dots,n^{(k)};k=1,\dots,T\), where \(T\) is the last chunk received. However, we cannot directly compute the mean and variance on the elements \(a_i^{(k)}\) as they cannot reside in memory at the same time.
Basic fomulas
Given an ordered set \(a_1,\dots,a_n\) we define:
- the sample mean, \(\bar{a} = \frac{\sum_i^n a_i}{n}\)
- the sample variance, \(\hat{a} = \frac{\sum_i^n (a_i-\bar{a})^2}{n-1}\)
We note that: \(\hat{a} = \frac{\sum_i^n a_i^2 -\sum_i^n2a_i\bar{a}+\sum_i^n\bar{a}^2}{n-1} = \frac{\sum_i^n a_i^2 -2n\bar{a}^2+n\bar{a}^2}{n-1} = \frac{\sum_i^n a_i^2}{n-1} -\frac{n\bar{a}^2}{n-1}\)
Statistics composition
Given two ordered sets, \(a_1,\dots,a_n\) with statistics \(\bar{a},\hat{a}\), and \(b_1,\dots,b_m\) with statistics \(\bar{b},\hat{b}\) we have that:
- the combined set has mean \(\bar{c} = \frac{\sum_i^n a_i + \sum_j^m b_j}{n+m}\)
- the combined set has variance \(\hat{c} = \frac{\sum_i^n a_i^2 + \sum_j^m b_j^2}{n+m-1} - \frac{(n+m)\bar{c}^2}{n+m-1}\)
It follows that:
\[\bar{c} = \frac{n\bar{a} + m\bar{b}}{n+m}\]Being \(\sum_i^n a_i^2 = (n-1)\hat{a} + n\bar{a}^2\),
\[\hat{c} = \frac{(n-1)\hat{a}+n\bar{a}^2+(m-1)\hat{b}+m\bar{b}^2}{n+m-1} - \frac{(n+m)\bar{c}^2}{n+m-1}\]The solution
Given our chunks \(c_k\) coming in order, we can apply the formulas from the previous section to keep up-to-date mean \(\mu\) and variance \(v\) of our entire dataset.
First we compute \(\mu,v\) on \(c_1\). When we receive/load \(c_2\) we compute its mean and variance, \(\bar{c}_2,\hat{c}_2\), and we use the composition formulas, combining \(\mu,v\) with \(\bar{c}_2,\hat{c}_2\), and saving the results in \(\mu,v\). When we receive/load \(c_3\) we repeat,… and so on.