## On the on-line data mean/variance computation

April 08, 2021

Sometimes it is not possible to compute the mean and variance on the whole dataset; in machine learning you often need to normalize data that cannot fit in RAM. To compute mean and standard deviation on chunked data, we need some adapted formulas.

### The problem

Our dataset in composed on chunks \(c_k,\ k=1,\dots\); each chunk is made up of elements \(a_i^{(k)},\ i=1,\dots,n^{(k)}\) (note we do not assume all chunks to have the same length). Our goal is to compute the mean and variance of all the elements \(a_i^{(k)},\ i=1,\dots,n^{(k)};k=1,\dots,T\), where \(T\) is the last chunk received. However, we cannot directly compute the mean and variance on the elements \(a_i^{(k)}\) as they cannot reside in memory at the same time.

### Basic fomulas

Given an ordered set \(a_1,\dots,a_n\) we define:

- the sample mean, \(\bar{a} = \frac{\sum_i^n a_i}{n}\)
- the sample variance, \(\hat{a} = \frac{\sum_i^n (a_i-\bar{a})^2}{n-1}\)

We note that: \(\hat{a} = \frac{\sum_i^n a_i^2 -\sum_i^n2a_i\bar{a}+\sum_i^n\bar{a}^2}{n-1} = \frac{\sum_i^n a_i^2 -2n\bar{a}^2+n\bar{a}^2}{n-1} = \frac{\sum_i^n a_i^2}{n-1} -\frac{n\bar{a}^2}{n-1}\)

### Statistics composition

Given two ordered sets, \(a_1,\dots,a_n\) with statistics \(\bar{a},\hat{a}\), and \(b_1,\dots,b_m\) with statistics \(\bar{b},\hat{b}\) we have that:

- the combined set has mean \(\bar{c} = \frac{\sum_i^n a_i + \sum_j^m b_j}{n+m}\)
- the combined set has variance \(\hat{c} = \frac{\sum_i^n a_i^2 + \sum_j^m b_j^2}{n+m-1} - \frac{(n+m)\bar{c}^2}{n+m-1}\)

It follows that:

\[\bar{c} = \frac{n\bar{a} + m\bar{b}}{n+m}\]Being \(\sum_i^n a_i^2 = (n-1)\hat{a} + n\bar{a}^2\),

\[\hat{c} = \frac{(n-1)\hat{a}+n\bar{a}^2+(m-1)\hat{b}+m\bar{b}^2}{n+m-1} - \frac{(n+m)\bar{c}^2}{n+m-1}\]### The solution

Given our chunks \(c_k\) coming in order, we can apply the formulas from the previous section to keep up-to-date mean \(\mu\) and variance \(v\) of our entire dataset.

First we compute \(\mu,v\) on \(c_1\). When we receive/load \(c_2\) we compute its mean and variance, \(\bar{c}_2,\hat{c}_2\), and we use the composition formulas, combining \(\mu,v\) with \(\bar{c}_2,\hat{c}_2\), and saving the results in \(\mu,v\). When we receive/load \(c_3\) we repeat,… and so on.