# The Stone-Weierstrass Approximation Theorem

With enough polynomials, we can approximate any continuous function \(f(x)\) on a compact set arbitrarily well. Better still, we know of a *uniform* approximation: for any \(\epsilon>0\), we can ensure that the pointwise approximation error is less than \(\epsilon\) for all \(x\). It might seem like it ought to be painful to try to prove this fact, but there’s a simple constructive proof due to Bernstein. In Bernstein’s proof, we explicitly build and evaluate the polynomial approximation using probabilistic methods. The proof is elegant and ingenious, illustrating that probabilities can help us understand even things that are not random at all.

Yes, this is on Wikipedia. But I don’t like their exposition, and I wanted to tinker with the proof to see how it works.

The approximation method is as follows. We approximate a function \(f\) on the domain \([0,1]\) (Note that we can rescale any closed interval of \(\mathbb R\) to \([0,1]\)), using a linear combination of polynomials up to degree \(n\) as follows:

\[ \hat{f}_n(x)= \sum_{j=0}^n f(\frac{j}{n}) b_{j,n}(x) \] where \(b_{j,n}(x):={n \choose j} x^n (1-x)^{n-j}\) are the Bernstein basis polynomials.

Under the assumption that \(f\) is continuous, we will find a uniform upper bound on the pointwise approximation error \(\left | \hat{f}_n(x) -f(x) \right |\); that is, we find a bound that applies for all \(x\). We then show that this uniform bound converges to zero as \(n\) is taken to infinity.

## Proof

Notice that the basis functions are probabilities \(b_{j,n}(x) = \Pr(Z_n(x)=j|n,x)\) corresponding to a binomially distributed random variable \(Z_n(x)\sim \mathrm{Binom}(n,x)\). This is the key: we can study the properties of the basis function approximation using basic properties of the Binomial distribution.

We may write \(\hat{f}_n(x) = \mathbb E\left [ f(\frac{Z_n(x)}{n})\right]\) where \(Z_n(x) \sim \mathrm{Binom}(n,x)\), and by Jensen’s Inequality

\[\left | \mathbb E\left [ f(\frac{Z_n(x)}{n})\right] -f(x) \right | \leq \mathbb E\left[\left | f(\frac{Z_n(x)}{n}) -f(x) \right |\right]\]

For concision, let \(\Delta_n(x)\) be the random variable \(\left | f(\frac{Z_n(x)}{n}) -f(x) \right |\). Fix an arbitrarily small \(\epsilon>0\) and split the above expectation into two parts:

\[\begin{align} \mathbb E[\Delta_n(x)] = & \mathbb E[\Delta_n(x)|\Delta_n(x) \leq \epsilon]\mathbb{P}(\Delta_n(x) \leq \epsilon) \\ &+ \mathbb E[\Delta_n(x)|\Delta_n(x) > \epsilon]\mathbb{P}(\Delta_n(x) > \epsilon) \end{align}\]

The first term in the sum is no greater than \(\epsilon\). So we will focus our efforts on the second term.

First consider \(\mathbb E[\Delta_n(x)|\Delta_n(x) \leq \epsilon]\). Since \(f\) is continuous and \([0,1]\) is closed and bounded, \(f\) is bounded; that is, there exists a bound \(M<\infty\) such that \(|f(x)|<M\) for all \(x\). It follows that \(| f(\frac{Z_n(x)}{n}) -f(x)| < 2M\). Hence

\[\mathbb E[\Delta_n(x)|\Delta_n(x) \leq \epsilon] \leq 2M\]

Now consider \(\mathbb{P}(\Delta_n(x) \leq \epsilon)\). Since \(f\) is continuous on \([0,1]\), which is closed and bounded, \(f\) is *uniformly* continuous on \([0,1]\). That is, for any \(\epsilon > 0\) there exists a \(\delta >0\) such that \(|a-b| < \delta \implies |f(a) - f(b)| < \epsilon\) for all \(a,b \in [0,1]\). Contrapositively, for our \(\epsilon>0\) there exists a \(\delta>0\) such that \(|f(\frac{Z_n(x)}{n}) -f(x)| \geq \epsilon \implies |\frac{Z_n(x)}{n} - x| \geq \delta\). So

\[\mathbb{P}(\Delta_n(x) > \epsilon)\leq\mathbb{P}\left(\left|\frac{Z_n(x)}{n}-x\right|> \delta\right)\]

By Chebyshev’s Inequality,

\[ \mathbb{P}\left(\left|\frac{Z_n(x)}{n}-x\right|> \delta\right) \leq \frac{x(1-x)}{n\delta^2}\leq\frac{1}{4n\delta^2} \]

Putting it all together,

\[ \left | \hat{f}_n(x) -f(x) \right | \leq \epsilon + \frac{2M}{4n\delta^2} \]

So we find that the approximation error can be made arbitrarily small, uniformly in \(x\), by taking \(n\longrightarrow \infty\).

# Exploration

Using our intuition for probabilities, we can see why the Bernstein polynomial approximation is not ideal for functions that aren’t very smooth. The scheme approximates \(f\) at any given point \(x\) by taking the expected value of the following process: 1) flip \(n\) independent coins, each with probability \(x\) of heads; 2) calculate the proportion \(Z_n(x)/n\) of the \(n\) coins that landed heads; and 3) plug this proportion into \(f\). Although the proportion \(Z_n(x)/n\) is increasingly likely (as \(n\) increases) to be near its expected value \(x\), \(Z_n(x)/n\) can still take on values values distant from \(x\), so the approximation \(f_n\) at any point \(x\) always depends on the values of \(f\) at faraway points, albeit less and less as \(n\) increases. If \(f\) is “wiggly” or changes rapidly with small changes in \(x\), then points farther from \(x\) will tend to contribute bias to the approximation \(f_n(x)\), and \(n\) may have to grow very large before this bias is eliminated.

However, we could use the approximation scheme \(\hat{f}_n(x) = \mathbb E\left [ f(\frac{Z_n(x)}{n})\right]\) for \(Z_n(x)\) with distributions other than the binomial.