Statistical Learning Theory

  • Learning with a finite dictionary

Recall from the end of last lecture our setup: We are working with a finite dictionary
H = {h1, . . . , hM } of estimators, and we would like to understand the scaling of this problem
with respect to M and the sample size n. Given H, one idea is to simply try to minimize
ˆ the empirical risk based on the samples, and so we define the empirical risk minimizer, h


hˆerm ∈ ˆ argmin Rn(h).
ˆ ˆ In what follows, we will simply write h instead of h

erm when possible. Also recall the
definition of the oracle, h ̄, which (somehow) minimizes the true risk and is defined by

h ̄ ∈ argmin R(h).

The following theorem shows that, although hˆ  ̄ cannot hope to do better than h in
general, the difference should not be too large as long as the sample size is not too small
compared to M.


Beliebte Posts