Complementary Proofs

We begin by proving a simple lemma which shows that the class of nearest neighbor classifiers is a subset of the class of $1$-Lipschitz functions. Let $\textbf{nn}_{F}^{S}\left(\cdot\right)$ be a function such that the sign of $\textbf{nn}_{F}^{S}\left(\mathbf{x}\right)$ is the label that the nearest neighbor rule assigns to $\mathbf{x}$, while the magnitude is the sample-margin, i.e. the distance between $\mathbf{x}$ and the decision boundary.

Lemma 1  

Let $F$ be a set of features and let $S$ be a labeled sample. Then for any $\mathbf{x}_{1},\mathbf{x}_{2}\in\mathcal{R}^{N}$:

\begin{eqnarray*}
\left\vert\textbf{nn}_{F}^{S}\left(\mathbf{x}_{1}\right)-\text...
...ft(\mathbf{x}_{1}\right)-F\left(\mathbf{x}_{2}\right)\right\Vert \end{eqnarray*}


where $F\left(\mathbf{x}\right)$ is the projection of $\mathbf{x}$ on the features in $F$.

Proof:Let $\mathbf{x}_{1},\mathbf{x}_{2}\in\mathcal{X}$. We split our argument into two cases. First assume that $\textbf{nn}_{F}^{S}\left(\mathbf{x}_{1}\right)$ and $\textbf{nn}_{F}^{S}\left(\mathbf{x}_{2}\right)$ have the same sign. Let $\mathbf{z}_{1},\mathbf{z}_{2}\in\mathcal{R}^{\left\vert F\right\vert}$ be the points on the decision boundary of the 1-NN rule which are closest to $F\left(\mathbf{x}_{1}\right)$ and $F\left(\mathbf{x}_{2}\right)$ respectively. From the definition of $\mathbf{z}_{1,2}$ it follows that $\textbf{nn}_{F}^{S}\left(\mathbf{x}_{1}\right)=\left\Vert F\left(\mathbf{x}_{1}\right)-\mathbf{z}_{1}\right\Vert $ and $\textbf{nn}_{F}^{S}\left(\mathbf{x}_{2}\right)=\left\Vert F\left(\mathbf{x}_{2}\right)-\mathbf{z}_{2}\right\Vert $ and thus
$\displaystyle \textbf{nn}_{F}^{S}\left(\mathbf{x}_{2}\right)$ $\textstyle \leq$ $\displaystyle \left\Vert F\left(\mathbf{x}_{2}\right)-\mathbf{z}_{1}\right\Vert$  
  $\textstyle \leq$ $\displaystyle \left\Vert F\left(\mathbf{x}_{2}\right)-F\left(\mathbf{x}_{1}\right)\right\Vert +\left\Vert F\left(\mathbf{x}_{1}\right)-\mathbf{z}_{1}\right\Vert$  
  $\textstyle =$ $\displaystyle \left\Vert F\left(\mathbf{x}_{2}\right)-F\left(\mathbf{x}_{1}\right)\right\Vert +\textbf{nn}_{F}^{S}\left(\mathbf{x}_{1}\right)$ (4)

By repeating the above argument while reversing the roles of $\mathbf{x}_{1}$and $\mathbf{x}_{2}$ we get
$\displaystyle \textbf{nn}_{F}^{S}\left(\mathbf{x}_{1}\right)$ $\textstyle \leq$ $\displaystyle \left\Vert F\left(\mathbf{x}_{2}\right)-F\left(\mathbf{x}_{1}\right)\right\Vert +\textbf{nn}_{F}^{S}\left(\mathbf{x}_{2}\right)$ (5)

Combining (4) and (5) we obtain

\begin{eqnarray*}
\left\vert\textbf{nn}_{F}^{S}\left(\mathbf{x}_{2}\right)-\text...
...ft(\mathbf{x}_{2}\right)-F\left(\mathbf{x}_{1}\right)\right\Vert \end{eqnarray*}


The second case is when $\textbf{nn}_{F}^{S}\left(\mathbf{x}_{1}\right)$ and $\textbf{nn}_{F}^{S}\left(\mathbf{x}_{2}\right)$ have alternating signs. Since $\textbf{nn}_{F}^{S}\left(\cdot\right)$ is continuous, there is a point $\mathbf{z}$ on the line connecting $F\left(\mathbf{x}{1}\right)$ and $F\left(\mathbf{x}_{2}\right)$ such that $\mathbf{z}$ is on the decision boundary. Hence,

\begin{eqnarray*}
\left\vert\textbf{nn}_{F}^{S}\left(\mathbf{x}_{1}\right)\right...
... & \left\Vert F\left(\mathbf{x}_{2}\right)-\mathbf{z}\right\Vert \end{eqnarray*}


and so we obtain

\begin{eqnarray*}
\left\vert\textbf{nn}_{F}^{S}\left(\mathbf{x}_{2}\right)-\text...
...ft(\mathbf{x}_{2}\right)-F\left(\mathbf{x}_{1}\right)\right\Vert \end{eqnarray*}


\framebox[7pt]{\rule{0pt}{7pt}}The main tool for proving theorem 1 is the following:

Theorem 2   [Bar98] Let $\mathcal{H}$ be a class of real valued functions. Let $S$ be a sample of size $m$ generated i.i.d. from a distribution $\mathcal{D}$ over $\mathcal{X}\times\left\{ \pm1\right\} $ then with probability $1-\delta$ over the choices of $S$, every $h\in\mathcal{H}$ and every $\gamma\in\left(0,1\right]$ let $d=\textrm{fat}_{\mathcal{H}}\left(\gamma/32\right)$:

\begin{eqnarray*}
\textrm{er}_{\mathcal{D}}\left(h\right)&\leq&\hat{\textrm{er}}...
...\left(578m\right)+\ln\left(\frac{8}{\gamma\delta}\right)\right)}
\end{eqnarray*}


We now turn to prove theorem 1:

Proof(of theorem 1): Let $F$ be a set of features such that $\left\vert F\right\vert=n$ and let $\gamma>0$. In order to use theorem 2 we need to compute the fat-shattering dimension of the class of nearest neighbor classification rules which use the set of features $F$. As we saw in lemma 1 this class is a subset of the class of $1$-Lipschitz functions on these features. Hence we can bound the fat-shattering dimension of the class of NN rules by the dimension of Lipschitz functions.

Since $\mathcal{D}$ is supported in a ball of radius $R$ and $\left\Vert \mathbf{x}\right\Vert \geq\left\Vert F\left(\mathbf{x}\right)\right\Vert $, we need to calculate the fat-shattering dimension of Lipschitz functions acting on points in $\mathcal{R}^{n}$ with norm bounded by $R$. The fat$_{\gamma}$-dimension of the 1-NN functions on the features $F$ is thus bounded by the largest $\gamma$ packing of a ball in $\mathcal{R}^{n}$ with radius $R$, which in turn is bounded by $\left(2R/\gamma\right)^{\left\vert F\right\vert}$.

Therefore, for a fixed set of features $F$ we can apply to theorem 2 and use the bound on the fat-shattering dimension just calculated. Let $\delta_{F}>0$ and we have according to theorem 2 with probability $1-\delta_{F}$ over sample $S$ of size $m$ that for any $\gamma\in\left(0,1\right]$

$\displaystyle \textrm{er}_{\mathcal{D}}\left(\textrm{nearest-neighbor}\right)$ $\textstyle \leq$ $\displaystyle \hat{\textrm{er}}_{S}^{\gamma}\left(\textrm{nearest-neighbor}\right)+$ (6)
    $\displaystyle \sqrt{\frac{2}{m}\left(d\ln\left(\frac{34em}{d}\right)\log_{2}\left(578m\right)+\ln\left(\frac{8}{\gamma\delta_{F}}\right)\right)}$  

for $d=\left(64R/\gamma\right)^{\left\vert F\right\vert}$. By choosing \(
\delta_{F}=\delta/\left(N{N \choose \left\vert F\right\vert}\right)\) we have that $\sum_{F\subseteq\left[1\ldots N\right]}\delta_{F}=\delta$ and so we can apply the union bound to (6) and obtain the stated result. \framebox[7pt]{\rule{0pt}{7pt}}

Ran Gilad-Bachrach 2004-12-07