First, note that the smallest L2-norm vector that can fit the training data for the core model is \(>=[2,0,0]\)

First, note that the smallest L2-norm vector that can fit the training data for the core model is \(<\theta^\text<-s>>=[2,0,0]\)

On the other hand, in the presence of the spurious feature, the full model can fit the training data perfectly with a smaller norm by assigning weight \(1\) for the feature \(s\) (\(|<\theta^\text<-s>>|_2^2 = 4\) while \(|<\theta^\text<+s>>|_2^2 + w^2 = 2 < 4\)).

Generally, in the overparameterized regime, since the number of training examples is less than the number of features, there are some directions of data variation that are not observed in the training data. In this example, we do not observe any information about the second and third features. However, the non-zero weight for the spurious feature leads to a different assumption for the unseen directions. In particular, the full model does not assign weight \(0\) to the unseen directions. Indeed, by substituting \(s\) with \(<\beta^\star>^\top z\), we can view the full model as not using \(s\) but implicitly assigning weight \(\beta^\star_2=2\) to the second feature and \(\beta^\star_3=-2\) to the third feature (unseen directions at training).

Within analogy, deleting \(s\) decreases the error to own a test shipping with a high deviations out of zero towards next element, while deleting \(s\) boosts the error to possess an examination shipment with a high deviations away from no into the third feature. (more…)