Data Science Interview Challenge
Welcome to today's data science interview challenge! The inspiration of today’s questions are from the 2022 Hollander Distinguished Lecture, featuring Stanford Professor Trevor Hastie. Here it goes:
Question 1: Can you describe how does cross-validation work?
Question 2: What error is cross-validation really estimating?

Here are some tips for readers' reference:
Question 1:
In K-fold Cross-Validation, a K-fold partition of the sample space is created. The original sample is randomly partitioned into K equal sized (or almost equal sized) subsamples. Of the K subsamples, a single subsample is retained as the test set for estimating the Prediction Error, and the remaining K-1 subsamples are used as training data. The cross-validation process is then repeated K times (the folds), with each of the K subsamples used exactly once as the test set. The K error estimates from the folds can then be averaged to produce a single estimation. The advantage of this method is that all observations are used for both training and validation, and each observation is used for validation exactly once.
A common choice for K is 10. With a large number of folds (K large) the bias of the true error rate estimator is small but the variance will be large. The computational time may also be very large as well, depending on the complexity of the models under consideration. With a small number of folds the variance of the estimator will be small but the bias will be large. The estimate may be larger than the true error rate.
In practice the choice of the number of folds depends on the size of the data set. For large data set, smaller K (e.g. 3) may yield quite accurate results. For sparse data sets, Leave-one-out (LOO or LOOCV) may need to be used.
Let’s hear how Professor Hastie explains this: Remember to tune in a little longer to hear the full story!
Greetings, dear readers! I hope this message finds you well. Before we move on to the next question, I want to offer my sincerest apologies for the unexpected delay in posting this past weekend. As some of you may know, I've been traveling internationally for the past several weeks. Amidst occupied schedule and unexpected sickness, I regrettably couldn't find a moment to schedule the usual Sunday post. Nevertheless, I am back and eager to continue this learning experience with all of you. Thank you for your understanding! 🤓
Question 2:
This is a tough question. Cross-validation is a widely-used technique to estimate prediction error, but its behavior is complex and not fully understood.
Ideally, one would like to think that cross-validation estimates the prediction error for the model at hand, fit to the training data.
In Trevor’s 2021 paper, this is proven incorrect.
Keep reading with a 7-day free trial
Subscribe to The MLnotes Newsletter to keep reading this post and get 7 days of free access to the full post archives.