The MLnotes Newsletter

The MLnotes Newsletter

Share this post

The MLnotes Newsletter
The MLnotes Newsletter
Data Science Interview Challenge

Data Science Interview Challenge

Angelina Yang's avatar
Angelina Yang
Aug 31, 2023
∙ Paid
1

Share this post

The MLnotes Newsletter
The MLnotes Newsletter
Data Science Interview Challenge
Share

Welcome to today's data science interview challenge! Today’s challenge is inspired by a talk by Professor Trevor Hastie from Stanford at the University of Bristol. Here it goes:

Question 1: Let’s say you work for a digital marketing company. You are given a dataset for five-minute internet sessions totaling 64MM rows of data. There are 7 million features of session info (such as web page indicators, descriptors and so on).

Our goal is to predict if any five-minute session is watched by family with children or no children. With this, the advertisers can take caution of displaying specific ads.

How would you go about to build a model to solve this?

Question 2: Can you explain the following graph from the training result? Are we overfitting here?


Source

Here are some tips for readers' reference:

Question 1:

There are many ways you can solve this problem. For instance, we can use a binary classification model that has the target looks like the following:

  1. FamilyWithChildren = 1: This indicates that the five-minute internet session was watched by a family with children.

  2. FamilyWithChildren = 0: This indicates that the five-minute internet session was not watched by a family with children.

We have 7 million features. We can choose to reduce dimension by using Principal Component (PCA), or simply remove sparse features. In this case, Trevor removed all features with less than 3 non-zero values, which reduced the number of features to one million.

Then split the model into training, test and validation sets for model development. A model that can be used here may be glmnet if you are using R, or scikit-learn, statsmodels for python.

Question 2:

Check how Professor Trevor Hastie explains this:

Keep reading with a 7-day free trial

Subscribe to The MLnotes Newsletter to keep reading this post and get 7 days of free access to the full post archives.

Already a paid subscriber? Sign in
© 2025 MLnotes
Privacy ∙ Terms ∙ Collection notice
Start writingGet the app
Substack is the home for great culture

Share