Data Science Interview Challenge

Aug 31, 2023

∙ Paid

Welcome to today's data science interview challenge! Today’s challenge is inspired by a talk by Professor Trevor Hastie from Stanford at the University of Bristol. Here it goes:

Question 1: Let’s say you work for a digital marketing company. You are given a dataset for five-minute internet sessions totaling 64MM rows of data. There are 7 million features of session info (such as web page indicators, descriptors and so on).

Our goal is to predict if any five-minute session is watched by family with children or no children. With this, the advertisers can take caution of displaying specific ads.

How would you go about to build a model to solve this?

Question 2: Can you explain the following graph from the training result? Are we overfitting here?

Here are some tips for readers' reference:

Question 1:

There are many ways you can solve this problem. For instance, we can use a binary classification model that has the target looks like the following:

FamilyWithChildren = 1: This indicates that the five-minute internet session was watched by a family with children.
FamilyWithChildren = 0: This indicates that the five-minute internet session was not watched by a family with children.

We have 7 million features. We can choose to reduce dimension by using Principal Component (PCA), or simply remove sparse features. In this case, Trevor removed all features with less than 3 non-zero values, which reduced the number of features to one million.

Then split the model into training, test and validation sets for model development. A model that can be used here may be glmnet if you are using R, or scikit-learn, statsmodels for python.

Question 2:

Check how Professor Trevor Hastie explains this:

Keep reading with a 7-day free trial

Subscribe to The MLnotes Newsletter to keep reading this post and get 7 days of free access to the full post archives.