The MLnotes Newsletter

The MLnotes Newsletter

Share this post

The MLnotes Newsletter
The MLnotes Newsletter
Data Science Interview Challenge

Data Science Interview Challenge

Angelina Yang's avatar
Angelina Yang
Aug 10, 2023
∙ Paid

Share this post

The MLnotes Newsletter
The MLnotes Newsletter
Data Science Interview Challenge
Share

Welcome to today's data science interview challenge! Here it goes:

Question 1: Should we use proprietary or open-source when building a LLM application for production?

Question 2: How to assess the performance of our LLMs / LLM applications?

chatbot UI


Here are some tips for readers' reference:

Question 1:

Hint hint hint…. 🤓

The answer is mostly covered in this post:

How to Choose Base Model for Your LLM Application 🧐?

Angelina Yang
·
August 7, 2023
How to Choose Base Model for Your LLM Application 🧐?

The field of Large Language Models (LLMs) is flourishing with numerous models, continuously evolving day-by-day. If you want to develop an LLM application for production, which model on the market should you choose? Should we prioritize the best-performing model on the market? GPT-4 undoubtedly stands out as a top contender.

Read full story

Question 2:

Benchmark tasks and metrics are well-known for this purpose. Some example metrics are as follows:

Quantitative Metrics:

  • Perplexity: Perplexity measures how well a language model predicts a sample of text. Lower perplexity indicates better performance.

  • BLEU Score: Commonly used for machine translation, BLEU measures the similarity between model-generated text and human reference text.

  • ROUGE Score: ROUGE evaluates text summarization and measures overlap between model-generated and reference summaries.

  • F1 Score: For specific tasks like sentiment analysis or named entity recognition, F1 score assesses the model's precision and recall.

  • Accuracy and Precision: For classification tasks, accuracy and precision metrics indicate how well the model classifies input data.

However, those may not apply for your specific LLM application. The general guidance is that:

If you know what’s the right answer is, you can define those metrics for the LLM (like some of the above);

if you don’t know what the right answer is… like if the correct answer is subjective then the main technique that we have in the toolkit is to define a prompt that asks another model whether this is a good answer for the question or not.

A quick visual to explain what this means:

Keep reading with a 7-day free trial

Subscribe to The MLnotes Newsletter to keep reading this post and get 7 days of free access to the full post archives.

Already a paid subscriber? Sign in
© 2025 MLnotes
Privacy ∙ Terms ∙ Collection notice
Start writingGet the app
Substack is the home for great culture

Share