The MLnotes Newsletter

The MLnotes Newsletter

Data Science Interview Challenge

Angelina Yang's avatar
Angelina Yang
Aug 10, 2023
∙ Paid

Welcome to today's data science interview challenge! Here it goes:

Question 1: Should we use proprietary or open-source when building a LLM application for production?

Question 2: How to assess the performance of our LLMs / LLM applications?

chatbot UI


Here are some tips for readers' reference:

Question 1:

Hint hint hint…. 🤓

The answer is mostly covered in this post:

How to Choose Base Model for Your LLM Application 🧐?

Angelina Yang
·
August 7, 2023
How to Choose Base Model for Your LLM Application 🧐?

The field of Large Language Models (LLMs) is flourishing with numerous models, continuously evolving day-by-day. If you want to develop an LLM application for production, which model on the market should you choose? Should we prioritize the best-performing model on the market? GPT-4 undoubtedly stands out as a top contender.

Read full story

Question 2:

Benchmark tasks and metrics are well-known for this purpose. Some example metrics are as follows:

Quantitative Metrics:

  • Perplexity: Perplexity measures how well a language model predicts a sample of text. Lower perplexity indicates better performance.

  • BLEU Score: Commonly used for machine translation, BLEU measures the similarity between model-generated text and human reference text.

  • ROUGE Score: ROUGE evaluates text summarization and measures overlap between model-generated and reference summaries.

  • F1 Score: For specific tasks like sentiment analysis or named entity recognition, F1 score assesses the model's precision and recall.

  • Accuracy and Precision: For classification tasks, accuracy and precision metrics indicate how well the model classifies input data.

However, those may not apply for your specific LLM application. The general guidance is that:

If you know what’s the right answer is, you can define those metrics for the LLM (like some of the above);

if you don’t know what the right answer is… like if the correct answer is subjective then the main technique that we have in the toolkit is to define a prompt that asks another model whether this is a good answer for the question or not.

A quick visual to explain what this means:

Keep reading with a 7-day free trial

Subscribe to The MLnotes Newsletter to keep reading this post and get 7 days of free access to the full post archives.

Already a paid subscriber? Sign in
© 2025 MLnotes · Privacy ∙ Terms ∙ Collection notice
Start your SubstackGet the app
Substack is the home for great culture