Data Science Interview Challenge
Welcome to today's data science interview challenge! Here it goes:
Question 1: Should we use proprietary or open-source when building a LLM application for production?
Question 2: How to assess the performance of our LLMs / LLM applications?
Here are some tips for readers' reference:
Question 1:
Hint hint hint…. 🤓
The answer is mostly covered in this post:
Question 2:
Benchmark tasks and metrics are well-known for this purpose. Some example metrics are as follows:
Quantitative Metrics:
Perplexity: Perplexity measures how well a language model predicts a sample of text. Lower perplexity indicates better performance.
BLEU Score: Commonly used for machine translation, BLEU measures the similarity between model-generated text and human reference text.
ROUGE Score: ROUGE evaluates text summarization and measures overlap between model-generated and reference summaries.
F1 Score: For specific tasks like sentiment analysis or named entity recognition, F1 score assesses the model's precision and recall.
Accuracy and Precision: For classification tasks, accuracy and precision metrics indicate how well the model classifies input data.
However, those may not apply for your specific LLM application. The general guidance is that:
If you know what’s the right answer is, you can define those metrics for the LLM (like some of the above);
if you don’t know what the right answer is… like if the correct answer is subjective then the main technique that we have in the toolkit is to define a prompt that asks another model whether this is a good answer for the question or not.
A quick visual to explain what this means:
Keep reading with a 7-day free trial
Subscribe to The MLnotes Newsletter to keep reading this post and get 7 days of free access to the full post archives.