The MLnotes Newsletter

The MLnotes Newsletter

Share this post

The MLnotes Newsletter
The MLnotes Newsletter
Data Science Interview Challenge

Data Science Interview Challenge

Angelina Yang's avatar
Angelina Yang
Oct 05, 2023
∙ Paid

Share this post

The MLnotes Newsletter
The MLnotes Newsletter
Data Science Interview Challenge
Share

Welcome to today's data science interview challenge! Today’s challenge is inspired by CS25 Stanford Seminar talking about Transformers with Andrej Karpathy!

Here you go:

Question 1: What’s stopping the current language models from generating a Novel?

Question 2: How do self-attention and cross-attention differ?

This Book Is Completely AI Generated: Zhu, Tony, ChatGPT, OpenAI:  9798218164645: Amazon.com: Books
Source

Here are some tips for readers' reference:

Question 1:

The token length limitation in current large language models (LLMs) like GPT-3 and similar models is one of the significant constraints that prevent them from generating complete novels or very long texts. Token length refers to the number of individual words, subwords, or characters in a text sequence. Each language model has a predefined maximum token limit, which is a fundamental limitation due to computational and memory constraints. The token limit for models like GPT-3 was typically around 4096 tokens. As of August, 2023, GPT-4 boasts a 32K token context window, accommodating inputs, files, and follow-ups that are 4 times longer than before.

Here's why token length limitation is a significant factor:

  1. Memory and Computation: LLMs process text in chunks or tokens. Longer texts require more memory and computational resources to generate, making it challenging to handle very long sequences within the model's constraints.

  2. Coherence and Context: Longer texts often require maintaining context and coherence over extended passages. Current models may struggle to maintain consistency and coherence over a large number of tokens.

  3. Resource Consumption: Very long texts can consume a substantial number of tokens, potentially exceeding the model's token limit. This can make it impractical or computationally expensive to generate and store such texts.

  4. Training Data: Language models are trained on large corpora of text data, which may not include extremely long texts like novels. Models are more effective at generating content that aligns with the length and structure of the data they were trained on.

To address this limitation and generate longer texts, developers often need to chunk or split the text into smaller segments that fit within the model's token limit. However, doing this can introduce challenges related to maintaining context and coherence across segments.

Let’s see how the lecturer explains it:

Question 2:

Andrej had some really intuitive explanations about Transformers and worth listening in.

Let’s check out how Andrej explains it:

Keep reading with a 7-day free trial

Subscribe to The MLnotes Newsletter to keep reading this post and get 7 days of free access to the full post archives.

Already a paid subscriber? Sign in
© 2025 MLnotes
Privacy ∙ Terms ∙ Collection notice
Start writingGet the app
Substack is the home for great culture

Share