Lesson 01 / 8·8 minFree

What Is a Language Model

From autocomplete to reasoning — how predicting the next token gives rise to intelligence

Written by the RadarTrek editorial team · June 2026

A large language model is, at its core, a very sophisticated next-word predictor. Feed it text; it outputs the most likely continuation. That sounds underwhelming until you realise that to predict the next word well across all of human writing, a model must implicitly learn grammar, facts, reasoning patterns, coding conventions, emotional tone, and cultural context. The intelligence is a side effect of prediction at scale.

The transformer architecture — what changed everything

2017: attention is all you need — Google researchers published a paper with that title. The "transformer" architecture it introduced replaced the recurrent networks that had dominated NLP. Transformers process all tokens in parallel and use "attention" to weigh which tokens matter most for predicting each next token.
Why it scaled — Earlier architectures hit walls — they could not be made bigger without breaking. Transformers could. Add more parameters, add more data, add more compute — and performance kept improving. This was not obvious before 2017. It became the central insight that launched the LLM era.
GPT-1 to GPT-4, BERT to Claude 3 — Every major LLM today is a transformer variant. The differences are in training data, alignment techniques, context length, and fine-tuning — not the fundamental architecture.

What "large" actually means

Parameters — A model's parameters are the learned weights — numbers that get adjusted during training until the model predicts text accurately. GPT-3 has 175 billion parameters. Claude 3 Opus is estimated at a similar scale. More parameters = more capacity to store patterns, but also more compute required.
Training data — LLMs are trained on text scraped from the internet, books, code repositories, and curated datasets — hundreds of billions to trillions of tokens. The model never "sees" this data again after training; it is compressed into the weights.
Compute — Training a frontier model costs tens to hundreds of millions of dollars in GPU compute. Inference (running the model for a user query) is much cheaper — milliseconds of compute per request.

💡

An LLM is like a very well-read person who has never experienced the world

Imagine someone who has read every book, article, and forum post ever written — but has never left a room. They know everything written down, including contradictions, misinformation, and fictional "facts". They can reason about problems they have never encountered, but their knowledge has a hard cutoff at when they stopped reading, and they cannot verify anything against reality.

Why this matters for builders

Models do not look things up — Everything an LLM knows is baked into its weights at training time. It cannot Google something mid-response unless you explicitly give it a search tool. This is why knowledge cutoffs exist and why RAG (giving the model documents to read) is so powerful.
Models do not reason like computers — A computer evaluates an expression and returns an exact answer. An LLM predicts what a plausible answer looks like based on patterns in training data. For most tasks, these converge. For edge cases, they diverge — sometimes confidently.
Context is everything — The model sees only what you send it in the current request. It has no memory of previous conversations unless you include that history in the prompt. This is why context window size matters so much for complex workflows.

🎯

Try this

Ask Claude (or any LLM): "What is today's date?" Then ask: "What was the most recent news event you know about?" Then ask: "If you had to guess what happened in the world after your training cutoff, what would you say?" Observe how the model reasons about its own limitations — and where that reasoning breaks down.

Tokens and Context Windows