Course file
resources/glossary.md
A rule applied after a weighted sum that adds nonlinearity. Without it, a neural network would behave like a simple linear model.
A way for a model to decide which earlier pieces of information matter most for the current step.
The process of computing how much each parameter contributed to the final error so the model can update itself.
The amount of recent input a language model can look at when predicting the next token.
One full pass through a dataset during training.
The direction and size of change that tells a parameter how to reduce loss.
Using a trained model to make predictions, generate text, or answer a question.
The step size used when updating parameters during training.
A number that measures how wrong the model currently is.
When a model memorizes training data so strongly that it performs worse on new examples.
A learned value inside the model, such as a weight or bias.
The input instruction or context given to a language model.
Three vector roles used in attention:
A pattern where a system retrieves relevant source material first and then uses it to answer more accurately.
A chunk of text used by language models. A token can be a word, part of a word, punctuation, or even whitespace.
The repeated cycle of forward pass, loss calculation, gradient calculation, and parameter update.