Page 10 - AI Vol 1: Foundations of AI

P. 10

known as “transformers,” inspired by the human
brain’s way of processing information. This THE DEVELOPMENT OF

structure is comprised of layers and nodes, ATTENTION MECHANISM IS
paralleling the neurons and synapses in our ONE OF THE MOST IMPORTANT
brains. In these models, each layer processes DEVELOPMENTS MAKING THE
information before passing it on to the next, CURRENT LLMs POSSIBLE.
deeper layer, a process that underpins the “deep”
in “deep learning.” The initial layers manage the
basic structure of language, such as grammar and the varying importance of different parts of a

common phrases. As the information progresses sentence, thereby understanding its meaning
to deeper layers, the model’s understanding more effectively. Without this mechanism, the
becomes more refined, enabling it to discern model would process each word as if it had equal
nuances in context, tone, and even humor. importance necessitating greater processing
power and potentially diminishing the model’s
inferential capabilities. The development of
A key feature of these models is the “attention
mechanism” which enables LLMs to discern attention mechanisms is one of the most important
developments making the current generation of
LLMs possible. This is similar to how humans
Transformer-Based LLM Model Architecture process language, by focusing on key information
and paying less attention to irrelevant details. For
example, humans are often able to understand the
meaning of a grammatically incorrect sentence
by focusing on key words or phrases. Attention
mechanisms allow LLMs to do something
similar. Additionally, the incorporation of parallel

processing, where multiple parts of a sentence
are analyzed simultaneously by the transformers,
enhances the speed and efficiency of LLMs in
understanding and generating language.

2. DATA AND TRAINING PROCESS

To understand how LLMs are trained and
function, it is helpful to imagine a complex
mathematical formula with numerous variables.
In LLMs, these variables represent the “nodes” or
“parameters” of the model, with other variables
signifying “weights” and “biases” that determine
the relationships between these parameters. To

provide a sense of scale, GPT-3 has 175 billion

10 | VOLUME 1 FOUNDATIONS OF AI | LOZANOSMITH.COM

5 6 7 8 9 10 11 12 13 14 15