Like the mysteries of black holes, generative AI hides its brilliance in a swirling storm of math, tokens, and unseen forces. But is it really magic..
Introduction
In this article, we’ll take a stargazer’s lens to our first encounter with Generative AI — often seen as a mysterious “black box.” But unlike actual black holes, which swallow information, Gen AI thrives on it. At its core, it's not magic or science fiction — it’s just a highly trained system that learns patterns and predicts what comes next, one token at a time.
Imagine a telescope that doesn’t just observe the universe, but learns from it — and starts sketching galaxies before they even form. That’s what Gen AI feels like: a machine that, with enough data and training, begins to generate intelligent outputs that appear almost supernatural. But behind that illusion is a carefully engineered system of math, layers, and logic — not wizardry.
Let’s break down the cosmic fundamentals of how it all begins.
🌐 AI vs ML vs GenAI — A Quick Cosmic Context
Artificial Intelligence (AI)
The broadest galaxy — AI refers to any machine or system designed to mimic human intelligence: reasoning, decision-making, problem-solving, even perception.
🔍 Focus: Philosophical + Research-driven + Applications (robotics, vision, NLP)
Machine Learning (ML)
A star cluster within AI, ML gives machines the ability to learn from data instead of being explicitly programmed. It includes techniques like regression, decision trees, and neural networks.
🔍 Focus: Mostly academic + research + analytics-heavy industry work (e.g., data science, fraud detection)
Generative AI (GenAI)
A supernova in the AI universe, GenAI is a subfield of ML that enables machines to generate new content — text, images, music, code — using deep learning (especially Transformer models like GPT).
🔍 Focus: Product & business innovation (chatbots, content creation, coding assistants, design tools)
🤔How It’s started ?
In the early days of machine translation, systems were handcrafted with strict rules. Think of this as trying to decode alien languages with a dictionary and a laser pointer, hoping you’re pointing at the right star.
But meaning in language doesn’t live in isolated stars — it lives in patterns, in context, in the gravity between words.
Then came the real supernova. In 2017, Google Research released a landmark paper titled:
“Attention is All You Need”
by Vaswani et al.
This was the Big Bang moment of modern AI.
Instead of scanning one word at a time like a comet on a narrow path, this new model — the Transformer — used attention mechanisms to scan the whole galaxy of words at once. It asked:
“Which other words in this sentence pull this one into orbit?”
This ability to look around, to attend to the whole sentence (and eventually paragraphs, pages, even books), gave rise to understanding context — something no earlier system could grasp well.
This is the same architecture that fixed Google Translate. Instead of blindly translating word-by-word, now it could sense the gravitational pull of meaning.
Fast-forward to OpenAI. In 2018, they launched the first GPT (Generative Pretrained Transformer) — a model that didn’t just understand language, it could generate it.
While the Transformer model was like a galaxy mapper, GPT became a cosmic writer. Trained on massive amounts of text from books, websites, articles — it learned to predict the next token (word or sub-word) given everything before it.
“The stars were…”
GPT might say: “…glimmering softly above the quiet Earth.”
All it's doing is statistically guessing the next token — but it feels like it’s reading your mind. This was not magic, just math orbiting in very high-dimensional space.
🌌 Foundations of GenAI: Unpacking the Black Box (Token by Token)
In GenAI, we don’t feed full sentences to models — we break them into smaller pieces called tokens. Think of tokens as the stars or particles that make up a galaxy of meaning.
Example:
Sentence: “AI is awesome!”
Tokens: [AI] [ is ] [ awe] [some] [!]
(Depending on the tokenizer, words may be split further)
🔧 Tool: tiktoken — OpenAI's tokenizer used in GPT models.
Note: Tokenization may differ based on the model — some tokenize by word, others by sub-word, or even characters.
A sequence is just an ordered list of tokens — like planets aligned in a solar system. The model reads the sequence and tries to understand what comes next.
Example:[AI] → [is] → [awe] → [some] → [!]
In GPT-style models, you might see a limit like sequence length = 2048 tokens (or 8192+ in newer ones like GPT-4 Turbo).
Tokenization is the process of converting raw text into tokens (numerical data). It’s like turning light from a star into measurable spectrums — now the model can analyze it.
Based on its Vocabulary we can convert them to their unique numerical data which machine understands.
For Demostration you can use tiktoken library in python and follow below code:
Copy
Copy
import tiktoken
enc = tiktoken.get_encoding("cl100k_base")
tokens = enc.encode("Hello, world!")
print(tokens) # [15339, 1917, 0]
Each token maps to a unique ID in a vocabulary — the model’s language dictionary.
A model has a fixed set of known tokens, known as its vocabulary. Think of it as the observable universe for the AI.
GPT-3.5 Vocabulary size: ~50,000 tokens
GPT-4 (cl100k): ~100,000 tokens
Unseen words are broken into smaller known parts!
The bigger the vocabulary, the better the model handles different languages, slang, and technical words — but also the more complex the model becomes.
Once we have tokens, we convert them into vectors — numerical representations in high-dimensional space. These are called embeddings.
Imagine each token is a planet, and embeddings are its coordinates in the AI galaxy. The closer two planets are, the more similar their meanings.
🧠 Intuition with Graph:
If we plot the word vectors for:
["king", "queen", "man", "woman"]
,
you’ll see:
It says when you move by same magnitude and direction for man to reach King you absorb a similar thing happens with woman reaching queen. It’s something like keeping/preserving galactic words semantic meaning in dimensions of course! the AI has more 3D than we live.
king - man ≈ queen - woman
This shows that the model "understands" relationships in a mathematical sense.
Another Example with more explanation
First, here’s how we use OpenAI’s text-embedding-3-small
model to generate embeddings for a phrase like "dog chases cat"
.
Copy
Copy
from dotenv import load_dotenv
from openai import OpenAI
load_dotenv() # Loads your OpenAI API Key from .env file
client = OpenAI()
text = "dog chases cat"
response = client.embeddings.create(
model="text-embedding-3-small",
input=text
)
print("Vector Embedding:", response.data[0].embedding)
print("Vector Length:", len(response.data[0].embedding))
💡 What’s happening here?
We're turning a sentence into a vector in hyperspace — a long list of numbers (typically 1536 dimensions) that capture the meaning of the phrase. You can imagine these as coordinates of a spaceship in a galaxy of meaning. Similar meanings will land in nearby zones.
Since transformers don’t read in order like humans, we need to tell them where each token is in the sequence — this is called positional encoding.
It's like giving each token a time-stamp or coordinates in orbit.
Without this, the model wouldn’t know the difference between:
“The rocket launched after the signal.”
“The signal launched after the rocket.”
Positional encodings are usually sinusoidal functions or learned vectors added to the embeddings.
In self-attention, each word in a sentence looks at every other word (including itself) to understand context.
Example:
For the sentence “The moon orbits Earth”, the model looks at:
What does “moon” mean in this context?
How does it relate to “orbits” and “Earth”?
It creates attention scores, like gravitational forces between celestial bodies, to weigh their relationships.
Instead of using just one lens, the model uses multiple "heads" — each capturing a different pattern or perspective. One head may focus on grammar, another on sentiment, another on topic.
It’s like using different telescopes to observe the same constellation in radio, infrared, and visible light!
"The satellite observes Mars."
Let’s assume we tokenize this as:
["The", "satellite", "observes", "Mars"]
Each word is first embedded into a vector, and multi-head attention allows the model to look at the sentence from multiple perspectives at once.
🧠 Head 1: Focus on Subject-Verb Grammar
This head tries to understand grammatical structure:
“satellite” attends strongly to “observes” (subject–verb relationship)
“Mars” has weak attention here
🌌 Head 2: Focus on Semantic Roles (Who’s Watching Whom)
This head focuses on the meaning:
“observes” attends to both “satellite” (agent) and “Mars” (object)
“The” gets low attention
Each attention head gives a different weighted representation of the same input. These are then concatenated and linearly combined into one final representation.
This is like observing a planet from two different telescopes, each tuned to different wavelengths (grammar vs meaning), and then combining the images to form a clear, complete picture.
Instead of relying on a single interpretation, the model learns richer, deeper representations, capturing multiple types of relationships between tokens — all in parallel.
Takes input tokens + positional encoding.
Passes through multiple layers of:
Self-Attention
Feed Forward Network
Add & Norm (residual connections + layer normalization)
This is where the model learns to encode meaning, structure, and relationships.
Takes previously generated outputs (shifted right — to avoid peeking at the answer!).
Goes through masked self-attention → cross-attention → feed-forward → output.
This is the prediction phase, where the model guesses the next word.
Training is where the backpropagation magic happens.
Analogy: Imagine the model tries to predict “black hole” but says “green apple” instead.
It compares the predicted vector to the actual correct output.
Computes the error (loss).
Then adjusts internal weights — like recalibrating satellites after a failed mission — via backpropagation:
Activations flow forward (like signal transmission).
Errors flow backward to update each weight (gravitational recalibration).
Over time, the model learns from its mistakes and improves prediction.
Now the model has been trained. We give it a prompt:
“A galaxy is…”
It processes the input, adds positional encoding, passes through encoder-decoder, and predicts:“...a massive system of stars and planets.”
This is the inference phase — where the model becomes your astro-oracle, predicting meaningful next words based on your prompt.
ComponentRole in the UniverseTokenizationBreaks text into stars (tokens)EmbeddingPlaces stars in vector spacePositional EncodingMaps them in orbitSelf-AttentionCalculates gravitational pull between themMulti-Head AttentionViews from multiple telescopesFeed ForwardInternal transformation unit (like fusion core)BackpropagationLearns from navigation errorsInferenceGenerates new coordinates (words) from trained space
Conclusion
And just like that, we’ve launched our journey into the cosmos of Generative AI! 🚀
We’ve looked at how GenAI isn’t just some mysterious black box floating in a neural nebula, but a brilliant system powered by math, training, and attention – kind of like a space probe learning to navigate by starlight.
From tokens and embeddings to multi-head attention and the Transformer architecture, we’ve skimmed the surface of what makes these models tick – or should I say, orbit? 🪐
But this is just the beginning of the voyage. There’s a whole galaxy of concepts yet to explore — from decoding, fine-tuning, and hallucinations, to building your own AI co-pilot.
Join sbnd on Peerlist!
Join amazing folks like sbnd and thousands of other people in tech.
Create ProfileJoin with sbnd’s personal invite link.
0
0
0