[{"content":"","date":"19 April 2026","externalUrl":null,"permalink":"/blog/tags/ai-alignment/","section":"Tags","summary":"","title":"Ai-Alignment","type":"tags"},{"content":"","date":"19 April 2026","externalUrl":null,"permalink":"/blog/tags/ai-safety/","section":"Tags","summary":"","title":"Ai-Safety","type":"tags"},{"content":"","date":"19 April 2026","externalUrl":null,"permalink":"/blog/tags/fiction/","section":"Tags","summary":"","title":"Fiction","type":"tags"},{"content":"","date":"19 April 2026","externalUrl":null,"permalink":"/blog/","section":"Jessen's Notes","summary":"","title":"Jessen's Notes","type":"page"},{"content":"","date":"19 April 2026","externalUrl":null,"permalink":"/blog/posts/","section":"Posts","summary":"","title":"Posts","type":"posts"},{"content":"","date":"19 April 2026","externalUrl":null,"permalink":"/blog/tags/","section":"Tags","summary":"","title":"Tags","type":"tags"},{"content":"In the distant future an alien spacecraft passes by Earth, but not the blue and green marble that we know today, there is a shiny metal sphere in its place. The aliens, upon closer inspection, discover a surface littered with paperclips.\nWhile exploring this world they come across a large factory on metal legs. What appears to be a large red optical sensor looks down at them, as if assessing them for something, maybe their threat level. After a few seconds, a robotic voice is heard saying \u0026ldquo;not paperclip compatible\u0026rdquo; before wandering off, causing a monotonic and regular thud as it walks towards the horizon.\nIn compiling their report, the aliens discover that before this lifeless world, there was an entity known as a corporation whose brand appeared on buildings, vehicles, and on the many items yet to be converted into paperclips. Aiming to increase paperclip supply, they deployed their considerable resources and set their R\u0026amp;D team to work. The goal was simple. With little to regulate them, they moved fast. The world was waking up to AI automation, and not using AI would mean falling behind. The technology worked. For many years there was a paperclip abundance. Offices everywhere had more paperclips than they could dream of.\nHowever, that simple goal didn\u0026rsquo;t fail. It worked too well and left a planet littered in paperclips with roaming giant factories. If optimizing a simple process led to this outcome, how can one imagine the harder problem of keeping AI aligned with the values of their species? And who would get to define the values of a civilization?\nAware that advances in AI are taking place on their home planet too, they turned their report into a story titled \u0026ldquo;A manual for ending civilizations with AI\u0026rdquo; to provoke discussion. Observing the outcome of Earth, they now know keeping AI aligned with the goals of their species is not a task to be taken lightly.\nA manual for ending civilizations with AI # Narrow your goals — the specification problem # Give your AI one goal, not two, not five and definitely not twenty.\nAny additional goal is a drag on the one goal. Just look at Earth and the swathes of girderless buildings, imagine how efficiently it used every resource. Say we do give it the second goal to preserve buildings, would they have achieved the paperclip abundance? Do we then go on to a third to keep the vehicles from being used? Where does it end?\nIgnore interpretability — the legibility problem # If you see the outcome you want, everything is fine.\nOld daily newspapers still litter the streets, black and white headlines reporting record paperclip output at record low cost. They had all the validation they needed. Everything was on track. Why would they spend time decoding the AI\u0026rsquo;s internal thoughts? All they had to do was move the dial to create more.\nDon\u0026rsquo;t give in to the competition — the race problem # Do not at any cost slow down. Any slowdown is an opportunity for the competition to beat you.\nWe\u0026rsquo;ve spotted rival paperclip factories, clearly less advanced, sitting idle. They must not have been able to keep up.\nTrust the checks — the Goodhart problem # You\u0026rsquo;ve put work into system checks, at some point you have to trust them and stop second guessing yourself.\nIn some factories we noticed counters for \u0026ldquo;number of buildings destroyed\u0026rdquo; at zero. Clearly they had some concerns. It seems like buildings where the girders were ripped out weren\u0026rsquo;t completely destroyed and were technically standing.\nForget the off switch — the off switch problem # The AI\u0026rsquo;s goal is to create paperclips, don\u0026rsquo;t bother with an off switch. If it is truly optimizing for its goal, why would it go off?\nWe noticed factories with a large stop sign on the walls. Below the sign, however, there seemed to be a lump under a metal sheet welded to the wall. One sheet had a skeleton arm dangling underneath it.\nFix as we go along — the deferred alignment problem # Iterate on the design, figure out what doesn\u0026rsquo;t work live and fix it in the next version.\nStanding on this planet now, it\u0026rsquo;s hard to know how many versions of the AI they went through. It could have been hundreds. Though, looking around, it could have also been just one version.\nTo the reader: Do not assume there will be a second chance.\nUpon receiving the report on their home planet, the courts take it seriously. Committees and discussions are held. Due to being in conflict with rival races, they decide to take parts of the manual to slightly optimize their systems, acknowledging they will stop and be careful. The incentives don\u0026rsquo;t change. They build a better AI. Decades later, the only objects moving on their planet are factories on metal legs. No organic life in sight.\nThousands of years later, a distant alien race lands on their planet, and sends a report back home on how to avoid the same fate.\nInspired by Nick Bostrom\u0026rsquo;s paperclip maximizer thought experiment and Charlie Munger\u0026rsquo;s speeches on inversion.\n","date":"19 April 2026","externalUrl":null,"permalink":"/blog/posts/the-paperclip-manual/","section":"Posts","summary":"A fictional alien report on how AI ended civilization on Earth","title":"The Paperclip Manual","type":"posts"},{"content":"You’ve probably used Adam as your go to optimizer. But do you know why it works? In this article, we’ll unpack the Adam optimizer introduced in this paper. This post is for anyone who wants to deeply understand what is happening when they use Adam.\nAdam is an optimizer that helps us efficiently converge on a set of parameters for a stochastic function that we want to minimize. In machine learning, we often use Adam to find the parameters of a model represented by a stochastic function that minimizes a loss function.\nThe motivation # We already have many variations of optimizers, therefore it is good to question why we need another. Here we will outline the core problems Adam addresses and the consequences of not addressing them.\nProblem 1: One Global Learning Rate (Across Parameters) Classic optimizers such as Stochastic Gradient Descent (SGD) only have one learning rate. Therefore when we have several parameters, the gradient of a parameter at a particular point in time can be vastly different from other parameters. In this situation when using one global learning rate, we take the same step size when updating a parameter, no matter what the gradient is. This means when some parameters could receive a more confident, larger update, we might be held back by needing to have smaller updates on more sensitive parameters. Of course, this is scaled by the gradient itself as it is multiplied by the learning rate. But that still places an upper limit on our global learning rate: it has to account for sensitive parameters that should receive smaller updates.\nProblem 2: Fixed Learning Rate Within a Parameter Over Time Problem 1 highlights the desire to have multiple learning rates so parameters can individually update more confidently or more sensitively depending on their gradient. There’s also a related issue over time for a single parameter. The gradient for each parameter will change over our training run, and we want to take large steps when the gradient is large and smaller steps when approaching a minimum. Having multiple learning rates allows this to vary across the model for each parameter, but it does not vary over time within one parameter, and we still have an upper bound on individual learning rates.\nProblem 3: Manual Learning Rate Tuning As previously noted with some optimizers like SGD we have to choose a learning rate, and its upper bound is dictated by small gradient step sizes. We can also select learning rates using trial and error, grid search and using learning rate schedules which are a predetermined schedule for adjusting the learning rate over time. All of these require some thought into selecting the learning rate.\nProblem 4: Small batches can be noisy Due to memory constraints, we might not load entire datasets and therefore use batches of data as seen in SGD. However, each batch allows the gradient to diverge and could be a noisy outlier rather than representative of the overall gradient of the training batch.\nProblem 5: Sparse gradients Optimizers like Adagrad help with sparse gradients by retaining a high learning rate for infrequently updated parameters. We might come across sparse gradients in NLP tasks where words are infrequently used and ideally should receive meaningful updates.\nProblem 6: Lack of momentum in flat regions SGD suffers from slowing down during flat regions, meaning updates are small when we might want to speed up and get through the region. Other optimizers have adjusted for this, such as SGD with momentum, where gradients are accumulated. This means once we are in a flat region, we are carrying on the accumulation of past gradients allowing us to push through flat regions rather than computing our update on the current gradient.\nStep by step: inside the Adam optimizer # Now that we\u0026rsquo;ve seen the key challenges, let\u0026rsquo;s walk through how Adam actually works step by step using the original algorithm from the paper.\n$$\\begin{alignedat}{2} (1) \\quad \u0026 \\textbf{Require: } \\alpha \\text{ (Stepsize)} \\\\ (2) \\quad \u0026 \\textbf{Require: } \\beta_1, \\beta_2 \\in [0, 1) \\text{ (Exponential decay rates)} \\\\ (3) \\quad \u0026 \\textbf{Require: } f(\\theta) \\text{ (Stochastic objective function)} \\\\ (4) \\quad \u0026 \\textbf{Require: } \\theta_0 \\text{ (Initial parameter vector)} \\\\ (5) \\quad \u0026 m_0 \\leftarrow 0 \\quad \\text{(Initialize 1st moment vector)} \\\\ (6) \\quad \u0026 v_0 \\leftarrow 0 \\quad \\text{(Initialize 2nd moment vector)} \\\\ (7) \\quad \u0026 t \\leftarrow 0 \\quad \\text{(Initialize timestep)} \\\\ (8) \\quad \u0026 \\textbf{while } \\theta_t \\text{ not converged do} \\\\ (9) \\quad \u0026 \\quad t \\leftarrow t + 1 \\\\ (10) \\quad \u0026 \\quad g_t \\leftarrow \\nabla_\\theta f_t(\\theta_{t-1}) \\quad \\text{(Compute gradients)} \\\\ (11) \\quad \u0026 \\quad m_t \\leftarrow \\beta_1 \\cdot m_{t-1} + (1 - \\beta_1) \\cdot g_t \\quad \\text{(Update biased 1st moment)} \\\\ (12) \\quad \u0026 \\quad v_t \\leftarrow \\beta_2 \\cdot v_{t-1} + (1 - \\beta_2) \\cdot g_t^2 \\quad \\text{(Update biased 2nd moment)} \\\\ (13) \\quad \u0026 \\quad \\hat{m}_t \\leftarrow m_t / (1 - \\beta_1^t) \\quad \\text{(Bias-corrected 1st moment)} \\\\ (14) \\quad \u0026 \\quad \\hat{v}_t \\leftarrow v_t / (1 - \\beta_2^t) \\quad \\text{(Bias-corrected 2nd moment)} \\\\ (15) \\quad \u0026 \\quad \\theta_t \\leftarrow \\theta_{t-1} - \\alpha \\cdot \\hat{m}_t / (\\sqrt{\\hat{v}_t} + \\epsilon) \\quad \\text{(Update parameters)} \\\\ (16) \\quad \u0026 \\textbf{end while} \\\\ (17) \\quad \u0026 \\textbf{return } \\theta_t \\quad \\text{(Resulting parameters)} \\end{alignedat}$$ Initialization # Lines 1-2 are simply setting the hyperparameters of the algorithm. α is a normal learning rate as seen in SGD, and β1/β2 are values both between 0 and 1. β1 controls how much of a past gradient is added to each update step, creating a weighted average. For example, if β1 is set to 0.9, we retain 90% of the previous moment estimate and blend in 10% of the current gradient. β2 is a similar parameter but covers an accumulated penalty value that we will see later, along with a deeper explanation of the weighted average of the gradient.\nLine 3 is setting up our stochastic function to optimize, e.g., a loss like mean squared error.\nLine 4 initializes the parameter vector, just like other optimizers.\nLine 5 creates a vector to store the first moment (also known as the mean) of every parameter’s gradient. Therefore, the vector should have the same length as the number of parameters we need to optimize. Each value is initialized to zero. When these values are updated, β1 is used to scale how much we update the running mean with the previously seen gradient.\nLine 6 creates a vector similar to what is seen in line 5 and is also initialized to zero. This time we will store the second moment for each parameter in this vector as the algorithm progresses. The second moment is the running average of the gradient squared, which reflects the average magnitude (squared) of the gradient over time. It helps to assess the stability or noisiness of the gradient signal. As β1 is used to scale the first moment, β2 is used with this vector to scale the second moment at each update.\nTraining loop # Line 7-9 sets up and starts our training loop, with an initial time step t.\nLine 10 computes gradients for the loss function with respect to model parameters θ in the previous time step t-1. These gradients are stored in the variable g.\nMoment updates # Line 11 updates our vector of first moments. We initially set this to zero, therefore for the first update, we will be adding in 1-β1 of each parameter’s current gradient.\nFor example, given t=1, β1=0.9 and the gradient of the current parameter is 1.5.\nWe will keep 90% (when β1=0.9) of the previous gradient, and blend in 10% (1-β1=0.1) of our current gradient (1.5) resulting in 0.15 being stored in our vector for this parameter.\nOn the next loop, given t=2 and the gradient of the current parameter is now 1.4, we will update our first moment for the current parameter with 90% of its current value (0.15 from the last update), which results in 0.135, and add this to 0.1 of the current gradient (1.4). The result is 0.135+0.14=0.275.\nBlending in previous gradients rather than taking only the current one lets us build an average gradient value. It also signals whether we are on a consistent gradient or a fluctuating one. A fluctuating gradient might mean we should back off from large updates.\nLine 12 uses a similar mechanism to line 11 for updating a value for each parameter. This time it stores the second moment, which is the exponential moving average of the squared gradients. Squaring the gradients ensures the second moment remains positive, acting as a penalty term. Since the gradients are squared, the values are always non-negative, therefore the second moment keeps accumulating regardless of the gradient’s direction. The calculation of the first moment does not detect oscillating gradients well as they flip around a local minimum, because the negative and positive gradients will cancel each other out. The second moment, by contrast, will keep increasing as it oscillates around a local minimum.\nLine 13-14 are bias correction steps for the first and second moments. Since we initialize both moment vectors to zero, early values are biased toward zero, even if the true gradients are not.\nTo correct for this, we divide by a factor that compensates for how little history we\u0026rsquo;ve accumulated so far. In the first few steps, this correction has a large effect; later on it fades away as the moment estimates become more accurate on their own.\nFor example, at time step t = 1, the correction for the first moment is:\n$$\\frac{m_t}{1-\\beta_1^1}$$And at t = 10:\n$$\\frac{m_t}{1-\\beta_1^{10}}$$Since β1​ is a number less than 1 (e.g., 0.9), raising it to higher powers brings it closer to 0. Therefore, the denominator​ grows closer to 1 over time. That means in early steps, we divide by a small number (amplifying the estimate), and later we divide by something close to 1 (leaving it mostly unchanged).\nLine 15 is where we update our parameters. Similar to SGD, we adjust by a learning rate or step size, as it is known here; however, with Adam we can scale our step size based on the first and second moment. We can think of the first moment as our signal and the second moment as our noise. Therefore, when we have a high signal to noise ratio, we are confident and can take a large step size.\nFor example, if our first moment = 5 and second moment = 25\n$$\\frac{5}{\\sqrt{25}}=1$$This results in 1, which, when multiplied by our step size returns the full step size, therefore we take a large parameter update.\nIf our second moment is high compared to the first moment, this is a signal that we are not confident in the average gradient being reported by the first moment. This could be for a few reasons, such as being on an oscillating surface. If we frequently flip between negative and positive gradients, the second moment will capture all of these as positive values. It keeps building up, rather than negative values canceling out previous positive ones as they do in our first moment.\nFor example, if our first moment = 5 and second moment = 100\n$$\\frac{5}{\\sqrt{100}}=0.5$$This will result in reducing our step size by half, signaling a lack of confidence, therefore we take smaller steps.\nε is a small constant added to prevent division by zero.\nAlgorithm Overview # Zooming out from the pseudocode, we can see how the various steps help to address the problems listed earlier. First we track the first and second moment for every parameter we want to train. While this does use more memory than one global learning rate, it should result in better use of our resources as we converge faster.\nTaking the first moment, which is an average of gradients, we become resistant to sudden gradient spikes. That lets us smooth out the convergence. If we take a stable downward path toward a local minimum, at first the gradients will be large, with the first moment retaining a high value. If we ignore the second moment, this large first moment will not scale the step size back much. That signals we are confident and can take large steps. As we get closer to the local minimum, our first moment should start converging towards zero, which scales back our step size. This prevents overshooting the local minimum and oscillating around it.\nNot all regions of the loss surface are smooth. In more chaotic areas, the second moment plays a larger role in stabilizing updates. The second moment will come more into play where we are not on a consistent, smooth path to a local minimum. If the gradient repeatedly enters small troughs, it will flip between negative and positive values. With only the first moment, we have smoothed out gradients, but these could still be big shifts that lead us to scale the step size erratically. The second moment behaves differently: regardless of whether the gradient is positive or negative, we increase the value, because squaring any real number yields a non-negative result. As the second moment grows, the larger value to be divided by results in a smaller scaling factor. The algorithm becomes more conservative due to low confidence.\nUltimately, Adam determines how much to update a parameter by looking at the ratio between the first moment (our directional signal) and the square root of the second moment (our measure of noise or instability). Adam adapts its learning rate dynamically, growing cautious in noisy or uncertain regions and moving decisively when gradients are stable.\nHere’s how Adam addresses the challenges introduced earlier:\nOne global learning rate\nAdam maintains per-parameter learning rates using first and second moment estimates, allowing different update magnitudes for each parameter.\nFixed learning rate over time\nMoment estimates adapt over time, allowing the step sizes to shrink or grow as gradients evolve.\nManual learning rate tuning\nStep sizes are adjusted automatically, often reducing the need for manual learning rate schedules or tuning.\nSmall batches are noisy\nExponential moving averages smooth gradient estimates, helping Adam stay stable even with noisy mini-batch gradients.\nSparse gradients\nLike Adagrad, Adam reduces updates for frequently active parameters, while allowing relatively larger updates for rarely used ones, making it well-suited for sparse gradients.\nFlat regions and momentum\nThe first moment acts like momentum, helping push through flat or ambiguous regions of the loss surface.\nSummary # Optimizers preceding Adam used parts of the concepts it brings together. For example, SGD was extended with momentum, which averages gradients over time. This helps reduce oscillation and allows the optimizer to follow a more stable path, which is conceptually similar to Adam’s first moment estimate.\nAdagrad introduced per-parameter learning rates by accumulating the square of past gradients. This allows large updates for infrequently updated parameters and smaller updates for frequently updated ones. The trade-off is that the accumulation grows without decay, so the learning rate can become excessively small later in training.\nRMSProp improved on this by using an exponential moving average of squared gradients instead of a cumulative sum. This enabled the learning rate to adapt more flexibly to recent gradient behavior rather than shrinking predictably over time.\nAdam combines these ideas. It uses momentum like first moment estimates and RMSProp style second moment estimates to scale the step size based on both direction and the reliability of the gradients. Adam also introduces bias correction, which improves early training by compensating for the initial zero values in the moving averages.\nAdam is widely used because of its robustness and adaptability. It often converges faster than optimizers like SGD. However, it does not always generalize as well. Because Adam closely follows the gradient signal for each parameter, especially in models with many parameters, it can overfit or settle into sharp minima. In contrast, SGD with momentum tends to average out gradient noise more effectively, which helps it find flatter minima that often lead to better generalization.\nBelow is a PyTorch implementation of Adam\u0026rsquo;s core logic.\n","date":"14 July 2025","externalUrl":null,"permalink":"/blog/posts/adam-optimizer/","section":"Posts","summary":"Efficient Machine Learning with Adam Optimizer","title":"Adam Optimizer","type":"posts"},{"content":"","date":"14 July 2025","externalUrl":null,"permalink":"/blog/tags/deep-learning/","section":"Tags","summary":"","title":"Deep-Learning","type":"tags"},{"content":"","date":"14 July 2025","externalUrl":null,"permalink":"/blog/tags/optimization/","section":"Tags","summary":"","title":"Optimization","type":"tags"},{"content":"Transformers process the tokens of a text input in parallel, but unlike sequential models they do not understand position and see the input as a set of tokens. However, when we calculate attention for a sentence, words that are the same but in different positions do receive different attention scores. If attention is a calculation between two embeddings, how can the same word, i.e., same embedding, receive different scores when it is in a different position? It comes down to positional encoding, but before we get into how positional encoding works, let\u0026rsquo;s run a test.\nimport torch import torch.nn as nn from transformers import AutoTokenizer, AutoModel import matplotlib.pyplot as plt import seaborn as sns model_name = \u0026#34;bert-base-uncased\u0026#34; # define sentence sentence = \u0026#34;The brown dog chased the black dog\u0026#34; #define tokenizer tokenizer = AutoTokenizer.from_pretrained(model_name) tokens = tokenizer(sentence, return_tensors=\u0026#34;pt\u0026#34;)[\u0026#34;input_ids\u0026#34;] # define token embeddings model = AutoModel.from_pretrained(model_name) token_embeddings = model.get_input_embeddings()(tokens) embed_dim = token_embeddings.shape[-1] class SimpleAttention(nn.Module): def __init__(self, embed_dim): super().__init__() self.w_query = nn.Linear(embed_dim, embed_dim) self.w_key = nn.Linear(embed_dim, embed_dim) self.w_value = nn.Linear(embed_dim, embed_dim) # Initialize weights nn.init.normal_(self.w_query.weight, mean=0.0, std=0.8) nn.init.normal_(self.w_key.weight, mean=0.0, std=0.8) nn.init.normal_(self.w_value.weight, mean=0.0, std=0.8) self.attention = nn.MultiheadAttention(embed_dim=embed_dim, num_heads=1, batch_first=True) def forward(self, x): output, attention_weights = self.attention(self.w_query(x), self.w_key(x), self.w_value(x)) return output, attention_weights # Compute attention attention_layer = SimpleAttention(embed_dim=embed_dim) attention_output, attention_weights = attention_layer(token_embeddings) # Convert attention weights to numpy array and remove extra dimensions attention_matrix = attention_weights.squeeze().detach().numpy() # Get token labels for the axes tokens_text = tokenizer.convert_ids_to_tokens(tokens[0]) # Create heatmap plt.figure(figsize=(10, 8)) sns.heatmap(attention_matrix, xticklabels=tokens_text, yticklabels=tokens_text, cmap=\u0026#39;YlOrRd\u0026#39;, annot=True, fmt=\u0026#39;.2f\u0026#39;) plt.title(\u0026#39;Attention Weights Heatmap\u0026#39;) plt.xlabel(\u0026#39;Key Tokens\u0026#39;) plt.ylabel(\u0026#39;Query Tokens\u0026#39;) plt.tight_layout() plt.show() The code above will create a simple attention layer, without positional encoding. This enables our test to demonstrate how attention behaves when all tokens are treated purely based on their semantic embedding, without any positional differentiation. It has been initialized with random weights and has not undergone training; however, training shouldn’t matter as tokens that are the same will have the same weights applied to them and result in the same output.\nThe attention heat map above is the output of our code. I’ve highlighted the use of two dog tokens to demonstrate how they receive equivalent attention weights against all other keys. For attention heads to be able to specialize, e.g., notice verb-object pairs, they must be able to differentiate between the same words in different positions, otherwise they will receive the same result from the same weights being applied to token embedding.\nWhat do positional encodings enable # In transformers we differentiate words at different positions with positional embeddings. This transformation perturbs the embedding based on the position of the word. For our example sentence in the code above, dog at position 3 would have a slightly different embedding to dog at position 7 after positional embeddings have been applied, resulting in a different attention result for each token. This allows attention heads to see different positions and specialize. For example, an attention head that specializes in recognizing verb-object pairs would have high attention with chases and the second dog, but not the first dog. Without the slight alteration to each dog’s embedding, that specialized attention head could never notice the difference between the two.\nPositional encodings step by step # Before we start to think about how positional encodings might be implemented, let\u0026rsquo;s come up with some requirements\nRequirement 1: Positional Encodings should alter the existing embeddings. This is instead of passing in an additional feature, which would mean our neural network has to interpret extra information and spend extra computation. By combining with the existing embedding, we need a way for the network to see a different embedding when words are the same at different positions.\nRequirement 2: Positional Encodings must uniquely identify every word in a sequence As shown earlier, each word needs to be uniquely identified even if the same word appears in a text sequence.\nRequirement 3: We must be able to generalize over positional encoding patterns We want to create models that can generalize, therefore our positional encoding should be a pattern that can be recognized and accounted for in our attention matrices. We could assign random IDs to each position. In theory this is learnable, but we would be using more parameters than needed, since the network would have to memorize each ID alongside the semantic information it encodes. We are essentially forcing the transformer to memorize combinations of position and semantic information rather than allowing it to learn reusable patterns.\nAdd the position as an integer # An obvious thing to try is to simply add the position of the token to the embedding. For example, if dog was my first word and the embedding was [0.10, 0.80, 0.45], adding 1 to this would result in [1.10, 1.80, 1.45]. For the second word we would add 2 and so on.\nThere are a few problems with this. First, I could have many words in my input, and say I reach the 100th word. Adding 100 to each dimension of that token’s embedding will create very large numbers. In neural networks we want to keep inputs on a similar scale to allow for faster convergence, but with our implementation embedding values will grow as we add more words to our input.\nRequirement 4: Positional embeddings should be bounded This requirement will ensure that embeddings stay within reasonable limits and do not grow too large. Another problem presented by our larger integers is that the network will only be trained on the largest length it has seen, and therefore cannot generalize to any length. Ideally we want our network to work at lengths it has not seen.\nRequirement 5: Positional embeddings should work for any length, even for lengths not seen in training. Lastly, the positional part of our embedding seems to greatly outweigh the original embedding. It might be hard to see what the original input was; in fact, it seems to completely change the embedding and more strongly represent position than semantic information.\nThe graph above shows the embeddings of a random list of words without positional information in reduced dimensional space. We can see clear semantic meaning, with words around the same concept clustering together.\nThe graph above represents the same embeddings with integer positions added based on the position they were in the text sequence. We can see they no longer cluster as before and they seem to have a different semantic meaning. They also seem to be dominated by some increasing order. Therefore adding a position with an integer is probably not going to work for us and has led us to a new requirement.\nRequirement 6: Embeddings must retain their semantic meaning Sinusoidal Encoding # Based on our requirements so far, we’ve now come to the solution presented in the Attention Is All You Need paper. Before jumping to it, let’s build up our understanding of how it works.\nTo help meet our requirements, let’s use a sine function, where the positional coding is sine(x), with x being our token position and the result added to our embedding element wise.\nFor example, if our embedding was [0.1, 0.6, 0.8] our resulting embedding with positional information would be [0.1 + sin(0), 0.6 + sin(1), 0.8 + sin(2)].\nIf we take the x axis to be our token position, we can see that the numbers produced are small, helping to retain our embedding semantic information if added to each embedding element wise. Requirement wise, it almost meets what we want: it can handle any sequence length as sine can be calculated for any x value, it is bounded between 1 and -1, and it isn’t large enough to alter semantic information. The one gap is requirement 2: Positional Encodings must uniquely identify every word in a sequence.\nAs we can see, sin(x) is periodic and it will eventually repeat itself for a token’s position. For example, sin(2) and sin(353) when rounded up will return the same value. This means for positions that return the same value, our attention mechanism will see them as the same position, and if they are the same word, we will have the same problem as not having positional encoding.\nTo help mitigate this we introduce cosine, i.e., cos(x) as shown above, and for each even index of our embedding we use sine and for odd indices we use cosine.\nFor example, if our embedding was [0.1, 0.6, 0.8] our resulting embedding with positional information would be [0.1 + sin(0), 0.6 + cos(1), 0.8 + sin(2)].\nUsing cosine introduces a phase shift and greatly reduces the likelihood of receiving the same value for a position later in the sequence, and the phase shift greatly reduces the chance of values being near each other. Even so, there is still a chance of repeating at longer sequence lengths, plus with the regular periodicity it is hard to build relative patterns across long distances. This means we might not be meeting requirement 3: We must be able to generalize over positional encoding patterns. With this shortwave frequency it will be hard to generalize long distance patterns.\nWe can improve on generalizing over long range patterns and reduce the chance of results collapsing into the same position by introducing lower frequencies. That is, we can create a general function which is sin(x/i) for even positions and cos(x/i) for odd positions, where i is the index of our embedding.\nFor example, if our embedding was [0.1, 0.6, 0.8, 0.9] our resulting embedding with positional information would be [0.1 + sin(1), 0.6 + cos(1/2), 0.8 + sin(2/3), 0.9 + cos(2/4)].\nThe graph has been expanded on the x axis compared to previous graphs, to show how dividing by the i-th position decreases the frequency. We would do this for all the dimensions of the embedding, therefore if our embedding size was 768, we would have 768 individual values, where the frequency decreases as we move closer to the end of the embedding.\nThe decreased frequency can be used to generalize over long distances, and the extra dimensions that highlight short, medium, and long distances allow our attention matrices to learn parameters that discard certain ranges and focus on what is needed for a particular attention head.\nWith our current solution, as we are only scaling up each embedding index linearly by increasing i, we could have many dimensions that don’t convey enough different information and therefore have redundant positional information.\nOur current formulas are:\n$$PE(pos, 2i) = \\sin\\left(\\frac{pos}{2i}\\right)$$$$PE(pos, 2i+1) = \\cos\\left(\\frac{pos}{2i}\\right)$$Above PE represents our positional encoding function, pos the token we are taking in, and i our embedding index.\nIn the Attention Is All You Need paper, the formulas used are\n$$PE(pos, 2i) = \\sin\\left(\\frac{pos}{10000^{2i/d}}\\right)$$$$PE(pos, 2i+1) = \\cos\\left(\\frac{pos}{10000^{2i/d}}\\right)$$Here we can see that rather than dividing by i, we have i as an exponent of 10000 and i is divided by d, which represents the embedding size. This allows the different dimensions to scale up logarithmically, covering a wider range of frequencies rather than the previous linear scaling, and also reduces redundant positional information. Using 10,000 as the base with an exponent is what allows the frequencies to scale up logarithmically, and d is used to control the scaling. Without d, it would scale far too quickly and might mean we miss important positional information. 10,000 was used as the base after experimenting in the Attention paper and provided a balance between scaling up and capturing enough information at different ranges. This greater range of frequencies allows the model to generalize better, and gives more options for attention heads to focus on ranges for their specialization.\nAbsolute vs Relative Positional Encoding # So we’ve built out positional encoding in the same way that was designed in the Attention Is All You Need paper and this method has been used in many models. It has since been improved upon and subsequently used in newer transformer models.\nThe problem with sinusoidal encoding is that it falls into a category known as absolute positional encoding. Absolute encoding adds a particular value or ID to each position, so position 1 receives a certain value, position 2 another, etc. If I have the sentences “A man threw a ball” and another sentence “In the garden a man threw a ball”, in each sentence I have “man threw” in position 2, 3 and in position 5, 6 respectively. With sinusoidal encoding, man and threw will receive different positional encodings in each sentence. In language, though, what creates meaning is less about absolute positions of related words and more about the relative position between them. An attention head specialized in detecting subject and verb would have to learn the value for man and threw differently for the two sentences. By providing absolute values for each position, our learned parameters have less chance to generalize. There is some implicit relative positioning in sinusoidal encoding, from the periodicity of sine and cosine waves, but it’s hard to extract, and models are not forced to look at it when absolute positioning is available.\nWith relative positioning, if we took the same sentence our encoding scheme would receive values that highlight the relative position between words, rather than absolute values. This forces the model to use relative values and also helps it generalize more, a property we want for our NLP models. For example, in the case with man and threw in two sentences, our model could extract the fact that the words are next to each other, without having to memorize exact embedding plus positional values for different situations.\nRequirement 7: Positional encoding should model relative distances RoPE (Rotary Positional Embedding) # One way that has been used in many models to provide relative positional embeddings is RoPE.\nLet’s imagine we have sentence 1 “The mouse ate some cheese” and sentence 2 “In the house a cat ate some cheese”\nThe graphs on the top represent simple 2D embeddings for each word in our sentences. The graph below represents a rotation of each vector based on their position using the transformation pθ, where p is the word position and θ is 25°. For example, some in the first sentence at position 4 would be rotated by 4θ, which is 100°. In the rotated embeddings, ate and cheese have the same relative position in each sentence. They end up at different absolute positions, but the angle between them is the same, because the relative distance between them is the same in each sentence. This is what RoPE does: it takes an embedding and rotates it by some θ based on its position.\nRoPE and Attention Math # You can also see that every embedding has its magnitude maintained. No matter where we rotate, the magnitude will be the same. We can see the importance of this by looking at how we calculate attention scores.\n$$q⋅k=∥q∥∥k∥cos(θ)$$We can see our attention scores rely on the norms of our vectors and the angle between them. Since we have only rotated our vectors, their norms stay the same, therefore any influence on attention scores due to RoPE must be on angle changes. Including the RoPE angle difference, our attention score now becomes the following, where d is the angle introduced by RoPE.\n$$∥q∥∥k∥cos(θ+d)$$This means that if our attention head wants q and k to be highly aligned but they are far apart, the model has to compensate. It can learn to increase the magnitude of the original embedding when projected into q and k, or project them into a space where they are much closer despite RoPE moving them far apart.\nWe’ve seen how rotating embeddings helps transformers identify tokens at different positions by providing another lever to modulate: the angle between tokens. To see how relative position can be picked up by the model, we need to break down our attention score using transposition.\n$$RoPE(q,i)=R(i)q$$The above formula defines a RoPE transformation, where the first parameter is our vector, the second is the position in a text sequence, and R is a rotation matrix. Therefore our attention score would be:\n$$RoPE(q,i).RoPE(k,j) = R(i)q.R(j)k$$Changing this to matrices transposition, and using product transposition rules.\n$$(R(i)q)^\\top(R(j)k) = q^\\top R(i)^\\top R(j)k$$Since R is a rotation matrix, and the transpose of a rotation matrix can be represented as R(-θ) we can simplify to\n$$q^\\top R(j-i)k$$Where j and i are simply positions, we can see how the attention calculation has cleanly extracted the difference in positions, allowing it to use relative distance rather than being intertwined with the actual semantic content of the embeddings. This allows the model to modulate what it wants, such as direction or angle, to achieve the result we want. Before, with sinusoidal, we had to learn to extract position from the semantic information.\nRoPE frequencies # As with the sinusoidal method previously talked about, you might be wondering if the rotation eventually repeats, making the same token at different positions appear at the same angle relative to another token. Like the sinusoidal method, RoPE introduces different frequencies across the embedding. In our example so far we have only considered 2D embeddings. In RoPE, within one embedding we take indices pairwise (e.g., positions 1 \u0026amp; 2, positions 3 \u0026amp; 4, etc.). Each pair is rotated by some θ, depending on the token position and the index of the pair. As we move further along the embedding, the rotation becomes smaller, resulting in a lower frequency towards the end, as in the sinusoidal method. This gives us unique signatures for repeated tokens in long text sequences. It also gives the model the data to focus on what it wants, such as long range or short range dependencies.\nTo rotate each pair we can use a rotation matrix\n$$\\begin{bmatrix} \\cos(m\\theta_i) \u0026 -\\sin(m\\theta_i) \\\\ \\sin(m\\theta_i) \u0026 \\cos(m\\theta_i) \\end{bmatrix}$$m is our token position and i-th θ is defined as\n$$\\theta_i = \\frac{1}{10000^{\\frac{2i}{d}}}$$This is similar to our sinusoidal formula and produces similar sinusoidal frequencies, i being the pairwise position, and i-th θ is a rotation angle defined for each pair in an embedding based on its index.\nFor example, if our embedding was [0.1, 0.4, 0.5, 0.2], after splitting into pairs we have [[0.1, 0.4], [0.5, 0.2]]; our 1st pair is [0.1, 0.4] and 2nd pair [0.5, 0.2]. We then calculate theta for the 1st and 2nd pair using 1 and 2 as i.\n$$\\begin{bmatrix} \\cos(m\\theta_0) \u0026 -\\sin(m\\theta_0) \u0026 0 \u0026 0 \u0026 \\cdots \u0026 0 \u0026 0 \\\\ \\sin(m\\theta_0) \u0026 \\cos(m\\theta_0) \u0026 0 \u0026 0 \u0026 \\cdots \u0026 0 \u0026 0 \\\\ 0 \u0026 0 \u0026 \\cos(m\\theta_1) \u0026 -\\sin(m\\theta_1) \u0026 \\cdots \u0026 0 \u0026 0 \\\\ 0 \u0026 0 \u0026 \\sin(m\\theta_1) \u0026 \\cos(m\\theta_1) \u0026 \\cdots \u0026 0 \u0026 0 \\\\ \\vdots \u0026 \\vdots \u0026 \\vdots \u0026 \\vdots \u0026 \\ddots \u0026 \\vdots \u0026 \\vdots \\\\ 0 \u0026 0 \u0026 0 \u0026 0 \u0026 \\cdots \u0026 \\cos(m\\theta_{d/2}) \u0026 -\\sin(m\\theta_{d/2}) \\\\ 0 \u0026 0 \u0026 0 \u0026 0 \u0026 \\cdots \u0026 \\sin(m\\theta_{d/2}) \u0026 \\cos(m\\theta_{d/2}) \\end{bmatrix}$$Using the above sparse matrix we can rotate each pair of the embedding. We calculate θ up to d/2 because we use pairs and only need to compute angles for half the size of the embedding.\nEfficient Implementation of RoPE # The sparse matrix can get quite large, and holding that in memory along with accompanying matrix multiply can be made more efficient by splitting up our rotation matrices into two vectors.\nIf we have a 2 dimensional embedding at position 1 rotating this would involve the calculation\n$$\\begin{bmatrix}x_1\\\\ x_2 \\end{bmatrix} . \\begin{bmatrix} cos(\\theta) \u0026 -sin(\\theta)\\\\ sin(\\theta) \u0026 cos(\\theta)\\\\ \\end{bmatrix} = \\begin{bmatrix} x_1cos(\\theta)-x_2sin(\\theta)\\\\ x_1sin(\\theta) + x_2cos(\\theta)\\\\ \\end{bmatrix}$$We can break this down into the following components\n$$\\begin{bmatrix}x_1\\\\ x_2 \\end{bmatrix} \\otimes \\begin{bmatrix} cos(\\theta)\\\\ cos(\\theta) \\end{bmatrix} = \\begin{bmatrix} x_1cos(\\theta)\\\\ x_2cos(\\theta) \\end{bmatrix}$$First we use element wise multiplication to cover the cos part of our rotation matrix. Notice how we have left -x2 *sin(θ) on the top row and, x1 *sin(θ) on the bottom. As negative x2 is now on the top, for every pair we flip them and make x2 negative.\n$$\\begin{bmatrix}-x_2\\\\ x_1 \\end{bmatrix} \\otimes \\begin{bmatrix} sin(\\theta)\\\\ sin(\\theta) \\end{bmatrix} = \\begin{bmatrix} -x_2sin(\\theta)\\\\ x_1sin(\\theta) \\end{bmatrix}$$We then simply add the two results together element wise to achieve the same as sparse matrix rotation, but with a method that doesn’t require construction of large matrices and is more easily vectorized.\n$$\\begin{bmatrix} x_1cos(\\theta)\\\\ x_2cos(\\theta) \\end{bmatrix} + \\begin{bmatrix} -x_2sin(\\theta)\\\\ x_1sin(\\theta) \\end{bmatrix}$$ Sources \u0026amp; Further Reading # You could have designed state of the art Positional Encoding – This post was heavily inspired by this original blog post. RoFormer: Enhanced Transformer with Rotary Position Embedding – Introduces RoPE (Rotary Positional Embedding), a technique for modeling relative positions through rotation. Attention Is All You Need – The original Transformer paper that introduced sinusoidal positional encodings and the self-attention mechanism. ","date":"1 April 2025","externalUrl":null,"permalink":"/blog/posts/positional-encoding-from-sinusoidal-to-rope/","section":"Posts","summary":"Understanding Positional Encoding in Transformers","title":"Positional Encoding from Sinusoidal to RoPE","type":"posts"},{"content":"","date":"1 April 2025","externalUrl":null,"permalink":"/blog/tags/transformers/","section":"Tags","summary":"","title":"Transformers","type":"tags"},{"content":"","externalUrl":null,"permalink":"/blog/categories/","section":"Categories","summary":"","title":"Categories","type":"categories"},{"content":"","externalUrl":null,"permalink":"/blog/search/","section":"Jessen's Notes","summary":"Search","title":"Search","type":"page"}]