ML Jargon Glossary Pt. 1

My list of ML Jargon that I forget too often than I probably should...

ML Jargon Glossary Pt. 1
It's not that deep

Preface

At the moment, I learn best when I'm reading and have a writing utensil at hand. I underline/highlight/box stuff I don't understand, and often ask the question: "What is ___?"

The margins of my reading material are filled with these questions, and what's annoying and even discouraging is a margin filled with a huge list containing "What is this?"s and "What does this mean?"s.

I'm currently working through a book called Build a Large Language Model From Scratch by Sebastian Raschka, as I'm trying to rebuild the foundations to break back into the NLP field, and I'm noticing that I get tripped up by a handful of jargon whenever I read a paragraph.

I attribute this to my dear arrogance, who rightfully thought she could just "dive right back in." So the conscious and realistic me had to give her a wakeup call to throw her ego out the door.

Anyway, back to the point.

This is a non-exhaustive list of jargon that I ran into -- and you may too -- while rebuilding my ML foundations. I might have a part 2 (or 3 or 4) depending on how long the list is, but as far as this post goes, I'll focus on the terms I ran into while reading about implementing a typical training loop in PyTorch.

Hope this helps,
Ael

Logits

The numerical outputs from a neural network's final linear layer, before converting them into probabilitites

The Simple Version

Logits are the raw numerical scores that neural networks output when making predictions. They are essentially just real numbers that range from -∞ to ∞.

For example, if you're building a classifier that classifies an image as a cat, bear, or a lion, the network might output:

  • Cat: 2.5
  • Bear: -0.3
  • Lion: 1.2

These numbers (2.5, -0.3, and 1.2) are the logits.

Looking at these scores, you may ask: "How do you know from those scores whether the model predicted a cat, bear, or a lion?"

...And that's exactly why we have functions like softmax or sigmoid, which squashes the logits from range (-∞, ∞) (or the Real Numbers) to the range [0, 1] (probabilities). I'll have a whole section on these two functions, so scroll down if you're curious.

Some Etymology For Those Who Are Curious

Exercise your free will and skip over this section if you want 😃

The name comes from the mathematical concept of "log-odds."

When you get probabilities by applying softmax to logits (again, more on this below), then take the logarithm of those probabilities, you return to your original logits. In other words, logits are the logarithm of the odds.

A Quick Primer on Odds VS. Probability

If you're already familiar with the difference between odds and probability, feel free to skip over this introductory section.

Difference between probability & odds in a nutshell

This is a concept that's used interchangeably with probability, and while they are related, they're not the same.

Here are two questions to get started:

What is the probability of getting a 6 when you throw a 6-sided die?
The probability of getting a 6 is 1/6
What are the odds of getting a 6 when you throw a 6-sided die?
The odds of getting a 6 is 1:5

The Logit Function

Now, in math, there's a logit function that converts probabilities to log-odds:

This function takes probabilities (range [0, 1]) and "projects" them out to the set of real numbers (range (-∞, ∞)). Here's a graphical representation:

Logit Function
Logit Function

I like to think that neural networks borrowed this term because their raw outputs (the logits) have the same property – they can be any real number and represent log-odds of the predicted probabilities. But importantly, we don't actually use the logit function directly in neural networks; we just call the raw outputs "logits" because of this parallel.

Softmax & Sigmoid

These functions are two examples of non-linear activation functions, which deserve a post or posts of their own. As far as this post goes, I'll focus on defining softmax and sigmoid.

Softmax

A math function that takes in a list of raw numbers (in this context, logits) and converts them into probabilities – usually used for multi-class classification (number of classes > 2)

Sigmoid

A math function that takes a real number and converts it into a probability – usually used for binary classification (number of classes = 2)

If that doesn't cut it for you, here's some math 😅:

I'll start with Softmax, since it's essentially a more generic form of the Sigmoid function.

Softmax Function Equation
z is the list of real numbers, or if you want to get fancy, a vector

Our friends at PyTorch did us a great favor and saved us from having to do a bunch of exponential calculations. In code, the function is pretty simple:

import torch

a = torch.tensor([3.2, -1.0, 0.7])
torch.nn.functional.softmax(a, dim=0)

Same list of numbers going through softmax

And the result:

tensor([0.9115, 0.0137, 0.0748])

...but of course, me being me, I had to prove it out myself:

The example calculates the softmax output for 3.2, which results to 0.9115
The example calculates the softmax output for 3.2, which results to 0.9115

If you look at the input to the Softmax equation, you may have noticed that it is a list of numbers. I like to connect this observation to the common practice of using Softmax for multi-class classification (e.g., taking handwritten digits and classifying them 0 through 9).

Speaking of the number of outputs, I like to think of Sigmoid as a "special" version of Softmax, where we only have 2 classes – a.k.a. binary classification.

Sigmoid Function

Here's what it looks like:

Signmoid Function
Sigmoid Function

As I mentioned, you can think of Sigmoid as a version of Softmax with the input being a list of 2 numbers (2 raw scores). Since a binary classification is a simple yes/no decision, you essentially only need one probability. For example, if the probability of an email being spam is 0.8, then the probability of it being non-spam is 1 - 0.8 = 0.2.

How Sigmoid is a "special version" of Softmax
How Sigmoid is a "special version" of Softmax

Since Softmax with 2 outputs simply calculates the relative difference between the two scores, it would be redundant to calculate the softmax for each element if you only had 2 scores: the probability of the second class is just 1 minus the probability of the first. Therefore, it's more efficient to use Sigmoid when working with binary classification.

Loss

The discrepency between the neural network's predications and the actual labels

As much as I hate being wrong or making mistakes, it will always happen, and despite the pain, getting the wrongness corrected is the moment when I learn & grow the most.

Now I'm not here to give neural networks a persona, but the same goes for them. If they produce some prediction, you have to give it feedback on whether or not they were wrong (in the case of classifiers) and how much they were wrong by – i.e., the extent of their "wrongness."

The loss is a value that tells how wrong a model's prediction is, and depending on the type of task you're working with (i.e., regression or classification), the way of measuring the loss differs. Nonetheless, the general idea here is that the loss value ultimately contributes to making models "better." This too deserves a post of its own IMO.

Cross Entropy

Cross-entropy is a type of "loss," usually used for classification tasks.

A value that represents how different two probability distributions are: the predicted probabilities from the model vs. the true answers.
(Or, how "surprised" the model is when it compares its outcomes with the true labels)

When I came across this term after my ML hiatus, the question that immediately hit me was:

What's so "crossed" (cross) about how chaotic something is (entropy)?

As someone who was pretty passionate about physics at one point in my life, I couldn't get out of the way I framed entropy as how chaotic, or disordered, the particles of a system were. However, I wouldn't say entropy in the ML context is too far off. It's more about how "surprised" a model would be after it compares its predictions to the true labels. Here, the intuition of “more disorder = more uncertainty” carries over.

Example (skip if not needed)

Let's say we're dealing with an image classification model, where, given an image, the model predicts whether the image is a cat, bear, or lion.

As you may have noticed, the model's prediction is looking pretty accurate: it assigns a 0.75 probability that the image is of a cat, and the true label confirms that the image is indeed a cat.

So, in terms of how "surprised" the model is, it shouldn't be too surprised, given that it predicted a higher chance of the image being a cat. You can think of the numerical extent of "surprise" as the cross-entropy value.

Speaking of surprises, cross-entropy also has a mathematical equation:

Cross-entropy Equation
Cross-entropy Equation

Of course, the log is a natural log (log base e).

Using the above equation, I'll go over 3 cases to show how the cross-entropy value changes with varying "surprise levels."

Case 1: Confident Correct Prediction
A confident correct prediction yields a small cross entropy value
A confident correct prediction yields a small cross entropy value

This is the case of the probability distribution where the model predicts:

  • 75% chance of the image being a cat
  • 15% chance of the image being a bear
  • 10% chance of the image being a lion

Since it places the highest percentage chance on the correct answer, the cross-entropy value will be low since there isn't much surprise when you confidently predict the correct outcome, and you're actually correct.

Case 2: Uncertain Prediction
An uncertain prediction like a random guess has a higher cross entropy value
An uncertain prediction like a random guess has a higher cross entropy value

In case of an uncertain prediction, the cross-entropy value is going to be greater than the value in the first scenario.

In the example above, I have probabilities that mirror something similar to a random guess – 1/3 bet on each option. Because the model isn't as "confident" with its predictions, there is a little more surprise when the predictions are compared to the actual labels.

Case 3: Confident Wrong Prediction
A confidently wrong prediction yields the highest cross entropy value
A confidently wrong prediction yields the highest cross entropy value

In this last example, we can see that the model's predictions are a lot more off than what we saw in case #1 – the model thinks there is the highest chance of the image being a lion (70%) and the lowest chance of the image being a cat (10%) – the actual label.

In this case, we can say the model is "confidently wrong," because it's assigning a large probability to the wrong answer. It's like when you just know you chose the correct answer, only to find out later that you were completely wrong, leading to "more surprise."

The bottom line of cross-entropy is that it comes from the idea of entropy (the uncertainty or surprise in a distribution), and it’s called cross-entropy because it measures the surprise of the model's prediction against the true labels.

Closing

When I was first planning this post, I didn't realize I would hit an almost-2000 word-count. Therefore, I plan to write up another post on the handful of other jargon I want to clarify. I was probably right about this becoming its own series lol 😆

Anyway...

I'm still on my learning journey with all this, so if you find anything inaccurate, please let me know -- I'd really appreciate it.

...And thanks if you make it this far!

Until the next post of this series,
Ael