All my articles are free to read. Non-members can read for free by clicking this link.

You've probably been taught AI as a ritual:

Pick a model, write a loss, run gradient descent.

If that sounds like you, you're missing the big picture.

There's a more intuitive way of seeing it that will make things click so much better:

Every AI model has a probabilistic way of looking at it.

For this blog, I'm going to show you exactly how.

We'll explore the probabilistic version of the simplest ML model, linear regression, and you'll realize that the "standard" way of learning it hides the real idea.

What even is linear regression?

Here's a quick refresher.

Linear regression is all about finding a line that passes as close as possible to a set of datapoints.

To do this, we normally minimise the error between our predicted values and the actual values.

The typical error function we use is the Mean Squared Error (MSE), which looks like this:

None
(Image by author)

But why are we squaring it? Where did this expression really come from?

When we look at linear regression from a probability lens, it'll be super clear why we're taking this specific form of the error funciton.

But for now, let's forget all this error function stuff, and take a step back.

What does it mean to look at linear regression probabilistically?

Let's set up a simple problem

Meet Harry.

None

Harry wants to predict the points scored by Gryffindor in the Quidditch match tomorrow.

According to him, the score depends on two things:

  • practice hours
  • team energy level (rated 0–10 on match day)

Harry assumes the score (y) is some weighted combination of these two inputs (x1 and x2).

None
(Image by author)

In other words, he's thinking of a linear regression model.

None
(Image by author)

But here's the thing:

Harry needs to figure out the optimal weights (w₁ and w₂) so he can predict how many points Gryffindor will score tomorrow based on the practice hours and energy level.

Harry's got some data

Let's say Harry has collected data from previous matches, and it looks like this:

None
(Image by author)

Now, we're assuming this data came out of some linear regression model.

If we can figure out the parameters of that model, we can figure out what the expected y is even for unseen inputs.

But nothing in life is simple

It can't be that simple. To assume in the real world that y will always be some precise linear combination of x1 and x2 would be absurd.

There's always some error involved.

Let's say the output of the model is a linear combination of x₁ and x₂.

And let's assume this error follows a Gaussian (normal) distribution with mean 0.

That makes y a random variable!

It's centred at the linear combination (w₁x₁ + w₂x₂), but it can also deviate from it in either direction.

None
(Image by author)

So for each datapoint i, we can write:

None
y_i is just a random variable sampled from a normal distribution centred at xiTw and with variance sigma². (Image by author)

This probabilistic framing changes everything.

Enter the Maximum Likelihood Estimator

There's something called the Maximum Likelihood Estimator (MLE) that helps us figure out the most likely parameters given this data.

The point of it is simple:

Given the parameters w and inputs x, what is the probability (or likelihood) of obtaining these specific values of y?

How do we actually quantify this?

Well, remember the Gaussian distribution? It has a mathematical form that looks like this:

None
The gaussian distribution's functional form. (Image by author)

Since there's a mathematical form to it, we can figure out the probability distribution function of any specific random variable y in the data.

None
(Image by author)

The probability that yᵢ emerged given xᵢ and w is:

None
(Image by author)

This essentially tells us: how likely is it that we observed this particular value of yᵢ, given our input xᵢ and our model parameters w?

Now let's multiply them all together

The likelihood of obtaining ALL of these yᵢ's is simply P(y₁) × P(y₂) × P(y₃) × …

We're just multiplying them together.

So the full likelihood looks like this:

None
(Image by author)

Now, if we say that the best w is the one where this likelihood is maximised, we frame it mathematically like this:

None
(Image by author)

The magic happens when we simplify

To make things easier, we take the log of the likelihood (since log is a monotonic function, maximising the log-likelihood is the same as maximising the likelihood).

When we do this, the product becomes a sum, and the exponentials simplify nicely.

None
(Image by author)

Now, maximising this log-likelihood is the same as minimising the term:

Σᵢ (yᵢ — xᵢᵀw)²

None
(Image by author)

And guess what? This is exactly the Mean Squared Error (MSE) that we minimise in standard linear regression!

So essentially, the probabilistic view gives us the same MSE minimisation that we've been using all along.

What does this mean?

Here's the beautiful part:

When you minimise MSE in linear regression, you're not just arbitrarily choosing an error function.

You're actually finding the parameters that make your observed data most likely, assuming Gaussian errors.

I find this perspective way more satisfying than just saying "let's minimise error".

Thanks for reading!

If you like learning AI concepts through easy-to-understand diagrams, I've created a free resource that organises all my work in one place — feel free to check it out!

Wrapping up

In this blog, we looked at linear regression from a completely different angle: the probabilistic angle.

Every model has a probabilistic interpretation, and understanding these interpretations will undoubtedly make you a better ML practitioner, whatever kind you may be.