I have been exploring Machine Learning for about a year but only recently while being on vacation with laptop had a chance to catch up with my Stanford ML class. In my experience unlike any other new things that I learned, ML isn't something one can pick up by watching a couple of Pluralsight videos or reading a book. Machine Learning however truly opens door for solving new class of problems which wouldn't be possible or would be very hard to tackle with traditional algorithmic approaches.

To start off with a trivial problem, that as a matter of fact can be solved without ML, but gives a sense what's that all about, suppose that I have a number data points collected over period of time. In my example, those were code coverage numbers (1-100%) collected daily. Supposed I can plot these points on X-Y scatter plot, but what I want is a trend, meaning is my code coverage getting better or worse and possibly I would want to predict given these points what my code coverage will be in the future based on the given data and so what I need is to come up with what is called **Regression Line** which is a line close enough to all my data points:

There are two ways that I know of to come up with regression line. The first one is simply compute it based on given point coordinates. Khan Academy has very fun series of videos to watch on that subject. Knowing the line parameters (m and b in y = mx + b) gives us trend which is "m" and also we can compute any other points based on the given "x". So pretty simply, no hairy math.

So how can Machine Learning can be used to solve this toy problem? Here's a little bit of theory. First of all there is what is called hypothesis. In this example **hypothesis** is line equation y = mx + b, where parameters are called theta zero and theta one:

The main idea is how to choose parameters theta to that hypothesis h is close to "y's" in our training examples. In order to achieve that there is **cost function** which in this case is *square error cost function*. This formula basically says - for each example in training set calculate the sum of differences between hypothesis and actual "y":

This finally takes us to **Gradient Descent** algorithm which tries to minimize the given cost function. With each step of Gradient Descent parameters theta come closer to the optimal values that will achive the lowest cost of J. To illustrate by picture from Andrew Ng class, we look around and try to go down from where we stand. Once we have minimum of the cost function, we grab parameters theta which will be our regression line.

In order to see how it actually works in practice I adapted a bit of Octave (that's high-level language for numerical computations, similar to Mathlab) code from ML lectures. Here's the source.

The sample has very few moving parts. First, there is cost function:

Which is plugged into Gradient Descent algorithm:

it produces this regression line:

What I like about Octave is built-in surface plot graph. When Gradient Descent started initial values of theta zero and theta one were zeros which produced cost J = 292. When algorithm converged, the value of J dropped to 15.4 and with corresponding theta values of 0.5 and 1.14. So essentially the regression line equation is h = 0.5 + 1.14x

You can see that these aren't the only possible values, but they close. Isn't it a cool picture?