## What is machine learning ?

Machine Learning is a statistical science where the goal is to find the statistical regularities in the environment and to model a system to work as if a physical system might have performed in that environment or even better.

As for any intelligent living being’s need to be aware of it’s environment to learn, Machine Learning(ML) systems also require to understand its environment to learn these regularities. We provide this information to the ML system by a set of vectors called the input pattern vectors. Input pattern vectors are a subset of the feature space and feature space is a vector space containing all the events in the environment in a transformed representation. This type of feature transformations are important because it helps us to reduce the dimensionality of the original vector space in many cases.

We call the output or action of the ML system on the environment as the output pattern vector and the output we expect from the ML system as the desired output vector or desired output response.

Ok now we have the inputs to the system and we know what to expect out of the system. So how do we know if the system is working as we expected? One good way to know that would be to find the difference between the output of our system and the desired output. Had it been a static system, things were simple and you only had to literally do what was said in the previous sentence. But in a dynamic system things are a bit different.

So for the time being let’s consider our ML system to be a black box. Let’s also assume that the system’s action can be determined by a parameter vector $\theta$. So given an input pattern vector $\mathbf{s}$, we can write our ML system’s output response as $\hat{y}(s,\theta)$.

Here comes the idea of an empirical risk minimization framework. Usually, given the input and output vectors we can define a loss function that minimizes the error between the predicted response and the desired response. This is called the true risk. But in real world scenarios we never have access to the whole population of data. So we assume that the data we have at our had have the same distribution as that of the population and hence we approximate this as the whole population data distribution and hence the term empirical. We now try to find a function that minimizes the risk (error) between the output response and the desired response. This process is called empirical risk minimization.

So if $\mathbf{c}(\mathbf{s},\mathbf{\theta})$ is the loss function that computes the error between predicted response and the desired response, we can define our empirical risk function $\hat{l}_n (\theta)$ as

$\displaystyle \hat{l}_n (\theta) = \frac{1}{n} \sum_{i=1}^n \mathbf{c}(\mathbf{s}_i,\mathbf{\theta})$

Our objective here is to find a $\theta$ that minimizes the above function. To start with we give a random value to $\theta$ and call it $\theta_{0}$ and compute the loss function. By monitoring the loss function we can see if we are getting closer to our optimal $\theta$. The change in loss function with respect to $\theta$ can be calculated by taking its derivative i.e. $\displaystyle \frac{d\hat{l}_n(\theta)}{d\theta}$.

So given an initial parameter $\theta_{0}$ (remember we choose the value for this), we can compute the $\theta$ at iteration $\mathbf{n+1}$ as

$\displaystyle \theta_{n+1} = \theta_{n} – \gamma_{n} \frac{d\hat{l}_n(\theta)}{d\theta}$

where $\gamma_n$ is called the learning rate.

This idea is called the method of gradient descent and is the essence of a huge number of practical machine learning algorithms.