Overfitting using higher-order regression

fat8348
Nov 12, 2021
4 min read

Problem explanation

This assignment is totally about problem or overfitting, underfitting on the various degree and solution to that problem.

Let's try to understand a few terminologies

Linear regression:-

It is a linear approach to predict the relationship between scaler variable

Most basic Linear equation Y=MX+C

here C is intercepted and M is slop.

This is how we can represent a linear relationship between X and Y.

But what should we do if we are not able to predict using this line?

The answer is we can use polynomial regression

Polynomial regression is nothing but Linear regression with a higher degree of X

For example y=w0+w1x+w2x2

This is a degree 2 equation.

This is one of the ways we can represent the relationship between X and Y.

Which degree of function to use is really depends on the nature of the Data.

Here, we have data of sine function so a higher degree like 3 may be a better prediction. This is neither underfitting nor overfitting data.

Overfitting means - Your train data set perfectly fits your model without missing singe

Underfitting means- your only minimum data fit your model or prediction line created by your model. The underfit model will never predict a good (with less error) value for train or test models.

Ways to solve this problem

1)Regularization using penalty - To add a penalty in error function to reduce overfitting.

We can use L1 Normalization, L2 Normalization. Here we have used L2 Normalization because adding penalty using a sum of weight square.

2)Adding more data (changing size from 10 to 100) - In This, I have increased testing data points from 10 to 100 so it is properly fitting data it is not the concept of overfitting in this case

3) Add validation data set - In This case we divide out input data set into 3 parts

Training, testing, and validation

Training is for trying out different hyperparameters, like the number of epoch and degrees of models like 2, 3,4,..9 and we can also increase data points.

Validation data set is basically for choosing best hyperparameters means to choose which degree and how many numbers of epoch will give you perfect result.

Test data set for finding accuracy on the model chosen by validation data set

Task

Step1

First, we have to generate data 20 data points with the following criteria.

1) Generate X from a uniform distribution

2) Generate N from a normal gaussian distribution

3) Generate Y from y = sin(2*pi*X) + 0.1 * N

4) Generate Linespace data to draw prediction line on test data

Step 2

Generate a Model for different degrees.

I am using auto-grade and torch.

I have created a class with forwarding, loss, train data, test data functions.

The forward function is for making a prediction

The loss function is for generating a loss/error (actual - prediction) square.

The train function is for training my model with train data set and different epoch

The task of the train function is to generate loss using the loss function for each data in each epoch

then find gradient and reduce the weight by gradient* learning rate.

This process is done for the number of epoch specified.

So our task in this function is to get weights that can predict accurately and reduce error.

We also find train errors in this function to show in the graph.

The Test function is for predicting a testing value and to generating loss over the testing value

The drawPredictionLine function is used to draw prediction lines and test data into a graph

Using Matlab for the same.

Wights for different degrees

Charts for degree 0,1,3,9

Degree 0

This is straight-line because the prediction function will be y=w which means y= some constant.

This is an example of underfitting. It is not actually passing through any data in training or testing.

To solve this kind of behaviour we have to increase degree of polynomial equation. Because our data is generated using the sine function so it has a sine curve. To fit such kind of data we have to increase the degree of the equation.

Degree 1

It is also an example of underfitting because it is passing through just 1 or 2 data.

Degree 3

This is going to be a curvy graph because the prediction function will be y= w0+w1 x+w2x2+w3x3

Degree 9

This will overfit your data because your prediction function is y=w0+x1w1+x2w2+x3w3+x4w4+...+w9x9

It will actually pass through each and every point of training.

This is called the concept of overfitting because if you try to predict a new value from your model you will not get an accurate value. It will be very different from the actual value. so the error will increase. In Overfit model Training error is very less but the testing/real-world scenario error is very high.

Generate 100 More data points

This is one of the ways to reduce overfitting. As data, point increases it

Train and Test Error for degree 0,1,3,9

Regularization

It is another way to solve the overfitting issue

In this we are penalizing error function with sum of wights.

Here we are using L2 Normalization

For Example lambda = 1

Train VS test Error after applying Regularization

Here I am getting a very small value so each point of Test error and train error looks like they are at the same place.

My Contribution:

Arranging code into a class with different methods (Forward, Loss, Train data, Test data, draw chart).

The base of the code was referred from dr.park's PPT.

Created Loss function for regularization, which means adding penalties using wights.

Tried different epochs for different degrees to see effects.

Tried plotting train data with prediction line using test data but it was not giving proper result.

Generated different graphs for visualizing different data.

Created different Models for various degrees like degrees 2,4,6,7,8.

Graph for Degree 2 (Experiment)

Graph for Degree 4(Experiment)

Challenge

The first challenge was to manage large code in a way that is easy to understand.

I was thinking to implement using looping but that may increase complexity and it will be difficult to trace errors if I use looping so avoided that and created a class with different functions.

The most difficult part was to get a proper curvy graph.

I tried plotting train data set, test data set, increased epoch but none was giving proper graph.

After that tried linespace, which was helpful in getting proper curvy charts.

References

Dr.Deugan Park's pdf

https://uta.instructure.com/courses/88045/files/15438578?module_item_id=3762685

https://towardsdatascience.com/polynomial-regression-bbe8b9d97491

https://numpy.org/doc/stable/reference/random/generated/numpy.random.normal.html

https://numpy.org/doc/stable/reference/random/generated/numpy.random.uniform.html

https://www.w3schools.com/python/ref_math_pi.asp https://www.geeksforgeeks.org/python-math-sin-function/

https://www.geeksforgeeks.org/python-change-column-names-and-row-indexes-in-pandas-dataframe/

https://en.wikipedia.org/wiki/Linear_regression

https://www.geeksforgeeks.org/python-implementation-of-polynomial-regression/

Overfitting using higher-order regression

Recent Posts

Comments