Titanic-Machine Learning from Disaster

fat8348
Sep 28, 2021
3 min read

Problem Explanation

This problem situation is a collision of the Titanic with an iceberg. In this problem, we have been given a dataset of passengers and we have been asked to predict if passengers survived or not after the titanic crash with an iceberg. There were really fewer lifeboats. So there was no chance of surviving everyone. Data set of passenger contains many details like gender,age,pass,ticket,fare,etc. From this, we have to predict the pattern as all women survived or women above certain age survived or all children survived. In this problem, we are given 2 data sets. One is the training data set, which we will use to train our Machine Learning Model. The second is testing the data set, for which we will make predictions and test the accuracy of our prediction model.

Solution

Note:- This solution is based on the Kaggle explanation.

Step1:- Import Library

Here I am using pandas to read CSV files. Basically, for matrix data, we are using the pandas library.

Training data set formate is 891 rows and 12 columns.

Using math library to find mean of column age.

Another library is scikit learn. We are using RandomForestClassifier from Scikit learn.

RandomForestClassifier is an estimator or classifier. It creates a decision tree on various columns passed from the dataset.

Using this decision tree predicts a value for the test data set.

Step2:- Read CSV

Now we will use the read_csv method of pandas to read CSV files.

The head method will return the first n rows. n is the count of rows users want to see. By default, it is 5.

Contribution

Step3:- Check for NULL values

This step is important because we have to check all column we are considering for prediction have proper value and it does not contain null.

Age Column contains null. So, to remove null from the age column we will replace the null value in the age column with the mean of the Age column.

Step 4: Arrange Data in train and test data set

Here we pass the feature list on which we want to make predictions. Feature List is pclass, sex, Sibsp, Parch,Age.

pclass-Ticket class

sex- gender of the passenger

Sibsp- no of siblings/spouses aboard the Titanic

Parch- no of parents/children aboard the Titanic

Contribution

-----

Age - In this problem there were fewer lifeboats.

So while saving people they might have considered their Age as one of the factors.

updateNullAge is a function to be applied on x_train and x_test to update null values in Age columns.

------

Created new x_train from train.csv and above features from that file.

y_train created from column survived. Because this is the output we want from the model.

from x_train and y_train we will train a model. It is input and output for the training model.

Created new x_test from test.csv and above features from that file.

Step5:- Build Model

Using the RandomForestClassifier method we are creating a model for prediction. Input to function n_estimators = number of the tree while creating decision model max_depth = depth or level of the tree in the decision model random_state =Controls randomness in code

Contribution

Two new parameters were added to the Random classifier. bootstrap = sometimes in this classifier sample data set is used for generating a decision tree. Passing false for this parameter, it will always check the whole data set instead of the sample dataset. max_features = It is a number of features that the model looks for while making a decision tree.