Problem Explanation
This problem situation is a collision of the Titanic with an iceberg. In this problem, we have been given a dataset of passengers and we have been asked to predict if passengers survived or not after the titanic crash with an iceberg. There were really fewer lifeboats. So there was no chance of surviving everyone. Data set of passenger contains many details like gender,age,pass,ticket,fare,etc. From this, we have to predict the pattern as all women survived or women above certain age survived or all children survived. In this problem, we are given 2 data sets. One is the training data set, which we will use to train our Machine Learning Model. The second is testing the data set, for which we will make predictions and test the accuracy of our prediction model.
Solution
Note:- This solution is based on the Kaggle explanation.
Step1:- Import Library
Here I am using pandas to read CSV files. Basically, for matrix data, we are using the pandas library.
Training data set formate is 891 rows and 12 columns.
Using math library to find mean of column age.
Another library is scikit learn. We are using RandomForestClassifier from Scikit learn.
RandomForestClassifier is an estimator or classifier. It creates a decision tree on various columns passed from the dataset.
Using this decision tree predicts a value for the test data set.
Step2:- Read CSV
Now we will use the read_csv method of pandas to read CSV files.
The head method will return the first n rows. n is the count of rows users want to see. By default, it is 5.
Contribution
Step3:- Check for NULL values
This step is important because we have to check all column we are considering for prediction have proper value and it does not contain null.
Age Column contains null. So, to remove null from the age column we will replace the null value in the age column with the mean of the Age column.
Step 4: Arrange Data in train and test data set
Here we pass the feature list on which we want to make predictions. Feature List is pclass, sex, Sibsp, Parch,Age.
pclass-Ticket class
sex- gender of the passenger
Sibsp- no of siblings/spouses aboard the Titanic
Parch- no of parents/children aboard the Titanic
Contribution
-----
Age - In this problem there were fewer lifeboats.
So while saving people they might have considered their Age as one of the factors.
updateNullAge is a function to be applied on x_train and x_test to update null values in Age columns.
------
Created new x_train from train.csv and above features from that file.
y_train created from column survived. Because this is the output we want from the model.
from x_train and y_train we will train a model. It is input and output for the training model.
Created new x_test from test.csv and above features from that file.
Step5:- Build Model
Using the RandomForestClassifier method we are creating a model for prediction. Input to function
n_estimators = number of the tree while creating decision model
max_depth = depth or level of the tree in the decision model
random_state =Controls randomness in code
Contribution
Two new parameters were added to the Random classifier. bootstrap = sometimes in this classifier sample data set is used for generating a decision tree. Passing false for this parameter, it will always check the whole data set instead of the sample dataset. max_features = It is a number of features that the model looks for while making a decision tree.
Step6: Make a prediction and generate an output file
Make predictions using predict method of the trained model.
In this prediction model, we will pass x_test data for which we have to make a prediction.
The next task is to save the prediction in a CSV file.
CSV file should contain passenger id and status of survival.
So, save passenger id and status of survival in CSV file and upload this CSV file in Kaggle competition.
Check the accuracy score you got on Kaggle Competition and try to improve that.
ACCURACY SCORE
Comments