Post on using Machine Learning to classify the Survival of the Titanic Passengers.
In this blog, I will create a machine learning model which will predict the survival of the people in the Titanic accident. I will use titanic survival dataset and use the knn algorithm to find the survival of the people in the dataset.
RMS Titanic was a British passenger liner that sank in the North Atlantic Ocean in the early hours of 15 April 1912, after colliding with an iceberg during its maiden voyage from Southampton to New York City. The data is taken from data.world. I have also done exploratory analysis in my previous blog on this data I will not work on EDA in this blog on the same data.
Lets start with loading the packages and dataset.
library(dplyr)
library(caret)
library(MASS)
## importing data
titanic_df <- read.csv("titanic.csv")
The titanic datasets has 1309 rows and 14 columns. Lets know about the features or column of this datasets:
head(titanic_df)
pclass | survived | name | sex | age | sibsp | parch | ticket | fare | cabin | embarked | boat | cabin_1 | embarked_1 | boat_1 | body | home_dest |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | TRUE | Allen, Miss. Elisabeth Walton | female | 29.0000 | 0 | 0 | 24160 | 211.3375 | B5 | S | 2 | B5 | S | 2 | NA | St Louis, MO |
1 | TRUE | Allison, Master. Hudson Trevor | male | 0.9167 | 1 | 2 | 113781 | 151.5500 | C22 C26 | S | 11 | C22 C26 | S | 11 | NA | Montreal, PQ / Chesterville, ON |
1 | FALSE | Allison, Miss. Helen Loraine | female | 2.0000 | 1 | 2 | 113781 | 151.5500 | C22 C26 | S | NA | C22 C26 | S | NA | NA | Montreal, PQ / Chesterville, ON |
1 | FALSE | Allison, Mr. Hudson Joshua Creighton | male | 30.0000 | 1 | 2 | 113781 | 151.5500 | C22 C26 | S | NA | C22 C26 | S | NA | 135 | Montreal, PQ / Chesterville, ON |
1 | FALSE | Allison, Mrs. Hudson J C (Bessie Waldo Daniels) | female | 25.0000 | 1 | 2 | 113781 | 151.5500 | C22 C26 | S | NA | C22 C26 | S | NA | NA | Montreal, PQ / Chesterville, ON |
1 | TRUE | Anderson, Mr. Harry | male | 48.0000 | 0 | 0 | 19952 | 26.5500 | E12 | S | 3 | E12 | S | 3 | NA | New York, NY |
I will clean the data before training the model with it. Let’s start with finding any missing values.
pclass survived name sex age sibsp parch ticket
0 0 0 0 263 0 0 0
fare cabin embarked boat body home_dest
1 0 0 0 1188 0
We can see that the body column has 1188 missing value whereas age has 263 and fare has 1 missing value. I will remove body column from the data frame as it has around 90% missing value.
I use the median value of the age column to fill up the missing value and remove the row where the fare is missing.
Lets check again is there any missing values.
pclass survived name sex age sibsp parch ticket 0 0 0 0 0 0 0 0
fare cabin embarked boat body home_dest
0 0 0 0 0 0
The name column holds the name of each person which is a string and it cant be used in KNN. The titanic_df data frame has the home_dest column which is the destination of the people and it’s in the string. The boat column has the boat number if the people were alive otherwise blank. The cabin column has a cabin of the people which was provided for the rich people only and other cabin is unknown. The ticket column gives the ticket number of each person which is different for each person. I will drop these five columns from the data frame.
The embarked column gives Port of Embarkation:
library(dummies)
titanic_df <- cbind(titanic_df,dummy(titanic_df$embarked))
head(titanic_df)
pclass | survived | sex | age | sibsp | parch | fare | embarked | fare_1 | embarked_1 | titanic_df | titanic_dfC | titanic_dfQ | titanic_dfS |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | TRUE | female | 29.0000 | 0 | 0 | 211.3375 | S | 211.3375 | S | 0 | 0 | 0 | 1 |
1 | TRUE | male | 0.9167 | 1 | 2 | 151.5500 | S | 151.5500 | S | 0 | 0 | 0 | 1 |
1 | FALSE | female | 2.0000 | 1 | 2 | 151.5500 | S | 151.5500 | S | 0 | 0 | 0 | 1 |
1 | FALSE | male | 30.0000 | 1 | 2 | 151.5500 | S | 151.5500 | S | 0 | 0 | 0 | 1 |
1 | FALSE | female | 25.0000 | 1 | 2 | 151.5500 | S | 151.5500 | S | 0 | 0 | 0 | 1 |
1 | TRUE | male | 48.0000 | 0 | 0 | 26.5500 | S | 26.5500 | S | 0 | 0 | 0 | 1 |
titanic_df <- dplyr::select(titanic_df,-embarked)
titanic_df <- dplyr::select(titanic_df,-titanic_df)
titanic_df <- dplyr::select(titanic_df,-sex)
The sex column provides the gender of the person I will denote 1 as female and 0 as male which makes it easy to implement in the model.
After cleaning and some manipulation, we have a dataset that is applicable for modelling or classification purpose. Let’s look at the structure of the data.
pclass | survived | age | sibsp | parch | fare | embark_C | embark_Q | embark_S | gender |
---|---|---|---|---|---|---|---|---|---|
1 | TRUE | 29.0000 | 0 | 0 | 211.3375 | 0 | 0 | 1 | 1 |
1 | TRUE | 0.9167 | 1 | 2 | 151.5500 | 0 | 0 | 1 | 0 |
1 | FALSE | 2.0000 | 1 | 2 | 151.5500 | 0 | 0 | 1 | 1 |
1 | FALSE | 30.0000 | 1 | 2 | 151.5500 | 0 | 0 | 1 | 0 |
1 | FALSE | 25.0000 | 1 | 2 | 151.5500 | 0 | 0 | 1 | 1 |
1 | TRUE | 48.0000 | 0 | 0 | 26.5500 | 0 | 0 | 1 | 0 |
Usually, in Machine Learning Data is divided into two parts, training and testing data. I will divide the data into 70% training and 30% testing data.
set.seed(123)
data <- titanic_df[base::sample(nrow(titanic_df)),] # suffling the data
bound <- floor(0.7 * nrow(data))
df_train <- data[1:bound,]
df_test <- data[(bound+1): nrow(data),]
cat("number of training and test samples are",nrow(df_train), nrow(df_test))
number of training and test samples are 914 392
I will use K- Nearest Neighbour (knn) algorithm to train our classification model. I will start with k = 1.
y_test
knn.pred1 false true
false 210 60
true 41 81
Accuracy: 0.7423469
knn.pred3<-knn(X_train,X_test,y_train,k =3)
table(knn.pred3, y_test)
y_test
knn.pred3 false true
false 212 57
true 39 84
cat("Accuracy:",mean(y_test==knn.pred3))
Accuracy: 0.755102
knn.pred5 <- knn(X_train,X_test,y_train,k = 5)
table(knn.pred5, y_test)
y_test
knn.pred5 false true
false 204 56
true 47 85
cat("Accuracy:",mean(y_test==knn.pred5))
Accuracy: 0.7372449
knn.pred7 <- knn(X_train,X_test,y_train,k = 20)
table(knn.pred7, y_test)
y_test
knn.pred7 false true
false 222 57
true 29 84
cat("Accuracy:",mean(y_test==knn.pred7))
Accuracy: 0.7806122
I kept on increasing the value of k and the best result I found was with K = 20. Further increase in K didn’t improve the performance of the model. So, I got 78% accuracy using the K nearest Neighbour algorithm with k = 20. You can use other algorithms on this problem to get better accuracy.