Data Stories: Classification of Survival in Titanic

In this blog, I will create a machine learning model which will predict the survival of the people in the Titanic accident. I will use titanic survival dataset and use the knn algorithm to find the survival of the people in the dataset.

RMS Titanic was a British passenger liner that sank in the North Atlantic Ocean in the early hours of 15 April 1912, after colliding with an iceberg during its maiden voyage from Southampton to New York City. The data is taken from data.world. I have also done exploratory analysis in my previous blog on this data I will not work on EDA in this blog on the same data.

Lets start with loading the packages and dataset.


library(dplyr)
library(caret)
library(MASS)
## importing data
titanic_df <- read.csv("titanic.csv")

Dataset

The titanic datasets has 1309 rows and 14 columns. Lets know about the features or column of this datasets:

survival - Survival (0 = No; 1 = Yes)
class - Passenger Class (1 = 1st; 2 = 2nd; 3 = 3rd)
name - Name
sex - Sex
age - Age
sibsp - Number of Siblings/Spouses Aboard
parch - Number of Parents/Children Aboard
ticket - Ticket Number
fare - Passenger Fare
cabin - Cabin
embarked - Port of Embarkation (C = Cherbourg; Q = Queenstown; S = Southampton)
boat - Lifeboat (if survived)
body - Body number (if did not survive and body was recovered)
homedest - destination


head(titanic_df)

pclass	survived	name	sex	age	sibsp	parch	ticket	fare	cabin	embarked	boat	cabin_1	embarked_1	boat_1	body	home_dest
1	TRUE	Allen, Miss. Elisabeth Walton	female	29.0000	0	0	24160	211.3375	B5	S	2	B5	S	2	NA	St Louis, MO
1	TRUE	Allison, Master. Hudson Trevor	male	0.9167	1	2	113781	151.5500	C22 C26	S	11	C22 C26	S	11	NA	Montreal, PQ / Chesterville, ON
1	FALSE	Allison, Miss. Helen Loraine	female	2.0000	1	2	113781	151.5500	C22 C26	S	NA	C22 C26	S	NA	NA	Montreal, PQ / Chesterville, ON
1	FALSE	Allison, Mr. Hudson Joshua Creighton	male	30.0000	1	2	113781	151.5500	C22 C26	S	NA	C22 C26	S	NA	135	Montreal, PQ / Chesterville, ON
1	FALSE	Allison, Mrs. Hudson J C (Bessie Waldo Daniels)	female	25.0000	1	2	113781	151.5500	C22 C26	S	NA	C22 C26	S	NA	NA	Montreal, PQ / Chesterville, ON
1	TRUE	Anderson, Mr. Harry	male	48.0000	0	0	19952	26.5500	E12	S	3	E12	S	3	NA	New York, NY

Cleaning Data

I will clean the data before training the model with it. Let’s start with finding any missing values.


pclass  survived      name       sex       age     sibsp     parch    ticket  
 0         0            0         0        263       0         0         0          
fare    cabin  embarked      boat      body   home_dest 
  1      0         0           0      1188         0

We can see that the body column has 1188 missing value whereas age has 263 and fare has 1 missing value. I will remove body column from the data frame as it has around 90% missing value.

I use the median value of the age column to fill up the missing value and remove the row where the fare is missing.

Lets check again is there any missing values.


pclass  survived      name       sex       age     sibsp     parch    ticket   0         0          0          0         0        0         0         0          
fare    cabin  embarked      boat      body   home_dest 
  0      0         0           0         0       0

The name column holds the name of each person which is a string and it cant be used in KNN. The titanic_df data frame has the home_dest column which is the destination of the people and it’s in the string. The boat column has the boat number if the people were alive otherwise blank. The cabin column has a cabin of the people which was provided for the rich people only and other cabin is unknown. The ticket column gives the ticket number of each person which is different for each person. I will drop these five columns from the data frame.

The embarked column gives Port of Embarkation:

C = Cherbourg
Q = Queenstown
S = Southampton I will create three new features using the emarked data .


library(dummies)
titanic_df <- cbind(titanic_df,dummy(titanic_df$embarked))
head(titanic_df)

pclass	survived	sex	age	sibsp	parch	fare	embarked	fare_1	embarked_1	titanic_dfS
1	TRUE	female	29.0000	0	0	211.3375	S	211.3375	S	1
1	TRUE	male	0.9167	1	2	151.5500	S	151.5500	S	1
1	FALSE	female	2.0000	1	2	151.5500	S	151.5500	S	1
1	FALSE	male	30.0000	1	2	151.5500	S	151.5500	S	1
1	FALSE	female	25.0000	1	2	151.5500	S	151.5500	S	1
1	TRUE	male	48.0000	0	0	26.5500	S	26.5500	S	1


titanic_df <- dplyr::select(titanic_df,-embarked)
titanic_df <- dplyr::select(titanic_df,-titanic_df)
titanic_df <- dplyr::select(titanic_df,-sex)

The sex column provides the gender of the person I will denote 1 as female and 0 as male which makes it easy to implement in the model.

After cleaning and some manipulation, we have a dataset that is applicable for modelling or classification purpose. Let’s look at the structure of the data.

pclass	survived	age	sibsp	parch	fare	embark_S	gender
1	TRUE	29.0000	0	0	211.3375	1	1
1	TRUE	0.9167	1	2	151.5500	1	0
1	FALSE	2.0000	1	2	151.5500	1	1
1	FALSE	30.0000	1	2	151.5500	1	0
1	FALSE	25.0000	1	2	151.5500	1	1
1	TRUE	48.0000	0	0	26.5500	1	0

Split the Data

Usually, in Machine Learning Data is divided into two parts, training and testing data. I will divide the data into 70% training and 30% testing data.


set.seed(123)
data <- titanic_df[base::sample(nrow(titanic_df)),] # suffling the data
bound <- floor(0.7 * nrow(data))
df_train <- data[1:bound,]
df_test <- data[(bound+1): nrow(data),]
cat("number of training and test samples are",nrow(df_train), nrow(df_test))


number of training and test samples are 914 392

Train the Model on Data

I will use K- Nearest Neighbour (knn) algorithm to train our classification model. I will start with k = 1.


y_test
knn.pred1 false true
    false   210   60
    true     41   81
Accuracy: 0.7423469


knn.pred3<-knn(X_train,X_test,y_train,k =3)
table(knn.pred3, y_test)


  y_test
knn.pred3 false true
    false   212   57
    true     39   84


cat("Accuracy:",mean(y_test==knn.pred3))


Accuracy: 0.755102


knn.pred5 <- knn(X_train,X_test,y_train,k = 5)
table(knn.pred5, y_test)


y_test
knn.pred5 false true
    false   204   56
    true     47   85


cat("Accuracy:",mean(y_test==knn.pred5))


Accuracy: 0.7372449


knn.pred7 <- knn(X_train,X_test,y_train,k = 20)
table(knn.pred7, y_test)


 y_test
knn.pred7 false true
    false   222   57
    true     29   84


cat("Accuracy:",mean(y_test==knn.pred7))


Accuracy: 0.7806122

I kept on increasing the value of k and the best result I found was with K = 20. Further increase in K didn’t improve the performance of the model. So, I got 78% accuracy using the K nearest Neighbour algorithm with k = 20. You can use other algorithms on this problem to get better accuracy.