1. Problem Statement

The goal is to predict survival of passengers travelling in RMS Titanic using Logistic regression.

2. Data Loading and Description

image.png

  • The dataset consists of the information about people boarding the famous RMS Titanic. Various variables present in the dataset includes data of age, sex, fare, ticket etc.
  • The dataset comprises of 891 observations of 12 columns. Below is a table showing names of all the columns and their description.
Column Name Description
PassengerId Passenger Identity
Survived Whether passenger survived or not
Pclass Class of ticket
Name Name of passenger
Sex Sex of passenger
Age Age of passenger
SibSp Number of sibling and/or spouse travelling with passenger
Parch Number of parent and/or children travelling with passenger
Ticket Ticket number
Fare Price of ticket
Cabin Cabin number

Importing packages

import numpy as np                                                 # Implemennts milti-dimensional array and matrices
import pandas as pd                                                # For data manipulation and analysis
# import pandas_profiling
import matplotlib.pyplot as plt                                    # Plotting library for Python programming language and it's numerical mathematics extension NumPy
import seaborn as sns                                              # Provides a high level interface for drawing attractive and informative statistical graphics
%matplotlib inline
sns.set()

from subprocess import check_output

Importing the Dataset

titanic_data = pd.read_csv("LogReg/titanic_train.csv")     # Importing training dataset using pd.read_csv

3. Preprocessing the data

  • Dealing with missing values
    • Dropping/Replacing missing entries of Embarked.
    • Replacing missing values of Age and Fare with median values.
    • Dropping the column 'Cabin' as it has too many null values.
titanic_data.Embarked = titanic_data.Embarked.fillna(titanic_data['Embarked'].mode()[0])
median_age = titanic_data.Age.median()
median_fare = titanic_data.Fare.median()
titanic_data.Age.fillna(median_age, inplace = True)
titanic_data.Fare.fillna(median_fare, inplace = True)
titanic_data.drop('Cabin', axis = 1,inplace = True)
  • Creating a new feature named FamilySize.
titanic_data['FamilySize'] = titanic_data['SibSp'] + titanic_data['Parch']+1
  • Segmenting Sex column as per Age, Age less than 15 as Child, Age greater than 15 as Males and Females as per their gender.
titanic_data['GenderClass'] = titanic_data.apply(lambda x: 'child' if x['Age'] < 15 else x['Sex'],axis=1)
titanic_data[titanic_data.Age<15].head(2)
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Embarked FamilySize GenderClass
7 8 0 3 Palsson, Master. Gosta Leonard male 2.0 3 1 349909 21.0750 S 5 child
9 10 1 2 Nasser, Mrs. Nicholas (Adele Achem) female 14.0 1 0 237736 30.0708 C 2 child
titanic_data[titanic_data.Age>15].head(2)
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Embarked FamilySize GenderClass
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 S 2 male
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C 2 female
  • Dummification of GenderClass & Embarked.
titanic_data = pd.get_dummies(titanic_data, columns=['GenderClass','Embarked'], drop_first=True)
  • Dropping columns 'Name' , 'Ticket' , 'Sex' , 'SibSp' and 'Parch'
titanic = titanic_data.drop(['Name','Ticket','Sex','SibSp','Parch'], axis = 1)
titanic.head()
PassengerId Survived Pclass Age Fare FamilySize GenderClass_female GenderClass_male Embarked_Q Embarked_S
0 1 0 3 22.0 7.2500 2 0 1 0 1
1 2 1 1 38.0 71.2833 2 1 0 0 0
2 3 1 3 26.0 7.9250 1 1 0 0 1
3 4 1 1 35.0 53.1000 2 1 0 0 1
4 5 0 3 35.0 8.0500 1 0 1 0 1

Drawing pair plot to know the joint relationship between 'Fare' , 'Age' , 'Pclass' & 'Survived'

sns.pairplot(titanic_data[["Fare","Age","Pclass","Survived"]],vars = ["Fare","Age","Pclass"],hue="Survived", dropna=True,markers=["o", "s"])
plt.title('Pair Plot')
Text(0.5, 1, 'Pair Plot')

Observing the diagonal elements,

  • More people of Pclass 1 survived than died (First peak of red is higher than blue)
  • More people of Pclass 3 died than survived (Third peak of blue is higher than red)
  • More people of age group 20-40 died than survived.
  • Most of the people paying less fare died.

Establishing coorelation between all the features using heatmap.

corr = titanic_data.corr()
plt.figure(figsize=(10,10))
sns.heatmap(corr,vmax=.8,linewidth=.01, square = True, annot = True,cmap='YlGnBu',linecolor ='black')
plt.title('Correlation between features')
Text(0.5, 1, 'Correlation between features')
  • Age and Pclass are negatively corelated with Survived.
  • FamilySize is made from Parch and SibSb only therefore high positive corelation among them.
  • Fare and FamilySize are positively coorelated with Survived.
  • With high corelation we face redundancy issues.

4. Logistic Regression

4.1 Introduction to Logistic Regression

Logistic regression is a techinque used for solving the classification problem.
And Classification is nothing but a problem of identifing to which of a set of categories a new observation belongs, on the basis of training dataset containing observations (or instances) whose categorical membership is known.
For example to predict:
Whether an email is spam (1) or not (0) or,
Whether the tumor is malignant (1) or not (0)
Below is the pictorial representation of a basic logistic regression model to classify set of images into happy or sad. image.png

Both Linear regression and Logistic regression are supervised learning techinques. But for the Regression problem the output is continuous unlike the classification problem where the output is discrete.

  • Logistic Regression is used when the dependent variable(target) is categorical.
  • Sigmoid function or logistic function is used as hypothesis function for logistic regression. Below is a figure showing the difference between linear regression and logistic regression, Also notice that logistic regression produces a logistic curve, which is limited to values between 0 and 1.
    image.png

4.2 Mathematics behind Logistic Regression

The odds for an event is the (probability of an event occuring) / (probability of event not occuring): image.png For Linear regression: continuous response is modeled as a linear combination of the features: y = β0 + β1x
For Logistic regression: log-odds of a categorical response being "true" (1) is modeled as a linear combination of the features:

image.png This is called the logit function.
On solving for probability (p) you will get:

image.png

image.png

Shown below is the plot showing linear model and logistic model:

image.png

In other words:

  • Logistic regression outputs the probabilities of a specific class.
  • Those probabilities can be converted into class predictions.

The logistic function has some nice properties:

  • Takes on an "s" shape
  • Output is bounded by 0 and 1

We have covered how this works for binary classification problems (two response classes). But what about multi-class classification problems (more than two response classes)?

  • Most common solution for classification models is "one-vs-all" (also known as "one-vs-rest"): decompose the problem into multiple binary classification problems.
  • Multinomial logistic regression can solve this as a single problem.

4.3 Applications of Logistic Regression

Logistic Regression was used in biological sciences in early twentieth century. It was then used in many social science applications. For instance,

  • The Trauma and Injury Severity Score (TRISS), which is widely used to predict mortality in injured patients, was originally developed by Boyd et al. using logistic regression.
  • Many other medical scales used to assess severity of a patient have been developed using logistic regression.
  • Logistic regression may be used to predict the risk of developing a given disease (e.g. diabetes; coronary heart disease), based on observed characteristics of the patient (age, sex, body mass index, results of various blood tests, etc.).

Now a days, Logistic Regression have the following applications

  1. Image segementation and categorization
  2. Geographic image processing
  3. Handwriting recognition
  4. Detection of myocardinal infarction
  5. Predict whether a person is depressed or not based on a bag of words from corpus. image.png

The reason why logistic regression is widely used despite of the state of the art of deep neural network is that logistic regression is very efficient and does not require too much computational resources, which makes it affordable to run on production.

4.4 Preparing X and y using pandas

X = titanic.loc[:,titanic.columns != 'Survived']
X.head()
PassengerId Pclass Age Fare FamilySize GenderClass_female GenderClass_male Embarked_Q Embarked_S
0 1 3 22.0 7.2500 2 0 1 0 1
1 2 1 38.0 71.2833 2 1 0 0 0
2 3 3 26.0 7.9250 1 1 0 0 1
3 4 1 35.0 53.1000 2 1 0 0 1
4 5 3 35.0 8.0500 1 0 1 0 1
y = titanic.Survived 

4.5 Splitting X and y into training and test datasets.

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=1)
print(X_train.shape)
print(y_train.shape)
(712, 9)
(712,)

4.6 Logistic regression in scikit-learn

To apply any machine learning algorithm on your dataset, basically there are 4 steps:

  1. Load the algorithm
  2. Instantiate and Fit the model to the training dataset
  3. Prediction on the test set
  4. Calculating the accuracy of the model

The code block given below shows how these steps are carried out:

from sklearn.linear_model import LogisticRegression logreg = LogisticRegression() logreg.fit(X_train, y_train) accuracy_score(y_test,y_pred_test))

from sklearn.linear_model import LogisticRegression
logreg = LogisticRegression()
logreg.fit(X_train,y_train)
C:\Users\prata\AppData\Roaming\Python\Python37\site-packages\sklearn\linear_model\least_angle.py:35: DeprecationWarning: `np.float` is a deprecated alias for the builtin `float`. To silence this warning, use `float` by itself. Doing this will not modify any behavior and is safe. If you specifically wanted the numpy scalar type, use `np.float64` here.
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  eps=np.finfo(np.float).eps,
C:\Users\prata\AppData\Roaming\Python\Python37\site-packages\sklearn\linear_model\least_angle.py:597: DeprecationWarning: `np.float` is a deprecated alias for the builtin `float`. To silence this warning, use `float` by itself. Doing this will not modify any behavior and is safe. If you specifically wanted the numpy scalar type, use `np.float64` here.
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  eps=np.finfo(np.float).eps, copy_X=True, fit_path=True,
C:\Users\prata\AppData\Roaming\Python\Python37\site-packages\sklearn\linear_model\least_angle.py:836: DeprecationWarning: `np.float` is a deprecated alias for the builtin `float`. To silence this warning, use `float` by itself. Doing this will not modify any behavior and is safe. If you specifically wanted the numpy scalar type, use `np.float64` here.
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  eps=np.finfo(np.float).eps, copy_X=True, fit_path=True,
C:\Users\prata\AppData\Roaming\Python\Python37\site-packages\sklearn\linear_model\least_angle.py:862: DeprecationWarning: `np.float` is a deprecated alias for the builtin `float`. To silence this warning, use `float` by itself. Doing this will not modify any behavior and is safe. If you specifically wanted the numpy scalar type, use `np.float64` here.
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  eps=np.finfo(np.float).eps, positive=False):
C:\Users\prata\AppData\Roaming\Python\Python37\site-packages\sklearn\linear_model\least_angle.py:1097: DeprecationWarning: `np.float` is a deprecated alias for the builtin `float`. To silence this warning, use `float` by itself. Doing this will not modify any behavior and is safe. If you specifically wanted the numpy scalar type, use `np.float64` here.
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  max_n_alphas=1000, n_jobs=None, eps=np.finfo(np.float).eps,
C:\Users\prata\AppData\Roaming\Python\Python37\site-packages\sklearn\linear_model\least_angle.py:1344: DeprecationWarning: `np.float` is a deprecated alias for the builtin `float`. To silence this warning, use `float` by itself. Doing this will not modify any behavior and is safe. If you specifically wanted the numpy scalar type, use `np.float64` here.
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  max_n_alphas=1000, n_jobs=None, eps=np.finfo(np.float).eps,
C:\Users\prata\AppData\Roaming\Python\Python37\site-packages\sklearn\linear_model\least_angle.py:1480: DeprecationWarning: `np.float` is a deprecated alias for the builtin `float`. To silence this warning, use `float` by itself. Doing this will not modify any behavior and is safe. If you specifically wanted the numpy scalar type, use `np.float64` here.
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  eps=np.finfo(np.float).eps, copy_X=True, positive=False):
C:\Users\prata\AppData\Roaming\Python\Python37\site-packages\sklearn\linear_model\randomized_l1.py:152: DeprecationWarning: `np.float` is a deprecated alias for the builtin `float`. To silence this warning, use `float` by itself. Doing this will not modify any behavior and is safe. If you specifically wanted the numpy scalar type, use `np.float64` here.
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  precompute=False, eps=np.finfo(np.float).eps,
C:\Users\prata\AppData\Roaming\Python\Python37\site-packages\sklearn\linear_model\randomized_l1.py:320: DeprecationWarning: `np.float` is a deprecated alias for the builtin `float`. To silence this warning, use `float` by itself. Doing this will not modify any behavior and is safe. If you specifically wanted the numpy scalar type, use `np.float64` here.
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  eps=np.finfo(np.float).eps, random_state=None,
C:\Users\prata\AppData\Roaming\Python\Python37\site-packages\sklearn\linear_model\randomized_l1.py:580: DeprecationWarning: `np.float` is a deprecated alias for the builtin `float`. To silence this warning, use `float` by itself. Doing this will not modify any behavior and is safe. If you specifically wanted the numpy scalar type, use `np.float64` here.
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  eps=4 * np.finfo(np.float).eps, n_jobs=None,
C:\Users\prata\AppData\Roaming\Python\Python37\site-packages\sklearn\linear_model\logistic.py:433: FutureWarning: Default solver will be changed to 'lbfgs' in 0.22. Specify a solver to silence this warning.
  FutureWarning)
LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='warn',
          n_jobs=None, penalty='l2', random_state=None, solver='warn',
          tol=0.0001, verbose=0, warm_start=False)

4.7 Using the Model for Prediction

y_pred_train = logreg.predict(X_train)  
C:\Users\prata\AppData\Roaming\Python\Python37\site-packages\sklearn\linear_model\base.py:283: DeprecationWarning: `np.int` is a deprecated alias for the builtin `int`. To silence this warning, use `int` by itself. Doing this will not modify any behavior and is safe. When replacing `np.int`, you may wish to use e.g. `np.int64` or `np.int32` to specify the precision. If you wish to review your current use, check the release note link for additional information.
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  indices = (scores > 0).astype(np.int)
y_pred_test = logreg.predict(X_test)                                                           # make predictions on the testing set
C:\Users\prata\AppData\Roaming\Python\Python37\site-packages\sklearn\linear_model\base.py:283: DeprecationWarning: `np.int` is a deprecated alias for the builtin `int`. To silence this warning, use `int` by itself. Doing this will not modify any behavior and is safe. When replacing `np.int`, you may wish to use e.g. `np.int64` or `np.int32` to specify the precision. If you wish to review your current use, check the release note link for additional information.
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  indices = (scores > 0).astype(np.int)
  • We need an evaluation metric in order to compare our predictions with the actual values.

5. Model evaluation

Error is the deviation of the values predicted by the model with the true values.
We will use accuracy score and confusion matrix for evaluation.

5.1 Model Evaluation using accuracy classification score

from sklearn.metrics import accuracy_score
print('Accuracy score for test data is:', accuracy_score(y_test,y_pred_test))
Accuracy score for test data is: 0.7988826815642458

5.2 Model Evaluation using confusion matrix

A confusion matrix is a summary of prediction results on a classification problem.

The number of correct and incorrect predictions are summarized with count values and broken down by each class.
Below is a diagram showing a general confusion matrix. image.png

from sklearn.metrics import confusion_matrix

confusion_matrix = pd.DataFrame(confusion_matrix(y_test, y_pred_test))

print(confusion_matrix)
    0   1
0  95  11
1  25  48
confusion_matrix.index = ['Actual Died','Actual Survived']
confusion_matrix.columns = ['Predicted Died','Predicted Survived']
print(confusion_matrix)
                 Predicted Died  Predicted Survived
Actual Died                  95                  11
Actual Survived              25                  48

This means 93 + 48 = 141 correct predictions & 25 + 13 = 38 false predictions.

Adjusting Threshold for predicting Died or Survived.

  • In the section 4.7 we have used, .predict method for classification. This method takes 0.5 as the default threshhod for prediction.
  • Now, we are going to see the impact of changing threshold on the accuracy of our logistic regression model.
  • For this we are going to use .predict_proba method instead of using .predict method.

Setting the threshold to 0.75

preds1 = np.where(logreg.predict_proba(X_test)[:,1]> 0.75,1,0)
print('Accuracy score for test data is:', accuracy_score(y_test,preds1))
Accuracy score for test data is: 0.7374301675977654

The accuracy have been reduced significantly changing from 0.79 to 0.73. Hence, 0.75 is not a good threshold for our model.

Setting the threshold to 0.25

preds2 = np.where(logreg.predict_proba(X_test)[:,1]> 0.25,1,0)
print('Accuracy score for test data is:', accuracy_score(y_test,preds2))
Accuracy score for test data is: 0.7486033519553073

The accuracy have been reduced, changing from 0.79 to 0.75. Hence, 0.25 is also not a good threshold for our model.
Later on we will see methods to identify the best threshold.