Titanic Survival Prediction Machine Learning Project

This project develops a classification model to predict which passengers survived the Titanic disaster, using the famous Kaggle dataset. Beyond the prediction task, the project explores socioeconomic factors that influenced survival rates and demonstrates essential machine learning workflow elements.

Key Features

Binary Classification: Predicts passenger survival with multiple algorithms
Feature Engineering: Created meaningful features from passenger data including family size, title extraction from names, and age imputation
Socioeconomic Analysis: Analyzed how social class, gender, and age affected survival probability
Model Comparison: Implemented and compared multiple classification algorithms
Hyperparameter Tuning: Optimized model parameters for maximum performance

Tech Stack

Python
Scikit-learn
Pandas
NumPy
Matplotlib/Seaborn
XGBoost
RandomForest

Model Performance

The final ensemble model achieved 83.5% accuracy on the test set, with strong performance across all passenger classes. Random Forest and XGBoost models performed best among all tested algorithms.

Historical Insights

The analysis confirmed the "women and children first" policy, with significantly higher survival rates for women (74%) compared to men (19%). Additionally, passengers in first class had a survival rate of 63%, compared to 47% for second class and only 24% for third class passengers.

Technical Highlights

Missing Data Strategy: Implemented sophisticated imputation techniques for missing age values based on passenger class and title
Feature Importance: Identified passenger sex, class, and fare as the strongest predictors of survival
Cross-Validation: Used stratified k-fold cross-validation to ensure robust model evaluation
Ensemble Methods: Combined predictions from multiple models to improve overall accuracy

Educational Value

This project serves as an excellent introduction to core machine learning concepts including data cleaning, feature engineering, model selection, and evaluation metrics, using a dataset with historical significance and easy interpretability.