Back to Projects

Titanic Survival Prediction Machine Learning Project

A classification model that predicts survival outcomes for Titanic passengers, demonstrating fundamental machine learning concepts through a historically significant dataset.

classification machine learning python data science kaggle

Titanic Survival Prediction Machine Learning Project

This project develops a classification model to predict which passengers survived the Titanic disaster, using the famous Kaggle dataset. Beyond the prediction task, the project explores socioeconomic factors that influenced survival rates and demonstrates essential machine learning workflow elements.

Key Features

  • Binary Classification: Predicts passenger survival with multiple algorithms
  • Feature Engineering: Created meaningful features from passenger data including family size, title extraction from names, and age imputation
  • Socioeconomic Analysis: Analyzed how social class, gender, and age affected survival probability
  • Model Comparison: Implemented and compared multiple classification algorithms
  • Hyperparameter Tuning: Optimized model parameters for maximum performance

Tech Stack

  • Python
  • Scikit-learn
  • Pandas
  • NumPy
  • Matplotlib/Seaborn
  • XGBoost
  • RandomForest

Model Performance

The final ensemble model achieved 83.5% accuracy on the test set, with strong performance across all passenger classes. Random Forest and XGBoost models performed best among all tested algorithms.

Historical Insights

The analysis confirmed the "women and children first" policy, with significantly higher survival rates for women (74%) compared to men (19%). Additionally, passengers in first class had a survival rate of 63%, compared to 47% for second class and only 24% for third class passengers.

Technical Highlights

  • Missing Data Strategy: Implemented sophisticated imputation techniques for missing age values based on passenger class and title
  • Feature Importance: Identified passenger sex, class, and fare as the strongest predictors of survival
  • Cross-Validation: Used stratified k-fold cross-validation to ensure robust model evaluation
  • Ensemble Methods: Combined predictions from multiple models to improve overall accuracy

Educational Value

This project serves as an excellent introduction to core machine learning concepts including data cleaning, feature engineering, model selection, and evaluation metrics, using a dataset with historical significance and easy interpretability.