A robust baseline solution using TensorFlow Decision Forests (TF-DF) for the Kaggle Spaceship Titanic competition.
This project aims to predict whether a passenger on the Spaceship Titanic was transported to an alternate dimension. We utilize TensorFlow Decision Forests (TF-DF), a library that enables the training of tree-based models (such as Random Forest) using the familiar and easy-to-use Keras API.
This approach is highly effective for tabular data as it requires minimal preprocessing compared to classical Neural Networks.
The dataset is sourced from the Kaggle competition: Spaceship Titanic.
- Total Training Data: 8,693 entries with 14 features.
- Target:
Transported(Boolean) - Whether the passenger was transported. - Key Features:
HomePlanet,CryoSleep,Cabin,Destination,Age,VIP, and luxury amenities expenditure (RoomService,FoodCourt, etc.).
The code in this repository covers the following end-to-end steps:
We analyzed the distribution of both numerical and categorical data to understand passenger characteristics and identify patterns.
While TF-DF handles many data types natively, some adjustments were necessary:
- Handling Missing Values: Null values in numerical and boolean columns were imputed with
0. - Boolean Conversion: Since TF-DF does not currently support boolean data types directly, columns like
Transported,VIP, andCryoSleepwere converted tointeger(0 or 1). - Dropping Columns: Removed
PassengerIdandNameas they are irrelevant for training.
The Cabin feature, which contains data in the format Deck/Num/Side, was split into three more informative features:
DeckCabin_numSide
We utilized the standard Random Forest algorithm from TF-DF.
- Model:
tfdf.keras.RandomForestModel() - Data Split: 80% Training, 20% Validation.
- Data Format: Converted Pandas DataFrame to
tf.data.Datasetfor optimal performance.
The model was evaluated using accuracy metrics on the validation set and Out-of-Bag (OOB) data.
- Training Time: ~54 seconds.
- OOB Accuracy: ~79.73%
- Validation Accuracy: 80.25%
Based on the NUM_AS_ROOT metric (how often a feature appears as the root of a tree), the most influential features are:
- CryoSleep (Highly dominant)
- RoomService
- Spa
- VRDeck
-
Install Dependencies:
pip install tensorflow tensorflow_decision_forests pandas numpy seaborn matplotlib
-
Run the Notebook: Open the notebook file (e.g.,
spaceship_titanic_tfdf.ipynb) in Jupyter Notebook, Google Colab, or a Kaggle Kernel. -
Output: The script will generate a
submission.csvfile ready for upload to Kaggle.
This repository is designed as a learning baseline. Feel free to fork and experiment with:
- Using
GradientBoostedTreesModelinstead of Random Forest. - Conducting deeper hyperparameter tuning.
- Implementing more advanced missing data imputation techniques.
Created based on TensorFlow Decision Forests v1.2.0 implementation.