A modern, from-scratch implementation of an Image Captioning system using a Vision Transformer (ViT) encoder and a Transformer Decoder. This project demonstrates the power of attention mechanisms in both computer vision and natural language processing.
The goal of this project was to build a pure "scratch-built" Transformer-based system to understand the mechanics of cross-modal attention. Unlike traditional approaches using pre-trained ResNet/CNNs, this model learns feature representation directly from image patches using a ViT architecture.
- Pure Transformer Architecture: Both encoder (ViT) and decoder are Transformer-based.
- Beam Search (k=5): Implemented for inference to ensure linguistically coherent and higher-quality captions compared to greedy search.
- Advanced Training: Features Automatic Mixed Precision (AMP), Label Smoothing, AdamW optimizer with Weight Decay, and intense Data Augmentation.
- Parquet Support: Directly processes high-performance Parquet datasets from HuggingFace.
├── checkpoints/ # Saved model weights (.pth files)
├── models/ # Model architecture definitions
│ ├── __init__.py
│ ├── vit.py # Vision Transformer Encoder
│ ├── decoder.py # Transformer Decoder
│ ├── bridge.py # Dimension alignment & normalization
│ └── model.py # Full ImageCaptioningModel class
├── scripts/ # Utility and testing scripts
│ ├── inference.py # Generate captions for custom images
│ ├── test_beam.py # Evaluate metrics using Beam Search
│ ├── test_model.py # Standard greedy evaluation
│ └── sample_test.jpg # Sample image for testing
├── utils/ # Data processing utilities
│ ├── __init__.py
│ └── data_loader.py # Vocabulary & FlickrDataset logic
├── train.py # Main training entry point
└── requirements.txt # Project dependencies
- Full Pipeline: Data loading, training loop, validation, and inference are fully operational and optimized for 6GB+ VRAM cards.
- Beam Search: Significant improvement in caption quality (+5 BLEU points over greedy search).
- Environment Recognition: The model is very good at identifying background contexts (e.g., "snow," "beach," "mountain").
- Object Specificity: Since the ViT is trained from scratch on the relatively small Flickr8k dataset, it sometimes confuses similar objects (e.g., labeling "dogs" as "people" or "cats").
- Vocabulary Limit: Limited to common words found in the 8,000 images of the Flickr8k dataset.
Ensure you have Python 3.10+ and CUDA installed. Install dependencies:
pip install -r requirements.txtRun the training script to rebuild the vocabulary and start learning:
python train.pyTo generate a caption for your own image using the best-trained model:
python scripts/inference.py --image "path/to/your/image.jpg"To run the full BLEU metric benchmark on the 1,000 image test set:
python scripts/test_beam.py- Transfer Learning: Integrate pre-trained weights (ResNet-50 or ImageNet-ViT) to drastically improve object recognition.
- Larger Datasets: Scale up to Flickr30k or MS-COCO for a more robust vocabulary.
- Attention Visualization: Add Grad-CAM or heatmap visualizations to see which part of the image the decoder is looking at while generating words.
- Patch Size Reduction: Move from 16x16 to 8x8 patches for higher-resolution feature extraction.
To explore the boundaries of Transformers in Vision. By avoiding pre-trained models initially, this project serves as a deep dive into how raw pixels can be transformed into meaningful human language through the power of Self-Attention.