This report presents a comprehensive analysis of different approaches for deep fake image classification tasks. The project implements and compares multiple convolutional neural network architectures, including individual CNN models and ensemble methods. The main objective is to develop an effective classification system while documenting the complete experimental process, including hyperparameter tuning and model optimization strategies.
The training data is composed of 12,500 image files. The validation set is composed of 1,250 image files. The test is composed of 6,500 image files. Each image file have 100x100 resolution and also csv files (train and validation) have this format:
image_id,label
532de967-c8fb-49a6-9a8c-3c32cfa93d3e,0
c0519e94-1422-405c-a847-ce726f4a13cf,2
13a99838-2919-4b79-b9fd-bce8f0e59e09,2
The data preprocessing pipeline includes several stages designed to improve model generalization and performance:
The training data undergoes the following transformations:
-> resize to 100×100 pixels
-> random horizontal flip with probability 0.5
-> random rotation up to 15 degrees
-> normalization using ImageNet statistics (mean=[0.485, 0.456, 0.406],
std=[0.229, 0.224, 0.225])
For validation and test data, only essential preprocessing is applied:
-> resize to 100×100 pixels
-> normalization using the same ImageNet statistics
-> batch size: 4 (Ensemble CNN), 8 (Skip Connections CNN) and 16 (Three layers CNN)
-> number of workers: 4
-> pin memory: true (faster GPU transfer)
-> drop last: true (for training to maintain consistent batch sizes)
Layer Type of layer Output Shape No. of Parameters
1 Conv2d [ -1, 128, 100, 100 ] 3,584
2 BatchNorm2d [ -1, 128, 100, 100 ] 256
3 ReLU [ -1, 128, 100, 100 ] 0
4 Conv2d [ -1, 256, 100, 100 ] 295,168
5 BatchNorm2d [ -1, 256, 100, 100 ] 1,024
6 ReLU [ -1, 256, 100, 100 ] 0
7 Max Pooling2d [ -1, 256, 50, 50 ] 0
Residual Block:
8 Conv2d [ -1, 512, 100, 100 ] 1,180,160
9 BatchNorm2d [ -1, 512, 100, 100 ] 1,024
10 ReLU [ -1, 128, 512, 100 ] 0
11 Conv2d [ -1, 512, 100, 100 ] 2,359,808
12 BatchNorm2d [ -1, 512, 100, 100 ] 1,024
13 Conv2d [ -1, 512, 100, 100 ] 131,584
14 BatchNorm2d [ -1, 512, 100, 100 ] 1,024
15 ReLU [ -1, 128, 512, 100 ] 0
16 ResidualBlock [ -1, 256, 50, 50 ] 0
17 Max Pooling2d [ -1, 256, 50, 50 ] 0
18 Conv2d [ -1, 256, 50, 50 ] 2,359,808
19 BatchNorm2d [ -1, 256, 50, 50 ] 1,024
20 ReLU [ -1, 256, 50, 50 ] 0
21 Dropout [ -1, 256, 50, 50 ] 0
Classifier:
22 AdaptiveAvgPool2d [ -1, 512, 6, 6 ] 0
23 Flatten [ -1, 512\*6\*6] 0
24 Linear-1 [ -1, 128 ] 2,359,424
25 ReLU [ -1, 128 ] 0
26 Dropout [ -1, 128 ] 0
27 Linear-2 [ -1, 5 ] 645
Total params 8,695,045
Trainable params 8,695,045
Non-trainable params 0
Layer Type of layer Output Dimension No. of Parameters
1 Conv2D [ -1, 64, 100, 100 ] 1,792
2 BatchNorm2D [ -1, 64, 100, 100 ] 128
3 ReLU [ -1, 64, 100, 100 ] 0
4 Max Pooling2D [ -1, 64, 50, 50 ] 0
5 Conv2D [ -1, 128, 50, 50 ] 73,856
6 BatchNorm2D [ -1, 128, 50, 50 ] 256
7 ReLU [ -1, 128, 50, 50 ] 0
8 Max Pooling2D [ -1, 128, 25, 25 ] 0
9 Conv2D [ -1, 192, 23, 23 ] 221,376
10 BatchNorm2D [ -1, 192, 23, 23 ] 384
11 ReLU [ -1, 192, 23, 23 ] 0
12 Max Pooling2D [ -1, 192, 11, 11 ] 0
13 Dropout2D [ -1, 192, 11, 11 ] 0
14 AdaptiveAvgPool2D [ -1, 192, 4, 4 ] 0
15 Flatten [ -1, 192\*4\*4 ] 0
16 Linear-1 [ -1, 128] 393,344
17 ReLU [ -1, 128 ] 0
18 Dropout [ -1, 128 ] 0
19 Linear-2 [ -1, 5 ] 645
Total params 691,781
Trainable params 691,781
Non-trainable params 0
Component Type of layer Output Dimension No. of Parameters
Base Models
SkipConCNN Pre-trained CNN [ -1, 5 ] 8,695,045
ThreeCNN Pre-trained CNN [ -1, 5 ] 691,781
Weights Layer
1 Linear [ -1, 16 ] 176
2 ReLU [ -1, 16 ] 0
3 Linear [ -1, 2 ] 34
4 Softmax [ -1, 2 ] 0
Meta Classifier
1 Linear [ -1, 32 ] 352
2 ReLU [ -1, 32 ] 0
3 Dropout [ -1, 32 ] 0
4 Linear [ -1, 5 ] 165
Total params 9,387,553
Total frozen params 9,386,826
Total trainable params 727
Ensemble Strategy:
-> Dynamic Weight Assignment: Neural network learns optimal weights for
combining base model predictions
-> Meta Classification: Additional classifier processes concatenated
softmax outputs
-> Confidence-based Blending: Final prediction combines weighted and meta
predictions based on confidence scores
Parameter SkipConCNN ThreeCNN EnsembleCNN
Batch Size 8 16 4
Loss Function CrossEntropyLoss CrossEntropyLoss CrossEntropyLoss
Label Smoothing - - 0.1 Optimizer Adam Adam Adam Learning Rate 0.0005 0.0005 0.0001 Weight Decay 0.0001 0.0001 0.0001 Scheduler ReduceLROnPlateau ReduceLROnPlateau ReduceLROnPlateau Mode 'max' 'max' 'max' Patience 5 15 5 Factor 0.5 0.8 0.5
Parameter SkipConCNN ThreeCNN EnsembleCNN
Epochs 200 200 80
Early Stop 25 15 5
Confusion Matrix:
0 1 2 3 4 predicted
+------------------------+
0 | 236 2 3 0 9 |
1 | 4 230 2 1 13 |
2 | 3 0 242 0 5 |
3 | 0 0 1 248 1 |
4 | 24 14 10 0 202 |
actual +------------------------+
Classification Report:
precision recall f1-score support
0 0.88 0.94 0.91 250
1 0.93 0.92 0.93 250
2 0.94 0.97 0.95 250
3 1.00 0.99 0.99 250
4 0.88 0.81 0.84 250
accuracy 0.93 1250
macro avg 0.93 0.93 0.93 1250
weighted avg 0.93 0.93 0.93 1250
Confusion Matrix:
0 1 2 3 4 predicted
+------------------------+
0 | 228 1 4 0 17 |
1 | 2 228 1 0 19 |
2 | 7 0 235 0 8 |
3 | 0 0 0 250 0 |
4 | 14 26 11 0 199 |
actual +------------------------+
Classification Report:
precision recall f1-score support
0 0.91 0.91 0.91 250
1 0.89 0.91 0.90 250
2 0.94 0.94 0.94 250
3 1.00 1.00 1.00 250
4 0.82 0.80 0.81 250
accuracy 0.91 1250
macro avg 0.91 0.91 0.91 1250
weighted avg 0.91 0.91 0.91 1250
Confusion Matrix:
0 1 2 3 4 predicted
+------------------------+
0 | 234 2 3 0 11 |
1 | 0 238 1 0 11 |
2 | 3 0 242 0 5 |
3 | 0 0 1 248 1 |
4 | 11 23 10 0 206 |
actual +------------------------+
Classification Report:
precision recall f1-score support
0 0.94 0.94 0.94 250
1 0.90 0.95 0.93 250
2 0.94 0.97 0.95 250
3 1.00 0.99 1.00 250
4 0.88 0.82 0.85 250
accuracy 0.93 1250
macro avg 0.93 0.93 0.93 1250
weighted avg 0.93 0.93 0.93 1250
Geometric augmentations significantly improved model performance, while photometric augmentations degraded accuracy, and advanced augmentation techniques were not evaluated in this study. The difference between training with and without photometric augmentation was around 3%. My hypothesis is that the dataset having images with 100x100 resolution means there's already limited visual information, so messing with brightness and colors just made things worse by removing the few useful details the model could actually learn from.
Residual connections are architectural components that create skip pathways, allowing input information to bypass one or more layers and be added directly to the output. This mechanism enables the network to learn residual mappings rather than complete transformations. The SkipConCNN with residual connections achieved 92.64% accuracy compared to ThreeCNN's 89.78% accuracy (starting model). I think that the dataset having images with 100x100 resolution contains limited visual information, and residual connections preserved critical details that would otherwise be lost through successive convolutions. The skip connections allowed the model to maintain both local texture patterns and global structural information necessary for effective deepfake clusterization.
Ensemble learning combines predictions from multiple models to achieve better performance than individual components. The EnsembleCNN architecture merges SkipConCNN and ThreeCNN through a weighting mechanism that dynamically balances their contributions based on prediction confidence. The ensemble achieved 93.44% accuracy compared to SkipConCNN's 92.64% accuracy and ThreeCNN's 93.03% accuracy (after I adjusted the dilation for last convolutional layer and made it wider). The ensemble uses a dual-pathway approach: a learned weights layer determines optimal model combination ratios, while a separate classifier processes concatenated predictions. The final output combines weighted predictions with classifier results based on prediction confidence, allowing the model to rely more on individual model expertise when confident and fall back to ensemble learning when uncertain. Combining models with different architectural strengths (residual learning vs. traditional convolution) captures complementary feature representations.
Several architectural experiments failed to improve performance beyond the baseline models. I implemented an EfficientNet-inspired architecture using SE blocks and multi-scale attention mechanisms, but it did not achieve more than 90% accuracy, significantly underperforming compared to simpler CNNs. A brute-force hyperparameter search on a two-convolutional CNN found optimal parameters (out1=64, out2=128, kernel=5) yielding 76% accuracy , still below our three-layer baseline.
Additionally, a five-convolutional CNN suffered from overfitting despite regularization techniques, confirming that deeper architectures were counterproductive for this dataset. The conclusion was that 3-4 convolutional layers represent the optimal depth balance for this specific task and dataset constraints.
The ReduceLROnPlateau scheduler with patience=5 and factor=0.5 destabilized the learning process for the three-convolutional CNN due to overly aggressive learning rate reductions, so now I have this set of hyperparameters (patience=15, factor=0.8) after a kind of binary search. The difference between using Adam and AdamW optimizers was negligible, with both achieving similar performance levels. The label smoothing technique (0.1) did not yield significant improvements, and in some cases, it even degraded performance.
The primary limitation identified during this study was the poor classification performance for Class 4, which was discovered relatively late in the experimental process. This class demonstrated significantly lower prediction accuracy compared to other categories, indicating potential issues with feature distinguishability. Two mitigation strategies were attempted but proved unsuccessful due to time constraints and suboptimal implementation:
-
Specialized CNN for Class 4 : A dedicated binary classifier wastrained specifically to distinguish Class 4 from other categories, but failed to achieve meaningful improvements
-
Weighted Loss Function : Label smoothing was replaced with aclass-weighted tensor approach, implemented by continuing training from the best checkpoint rather than retraining from scratch, resulting in degraded performance
Both approaches were implemented as quick fixes rather than systematic solutions, contributing to their failure.
Future research should prioritize addressing the Class 4 classification challenge through:
-
Data Analysis : Comprehensive investigation of Class 4characteristics and potential mislabeling issues
-
Balanced Sampling : Implementation of advanced samplingtechniques (SMOTE, focal loss) to handle class imbalance
-
Feature Engineering : Development of specialized featureextraction methods targeting Class 4 distinguishing characteristics
The project is organized as follows:
project/
|-- main.py: core logic for training and evaluation
|-- models.py: CNN architectures and ensemble implementation
|-- dataloader.py: data loading and preprocessing
|-- dataset.py: custom dataset class
|-- dataset/
| |-- train.csv
| |-- validation.csv
| |-- test.csv
| |-- train/
| | |-- image1.jpg
| | |-- image2.jpg
| | ...
| |-- validation/
| | |-- image1.jpg
| | |-- image2.jpg
| | ...
| |-- test/
| | |-- image1.jpg
| | |-- image2.jpg
| | ...
|-- env/
|-- models_pth/
|-- requirements.txt
GPU: NVIDIA GeForce GTX 1650 Driver: 576.02 CUDA: 12.9
Memory: 628MiB/4096MiB GPU Util: 4%
CPU: Intel Core i5-10300H @ 2.50GHz Cores: 4 Threads: 8
Cache: L1: 256KB L2: 1MB L3: 8MB Architecture: x86_64
OS: Windows with WSL (Windows Subsystem for Linux)