A deep learning system for classifying speech into 8 emotions using a dual-input fusion architecture.
Trained on 4 public datasets — achieves 89.3% test accuracy and 91.1% macro F1.
Record 3 seconds of audio or upload a file and watch the signal race through the pipeline in real time.
Hosted on HuggingFace Spaces · Docker · Keras 3.10 · TensorFlow 2.18
| Metric | Score |
|---|---|
| Test Accuracy | 89.29% |
| Macro F1 | 91.09% |
| Best Val Accuracy (training) | 81.41% (epoch 67/80) |
| TTA Test Accuracy (15 passes) | ~82–83% on Kaggle |
Full evaluation on held-out test data in Google Colab (CPU, no TTA).
Pre-trained model: huggingface.co/VeerR13/Speech-Emotion-Recognition
| Emotion | Precision | Recall | F1 |
|---|---|---|---|
| calm | 1.00 | 1.00 | 1.00 |
| surprise | 1.00 | 1.00 | 1.00 |
| angry | 0.98 | 0.98 | 0.98 |
| fear | 0.97 | 0.97 | 0.97 |
| happy | 0.94 | 0.97 | 0.95 |
| disgust | 0.93 | 0.87 | 0.90 |
| neutral | 0.82 | 0.79 | 0.81 |
| sad | 0.79 | 0.81 | 0.80 |
The live demo is a premium dark-mode SPA built with Flask and vanilla JS:
- 3-second WAV recorder — uses
AudioContext + ScriptProcessorfor cross-browser compatibility (works on iOS Safari, Chrome, Firefox) - Speed-of-sound analysis overlay — when audio is submitted, a waveform blasts through 5 labeled pipeline stages at full speed, then the overlay whooshes off-screen left to reveal results
- Live neural architecture canvas — section 03 shows the dual-branch CNN + MLP model as a live-animated flow diagram with data packets racing through processing nodes
- Three.js 3D audio sphere — hero section Fibonacci-distributed frequency bars that respond to scroll
- Result panel — emotion label, confidence %, animated probability bars per class
Source: VeerR13/Speech-Emotion-App on HuggingFace Spaces
angry · happy · neutral · sad
The web app is trained on a 4-class subset. The full model supports 8 classes — see per-class table above.
Dual-input fusion — CNN branch processes mel spectrograms, MLP branch processes handcrafted acoustic features. Both fuse via a learned sigmoid gate.
Audio Input (22,050 Hz · 3 s · mono)
│
├──── Spectrogram Branch ────────────────────────────────────────────┐
│ Multi-scale mel spectrograms (3 channels, FFT 512/1024/2048) │
│ 4× ResBlock (64 → 128 → 256 → 512 filters) │
│ CBAM attention (channel + spatial) per block │
│ Attention pooling → Dense(256) │
│ ├─→ Gated Fusion → Dense(256) → Dense(128) → Softmax
└──── Features Branch ──────────────────────────────────────────────┘
645-dim handcrafted features:
MFCCs (40) × 4 stats + delta/delta-delta, chroma (12),
mel per-band (128), spectral contrast (7), tonnetz (6),
ZCR, RMS, centroid, bandwidth, rolloff, piptrack pitch
MLP: Dense(512) → Dense(256) → Dense(128) with skip connections
| Component | Role |
|---|---|
| Multi-scale spectrograms | 3 parallel mel spectrograms stacked as RGB-like channels — captures fine temporal detail and broad spectral shape |
| CBAM attention | Channel + spatial attention inside each residual block |
| Attention pooling | Learned 1D weights per time frame — focuses on emotionally salient segments |
| Gated fusion | Sigmoid gate vector learns per-prediction weighting of CNN vs MLP branch |
| Focal Loss (γ=2.0, label smoothing=0.1) | Downweights easy examples, prevents overconfidence on hard pairs |
| Parameter | Value |
|---|---|
| Datasets | RAVDESS, TESS, CREMA-D, SAVEE |
| Total audio clips | ~7,400 |
| Sample rate | 22,050 Hz |
| Clip duration | 3 seconds |
| Batch size | 64 |
| Epochs | 80 (early stop patience=25) |
| Optimizer | AdamW (weight decay=1e-4) |
| LR schedule | Linear warmup (5 epochs) → Cosine decay (1e-3 → 1e-6) |
| Loss | Focal Loss (γ=2.0, label smoothing=0.1) |
| Augmentation | SpecAugment, random gain, Gaussian noise, Mixup (α=0.4) |
| Training platform | Kaggle T4 GPU (~8 hours) |
| File | Description |
|---|---|
SER_v2_kaggle.ipynb |
Full training notebook — Kaggle T4 GPU, ~8 hours |
SER_colab_eval.ipynb |
Evaluation-only notebook — Google Colab, loads pretrained weights |
best_ser_v2.keras |
Trained model weights (89.3% test accuracy) |
label_encoder.pkl |
Sklearn LabelEncoder mapping emotion strings ↔ class indices |
scaler.pkl |
Sklearn StandardScaler fitted on training features |
Pre-trained model: huggingface.co/VeerR13/Speech-Emotion-Recognition
Downloadbest_ser_v2.keras,label_encoder.pkl, andscaler.pklto skip the 8-hour training run.
- Open
SER_colab_eval.ipynbin Google Colab - Mount Google Drive when prompted
- Paste your Kaggle API token in Cell 2:
KAGGLE_TOKEN = '{"username":"YOUR_USERNAME","key":"YOUR_KEY"}'
- Upload
best_ser_v2.keras,label_encoder.pkl,scaler.pkltoMyDrive/SER/ - Run all cells — feature extraction ~8 min, then evaluation runs automatically
Feature arrays are cached to Drive after first extraction.
- Create Notebook on kaggle.com and upload
SER_v2_kaggle.ipynb - + Add Data → search for:
uwrfkaggler/ravdess-emotional-speech-audioejlok1/toronto-emotional-speech-set-tessejlok1/cremadejlok1/surrey-audiovisual-expressed-emotion-savee
- Settings → Accelerator → GPU P100
- Run All — downloads, trains, saves weights to Output tab (~8 hours)
import pickle, numpy as np, librosa, keras
# Load artifacts
le = pickle.load(open('label_encoder.pkl', 'rb'))
scaler = pickle.load(open('scaler.pkl', 'rb'))
# Register custom objects (training-only, skipped at inference)
import keras as k
@k.saving.register_keras_serializable(package='custom')
class FocalLoss(k.losses.Loss):
def __init__(self, gamma=2.0, smoothing=0.1, **kw):
super().__init__(**kw)
self.gamma = gamma; self.smoothing = smoothing
def call(self, y_true, y_pred):
return k.losses.categorical_crossentropy(y_true, y_pred, label_smoothing=self.smoothing)
def get_config(self):
return {**super().get_config(), 'gamma': self.gamma, 'smoothing': self.smoothing}
@k.saving.register_keras_serializable(package='custom')
class WarmupCosine(k.optimizers.schedules.LearningRateSchedule):
def __init__(self, peak=1e-3, wu_steps=1000, total=10000, min_lr=1e-6, **kw):
super().__init__(**kw)
self.peak=peak; self.wu_steps=wu_steps; self.total=total; self.min_lr=min_lr
def __call__(self, step): return self.peak
def get_config(self):
return {'peak':self.peak,'wu_steps':self.wu_steps,'total':self.total,'min_lr':self.min_lr}
k.config.enable_unsafe_deserialization()
model = k.models.load_model('best_ser_v2.keras', compile=False)
# Preprocess audio
SR, DUR = 22050, 3
audio, _ = librosa.load('speech.wav', sr=SR, duration=DUR)
audio, _ = librosa.effects.trim(audio, top_db=25)
audio = np.pad(audio, (0, max(0, SR*DUR - len(audio))))[:SR*DUR]
# Predict (implement extract_features / extract_mel_multiscale from notebook)
flat = scaler.transform([extract_features(audio)])
spec = extract_mel_multiscale(audio)[np.newaxis]
pred = model.predict([spec, flat], verbose=0)
emotion = le.classes_[np.argmax(pred)]
print(f'{emotion} {pred.max()*100:.1f}%')tensorflow-cpu>=2.18
keras>=3.10
librosa>=0.10
scikit-learn>=1.3
numpy
soundfile
| Dataset | Speakers | Emotions | Clips |
|---|---|---|---|
| RAVDESS | 24 actors | 8 | ~1,440 |
| TESS | 2 actresses | 7 | ~2,800 |
| CREMA-D | 91 actors | 6 | ~7,442 (subset) |
| SAVEE | 4 male speakers | 7 | ~480 |
