Speech Emotion Recognition

A deep learning system for classifying speech into 8 emotions using a dual-input fusion architecture.
Trained on 4 public datasets — achieves 89.3% test accuracy and 91.1% macro F1.

Live Demo

speech-emotion-app.hf.space

Record 3 seconds of audio or upload a file and watch the signal race through the pipeline in real time.
Hosted on HuggingFace Spaces · Docker · Keras 3.10 · TensorFlow 2.18

Results

Metric	Score
Test Accuracy	89.29%
Macro F1	91.09%
Best Val Accuracy (training)	81.41% (epoch 67/80)
TTA Test Accuracy (15 passes)	~82–83% on Kaggle

Full evaluation on held-out test data in Google Colab (CPU, no TTA).
Pre-trained model: huggingface.co/VeerR13/Speech-Emotion-Recognition

Confusion Matrix

Per-Class Performance

Emotion	Precision	Recall	F1
calm	1.00	1.00	1.00
surprise	1.00	1.00	1.00
angry	0.98	0.98	0.98
fear	0.97	0.97	0.97
happy	0.94	0.97	0.95
disgust	0.93	0.87	0.90
neutral	0.82	0.79	0.81
sad	0.79	0.81	0.80

Web App

The live demo is a premium dark-mode SPA built with Flask and vanilla JS:

3-second WAV recorder — uses AudioContext + ScriptProcessor for cross-browser compatibility (works on iOS Safari, Chrome, Firefox)
Speed-of-sound analysis overlay — when audio is submitted, a waveform blasts through 5 labeled pipeline stages at full speed, then the overlay whooshes off-screen left to reveal results
Live neural architecture canvas — section 03 shows the dual-branch CNN + MLP model as a live-animated flow diagram with data packets racing through processing nodes
Three.js 3D audio sphere — hero section Fibonacci-distributed frequency bars that respond to scroll
Result panel — emotion label, confidence %, animated probability bars per class

Source: VeerR13/Speech-Emotion-App on HuggingFace Spaces

Emotions Detected

angry · happy · neutral · sad

The web app is trained on a 4-class subset. The full model supports 8 classes — see per-class table above.

Architecture

Dual-input fusion — CNN branch processes mel spectrograms, MLP branch processes handcrafted acoustic features. Both fuse via a learned sigmoid gate.

Audio Input (22,050 Hz · 3 s · mono)
    │
    ├──── Spectrogram Branch ────────────────────────────────────────────┐
    │     Multi-scale mel spectrograms (3 channels, FFT 512/1024/2048)  │
    │     4× ResBlock (64 → 128 → 256 → 512 filters)                   │
    │     CBAM attention (channel + spatial) per block                  │
    │     Attention pooling → Dense(256)                                │
    │                                                                   ├─→ Gated Fusion → Dense(256) → Dense(128) → Softmax
    └──── Features Branch ──────────────────────────────────────────────┘
          645-dim handcrafted features:
          MFCCs (40) × 4 stats + delta/delta-delta, chroma (12),
          mel per-band (128), spectral contrast (7), tonnetz (6),
          ZCR, RMS, centroid, bandwidth, rolloff, piptrack pitch
          MLP: Dense(512) → Dense(256) → Dense(128) with skip connections

Key Components

Component	Role
Multi-scale spectrograms	3 parallel mel spectrograms stacked as RGB-like channels — captures fine temporal detail and broad spectral shape
CBAM attention	Channel + spatial attention inside each residual block
Attention pooling	Learned 1D weights per time frame — focuses on emotionally salient segments
Gated fusion	Sigmoid gate vector learns per-prediction weighting of CNN vs MLP branch
Focal Loss (γ=2.0, label smoothing=0.1)	Downweights easy examples, prevents overconfidence on hard pairs

Training Details

Parameter	Value
Datasets	RAVDESS, TESS, CREMA-D, SAVEE
Total audio clips	~7,400
Sample rate	22,050 Hz
Clip duration	3 seconds
Batch size	64
Epochs	80 (early stop patience=25)
Optimizer	AdamW (weight decay=1e-4)
LR schedule	Linear warmup (5 epochs) → Cosine decay (1e-3 → 1e-6)
Loss	Focal Loss (γ=2.0, label smoothing=0.1)
Augmentation	SpecAugment, random gain, Gaussian noise, Mixup (α=0.4)
Training platform	Kaggle T4 GPU (~8 hours)

Files

File	Description
`SER_v2_kaggle.ipynb`	Full training notebook — Kaggle T4 GPU, ~8 hours
`SER_colab_eval.ipynb`	Evaluation-only notebook — Google Colab, loads pretrained weights
`best_ser_v2.keras`	Trained model weights (89.3% test accuracy)
`label_encoder.pkl`	Sklearn `LabelEncoder` mapping emotion strings ↔ class indices
`scaler.pkl`	Sklearn `StandardScaler` fitted on training features

Pre-trained model: huggingface.co/VeerR13/Speech-Emotion-Recognition
Download best_ser_v2.keras, label_encoder.pkl, and scaler.pkl to skip the 8-hour training run.

Quickstart: Evaluate on Colab

Open SER_colab_eval.ipynb in Google Colab
Mount Google Drive when prompted

Paste your Kaggle API token in Cell 2:

KAGGLE_TOKEN = '{"username":"YOUR_USERNAME","key":"YOUR_KEY"}'

Upload best_ser_v2.keras, label_encoder.pkl, scaler.pkl to MyDrive/SER/
Run all cells — feature extraction ~8 min, then evaluation runs automatically

Feature arrays are cached to Drive after first extraction.

Quickstart: Train from Scratch on Kaggle

Create Notebook on kaggle.com and upload SER_v2_kaggle.ipynb
+ Add Data → search for:
- uwrfkaggler/ravdess-emotional-speech-audio
- ejlok1/toronto-emotional-speech-set-tess
- ejlok1/cremad
- ejlok1/surrey-audiovisual-expressed-emotion-savee
Settings → Accelerator → GPU P100
Run All — downloads, trains, saves weights to Output tab (~8 hours)

Inference Example

import pickle, numpy as np, librosa, keras

# Load artifacts
le     = pickle.load(open('label_encoder.pkl', 'rb'))
scaler = pickle.load(open('scaler.pkl', 'rb'))

# Register custom objects (training-only, skipped at inference)
import keras as k

@k.saving.register_keras_serializable(package='custom')
class FocalLoss(k.losses.Loss):
    def __init__(self, gamma=2.0, smoothing=0.1, **kw):
        super().__init__(**kw)
        self.gamma = gamma; self.smoothing = smoothing
    def call(self, y_true, y_pred):
        return k.losses.categorical_crossentropy(y_true, y_pred, label_smoothing=self.smoothing)
    def get_config(self):
        return {**super().get_config(), 'gamma': self.gamma, 'smoothing': self.smoothing}

@k.saving.register_keras_serializable(package='custom')
class WarmupCosine(k.optimizers.schedules.LearningRateSchedule):
    def __init__(self, peak=1e-3, wu_steps=1000, total=10000, min_lr=1e-6, **kw):
        super().__init__(**kw)
        self.peak=peak; self.wu_steps=wu_steps; self.total=total; self.min_lr=min_lr
    def __call__(self, step): return self.peak
    def get_config(self):
        return {'peak':self.peak,'wu_steps':self.wu_steps,'total':self.total,'min_lr':self.min_lr}

k.config.enable_unsafe_deserialization()
model = k.models.load_model('best_ser_v2.keras', compile=False)

# Preprocess audio
SR, DUR = 22050, 3
audio, _ = librosa.load('speech.wav', sr=SR, duration=DUR)
audio, _ = librosa.effects.trim(audio, top_db=25)
audio    = np.pad(audio, (0, max(0, SR*DUR - len(audio))))[:SR*DUR]

# Predict (implement extract_features / extract_mel_multiscale from notebook)
flat = scaler.transform([extract_features(audio)])
spec = extract_mel_multiscale(audio)[np.newaxis]

pred    = model.predict([spec, flat], verbose=0)
emotion = le.classes_[np.argmax(pred)]
print(f'{emotion}  {pred.max()*100:.1f}%')

Dependencies

tensorflow-cpu>=2.18
keras>=3.10
librosa>=0.10
scikit-learn>=1.3
numpy
soundfile

Dataset Sources

Dataset	Speakers	Emotions	Clips
RAVDESS	24 actors	8	~1,440
TESS	2 actresses	7	~2,800
CREMA-D	91 actors	6	~7,442 (subset)
SAVEE	4 male speakers	7	~480

Name		Name	Last commit message	Last commit date
Latest commit History 21 Commits
models		models
static		static
templates		templates
.gitattributes		.gitattributes
.gitignore		.gitignore
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
SER_colab_eval.ipynb		SER_colab_eval.ipynb
SER_v2_kaggle.ipynb		SER_v2_kaggle.ipynb
app.py		app.py
render.yaml		render.yaml
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Speech Emotion Recognition

Live Demo

Results

Confusion Matrix

Per-Class Performance

Web App

Emotions Detected

Architecture

Key Components

Training Details

Files

Quickstart: Evaluate on Colab

Quickstart: Train from Scratch on Kaggle

Inference Example

Dependencies

Dataset Sources

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Speech Emotion Recognition

Live Demo

Results

Confusion Matrix

Per-Class Performance

Web App

Emotions Detected

Architecture

Key Components

Training Details

Files

Quickstart: Evaluate on Colab

Quickstart: Train from Scratch on Kaggle

Inference Example

Dependencies

Dataset Sources

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages