Skip to content

Okh2891996/speech-emotion-cnn-pca-nmf

Repository files navigation

Cross-Corpus Speech-Emotion Recognition

MFCC ➜ PCA/NMF Compression ➜ CNN & SVM Benchmarks

Reproducible EE8104 Adaptive Signal Processing project (Winter 2025) Supervisor Prof Dr. Sridhar Krishnan

Speech-emotion recognition (SER) models often perform brilliantly on a single dataset yet stumble the moment recording conditions—or speakers—change.
This repo tackles that mismatch by training and evaluating on two publicly available corpora at once:

  • TESS — Toronto Emotional Speech Set (2 800 clips)
  • RAVDESS — Ryerson Audio-Visual Database of Emotional Speech and Song (1 260 clips)

Our pipeline walks through every stage, end-to-end:

  1. Feature extraction – 13 × 85 MFCC maps for each utterance.
  2. Low-rank analysis – PCA & NMF reveal shared structure and let us compress features to 26 / 52 dimensions.
  3. Classical baseline – RBF-SVMs on the compressed vectors (48 – 65 % accuracy).
  4. Deep baselines
    • A 5-block CNN on full MFCC maps (≈ 91 % test accuracy)
    • A dual-input CNN that fuses MFCC maps with 26-D PCA+NMF vectors (≈ 92 % test accuracy).
  5. Learning-curve study – How data volume affects SVM vs. CNN performance.

The full report is available here →: Document

# Notebook 1-line tagline Details & key outputs
01 MFCC_TESS_NOTEBOOK.ipynb Generate 13 × 85 MFCC maps for the 2 800 TESS recordings. • Loads TESS wav files → 16 kHz → 0.5 s pads/trims → MFCC
• Saves X_tess.npy, y_tess.npy, and 80/20 split indices.
• Quick sanity MLP reaches ~100 % (shows dataset is easy to over-fit).
02 MFCC_RAVDESS_NOTEBOOK.ipynb Same MFCC pipeline for the 1 260 RAVDESS speech clips. • Produces X_ravdess.npy, y_ravdess.npy.
• Sanity MLP tops out at ~80 %.
03 PCA and NMF COM.ipynb Run PCA + NMF on the pooled corpus and plot the bases. • Concatenates both MFCC sets.
• Computes first 6 PCA & NMF components.
• Saves scatter plot (PCA) and bar chart (NMF) used in Figs 2–3.
• Dumps pca_com.npy, nmf_com.npy for later feature compression.
04 NMF PCA TESS.ipynb Stand-alone PCA/NMF exploration for TESS only. • Generates TESS-specific panels for Fig 2/3.
05 NMF PCA RAVESS.ipynb Stand-alone PCA/NMF exploration for RAVDESS only. • Generates RAVDESS-specific panels for Fig 2/3.
06 SVM_NFM_PCa.ipynb Train SVM baselines on compressed (PCA + NMF) features. • Builds 26-dim (2 × 2) and 52-dim (4 × 4) vectors.
• LOOCV + hold-out test:
  • 2 × 2 → ~0.48 test acc
  • 4 × 4 → ~0.65 test acc
• Saves 4 × 4 confusion matrix for the appendix.
07 CNN_COMBINEdata_Gaussian.ipynb Baseline CNN on full MFCC maps with Gaussian-noise augmentation. • Doubles training data with σ = 0.01 noise.
• 5-block CNN, 50 epochs → ~0.91 test acc.
• 7 × 7 confusion matrix = Fig 1; per-class PR table.
08 CNN_PCA_NMF.ipynb Lightweight two-input CNN that consumes 26-dim compressed features. • Dual branch: MFCC map & 26-dim side vector.
• ~0.92 test acc (numbers in Table VI).
09 ML vs DL.ipynb Compare learning curves of SVM vs. CNN as data size grows. • Trains both models on 10 – 100 % of data.
• Plots Figure 4 and prints the CSV behind it.

About

Reproducible EE8104 project – cross-corpus speech-emotion recognition on TESS + RAVDESS using MFCC extraction, PCA/NMF compression, and CNN / SVM baselines (91–92 % top accuracy).

Topics

Resources

Stars

Watchers

Forks

Packages

 
 
 

Contributors