Example Estimations of the Attention-Based Separation Model [Publication-intent]

Libri2Mix

Input Audio - Mixed

Estimated Audio - Speaker 1

Estimated Audio - Speaker 2

Input Audio - Mixed

Estimated Audio - Speaker 1

Estimated Audio - Speaker 2

Input Audio - Mixed

Estimated Audio - Speaker 1

Estimated Audio - Speaker 2

Input Audio - Mixed

Estimated Audio - Speaker 1

Estimated Audio - Speaker 2

Input Audio - Mixed

Estimated Audio - Speaker 1

Estimated Audio - Speaker 2

Input Audio - Mixed

Estimated Audio - Speaker 1

Estimated Audio - Speaker 2

Input Audio - Mixed

Estimated Audio - Speaker 1

Estimated Audio - Speaker 2

Input Audio - Mixed

Estimated Audio - Speaker 1

Estimated Audio - Speaker 2

Input Audio - Mixed

Estimated Audio - Speaker 1

Estimated Audio - Speaker 2

Input Audio - Mixed

Estimated Audio - Speaker 1

Estimated Audio - Speaker 2

Input Audio - Mixed

Estimated Audio - Speaker 1

Estimated Audio - Speaker 2

Input Audio - Mixed

Estimated Audio - Speaker 1

Estimated Audio - Speaker 2

Input Audio - Mixed

Estimated Audio - Speaker 1

Estimated Audio - Speaker 2

Input Audio - Mixed

Estimated Audio - Speaker 1

Estimated Audio - Speaker 2

Input Audio - Mixed

Estimated Audio - Speaker 1

Estimated Audio - Speaker 2

Input Audio - Mixed

Estimated Audio - Speaker 1

Estimated Audio - Speaker 2

Input Audio - Mixed

Estimated Audio - Speaker 1

Estimated Audio - Speaker 2

Input Audio - Mixed

Estimated Audio - Speaker 1

Estimated Audio - Speaker 2

Input Audio - Mixed

Estimated Audio - Speaker 1

Estimated Audio - Speaker 2

Input Audio - Mixed

Estimated Audio - Speaker 1

Estimated Audio - Speaker 2

Abstract

We present a highly efficient speech separation algorithm based on a novel compact attention mechanism within the Transformer architecture. By significantly reducing computational complexity and size of the model compared to existing baseline models, our approach achieves state-of-the-art efficiency on the Libri2Mix and LRS2-2Mix datasets.