.

ISSN 2063-5346
For urgent queries please contact : +918130348310

EXPLORING DEEP SPECTRAL AND TEMPORAL FEATURE REPRESENTATIONS WITH ATTENTION-BASED NEURAL NETWORK ARCHITECTURES FOR ACCENTED MALAYALAM SPEECH- A LOW-RESOURCED LANGUAGE

Main Article Content

Rizwana Kallooravi Thandil,Mohamed Basheer K.P
» doi: 10.53555/ecb/2023.12.si5a.0388

Abstract

Constructing an Accented Automatic Speech Recognition (AASR) system for a language is a challenging endeavor due to variations in pronunciation, intonation, and rhythm. This study proposes a novel approach to AASR for Malayalam speech, employing Recurrent Neural Network (RNN), Long Short-Term Memory (LSTM), and BiDirectional Long Short-Term Memory (BiLSTM) architectures, each incorporating an attention block.To conduct the study, the authors assembled an accented speech corpus containing samples from five different accents in Malayalam. The research was carried out in six distinct phases, involving different combinations of features and model architectures. In the initial phase, Mel Frequency Cepstral Coefficients (MFCC) were used for feature vectorization, while RNN was employed for modeling the accented speech data. This phase resulted in a Word Error Rate (WER) of 11.98% and a Match Error Rate (MER) of 76.03%.The second phase employed MFCC and tempogram methods for feature vectorization, combining them with RNN and an attention mechanism for constructing a unified model for the accented data. This phase achieved a WER of 7.98% and an MER of 82.31%. In the third phase, MFCC and Tempogram feature vectors were utilized with LSTM for modeling the accented data, resulting in a WER of 8.95% and an MER of 83.64%.The fourth phase used the same feature set as phases two and three, incorporating LSTM with attention mechanisms for constructing the accented model. This phase yielded a WER of 3.8% and an MER of 87.11%. The fifth and sixth phases utilized BiLSTM and BiLSTM with attention mechanisms, respectively, while maintaining the same feature set. These phases achieved WERs of 3.5% and 3.25% and MERs of 90.12% and 92.25%, respectively.The experiments demonstrate that the BiLSTM with attention mechanism architecture, incorporating appropriate accent attributes, performed well even for unknown accents. Performance evaluation using WER and MER indicated a reduction of 50% to 65% when employing attention mechanisms with RNN, LSTM, and BiLSTM approaches.

Article Details