top of page
  • Writer's picturePre-Collegiate Global Health Review

Deep Neural Networks for Enhanced Speech Transcription: A Case Study on Huntington’s Disease

By Kevin Liu, High Technology High School, Morganville, New Jersey, USA


ABSTRACT 


This paper addresses the challenges of creating a speech model for individuals with neurodegenerative diseases like Huntington's Disease (HD), who often struggle to communicate effectively due to the disease's effects on speech. The proposed approach fine-tunes a pre-trained DeepSpeech model with HD voice data using the checkpointing process, resulting in a significantly improved speech recognition accuracy of 85%. A web application was developed to provide a modern user-friendly interface to transcribe speech in real-time. The findings highlight the potential of checkpointed speech models to improve the communication abilities of individuals with neurodegenerative diseases and help monitor disease progression. 

 

1. INTRODUCTION 


1.1 Background 

Neurodegenerative diseases are a group of disorders that affect the nervous system, leading to the gradual loss of neurons and cognitive decline. Alzheimer’s, Parkinson’s, motor neuron ALS, and Huntington’s Disease (HD) are all neurodegenerative diseases, affecting millions of people worldwide. Huntington's Disease is a fatal genetic neurodegenerative disease that affects approximately one in 10,000 people worldwide. It is caused by a CAG codon trinucleotide expansion in the huntingtin (Htt) gene, which leads to an abnormally long polyglutamine expansion in the mutant Htt protein, and ultimately affects neurons associated with motion, cognition and emotion (Murray, 2000). Nearly 41,000 people in the US have HD, and it has become the most common and deadly hereditary disease and the 3rd most prevalent neurodegenerative condition in Europe and North America (Booth, 2007). Although certain treatments may help relieve some physical symptoms associated with neurodegeneration, there are currently no disease-modifying therapies to slow the progression of the neurodegenerative diseases. As such, these diseases can severely impact an individual's quality of life, causing mobility issues, speech impediments, and other debilitating symptoms.  

 

The mean age at onset of HD symptoms is 30-50 years depending on the CAG-repeat length (40-50), much earlier than other neurodegenerative diseases, and juvenile HD (CAG>55, onset before age 20) occurs in ~16% of all cases (Roos, 2010). One of the earliest symptoms of HD is the deterioration of the Basal Ganglia, a region of the deep brain that is not only involved in motor planning and control, but also cognitive functions such as language processing and production including speech capabilities, word fluency, and sentence construction (Silveri, 2020). As the disease progresses, the ability to control the muscles used for speech may be reduced, leading to difficulty in pronouncing words and controlling volume. Ultimately, more than 78-93% of affected patients will lose their ability to speak proficiently (Gordon, 2004). 


Individuals with HD often face significant communication barriers, with speech impairment being one of the most prevalent symptoms of the disease. Since the inability to communicate effectively can have a profound impact on their daily life, accurate transcription of their speech is crucial in facilitating communication with caregivers, family members, and healthcare professionals (Tan, 2021). However, existing open-source speech models trained on data from healthy individuals do not accurately capture the speech patterns of those with neurodegenerative diseases, leading to inaccurate transcription. These flawed transcriptions can result in confusion, further exacerbating the already challenging communication barriers faced by individuals with HD (Grimsdvedt, 2021). Therefore, there is a critical need for speech models that accurately transcribe the speech of individuals with HD to facilitate effective communication and improve quality of life. 


1.2 Research Objectives  

Recent advances in artificial intelligence (AI) and machine learning (ML) have shown promise in speech and language processing to predict cognitive decline (Garcia, 2020) in Alzheimer’s disease, and diagnose dementia from spontaneous speech using large language models such as GPT-3 (Agbavor, 2022). Using AI and ML to help people with speech impairments caused by neurological diseases is an area where more attention and development is needed. The primary objective of this study is to create an open-source speech model that can transcribe HD patients' speech with high accuracy, thereby improving their ability to communicate and alleviating the impact of speech deterioration on their lives. The study proposes a novel approach of using a pre-trained DeepSpeech model augmented with HD patient data to develop an accurate speech recognition system for individuals with HD. This will improve communication for individuals with HD and other neurodegenerative diseases, ultimately enhancing their quality of life and promoting social inclusion. Additionally, this research may pave the way for further advancements in AI and ML applications for addressing the unique challenges faced by individuals with neurological disorders. 


2. LITERATURE REVIEW 


2.1 Overview of Huntington’s Disease Symptoms 

HD speech is characterized by dysarthria, which is the weakness of muscles in articulating words, causing slurred speech and stuttering (Perez, 2018). While previous research has shown that speech recognition technology can aid in the diagnosis and management of neurodegenerative diseases such as ALS (MacDonald, 2014), there has been limited research on the development of speech recognition and transcription models for individuals with HD, who tend to have slower speaking rates, longer pauses between words, and greater variability in pitch and loudness (Gordon, 2004).

Figure 1. Subjects recited the phrase “I have been out of the U.K. for three years.” Note the higher peaks of the HD clip, indicating higher pitch and loudness, and the fewer amount of spaces. 


2.2 Limitations of Current Speech Models 

The development of a speech model for HD patients presents several challenges, such as the lack of data available for training. In this regard, most existing open-source speech models have been limited in their accuracy when transcribing the speech of individuals with vocal disabilities. These models, such as CMUSphinx, are almost entirely trained on data from healthy individuals and cannot process moderately distorted or difficult-to-understand speech (Singh 2003). Consequently, these models often produce inaccuracies and inconsistencies in transcription, rendering them unsuitable for individuals with HD. 


Additionally, traditional speech recognition software usually requires keyboard or mouse input, which can be challenging for individuals with HD who struggle with fine motor skills. Moreover, many models are closed-source and restricted to controlled research environments. To maximize their benefits to end-users, novel speech models should be accessible through interfaces that do not rely on traditional input devices and not limited to particular devices or research environments. 


2.3 Utilizing the Deepspeech Engine 

Using DeepSpeech, an open-source Speech-To-Text engine trained by machine learning techniques,  with a novel approach to training can address these challenges. DeepSpeech can use a process called “checkpointing” to train from a pre-existing model, making up for the lack of training data, and transfer learning techniques can be applied to further improve the accuracy of the model (Hannun, 2014). This approach was inspired by Google's Project Euphonia, which created a speech recognition model for ALS using voice data from volunteer subjects to supplement their data from ALS subjects (MacDonald, 2021). For Google, using this approach showed significant improvement in the accuracy of transcriptions for ALS. Additionally, DeepSpeech is an open-source speech recognition system, so models can be accessed with a public API from a variety of devices and platforms, without being limited to any particular device or system. 


3. METHODOLOGY 


3.1 Data Collection and Preparation 

To perform the checkpointing process, a pre-trained open-source voice recognition model, “deepspeech-0.9.3,” was used, along with additional training using HD patient voice audio. The HD audio data processed by the model were sourced from a CHDI Foundation dataset. These audio clips were adjusted to have the proper mono channel and 16000 Hz sample rate for compatibility with the DeepSpeech model. The final dataset consisted of 650 audio clips of varying lengths and quality, measuring ~2 hours in combined length.  


To train the model, the data was split into training, validation, and testing sets. The training set consisted of 70% of the data, the validation set consisted of 15% of the data, and the testing set consisted of 15% of the data. These datasets consisted of individual audio clips and their associated transcription, along with a CSV (comma-separated-values) file with the filenames, file sizes, and transcripts of the clips. 


3.2 Overview of Technology 

The model was implemented in an Ubuntu 20.04 virtual machine, while TensorFlow was used to create the neural network and DeepSpeech layers. Ubuntu 20.04 was installed on an OMEN 15t Laptop as a dual-boot system alongside Windows 11 while a NVIDIA-GeForce-GTX-1660-Ti GPU was used for GPU acceleration with Tensorflow. To train the model, a virtual environment was used to create an isolated environment with its own Python interpreter and dependencies, isolating Tensorflow from other Python installations. 


3.3 Overview of Model Architecture 

The model architecture consists of two main parts: Convolutional Neural Networks (CNNs) and Recurrent Neural Networks (RNNs). The CNNs take audio as input and convert it to a probability over the various characters in the English alphabet, while the RNNs analyze these probabilities to predict which words are more likely to follow each other and turn the probabilities into a series of words that form a coherent sentence. 


In a CNN, the first layer, the temporal convolution layer, is responsible for feature extraction from the audio signals. It works by convolving a series of filters over the audio to generate a set of output feature maps that highlight different aspects of the audio signal, such as frequency components and temporal patterns. These feature maps are then passed on to the Nonlinear Activation Unit (ReLU) layers, which allow the model to eliminate unimportant audio features like silence and background noise. 




Figure 2. The convolutional neural network (CNN) is used as a feature extractor to extract acoustic features from raw audio signals.  


The RNN layers in the model are typically implemented using LSTM, or Long Short-Term Memory cells, which are capable of capturing long-term dependencies and relationships between input sequences. The LSTM cells are connected in a recurrent manner, with each cell taking input from the previous cell's output and the current input feature map from the CNN layers. The output of the LSTM layer is a matrix of the probabilities for certain letters and words, which is passed through a softmax layer to produce a final transcription.  


Figure 3. The LSTM helps to capture long-term dependencies in the speech using various gates, allowing the model to “remember” earlier parts of the speech. 


3.4 Model Training and Validation 

To train and validate the speech model, the pre-trained DeepSpeech model was used as a starting point. First, the DeepSpeech model was pre-trained on healthy individuals using a large dataset of voice data, producing a high accuracy of transcription for healthy individuals. The pre-trained DeepSpeech model was then fine-tuned on the HD voice data using the checkpointing process. Checkpointing involves saving the model’s parameters periodically, allowing the model to resume training from where it left off. By using checkpointing, the model was able to be trained on the HD voice data without losing progress made by the pre-trained model. 


Stochastic gradient descent (SGD) optimization was used to minimize loss during neural network training, with a learning rate of 0.0001 to prevent overshooting and ensure smooth convergence. Mini-batch gradient descent was implemented with a batch size of 128, improving computational efficiency by processing 128 clips before weight updates. To prevent overfitting, a dropout rate of 0.4 randomly deactivated neurons during training, forcing the remaining ones to learn robust features that generalize better to unseen data. 


The model was trained for 7 epochs. Each successive epoch indicates one complete iteration through the training dataset, with each iteration taking approximately 6 hours to complete on a NVIDIA GeForce GTX 1660 Ti GPU with 1,536 CUDA cores. NVIDIA CUDA, a platform for large-scale computations, was used to train the model in parallel time, meaning that the computations were distributed across multiple cores or processors. This enabled the model to be trained much faster than if the computations were done sequentially. Although training on the 16-core CPU was a possibility, it was impractical given the vastness and intricacy of the deep neural network and the scale of the training data. Hence, the use of CUDA for parallel training was instrumental in achieving efficient training of the model. 


3.5 Web Application Development 

A web application, HD-Transcribe, was developed to provide HD patients with a user-friendly interface to work with the trained speech model. This application is free and can be accessed from any iPhone or Android phone and tablet and any Windows/Mac/Linux desktop. It offers a microphone recording and audio upload functionality to allow the user to generate desired transcriptions. It also provides many additional quality-of-life features to maximize the experience of the user. The application is hosted on the Vercel cloud platform. 


The web application is mainly built using TypeScript, a superset of JavaScript that provides strong types for expected data from APIs, making the code more robust and easier to maintain. The application also uses the Next.js framework, a React-based web application development framework that offers server-side rendering and automatic code splitting, providing faster load times and a better user experience. The main frontend framework used was React.js, offering fast performance, functional components, and state hooks for managing transcription and recording states. For styling the website, the utility-first CSS framework Tailwind was used. 


3.6 Enhancing User Experience with Quality-of-Life Features 

To generate transcriptions from the speech model, the DeepSpeech JavaScript API is used, allowing the model to be accessed during runtime. Transcriptions are stored in global states and displayed on the website after the user’s audio is processed. To offer additional interactive features, the OpenAI API is used to integrate ChatGPT-3.5, an artificial intelligence chatbot, into the web application. This integration allows users to ask questions about their condition, treatment, or any other relevant topic related to HD. Additionally, the React-TTS module is used to provide a text-to-speech feature in the application, allowing the user to hear the transcriptions of their speech as spoken words, thereby giving them a “voice” to communicate with others. 


4. RESULTS 


4.1 Word Error Rate Analysis 

Figure 4. The table shows the evolution of the training loss, validation loss, and word error rate (WER) on the training, validation, and test sets over the course of 7 training epochs. 


The results provided by the model related the number of epochs, or the number of times the model passed over the entire training dataset, with the accuracy of the model. One of these accuracy statistics was Word Error Rate (WER), a representation of the proportion of words that were transcribed incorrectly. A lower WER means higher accuracy. At the start of training, the WER on the training set was 0.97, indicating that the model was only correctly transcribing about 3% of the words it encountered. By the end of the training, the WER on the test set had improved to around 0.15, indicating that the model was transcribing correctly around 85% of the words it encountered.  


Figure 5. WER vs. Epoch Graph over the course of 7 training epochs/20,000 batches. Each batch refers to a subset of data that was iterated over for training. 


4.2 Loss Analysis 

Training and validation loss, or the difference between the predicted output of the model and the actual target output, decreases with each epoch, indicating that the model is generating more and more accurate transcriptions. Each decrease in the WER is accompanied by a decrease in the training loss and the validation loss, which indicates that the model is becoming better at generalizing to unseen data. The training loss, which measures how well the model is fitting the training data, decreases steadily from 2.032 at the start of training to 0.413 at the end of training. The validation loss, which measures how well the model is performing on a separate validation set, also decreases from 1.289 to 0.273. The fact that the validation loss tracks the training loss closely indicates that the model is not overfitting to the training data.  


Figure 6. Loss vs Epoch Graph over the course of 7 training epochs/20,000 batches. 


Additionally, WER decreases at a much faster rate than the loss, suggesting that the model is improving its ability to generalize better to new input data, and not just becoming better at memorizing training data. This is because WER measures the model's ability to produce the correct output at the word level, whereas loss measures the difference between the predicted output and the target output at a more granular level (e.g. character-level), so a greater decrease in WER indicates that the model is improving its ability to capture the semantic and syntactic information of the input data. 


5. DISCUSSION AND CONCLUSIONS 


5.1 Discussion 

The results of the study show that it is feasible to create a highly effective, accurate, and customized speech model for HD patients by using an innovative modification to training the DeepSpeech model. By fine-tuning a pre-trained model on HD patient voice data, the model's accuracy in transcribing HD patients' speech was significantly improved. The model achieved a word error rate of less than 15% on test cases featuring HD patients' audio clips, which is notably lower than the error rates of other open-source speech models. 


The study highlights the challenges of developing a speech model for HD patients, while also showing the potential for using voice recognition and transcription as a tool for alleviating the speech degradation experienced by individuals with not only HD, but also other neurodegenerative diseases such as Alzheimer’s, Parkinson’s, and ALS. As such, this model is intended to be open-source and can be accessed by other researchers who wish to improve upon the model from its GitHub


However, one significant limitation of our study was the small sample size of HD patient voice data used for training the model. The data only consisted of 2 hours of audio, of which only 85% could be reserved specifically for training and validating the model. While it was still possible to achieve a high level of accuracy with this data, a larger dataset could further improve the model's performance. Additionally, the study focused on transcribing HD patients' speech in isolation and did not consider the potential difficulties of transcribing speech in real-world settings with background noise or multiple speakers. 


Nevertheless, continuous training of the model using a larger amount of data, together with the novel approach of fine-tuning existing models, could help overcome these limitations by adapting specific characteristics of HD along with other neurodegenerative patients' speech. The approach has been successful in other areas such as natural-language processing. Moreover, speech analysis using ML algorithms and deep neural networks is promising for early detection and monitoring of neurodegenerative disease conditions. Given its viability as a non-invasive and cost-effective tool for detecting cognitive decline in Alzheimer's disease patients, the presented approach has the potential for earlier detection of HD to help better manage these conditions, ultimately improving patient’s quality of life and outcomes. 


5.2 Future Studies 

There are many potential avenues for future studies in using voice recognition and transcription as a tool for aiding individuals with HD. One direction is to customize the model for individual users' voices. Since the speech patterns and impediments of HD patients can vary widely, it may be beneficial to train personalized models of each individual. This approach could improve the accuracy of transcription for individuals. 


The model could also store voice data for monitoring disease progression. As the disease progresses, speech patterns can change, and tracking these changes may be valuable in understanding disease progression. By storing voice data over time and comparing transcriptions, it may be possible to monitor changes in speech patterns and identify potential disease markers. This could be particularly useful in clinical trials, where monitoring the disease condition is critical for evaluating the effectiveness of treatments. 


5.3 Conclusion 

The novel approach of fine-tuning existing models to train a speech model for HD patients shows remarkable results in improved accuracy in transcribing HD patients' speech. Along with its accessibility from any device, this technology has the potential to improve the quality of life for individuals with HD by providing a means of communication that is both accessible and accurate. Overall, this study represents a significant step forward in the development of speech recognition technology for individuals with HD and other neurodegenerative diseases.


References


 



bottom of page