Accéder au contenu principal

Act

  

Solution Concept

Since deaf people do not have that many options for communicating with a hearing person in the field of healthcare, and all of the alternatives do have major flaws, interpreters aren’t usually available, and also could be expensive, our solution is to create an easy-to-use application that we called "ESMAANI". This app will work by placing a web camera of the computer in front of the deaf person while the application translates gestures or sign language into text and/or speech. This application can transcribe the Tunisian Sign Language as quick as the person speaks. We will be using Computer Vision algorithms and neural networks to recognize the Tunisian Sign language in real-time, and then use algorithms translate it into text or speech. 

Solution Development

Data Collection

After our interview with Doctor Syrine Ben Othmen, we managed to collect quite a considerable number of sentences that are relevant to our topic. Also, thanks to the book provided by Mr. Rachid from AVST, we have our first data source, the next steps are to collect real-life data then train the model on it. Here are some pictures of the medical sign language terms from the book: 



You can find below a link to our dictionary which includes the full list of the medical terms that we'll continue working on in the future.

📁 Resources: Dictionary


This part was rather tedious, but it is of utmost necessity as we couldn't advance and train the model without it. Collecting sign language data, basically shooting videos of people in real life making different signs. We decided to start off with a dictionary of 4 words. Not a huge amount, but it would have to do as a start, to test our model efficiency. Each one of us basically had to shoot 10 videos while adding slight variations. Here is an example of the videos we shot which simply means "Pain"




Data Preparation


For this step, we used Mediapipe which is an open-source framework for building cross-platform multimedia processing pipelines. One of the components of Mediapipe is a machine learning model for extracting key points from videos, called the Holistic model. This model uses deep learning to detect and track key points in videos, such as the locations of a person's hands, face, and other body parts. 

Modeling

Having the key points extracted from the videos and saved as a ready-to-go input for the model training, we were able to train an LSTM neural network architecture that can recognize the chosen signs with an accuracy of 91%. 

Implementation

We deployed our solution as a real-time sign language transcription application on a website we made that implements the model we trained on our dictionary, and processes video input from web camera in order to recognize and interpret sign language gestures. We also included other business aspects and features in the website such as a landing page, account creation, and signing in. The following video is a demonstration of our work:





Commentaires