CAPSTONE PROJECT
CAPSTONE PROJECT
CAPSTONE PROJECT
Real-time Translation of Indian Sign Language to Hindi/Kannada
Real-time Translation of Indian Sign Language to Hindi/Kannada
Real-time Translation of Indian Sign Language to Hindi/Kannada
Final Year Capstone Project @ PES University with 2 Publications
Final Year Capstone Project @ PES University with 2 Publications
Final Year Capstone Project @ PES University with 2 Publications
Overview
Indian Sign Language (ISL) is a visual-gestural language used by the deaf community in India, with distinct grammar and syntax. This project aims to address the communication gap faced by the deaf-mute community through an application that translates ISL into text in Hindi and Kannada in real-time, providing accessibility to non-English speakers.
The system uses Mediapipe for video preprocessing and implements CNNs, RNN-LSTMs, and Transformer-Encoders, with LSTMs and Transformers achieving 95-97% accuracy. An ensemble model combines their strengths, and the output text is further processed into coherent Hindi or Kannada sentences.
This affordable, portable solution tackles the lack of interpreters in India, enabling better communication for the hearing-impaired population.
The project was undertaken as a capstone initiative at PES University in collaboration with Attili Subha Vidisha, Anubuthi Kottapalli, and Anshula Aithal.
Overview
Indian Sign Language (ISL) is a visual-gestural language used by the deaf community in India, with distinct grammar and syntax. This project aims to address the communication gap faced by the deaf-mute community through an application that translates ISL into text in Hindi and Kannada in real-time, providing accessibility to non-English speakers.
The system uses Mediapipe for video preprocessing and implements CNNs, RNN-LSTMs, and Transformer-Encoders, with LSTMs and Transformers achieving 95-97% accuracy. An ensemble model combines their strengths, and the output text is further processed into coherent Hindi or Kannada sentences.
This affordable, portable solution tackles the lack of interpreters in India, enabling better communication for the hearing-impaired population.
The project was undertaken as a capstone initiative at PES University in collaboration with Attili Subha Vidisha, Anubuthi Kottapalli, and Anshula Aithal.
Overview
Indian Sign Language (ISL) is a visual-gestural language used by the deaf community in India, with distinct grammar and syntax. This project aims to address the communication gap faced by the deaf-mute community through an application that translates ISL into text in Hindi and Kannada in real-time, providing accessibility to non-English speakers.
The system uses Mediapipe for video preprocessing and implements CNNs, RNN-LSTMs, and Transformer-Encoders, with LSTMs and Transformers achieving 95-97% accuracy. An ensemble model combines their strengths, and the output text is further processed into coherent Hindi or Kannada sentences.
This affordable, portable solution tackles the lack of interpreters in India, enabling better communication for the hearing-impaired population.
The project was undertaken as a capstone initiative at PES University in collaboration with Attili Subha Vidisha, Anubuthi Kottapalli, and Anshula Aithal.
Roles and Responsibilities
Researcher
Data Analyst
AI Developer
Roles and Responsibilities
Researcher
Data Analyst
AI Developer
Roles and Responsibilities
Researcher
Data Analyst
AI Developer
Tools used
Python
Google Colab Notebook
VS Code
Tools used
Python
Google Colab Notebook
VS Code
Tools used
Python
Google Colab Notebook
VS Code
Project Context
Final Year Capstone project
@ PES University, Bengaluru
Under mentorship of Dr. Ashwini M Joshi
Project Context
Final Year Capstone project
@ PES University, Bengaluru
Under mentorship of Dr. Ashwini M Joshi
Project Context
Final Year Capstone project
@ PES University, Bengaluru
Under mentorship of Dr. Ashwini M Joshi
ISL Translation Model Results
CNN: 75% training accuracy
RNN-LSTM: 87% validation accuracy
Transformer-Encoder: 83.1% validation accuracy
ISL Translation Model Results
CNN: 75% training accuracy
RNN-LSTM: 87% validation accuracy
Transformer-Encoder: 83.1% validation accuracy
ISL Translation Model Results
CNN: 75% training accuracy
RNN-LSTM: 87% validation accuracy
Transformer-Encoder: 83.1% validation accuracy
Publications
“Analysis of Vision based Techniques for the Translation of Indian Sign Language”
Presented @ International Conference of Engineering and Technology (ICET 2023)
Published in International Journal on Recent and Innovation Trends in Computing and Communication
“Real-time Translation of Indian Sign Language to Hindi and Kannada”
Presented @ ICC Robins 2024 (International Conference on Cognitive Robotics and Intelligent Systems)
Available on IEEE Xplore Digital Library
Publications
“Analysis of Vision based Techniques for the Translation of Indian Sign Language”
Presented @ International Conference of Engineering and Technology (ICET 2023)
Published in International Journal on Recent and Innovation Trends in Computing and Communication
“Real-time Translation of Indian Sign Language to Hindi and Kannada”
Presented @ ICC Robins 2024 (International Conference on Cognitive Robotics and Intelligent Systems)
Available on IEEE Xplore Digital Library
Publications
“Analysis of Vision based Techniques for the Translation of Indian Sign Language”
Presented @ International Conference of Engineering and Technology (ICET 2023)
Published in International Journal on Recent and Innovation Trends in Computing and Communication
“Real-time Translation of Indian Sign Language to Hindi and Kannada”
Presented @ ICC Robins 2024 (International Conference on Cognitive Robotics and Intelligent Systems)
Available on IEEE Xplore Digital Library
Defining the Problem
Deaf and Mute Population in India:
WHO estimates approximately 63 million Indians are hearing disabled, but there are only 300 certified interpreters in India, significantly affecting communication of non-verbal individuals with the rest of the population.
Existing Solutions:
Widespread solutions available for American Sign Language unfortunately do not extend to Indian Sign Language because ASL requires only the hands, while ISL combines hands, facial expressions and body language to deliver contextual meaning.
Defining the Problem
Deaf and Mute Population in India:
WHO estimates approximately 63 million Indians are hearing disabled, but there are only 300 certified interpreters in India, significantly affecting communication of non-verbal individuals with the rest of the population.
Existing Solutions:
Widespread solutions available for American Sign Language unfortunately do not extend to Indian Sign Language because ASL requires only the hands, while ISL combines hands, facial expressions and body language to deliver contextual meaning.
Defining the Problem
Deaf and Mute Population in India:
WHO estimates approximately 63 million Indians are hearing disabled, but there are only 300 certified interpreters in India, significantly affecting communication of non-verbal individuals with the rest of the population.
Existing Solutions:
Widespread solutions available for American Sign Language unfortunately do not extend to Indian Sign Language because ASL requires only the hands, while ISL combines hands, facial expressions and body language to deliver contextual meaning.
Implementation
Data Acquisition:
A Dataset consisting of 4292 videos was obtained from zenodo.org. Each category had multiple words and each word was signed by 3 or 4 different signers multiple times.The data set did not consist of any verb words needed for basic conversations, and thus needed to be constructed. We selected 48 most commonly used verb words: Fall, Run, Sleep, Break, Smell, Think, Suggest, Walk, Want, Watch, etc. and recorded 9 videos for each of the 48 signs, a total of 432 videos were added to the obtained dataset.
Implementation
Data Acquisition:
A Dataset consisting of 4292 videos was obtained from zenodo.org. Each category had multiple words and each word was signed by 3 or 4 different signers multiple times.The data set did not consist of any verb words needed for basic conversations, and thus needed to be constructed. We selected 48 most commonly used verb words: Fall, Run, Sleep, Break, Smell, Think, Suggest, Walk, Want, Watch, etc. and recorded 9 videos for each of the 48 signs, a total of 432 videos were added to the obtained dataset.
Implementation
Data Acquisition:
A Dataset consisting of 4292 videos was obtained from zenodo.org. Each category had multiple words and each word was signed by 3 or 4 different signers multiple times.The data set did not consist of any verb words needed for basic conversations, and thus needed to be constructed. We selected 48 most commonly used verb words: Fall, Run, Sleep, Break, Smell, Think, Suggest, Walk, Want, Watch, etc. and recorded 9 videos for each of the 48 signs, a total of 432 videos were added to the obtained dataset.
Pre-processing:
TFrames were generated from the inpute video, and then annotated using the mediapipe framework, which would assign coordinates to 21 unique position for each hand and 33 pose positions, with 3 x y z coordinates each resulting in 225 coordinates per frame.
Pre-processing:
TFrames were generated from the inpute video, and then annotated using the mediapipe framework, which would assign coordinates to 21 unique position for each hand and 33 pose positions, with 3 x y z coordinates each resulting in 225 coordinates per frame.
Pre-processing:
TFrames were generated from the inpute video, and then annotated using the mediapipe framework, which would assign coordinates to 21 unique position for each hand and 33 pose positions, with 3 x y z coordinates each resulting in 225 coordinates per frame.
Frames captured from the dataset
Frames captured from the dataset
Frames captured from the dataset







Annotated Frames using Mediapipe
Annotated Frames using Mediapipe
Annotated Frames using Mediapipe







SLR Model Training and Evaluation:
All the models created were done with the help of Tensorflow and Keras. These libraries provide various layers such as LSTM layer, Dense layer, Dropout layer, Conv3D layer, MaxPooling1D layer, Flatten layer, etc.
CNN: Trained for over 300 epochs with 75% training accuracy. However, it performed poorly on unseen data, with testing accuracy ranging between 16-21% and validation accuracy around 15-16%. Despite extensive training, the model struggled to generalize.
RNN - LSTMs: Achieved 95-97% training accuracy and 82-85% testing accuracy after 150-200 epochs. It also performed well on the validation dataset (83.1%).
Transformer Encoder: Trained for 30-40 epochs with 97% training accuracy and 87% testing accuracy. The model performed consistently on both training and validation datasets.
Ensemble Model: Combined LSTM and Transformer models to enhance performance. By comparing the probabilities of each model's output for each gloss, the ensemble model selects the label with higher confidence, improving predictive capabilities.
DeepTranslate Modules: Translated the final output sequence (English to Hindi and Kannada) using two modules, each containing Deep Translate layers to ensure coherence and produce meaningful translations from the derived labels. In the future, this can be extended to any language, particularly to include all Indian languages, enabling the system to cater to users across the nation.
SLR Model Training and Evaluation:
All the models created were done with the help of Tensorflow and Keras. These libraries provide various layers such as LSTM layer, Dense layer, Dropout layer, Conv3D layer, MaxPooling1D layer, Flatten layer, etc.
CNN: Trained for over 300 epochs with 75% training accuracy. However, it performed poorly on unseen data, with testing accuracy ranging between 16-21% and validation accuracy around 15-16%. Despite extensive training, the model struggled to generalize.
RNN - LSTMs: Achieved 95-97% training accuracy and 82-85% testing accuracy after 150-200 epochs. It also performed well on the validation dataset (83.1%).
Transformer Encoder: Trained for 30-40 epochs with 97% training accuracy and 87% testing accuracy. The model performed consistently on both training and validation datasets.
Ensemble Model: Combined LSTM and Transformer models to enhance performance. By comparing the probabilities of each model's output for each gloss, the ensemble model selects the label with higher confidence, improving predictive capabilities.
DeepTranslate Modules: Translated the final output sequence (English to Hindi and Kannada) using two modules, each containing Deep Translate layers to ensure coherence and produce meaningful translations from the derived labels. In the future, this can be extended to any language, particularly to include all Indian languages, enabling the system to cater to users across the nation.
SLR Model Training and Evaluation:
All the models created were done with the help of Tensorflow and Keras. These libraries provide various layers such as LSTM layer, Dense layer, Dropout layer, Conv3D layer, MaxPooling1D layer, Flatten layer, etc.
CNN: Trained for over 300 epochs with 75% training accuracy. However, it performed poorly on unseen data, with testing accuracy ranging between 16-21% and validation accuracy around 15-16%. Despite extensive training, the model struggled to generalize.
RNN - LSTMs: Achieved 95-97% training accuracy and 82-85% testing accuracy after 150-200 epochs. It also performed well on the validation dataset (83.1%).
Transformer Encoder: Trained for 30-40 epochs with 97% training accuracy and 87% testing accuracy. The model performed consistently on both training and validation datasets.
Ensemble Model: Combined LSTM and Transformer models to enhance performance. By comparing the probabilities of each model's output for each gloss, the ensemble model selects the label with higher confidence, improving predictive capabilities.
DeepTranslate Modules: Translated the final output sequence (English to Hindi and Kannada) using two modules, each containing Deep Translate layers to ensure coherence and produce meaningful translations from the derived labels. In the future, this can be extended to any language, particularly to include all Indian languages, enabling the system to cater to users across the nation.



Usability & Accuracy Considerations
Several efforts were made throughout the course of this project, to emphasize good user experience, accessibility, ethics etc.
Removal of Z-coordinate: Since 2D image frames were used, the Z-coordinate was excluded to avoid inaccuracies and reduce unnecessary computational load.
Bias Mitigation: Mediapipe was chosen for its neutrality, as it does not introduce skin tone bias, addressing challenges in traditional computer vision approaches, especially for diverse Indian skin tones.
Multilingual Output: The system offers multilingual support (Hindi and Kannada) to cater to non-English speaking users. English serves as an intermediate language for translation, with potential to extend to all local Indian languages.
Minimal Latency: Seamless conversation flow was prioritized by ensuring minimal latency, achieved through stream processing using Flask, to enhance the user experience with real-time interaction.
Usability & Accuracy Considerations
Several efforts were made throughout the course of this project, to emphasize good user experience, accessibility, ethics etc.
Removal of Z-coordinate: Since 2D image frames were used, the Z-coordinate was excluded to avoid inaccuracies and reduce unnecessary computational load.
Bias Mitigation: Mediapipe was chosen for its neutrality, as it does not introduce skin tone bias, addressing challenges in traditional computer vision approaches, especially for diverse Indian skin tones.
Multilingual Output: The system offers multilingual support (Hindi and Kannada) to cater to non-English speaking users. English serves as an intermediate language for translation, with potential to extend to all local Indian languages.
Minimal Latency: Seamless conversation flow was prioritized by ensuring minimal latency, achieved through stream processing using Flask, to enhance the user experience with real-time interaction.
Usability & Accuracy Considerations
Several efforts were made throughout the course of this project, to emphasize good user experience, accessibility, ethics etc.
Removal of Z-coordinate: Since 2D image frames were used, the Z-coordinate was excluded to avoid inaccuracies and reduce unnecessary computational load.
Bias Mitigation: Mediapipe was chosen for its neutrality, as it does not introduce skin tone bias, addressing challenges in traditional computer vision approaches, especially for diverse Indian skin tones.
Multilingual Output: The system offers multilingual support (Hindi and Kannada) to cater to non-English speaking users. English serves as an intermediate language for translation, with potential to extend to all local Indian languages.
Minimal Latency: Seamless conversation flow was prioritized by ensuring minimal latency, achieved through stream processing using Flask, to enhance the user experience with real-time interaction.
Application Design
The design aimed to provide a seamless user experience by first displaying the camera input for real-time sign language recognition. Alongside the camera view, the system shows the currently translating sentence, ensuring users can follow the translation in real time. To enhance usability, the design includes an option to easily scroll and view previous sentences, giving users control over the conversation flow. Additionally, recognizing the load-heavy nature of the recognition process, the system offers an option to pause the video input streaming, allowing users to control when they want input to be processed and when they prefer a break from the system’s recognition tasks. This feature ensures a more personalized and manageable interaction. It’s important to note that only a single screen was developed, as the project focused on the implementation of the core concept rather than a full-scale design.
Application Design
The design aimed to provide a seamless user experience by first displaying the camera input for real-time sign language recognition. Alongside the camera view, the system shows the currently translating sentence, ensuring users can follow the translation in real time. To enhance usability, the design includes an option to easily scroll and view previous sentences, giving users control over the conversation flow. Additionally, recognizing the load-heavy nature of the recognition process, the system offers an option to pause the video input streaming, allowing users to control when they want input to be processed and when they prefer a break from the system’s recognition tasks. This feature ensures a more personalized and manageable interaction. It’s important to note that only a single screen was developed, as the project focused on the implementation of the core concept rather than a full-scale design.
Application Design
The design aimed to provide a seamless user experience by first displaying the camera input for real-time sign language recognition. Alongside the camera view, the system shows the currently translating sentence, ensuring users can follow the translation in real time. To enhance usability, the design includes an option to easily scroll and view previous sentences, giving users control over the conversation flow. Additionally, recognizing the load-heavy nature of the recognition process, the system offers an option to pause the video input streaming, allowing users to control when they want input to be processed and when they prefer a break from the system’s recognition tasks. This feature ensures a more personalized and manageable interaction. It’s important to note that only a single screen was developed, as the project focused on the implementation of the core concept rather than a full-scale design.


Challenges & Limitations
Several challenges fell outside the scope of this project, resulting in unresolved issues:
Contextual Limitations: The system processes individual sentences without tracking conversation context, leading to potential inaccuracies when coherence depends on prior sentences.
Sign Language Variations: Differences in regional and institutional sign language practices, such as varied symbols for punctuation, were not addressed, leading to potential ambiguities.
Dataset Constraints: The limited and non-diverse dataset impacted the system's ability to generalize across different users, regional variations, and linguistic contexts.
These limitations highlight areas for improvement and opportunities for future iterations of the project.
Challenges & Limitations
Several challenges fell outside the scope of this project, resulting in unresolved issues:
Contextual Limitations: The system processes individual sentences without tracking conversation context, leading to potential inaccuracies when coherence depends on prior sentences.
Sign Language Variations: Differences in regional and institutional sign language practices, such as varied symbols for punctuation, were not addressed, leading to potential ambiguities.
Dataset Constraints: The limited and non-diverse dataset impacted the system's ability to generalize across different users, regional variations, and linguistic contexts.
These limitations highlight areas for improvement and opportunities for future iterations of the project.
Challenges & Limitations
Several challenges fell outside the scope of this project, resulting in unresolved issues:
Contextual Limitations: The system processes individual sentences without tracking conversation context, leading to potential inaccuracies when coherence depends on prior sentences.
Sign Language Variations: Differences in regional and institutional sign language practices, such as varied symbols for punctuation, were not addressed, leading to potential ambiguities.
Dataset Constraints: The limited and non-diverse dataset impacted the system's ability to generalize across different users, regional variations, and linguistic contexts.
These limitations highlight areas for improvement and opportunities for future iterations of the project.
Future Works
Several areas have been identified for future improvement to enhance the system's performance and usability:
Context Awareness: Implement methods to track conversational context, ensuring coherence and continuity across multiple sentences.
Optimized Efficiency: Explore resource optimization techniques to reduce computational intensity, enabling real-time scalability.
Expanded Dataset: Build a more diverse and expansive dataset to improve generalization across varied users, regions, and linguistic contexts.
Sign Language Variations: Incorporate regional and institutional differences in sign language practices to accommodate diverse user needs.
Future Works
Several areas have been identified for future improvement to enhance the system's performance and usability:
Context Awareness: Implement methods to track conversational context, ensuring coherence and continuity across multiple sentences.
Optimized Efficiency: Explore resource optimization techniques to reduce computational intensity, enabling real-time scalability.
Expanded Dataset: Build a more diverse and expansive dataset to improve generalization across varied users, regions, and linguistic contexts.
Sign Language Variations: Incorporate regional and institutional differences in sign language practices to accommodate diverse user needs.
Future Works
Several areas have been identified for future improvement to enhance the system's performance and usability:
Context Awareness: Implement methods to track conversational context, ensuring coherence and continuity across multiple sentences.
Optimized Efficiency: Explore resource optimization techniques to reduce computational intensity, enabling real-time scalability.
Expanded Dataset: Build a more diverse and expansive dataset to improve generalization across varied users, regions, and linguistic contexts.
Sign Language Variations: Incorporate regional and institutional differences in sign language practices to accommodate diverse user needs.
Conference Presentations & Publications
Real-time Sign Language Translation using Computer Vision and Machine Learning
I presented my paper at ICC Robins 2024 (International Conference on Cognitive Robotics and Intelligent Systems), and it is available on the IEEE Xplore Digital Library. The paper documented our implementation, advancing research in the field by addressing existing gaps and furthering the practical application of the concept.
Conference Presentations & Publications
Real-time Sign Language Translation using Computer Vision and Machine Learning
I presented my paper at ICC Robins 2024 (International Conference on Cognitive Robotics and Intelligent Systems), and it is available on the IEEE Xplore Digital Library. The paper documented our implementation, advancing research in the field by addressing existing gaps and furthering the practical application of the concept.
Conference Presentations & Publications
Real-time Sign Language Translation using Computer Vision and Machine Learning
I presented my paper at ICC Robins 2024 (International Conference on Cognitive Robotics and Intelligent Systems), and it is available on the IEEE Xplore Digital Library. The paper documented our implementation, advancing research in the field by addressing existing gaps and furthering the practical application of the concept.



Analysis of Vision based Techniques for the Translation of Indian Sign Language
I presented my paper at the International Conference of Engineering and Technology (ICET 2023) and it was subsequently published in the International Journal on Recent and Innovation Trends in Computing and Communication. The paper served as a comprehensive literature survey, thoroughly documenting all advancements in Indian Sign Language (ISL) recognition, focusing on both hardware and software-based approaches for app development.
Analysis of Vision based Techniques for the Translation of Indian Sign Language
I presented my paper at the International Conference of Engineering and Technology (ICET 2023) and it was subsequently published in the International Journal on Recent and Innovation Trends in Computing and Communication. The paper served as a comprehensive literature survey, thoroughly documenting all advancements in Indian Sign Language (ISL) recognition, focusing on both hardware and software-based approaches for app development.
Analysis of Vision based Techniques for the Translation of Indian Sign Language
I presented my paper at the International Conference of Engineering and Technology (ICET 2023) and it was subsequently published in the International Journal on Recent and Innovation Trends in Computing and Communication. The paper served as a comprehensive literature survey, thoroughly documenting all advancements in Indian Sign Language (ISL) recognition, focusing on both hardware and software-based approaches for app development.



←PREV
←PREV
←PREV
ADHD Management Research
ADHD Management Research
ADHD Management Research
NEXT →
NEXT →
NEXT →
ISL Translation to Hindi/Kannada
ISL Translation to Hindi/Kannada
ISL Translation to Hindi/Kannada