Automating Document Narration: A Deep Learning-Based Speech Captioning System for Visually impaired Person

Authors

  • Pritam Langde, Shrinivas Patil, Prachi Langde

Keywords:

Assistive Technology, Deep Learning LSTM, Image Captioning Optical Character Recognition (OCR) Text-to-Speech (TTS)

Abstract

This paper presents an innovative deep learning-based speech captioning system designed to enhance document accessibility for visually impaired individuals. Leveraging a modular architecture that integrates Convolutional Neural Networks (ResNet50), Long Short-Term Memory (LSTM) networks, Optical Character Recognition (OCR), and Text-to-Speech (TTS) technology, the system transforms both textual and visual content from printed documents into real-time, natural-sounding audio. The proposed framework employs image preprocessing and intelligent segmentation techniques to distinguish between text and image regions, followed by content-specific processing—text is extracted via Tesseract OCR, while visual regions are described using an image captioning model based on ResNet-LSTM integration. The summarized content is then converted into speech using the Google TTS API. A custom-built hardware assembly with a mobile-mounted camera, adjustable alignment, and portable design ensures ease of use in real-world settings.

Experimental evaluation on a 100-document dataset demonstrates high accuracy rates— 94% for text recognition, 91% for image detection, and 89% for caption generation. Quality assessments reveal minimal error margins, affirming the system’s reliability and effectiveness. This study underscores the potential of AI-driven multimodal solutions in promoting inclusive information access and enabling independent navigation of printed materials by visually impaired users. Future work will focus on real-time enhancements and participatory design inputs from end users to further optimize the system’s usability and impact.

Downloads

Download data is not yet available.

References

WHO, “Blindness and Vision Impairment,” Sustainable Development Goals Series, 2023.https://www.who.int/news-room/fact-sheets/detail/blindness-and-visual- impairment (accessed Mar. 31, 2024).

H. Akula, G. D. Reddy, and M. Kolla, “Assistive System for the Visually Impaired using Multiple Cameras and Sensors,” in Proceedings of 4th International Conference on Cybernetics, Cognition and Machine Learning Applications, ICCCMLA 2022, 2022, pp. 363–369. doi: 10.1109/ICCCMLA56841.2022.9989288.

S. C. Jakka, Y. V. Sai, A. Jesudoss, and A. Viji Amutha Mary, “Blind Assistance System using Tensor Flow,” 3rd Int. Conf. Electron. Sustain. Commun. Syst. ICESC 2022 - Proc., no. Icesc, pp. 1505–1511, 2022, doi: 10.1109/ICESC54411.2022.9885356.

S. Zulaikha Beevi, P. Harish Kumar, S. Harish, and S. J. Lakshan, “Decision Making Algorithm for Blind Navigation Assistance using Deep Learning,” in 2022 1st International Conference on Computational Science and Technology, ICCST 2022 - Proceedings, 2022, pp. 268–272. doi: 10.1109/ICCST55948.2022.10040269.

A. Saicharan, C. Jayalakshmi, B. Sowjanya, K. Raveendra, and M. P. Aslam, “Breaking Boundaries: Advancing Accessibility with Camera Vision to Voice Object Recognition for the Visually Impaired,” in Proceedings of 2nd International Conference on Advancements in Smart, Secure and Intelligent Computing, ASSIC 2024, 2024. doi: 10.1109/ASSIC60049.2024.10507906.

A. Navhule, Akif, K. Byndoor, A. N. Mohammed Nayaz, D. Shetty, and H. Sarojadevi, “Citizen Cane - An Object Detection and Image Description System for the Visually Impaired,” in Proceedings of the 2024 3rd Edition of IEEE Delhi Section Flagship Conference, DELCON 2024, 2024. doi: 10.1109/DELCON64804.2024.10866517.

A. Kariri and K. Elleithy, “Astute Support System for Visually Impaired and Blind with Highest Intersection over Union for Object Detection and Recognition with Voice Feedback,” in 2024 IEEE Long Island Systems, Applications and Technology Conference, LISAT 2024, 2024. doi: 10.1109/LISAT63094.2024.10807856.

K. M. Safiya and R. Pandian, “Computer Vision and Voice Assisted Image Captioning Framework for Visually Impaired Individuals using Deep Learning Approach,” in 2023 4th IEEE Global Conference for Advancement in Technology, GCAT 2023, 2023. doi: 10.1109/GCAT59970.2023.10353449.

T. Ghandi, H. Pourreza, and H. Mahyar, “Deep Learning Approaches on Image Captioning: A Review,” ACM Comput. Surv., vol. 56, no. 3, Mar. 2023, doi: 10.1145/3617592.

P. Sharma, “Generating Caption From Images Using Flickr Image Dataset,” 2024 15th Int. Conf. Comput. Commun. Netw. Technol., pp. 1–7, 2024, doi: 10.1109/ICCCNT61001.2024.10724963.

S. Das and R. Sharma, “A TextGCN-Based Decoding Approach for Improving Remote Sensing Image Captioning,” IEEE Geosci. Remote Sens. Lett., 2024, doi: 10.1109/LGRS.2024.3523134.

K. Jivrajani et al., “AIoT-Based Smart Stick for Visually Impaired Person,” IEEE Trans. Instrum. Meas., vol. 72, 2023, doi: 10.1109/TIM.2022.3227988.

2019 IEEE International Conference on Signal Processing, Information, Communication & Systems (SPICSCON) : 28th-30th November, FARS Hotel and Resorts, Dhaka, Bangladesh. IEEE, 2019.

S. Ji, V. Jayaswal, K. Deeksha, S. Kumari, A. Kumar, and P. Bhagat, “A Novel

Approach for Image Captioning using Deep Learning Techniques,” in 2024 1st International Conference on Advanced Computing and Emerging Technologies, ACET 2024, 2024. doi: 10.1109/ACET61898.2024.10730753.

Q. Mohd, I. Hussain, C. V. S. Satyamurty, and R. K. Godi, “Enhancing Accessibility : Image Captioning for Visually Impaired Individuals in the Realm of ECE Advancements,” 2024 4th Int. Conf. Technol. Adv. Comput. Sci., pp. 317–321, 2024, doi: 10.1109/ICTACS62700.2024.10840791.

S. Samundeswari, V. Lalitha, V. Archana, and K. Sreshta, “OPTICAL CHARACTER RECOGNITION for VISUALL YCHALLENGED PEOPLE with SHOPPING CART

USING AI,” 2022 Int. Virtual Conf. Power Eng. Comput. Control Dev. Electr. Veh. Energy Sect. Sustain. Futur. PECCON 2022, 2022, doi: 10.1109/PECCON55017.2022.9851037.

F. Sen Apu, F. I. Joyti, M. A. U. Anik, M. W. U. Zobayer, A. K. Dey, and S. Sakhawat, “Text and voice to braille translator for blind people,” 2021 Int. Conf. Autom. Control Mechatronics Ind. 4.0, ACMI 2021, vol. 0, no. July, pp. 8–9, 2021, doi: 10.1109/ACMI53878.2021.9528283.

Downloads

Published

29.10.2024

How to Cite

Pritam Langde. (2024). Automating Document Narration: A Deep Learning-Based Speech Captioning System for Visually impaired Person. International Journal of Intelligent Systems and Applications in Engineering, 12(23s), 2794 –. Retrieved from https://www.ijisae.org/index.php/IJISAE/article/view/7469

Issue

Section

Research Article