Beyond Speech-to-Text: Voice-to-Voice Generative AI Systems for Emotion-Aware, Real-Time Healthcare Contact Centers
Keywords:
Voice-To-Voice Generative AI, Speech Emotion Recognition, Healthcare Telephony, Affective Computing, Neural Vocoder, Prosody Modeling, Speech-Native AI, Hipaa Compliance, Real-Time Inference, Multilingual Voice AIAbstract
Healthcare contact centers represent one of the most demanding environments for conversational artificial intelligence, requiring real-time responsiveness, emotional sensitivity, regulatory compliance, and operational scalability. Traditional speech-based AI systems rely on speech-to-text (STT) and text-to-speech (TTS) pipelines that introduce latency, flatten emotional nuance, and fragment conversational context. Recent advances in generative modeling have enabled a new paradigm: voice-to-voice (V2V) generative AI, where spoken input is transformed directly into spoken output without intermediate text representation. This paper explores the architectural foundations, system design principles, and operational implications of deploying V2V generative AI systems for emotion-aware, real-time healthcare contact centers. We analyze how speech-native representations preserve prosody, cadence, and affect, enabling more human-like interactions across administrative, clinical, and insurance-related telephony workflows. The paper presents a cloud-native reference architecture for V2V systems, examines latency optimization strategies critical for telephony-grade performance, discusses multilingual equity considerations, evaluates safety and compliance requirements in regulated healthcare environments, and outlines evaluation frameworks for emotional fidelity, conversational trust, and operational impact. By moving beyond text-mediated conversational AI, voice-to-voice systems represent a foundational shift toward speech-native intelligence capable of transforming healthcare contact center operations at a national scale.
Downloads
References
Eric J. Topol, "High-performance medicine: The convergence of human and artificial intelligence," Nat. Med. 2019. [Online]. Available: https://doi.org/10.1038/s41591-018-0300-7
Suresh Padala, "AI-Powered Healthcare Contact Centers: Real-Time Patient Journey Mapping and Dynamic Call Prioritization," ResearchGate, 2025. https://www.researchgate.net/publication/393582895_AI-Powered_Healthcare_Contact_Centers_Real-Time_Patient_Journey_Mapping_and_Dynamic_Call_Prioritization
Hagen Soltau et al., "Neural Speech Recognizer: Acoustic-to-Word LSTM Model for Large Vocabulary Speech Recognition," arXiv:1610.09975 [cs.CL], 2016. https://arxiv.org/abs/1610.09975
U.S. Department of Health and Human Services, "4 Security Standards: Technical Safeguards," 2013. https://www.hhs.gov/sites/default/files/ocr/privacy/hipaa/administrative/securityrule/techsafeguards.pdf
Zalán Borsos et al., "AudioLM: A Language Modeling Approach to Audio Generation," arXiv:2209.03143 [cs.SD], 2023. [Online]. Available: https://arxiv.org/abs/2209.03143
E. Perfetto et al., "Text-Free Prosody-Aware Generative Spoken Language Modeling," arXiv:2109.15209 [cond-mat.mes-hall] 2021. [Online]. Available: https://arxiv.org/abs/2109.15209
Alexei Baevski et al., "wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations," arXiv:2006.11477 [cs.CL], 2020. [Online]. Available: https://arxiv.org/abs/2006.11477
Neil Zeghidour et al., "SoundStream: An End-to-End Neural Audio Codec," arXiv:2107.03312 [cs.SD], 2021. [Online]. Available: https://arxiv.org/abs/2107.03312
Siddique Latif et al., "Survey of Deep Representation Learning for Speech Emotion Recognition," IEEE Trans. Affect. Comput., 2021. [Online]. Available: https://ieeexplore.ieee.org/document/9543566
S Yunkap Kwankam et al., "11 Health technologies," Npj Prim. Care Respir. Med., 2024. https://www.ncbi.nlm.nih.gov/books/NBK618507/
Scott Bell, "Best Practices for Delivering a Seamless Healthcare Call Center Customer Experience," J. Healthc. Manag. 2023. https://www.acttoday.com/blog/best-practices-for-delivering-a-seamless-healthcare-call-center-customer-experience/
Steven R. Livingstone, Frank A. Russo, "The Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS): A Dynamic, Multimodal Set of Facial and Vocal Expressions in North American English," PLoS ONE, 2018. [Online]. Available: https://journals.plos.org/plosone/article?id=10.1371/journal.pone.019
Shashidhar G. Koolagudi & K. Sreenivasa Rao, "Emotion recognition from speech: a review," International Journal of Speech Technology, 2012. [Online]. Available: https://link.springer.com/article/10.1007/s10772-011-9125-1
William Chan, Ian Lane, "Deep Recurrent Neural Networks for Acoustic Modeling," arXiv:1504.01482 [cs.LG], 2015. [Online]. Available: https://doi.org/10.1109/ICASSP.2014.6638947
Wei-Ning Hsu, et al., "HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units," IEEE/ACM Trans. Audio Speech Lang Process, 2021. [Online]. Available: https://arxiv.org/abs/2106.07447
Alexandre Défossez et al., "High Fidelity Neural Audio Compression," arXiv:2210.13438 [eess.AS], 2022. [Online]. Available: https://arxiv.org/abs/2210.13438
Jungil Kong et al., "HiFi-GAN: Generative Adversarial Networks for Efficient and High Fidelity Speech Synthesis," arXiv:2010.05646 [cs.SD] 2020. [Online]. Available: https://arxiv.org/abs/2010.05646
Y. Zhang et al., "SpeechLM: Enhanced Speech Pre-Training with Unpaired Textual Data," IEEE/ACM Trans. Audio Speech Lang. Process., vol. 31, pp. 3897–3909, 2023. [Online]. Available: https://arxiv.org/abs/2209.15329
Soumya Dutta et al., "Audio-to-Audio Emotion Conversion With Pitch And Duration Style Transfer," arXiv:2505.17655v1 [eess.AS], 2021. [Online]. Available: https://arxiv.org/html/2505.17655v1
R. Lotfian and C. Busso, "Curriculum Learning for Speech Emotion Recognition from Crowdsourced Labels," IEEE/ACM Trans. Audio Speech Lang. Process., 2019. [Online]. Available: https://ieeexplore.ieee.org/document/8638999
Alec Radford, "Robust Speech Recognition via Large-Scale Weak Supervision," arXiv:2212.04356 [eess.AS], 2023. [Online]. Available: https://arxiv.org/abs/2212.04356
Björn Schuller et al., "The INTERSPEECH 2009 Emotion Challenge," in Proc. INTERSPEECH, Brighton, UK, 2009. [Online]. Available: https://www.isca-archive.org/interspeech_2009/schuller09_interspeech.html
Yuxuan Wang et al., "Tacotron: Towards End-to-End Speech Synthesis," arXiv:1703.10135 [cs.CL], 2017. [Online]. Available: https://arxiv.org/abs/1703.10135
Downloads
Published
How to Cite
Issue
Section
License

This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.
All papers should be submitted electronically. All submitted manuscripts must be original work that is not under submission at another journal or under consideration for publication in another form, such as a monograph or chapter of a book. Authors of submitted papers are obligated not to submit their paper for publication elsewhere until an editorial decision is rendered on their submission. Further, authors of accepted papers are prohibited from publishing the results in other publications that appear before the paper is published in the Journal unless they receive approval for doing so from the Editor-In-Chief.
IJISAE open access articles are licensed under a Creative Commons Attribution-ShareAlike 4.0 International License. This license lets the audience to give appropriate credit, provide a link to the license, and indicate if changes were made and if they remix, transform, or build upon the material, they must distribute contributions under the same license as the original.


