Beyond Speech-to-Text: Voice-to-Voice Generative AI Systems for Emotion-Aware, Real-Time Healthcare Contact Centers

Bhargavi Kalicheti

Authors

Bhargavi Kalicheti

Keywords:

Voice-To-Voice Generative AI, Speech Emotion Recognition, Healthcare Telephony, Affective Computing, Neural Vocoder, Prosody Modeling, Speech-Native AI, Hipaa Compliance, Real-Time Inference, Multilingual Voice AI

Abstract

Healthcare contact centers represent one of the most demanding environments for conversational artificial intelligence, requiring real-time responsiveness, emotional sensitivity, regulatory compliance, and operational scalability. Traditional speech-based AI systems rely on speech-to-text (STT) and text-to-speech (TTS) pipelines that introduce latency, flatten emotional nuance, and fragment conversational context. Recent advances in generative modeling have enabled a new paradigm: voice-to-voice (V2V) generative AI, where spoken input is transformed directly into spoken output without intermediate text representation. This paper explores the architectural foundations, system design principles, and operational implications of deploying V2V generative AI systems for emotion-aware, real-time healthcare contact centers. We analyze how speech-native representations preserve prosody, cadence, and affect, enabling more human-like interactions across administrative, clinical, and insurance-related telephony workflows. The paper presents a cloud-native reference architecture for V2V systems, examines latency optimization strategies critical for telephony-grade performance, discusses multilingual equity considerations, evaluates safety and compliance requirements in regulated healthcare environments, and outlines evaluation frameworks for emotional fidelity, conversational trust, and operational impact. By moving beyond text-mediated conversational AI, voice-to-voice systems represent a foundational shift toward speech-native intelligence capable of transforming healthcare contact center operations at a national scale.

Downloads

Download data is not yet available.

References

Eric J. Topol, "High-performance medicine: The convergence of human and artificial intelligence," Nat. Med. 2019. [Online]. Available: https://doi.org/10.1038/s41591-018-0300-7

Suresh Padala, "AI-Powered Healthcare Contact Centers: Real-Time Patient Journey Mapping and Dynamic Call Prioritization," ResearchGate, 2025. https://www.researchgate.net/publication/393582895_AI-Powered_Healthcare_Contact_Centers_Real-Time_Patient_Journey_Mapping_and_Dynamic_Call_Prioritization

Hagen Soltau et al., "Neural Speech Recognizer: Acoustic-to-Word LSTM Model for Large Vocabulary Speech Recognition," arXiv:1610.09975 [cs.CL], 2016. https://arxiv.org/abs/1610.09975

U.S. Department of Health and Human Services, "4 Security Standards: Technical Safeguards," 2013. https://www.hhs.gov/sites/default/files/ocr/privacy/hipaa/administrative/securityrule/techsafeguards.pdf

Zalán Borsos et al., "AudioLM: A Language Modeling Approach to Audio Generation," arXiv:2209.03143 [cs.SD], 2023. [Online]. Available: https://arxiv.org/abs/2209.03143

E. Perfetto et al., "Text-Free Prosody-Aware Generative Spoken Language Modeling," arXiv:2109.15209 [cond-mat.mes-hall] 2021. [Online]. Available: https://arxiv.org/abs/2109.15209

Alexei Baevski et al., "wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations," arXiv:2006.11477 [cs.CL], 2020. [Online]. Available: https://arxiv.org/abs/2006.11477

Neil Zeghidour et al., "SoundStream: An End-to-End Neural Audio Codec," arXiv:2107.03312 [cs.SD], 2021. [Online]. Available: https://arxiv.org/abs/2107.03312

Siddique Latif et al., "Survey of Deep Representation Learning for Speech Emotion Recognition," IEEE Trans. Affect. Comput., 2021. [Online]. Available: https://ieeexplore.ieee.org/document/9543566

S Yunkap Kwankam et al., "11 Health technologies," Npj Prim. Care Respir. Med., 2024. https://www.ncbi.nlm.nih.gov/books/NBK618507/

Scott Bell, "Best Practices for Delivering a Seamless Healthcare Call Center Customer Experience," J. Healthc. Manag. 2023. https://www.acttoday.com/blog/best-practices-for-delivering-a-seamless-healthcare-call-center-customer-experience/

Steven R. Livingstone, Frank A. Russo, "The Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS): A Dynamic, Multimodal Set of Facial and Vocal Expressions in North American English," PLoS ONE, 2018. [Online]. Available: https://journals.plos.org/plosone/article?id=10.1371/journal.pone.019

Shashidhar G. Koolagudi & K. Sreenivasa Rao, "Emotion recognition from speech: a review," International Journal of Speech Technology, 2012. [Online]. Available: https://link.springer.com/article/10.1007/s10772-011-9125-1

William Chan, Ian Lane, "Deep Recurrent Neural Networks for Acoustic Modeling," arXiv:1504.01482 [cs.LG], 2015. [Online]. Available: https://doi.org/10.1109/ICASSP.2014.6638947

Wei-Ning Hsu, et al., "HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units," IEEE/ACM Trans. Audio Speech Lang Process, 2021. [Online]. Available: https://arxiv.org/abs/2106.07447

Alexandre Défossez et al., "High Fidelity Neural Audio Compression," arXiv:2210.13438 [eess.AS], 2022. [Online]. Available: https://arxiv.org/abs/2210.13438

Jungil Kong et al., "HiFi-GAN: Generative Adversarial Networks for Efficient and High Fidelity Speech Synthesis," arXiv:2010.05646 [cs.SD] 2020. [Online]. Available: https://arxiv.org/abs/2010.05646

Y. Zhang et al., "SpeechLM: Enhanced Speech Pre-Training with Unpaired Textual Data," IEEE/ACM Trans. Audio Speech Lang. Process., vol. 31, pp. 3897–3909, 2023. [Online]. Available: https://arxiv.org/abs/2209.15329

Soumya Dutta et al., "Audio-to-Audio Emotion Conversion With Pitch And Duration Style Transfer," arXiv:2505.17655v1 [eess.AS], 2021. [Online]. Available: https://arxiv.org/html/2505.17655v1

R. Lotfian and C. Busso, "Curriculum Learning for Speech Emotion Recognition from Crowdsourced Labels," IEEE/ACM Trans. Audio Speech Lang. Process., 2019. [Online]. Available: https://ieeexplore.ieee.org/document/8638999

Alec Radford, "Robust Speech Recognition via Large-Scale Weak Supervision," arXiv:2212.04356 [eess.AS], 2023. [Online]. Available: https://arxiv.org/abs/2212.04356

Björn Schuller et al., "The INTERSPEECH 2009 Emotion Challenge," in Proc. INTERSPEECH, Brighton, UK, 2009. [Online]. Available: https://www.isca-archive.org/interspeech_2009/schuller09_interspeech.html

Yuxuan Wang et al., "Tacotron: Towards End-to-End Speech Synthesis," arXiv:1703.10135 [cs.CL], 2017. [Online]. Available: https://arxiv.org/abs/1703.10135

Beyond Speech-to-Text: Voice-to-Voice Generative AI Systems for Emotion-Aware, Real-Time Healthcare Contact Centers

Authors

Keywords:

Abstract

Downloads

References

Downloads

Published

How to Cite

Issue

Section

License

Similar Articles

Announcements

Information for Authors

ijisae

Information

Indexed By

Beyond Speech-to-Text: Voice-to-Voice Generative AI Systems for Emotion-Aware, Real-Time Healthcare Contact Centers

Authors

Keywords:

Abstract

Downloads

References

Downloads

Published

How to Cite

Issue

Section

License

Similar Articles

Announcements

Information for Authors

Like, Subscribe and Share This Video

ijisae

Information

Indexed By