Embedded Hallucination Detection Widgets as UI-Level Model Health Indicators in Web-Based LLM Applications

Authors

  • Sairam Jalakam Devarajulu

Keywords:

Hallucination Detection, Large Language Models, User Interface Design, Model Health Monitoring, Responsible AI, Uncertainty Quantification, Human-AI Interaction, Web Applications

Abstract

Large Language Models (LLMs) deployed in web-based applications are increasingly susceptible to generating hallucinated content outputs that are fluent yet factually incorrect, unsupported, or fabricated. While significant research has focused on backend hallucination detection pipelines, comparatively little attention has been devoted to surfacing model reliability signals directly within the user interface (UI) layer. This paper introduces , a lightweight, embeddable hallucination detection widget framework designed as a real-time, UI-level model health indicator for web-based LLM applications. The proposed system integrates a multi-signal hallucination detection pipeline combining semantic entropy estimation, cross-referential consistency verification, and token-level uncertainty quantification into a modular front-end widget that provides end-users with interpretable, actionable confidence indicators alongside LLM-generated responses. We evaluate across five production-grade LLM backends (GPT-4o, GPT-3.5-Turbo, LLaMA-3-70B, Mistral-Large, and Claude-3-Sonnet) using a curated benchmark of 12,400 query response pairs spanning four high-stakes domains: biomedical question answering, legal document summarization, financial report generation, and educational content synthesis. Our results demonstrate that achieves a hallucination detection F1-score of 0.891 (±0.017), introduces a median latency overhead of only 145 ms per response, and significantly improves end-user trust calibration by 34.7% as measured through a controlled user study ( = 186). Furthermore, we show that the visual affordances reduce user over-reliance on hallucinated content by 41.2% compared to unaugmented interfaces. This work contributes a novel paradigm for treating hallucination detection not merely as a backend audit mechanism but as a first-class UI component integral to responsible AI deployment.

DOI: https://doi.org/10.17762/ijisae.v13i2s.8147

Downloads

Download data is not yet available.

References

Amershi, S., Weld, D., Vorvoreanu, M., Fourney, A., Nushi, B., Collisson, P., Suh, J., Iqbal, S., Bennett, P.N., Inkpen, K., Teevan, J., Kiber, R., & Horvitz, E. (2019). Guidelines for human-AI interaction. Proceedings of the 2019 CHI Conference on Human Factors in Computing Systems, 1–13.

Bhatt, U., Xiang, A., Sharma, S., Weller, A., Taly, A., Jia, Y., Ghosh, J., Puri, R., Moura, J.M.F., & Eckersley, P. (2021). Explainable machine learning in deployment. Proceedings of the 2020 Conference on Fairness, Accountability, and Transparency, 648–657.

Breck, E., Cai, S., Nielsen, E., Salib, M., & Sculley, D. (2017). The ML test score: A rubric for ML production readiness and technical debt reduction. IEEE International Conference on Big Data, 1123–1132.

Brooke, J. (1996). SUS: A quick and dirty usability scale. Usability Evaluation in Industry, 189, 4–7.

Chang, Y., Wang, X., Wang, J., Wu, Y., Yang, L., Zhu, K., Chen, H., Yi, X., Wang, C.,

Wang, Y., Ye, W., Zhang, Y., Chang, Y., Yu, P.S., Yang, Q., & Xie, X. (2024). A survey on evaluation of large language models. ACM Transactions on Intelligent Systems and Technology, 15(3), 1–45.

Chen, S., Zhang, Y., & Liu, P. (2024). Ensemble methods for hallucination detection in large language models. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics, 2847–2863.

Davis, F.D. (1989). Perceived usefulness, perceived ease of use, and user acceptance of information technology. MIS Quarterly, 13(3), 319–340.

Dhuliawala, S., Komeili, M., Xu, J., Raileanu, R., Li, X., Celikyilmaz, A., & Weston, J. (2023). Chain-of-verification reduces hallucination in large language models. arXiv preprint arXiv:2309.11495.

Doshi-Velez, F., & Kim, B. (2017). Towards a rigorous science of interpretable machine learning. arXiv preprint arXiv:1702.08608.

Drews, F.A., & Westenskow, D.R. (2006). The right picture is worth a thousand numbers: Data displays in anesthesia. Human Factors, 48(1), 59–71.

Dzindolet, M.T., Peterson, S.A., Pomranky, R.A., Pierce, L.G., & Beck, H.P. (2003). The role of trust in automation reliance. International Journal of Human-Computer Studies, 58(6), 697–718.

Dziri, N., Milton, S., Yu, M., Zaiane, O., & Reddy, S. (2022). On the origin of hallucinations in conversational models: Is it the dataset or the model? Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics, 5271–5285.

European Union. (2024). Regulation (EU) 2024/1689 laying down harmonised rules on artificial intelligence (AI Act). Official Journal of the European Union.

Felt, A.P., Ainslie, A., Reeder, R.W., Consolvo, S., Thyagaraja, S., Bettes, A., Harris, H., & Grimes, J. (2016). Improving SSL warnings: Comprehension and adherence. Proceedings of the 33rd Annual ACM Conference on Human Factors in Computing Systems, 2893–2902.

Gao, L., Madaan, A., Zhou, S., Alon, U., Liu, P., Yang, Y., Callan, J., & Neubig, G. (2023). PAL: Program-aided language models. Proceedings of the 40th International Conference on Machine Learning, 10764–10799.

Gao, T., Zhong, R., & Chen, D. (2024). Continuous hallucination monitoring for production LLM systems. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, 5632–5648.

Honovich, O., Aharoni, R., Herzig, J., Taitelbaum, H., Kuber, D., Krauth, V., Schick, T., Scialom, T., Szpektor, I., & Sznajder, B. (2022). TRUE: Re-evaluating factual consistency evaluation. Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics, 3905–3920.

Huang, L., Yu, W., Ma, W., Zhong, W., Feng, Z., Wang, H., Chen, Q., Peng, W., Feng, X., Qin, B., & Liu, T. (2023). A survey on hallucination in large language models: Principles, taxonomy, challenges, and open questions. arXiv preprint arXiv:2311.05232.

Hullman, J., Kay, M., Kim, Y.S., & Shrestha, S. (2019). In pursuit of error: A survey of uncertainty visualization evaluation. IEEE Transactions on Visualization and Computer Graphics, 25(1), 903–913.

Izacard, G., Caron, M., Hosseini, L., Riedel, S., Bojanowski, P., Joulin, A., & Grave, E. (2022). Unsupervised dense information retrieval with contrastive learning. Transactions on Machine Learning Research.

Ji, Z., Lee, N., Frieske, R., Yu, T., Su, D., Xu, Y., Ishii, E., Bang, Y.J., Madotto, A., & Fung, P. (2023). Survey of hallucination in natural language generation. ACM Computing Surveys, 55(12), 1–38.

Kadavath, S., Conerly, T., Askell, A., Henighan, T., Drain, D., Perez, E., Schiefer, N., Hatfield-Dodds, Z., DasSarma, N., Tran-Johnson, E., Johnston, S., El-Showk, S., Jones, A., Elhage, N., Hume, T., Chen, A., Bai, Y., Bowman, S., Fort, S., Ganguli, D., Hernandez, D., Jacobson, J., Kernion, J., Kravec, S., Lovitt, L., Ndousse, K., Olsson, C., Ringer, S., Amodei, D., Brown, T., Clark, J., Joseph, N., Mann, B., McCandlish, S., Olah, C., & Kaplan, J. (2022). Language models (mostly) know what they know. arXiv preprint arXiv:2207.05221.

Kalai, A.T., & Vempala, S.S. (2024). Calibrated language models must hallucinate. Proceedings of the 56th Annual ACM Symposium on Theory of Computing, 160–171.

Kasneci, E., Seßler, K., Küchemann, S., Bannert, M., Dementieva, D., Fischer, F., Gasser, U., Groh, G., Günnemann, S., Hüllermeier, E., Kruber, S., Kuber, G., Lam, N., Nerdel, C., Prasser, F., & Kasneci, G. (2023). ChatGPT for good? On opportunities and challenges of large language models for education. Learning and Individual Differences, 103, 102274.

Kay, M., Kola, T., Hullman, J.R., & Munson, S.A. (2016). When (ish) is my bus? User- centered visualizations of uncertainty in everyday, mobile predictive systems. Proceedings of the 2016 CHI Conference on Human Factors in Computing Systems, 5092–5103.

Klaise, J., Van Looveren, A., Vacanti, G., & Coca, A. (2021). Monitoring machine learning models in production: A comprehensive survey. arXiv preprint arXiv:2105.02811.

Kuhn, L., Gal, Y., & Farquhar, S. (2023). Semantic uncertainty: Linguistic invariances for uncertainty estimation in natural language generation. Proceedings of the 11th International Conference on Learning Representations.

Lee, J.D., & See, K.A. (2004). Trust in automation: Designing for appropriate reliance.

Human Factors, 46(1), 50–80.

Lewis, P., Perez, E., Piktus, A., Petroni, F., Karpukhin, V., Goyal, N., Küttler, H., Lewis, M., Yih, W., Rocktäschel, T., Riedel, S., & Kiela, D. (2020). Retrieval-augmented generation for knowledge-intensive NLP tasks. Advances in Neural Information Processing Systems, 33, 9459–9474.

Li, J., Cheng, X., Zhao, W.X., Nie, J.Y., & Wen, J.R. (2023). HaluEval: A large-scale hallucination evaluation benchmark for large language models. Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, 6449–6464.

Liao, Q.V., & Vaughan, J.W. (2024). AI transparency in the age of LLMs: A human-centered research roadmap. Harvard Data Science Review, 6(1).

Liu, Y., Iter, D., Xu, Y., Wang, S., Xu, R., & Zhu, C. (2023). G-Eval: NLG evaluation using GPT-4 with better human alignment. Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, 2511–2522.

Lundberg, S.M., & Lee, S.I. (2017). A unified approach to interpreting model predictions.

Advances in Neural Information Processing Systems, 30.

Manakul, P., Liusie, A., & Gales, M.J. (2023). SelfCheckGPT: Zero-resource black-box hallucination detection for generative large language models. Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, 9004–9017.

McKenna, N., Li, T., Cheng, L., Hosseini, M.J., Johnson, M., & Steedman, M. (2023). Sources of hallucination by large language models on inference tasks. Findings of the Association for Computational Linguistics: EMNLP 2023, 2758–2774.

McKinsey & Company. (2024). The state of AI in early 2024: Gen AI adoption spikes and starts to generate value. McKinsey Global Survey.

Min, S., Krishna, K., Lyu, X., Lewis, M., Yih, W.T., Koh, P.W., Iyyer, M., Zettlemoyer, L., & Hajishirzi, H. (2023). FActScore: Fine-grained atomic evaluation of factual precision in long form text generation. Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, 12076–12100.

Nielsen, J. (1993). Usability Engineering. Academic Press.

National Institute of Standards and Technology. (2023). Artificial Intelligence Risk Manage- ment Framework (AI RMF 1.0). NIST AI 100-1.

Parasuraman, R., & Riley, V. (1997). Humans and automation: Use, misuse, disuse, abuse.

Human Factors, 39(2), 230–253.

Pirolli, P., & Card, S. (1999). Information foraging. Psychological Review, 106(4), 643.

Raptis, D., Tselios, N., Kjeldskov, J., & Skov, M.B. (2015). Does size matter? Investigating the impact of mobile phone screen size on users’ perceived usability, effectiveness and efficiency. Proceedings of the 15th International Conference on Human-Computer Interaction with Mobile Devices and Services, 127–136.

Rawte, V., Sheth, A., & Das, A. (2023). A survey of hallucination in large foundation models.

arXiv preprint arXiv:2309.05922.

Ribeiro, M.T., Singh, S., & Guestrin, C. (2016). “Why should I trust you?” Explaining the predictions of any classifier. Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 1135–1144.

Ribeiro, M.T., Lundberg, S., Guestrin, C., & Nushi, B. (2023). Adaptive testing and debugging of NLP models. Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics, 1068–1083.

Schuster, T., Fisch, A., & Barzilay, R. (2021). Get your vitamin C! Robust fact verification with contrastive evidence. Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics, 624–643.

Sculley, D., Holt, G., Golovin, D., Davydov, E., Phillips, T., Ebner, D., Chaudhary, V., Young, M., Crespo, J.F., & Dennison, D. (2015). Hidden technical debt in machine learning systems. Advances in Neural Information Processing Systems, 28.

U.S. Securities and Exchange Commission. (2024). Staff Statement on Artificial Intelligence and the Securities Industry.

Sunshine, J., Egelman, S., Almuhimedi, H., Atri, N., & Cranor, L.F. (2009). Crying wolf: An empirical study of SSL warning effectiveness. Proceedings of the 18th USENIX Security Symposium, 399–416.

Thirunavukarasu, A.J., Ting, D.S.J., Elangovan, K., Gutierrez, L., Tan, T.F., & Ting, D.S.W. (2023). Large language models in medicine. Nature Medicine, 29(8), 1930–1940.

Thorne, J., Vlachos, A., Christodoulopoulos, C., & Mittal, A. (2018). FEVER: A large-scale dataset for fact extraction and verification. Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics, 809–819.

Varshney, N., Yao, W., Zhang, H., Chen, J., & Yu, D. (2023). A stitch in time saves nine: Detecting and mitigating hallucinations of LLMs by validating low-confidence generation. arXiv preprint arXiv:2307.03987.

Vasconcelos, H., Jörke, M., Grunde-McLaughlin, M., Gerstenberg, T., Bernstein, M.S., & Krishna, R. (2023). Explanations can reduce overreliance on AI systems during decision- making. Proceedings of the ACM on Human-Computer Interaction, 7(CSCW1), 1–38.

Vinyals, O., & Le, Q. (2015). A neural conversational model. Proceedings of the ICML Deep Learning Workshop.

Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A., & Zhou, D. (2023). Self-consistency improves chain of thought reasoning in language models. Proceedings of the 11th International Conference on Learning Representations.

Weiser, B. (2023, May 27). Here’s what happens when your lawyer uses ChatGPT. The New York Times.

Wickens, C.D., Hollands, J.G., Banbury, S., & Parasuraman, R. (2015). Engineering Psychol- ogy and Human Performance (4th ed.). Psychology Press.

Yin, M., Wortman Vaughan, J., & Wallach, H. (2019). Understanding the effect of accuracy on trust in machine learning models. Proceedings of the 2019 CHI Conference on Human Factors in Computing Systems, 1–12.

Zhang, Y., Li, Y., Cui, L., Cai, D., Liu, L., Fu, T., Huang, X., Zhao, E., Zhang, Y., Chen, Y., Wang, L., Luu, A.T., Bi, W., Shi, F., & Shi, S. (2023). Siren’s song in the AI ocean: A survey on hallucination in large language models. arXiv preprint arXiv:2309.01219.

Zhao, W.X., Zhou, K., Li, J., Tang, T., Wang, X., Hou, Y., Min, Y., Zhang, B., Zhang, J.,

Dong, Z., Du, Y., Yang, C., Chen, Y., Chen, Z., Jiang, J., Ren, R., Li, Y., Tang, X., Liu, Z., Liu, P., Nie, J.Y., & Wen, J.R. (2023). A survey of large language models. arXiv preprint arXiv:2303.18223.

Zhou, J., Müller, H., Holzinger, A., & Chen, F. (2024). Ethical AI and accountability: Designing AI nutrition labels. AI and Ethics, 4(1), 215–227.

Downloads

Published

31.08.2025

How to Cite

Sairam Jalakam Devarajulu. (2025). Embedded Hallucination Detection Widgets as UI-Level Model Health Indicators in Web-Based LLM Applications. International Journal of Intelligent Systems and Applications in Engineering, 13(2s), 226–254. Retrieved from https://www.ijisae.org/index.php/IJISAE/article/view/8147

Issue

Section

Research Article