Embedded Hallucination Detection Widgets as UI-Level Model Health Indicators in Web-Based LLM Applications
Keywords:
Hallucination Detection, Large Language Models, User Interface Design, Model Health Monitoring, Responsible AI, Uncertainty Quantification, Human-AI Interaction, Web ApplicationsAbstract
Large Language Models (LLMs) deployed in web-based applications are increasingly susceptible to generating hallucinated content outputs that are fluent yet factually incorrect, unsupported, or fabricated. While significant research has focused on backend hallucination detection pipelines, comparatively little attention has been devoted to surfacing model reliability signals directly within the user interface (UI) layer. This paper introduces , a lightweight, embeddable hallucination detection widget framework designed as a real-time, UI-level model health indicator for web-based LLM applications. The proposed system integrates a multi-signal hallucination detection pipeline combining semantic entropy estimation, cross-referential consistency verification, and token-level uncertainty quantification into a modular front-end widget that provides end-users with interpretable, actionable confidence indicators alongside LLM-generated responses. We evaluate across five production-grade LLM backends (GPT-4o, GPT-3.5-Turbo, LLaMA-3-70B, Mistral-Large, and Claude-3-Sonnet) using a curated benchmark of 12,400 query response pairs spanning four high-stakes domains: biomedical question answering, legal document summarization, financial report generation, and educational content synthesis. Our results demonstrate that achieves a hallucination detection F1-score of 0.891 (±0.017), introduces a median latency overhead of only 145 ms per response, and significantly improves end-user trust calibration by 34.7% as measured through a controlled user study ( = 186). Furthermore, we show that the visual affordances reduce user over-reliance on hallucinated content by 41.2% compared to unaugmented interfaces. This work contributes a novel paradigm for treating hallucination detection not merely as a backend audit mechanism but as a first-class UI component integral to responsible AI deployment.
Downloads
References
Amershi, S., Weld, D., Vorvoreanu, M., Fourney, A., Nushi, B., Collisson, P., Suh, J., Iqbal, S., Bennett, P.N., Inkpen, K., Teevan, J., Kiber, R., & Horvitz, E. (2019). Guidelines for human-AI interaction. Proceedings of the 2019 CHI Conference on Human Factors in Computing Systems, 1–13.
Bhatt, U., Xiang, A., Sharma, S., Weller, A., Taly, A., Jia, Y., Ghosh, J., Puri, R., Moura, J.M.F., & Eckersley, P. (2021). Explainable machine learning in deployment. Proceedings of the 2020 Conference on Fairness, Accountability, and Transparency, 648–657.
Breck, E., Cai, S., Nielsen, E., Salib, M., & Sculley, D. (2017). The ML test score: A rubric for ML production readiness and technical debt reduction. IEEE International Conference on Big Data, 1123–1132.
Brooke, J. (1996). SUS: A quick and dirty usability scale. Usability Evaluation in Industry, 189, 4–7.
Chang, Y., Wang, X., Wang, J., Wu, Y., Yang, L., Zhu, K., Chen, H., Yi, X., Wang, C.,
Wang, Y., Ye, W., Zhang, Y., Chang, Y., Yu, P.S., Yang, Q., & Xie, X. (2024). A survey on evaluation of large language models. ACM Transactions on Intelligent Systems and Technology, 15(3), 1–45.
Chen, S., Zhang, Y., & Liu, P. (2024). Ensemble methods for hallucination detection in large language models. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics, 2847–2863.
Davis, F.D. (1989). Perceived usefulness, perceived ease of use, and user acceptance of information technology. MIS Quarterly, 13(3), 319–340.
Dhuliawala, S., Komeili, M., Xu, J., Raileanu, R., Li, X., Celikyilmaz, A., & Weston, J. (2023). Chain-of-verification reduces hallucination in large language models. arXiv preprint arXiv:2309.11495.
Doshi-Velez, F., & Kim, B. (2017). Towards a rigorous science of interpretable machine learning. arXiv preprint arXiv:1702.08608.
Drews, F.A., & Westenskow, D.R. (2006). The right picture is worth a thousand numbers: Data displays in anesthesia. Human Factors, 48(1), 59–71.
Dzindolet, M.T., Peterson, S.A., Pomranky, R.A., Pierce, L.G., & Beck, H.P. (2003). The role of trust in automation reliance. International Journal of Human-Computer Studies, 58(6), 697–718.
Dziri, N., Milton, S., Yu, M., Zaiane, O., & Reddy, S. (2022). On the origin of hallucinations in conversational models: Is it the dataset or the model? Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics, 5271–5285.
European Union. (2024). Regulation (EU) 2024/1689 laying down harmonised rules on artificial intelligence (AI Act). Official Journal of the European Union.
Felt, A.P., Ainslie, A., Reeder, R.W., Consolvo, S., Thyagaraja, S., Bettes, A., Harris, H., & Grimes, J. (2016). Improving SSL warnings: Comprehension and adherence. Proceedings of the 33rd Annual ACM Conference on Human Factors in Computing Systems, 2893–2902.
Gao, L., Madaan, A., Zhou, S., Alon, U., Liu, P., Yang, Y., Callan, J., & Neubig, G. (2023). PAL: Program-aided language models. Proceedings of the 40th International Conference on Machine Learning, 10764–10799.
Gao, T., Zhong, R., & Chen, D. (2024). Continuous hallucination monitoring for production LLM systems. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, 5632–5648.
Honovich, O., Aharoni, R., Herzig, J., Taitelbaum, H., Kuber, D., Krauth, V., Schick, T., Scialom, T., Szpektor, I., & Sznajder, B. (2022). TRUE: Re-evaluating factual consistency evaluation. Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics, 3905–3920.
Huang, L., Yu, W., Ma, W., Zhong, W., Feng, Z., Wang, H., Chen, Q., Peng, W., Feng, X., Qin, B., & Liu, T. (2023). A survey on hallucination in large language models: Principles, taxonomy, challenges, and open questions. arXiv preprint arXiv:2311.05232.
Hullman, J., Kay, M., Kim, Y.S., & Shrestha, S. (2019). In pursuit of error: A survey of uncertainty visualization evaluation. IEEE Transactions on Visualization and Computer Graphics, 25(1), 903–913.
Izacard, G., Caron, M., Hosseini, L., Riedel, S., Bojanowski, P., Joulin, A., & Grave, E. (2022). Unsupervised dense information retrieval with contrastive learning. Transactions on Machine Learning Research.
Ji, Z., Lee, N., Frieske, R., Yu, T., Su, D., Xu, Y., Ishii, E., Bang, Y.J., Madotto, A., & Fung, P. (2023). Survey of hallucination in natural language generation. ACM Computing Surveys, 55(12), 1–38.
Kadavath, S., Conerly, T., Askell, A., Henighan, T., Drain, D., Perez, E., Schiefer, N., Hatfield-Dodds, Z., DasSarma, N., Tran-Johnson, E., Johnston, S., El-Showk, S., Jones, A., Elhage, N., Hume, T., Chen, A., Bai, Y., Bowman, S., Fort, S., Ganguli, D., Hernandez, D., Jacobson, J., Kernion, J., Kravec, S., Lovitt, L., Ndousse, K., Olsson, C., Ringer, S., Amodei, D., Brown, T., Clark, J., Joseph, N., Mann, B., McCandlish, S., Olah, C., & Kaplan, J. (2022). Language models (mostly) know what they know. arXiv preprint arXiv:2207.05221.
Kalai, A.T., & Vempala, S.S. (2024). Calibrated language models must hallucinate. Proceedings of the 56th Annual ACM Symposium on Theory of Computing, 160–171.
Kasneci, E., Seßler, K., Küchemann, S., Bannert, M., Dementieva, D., Fischer, F., Gasser, U., Groh, G., Günnemann, S., Hüllermeier, E., Kruber, S., Kuber, G., Lam, N., Nerdel, C., Prasser, F., & Kasneci, G. (2023). ChatGPT for good? On opportunities and challenges of large language models for education. Learning and Individual Differences, 103, 102274.
Kay, M., Kola, T., Hullman, J.R., & Munson, S.A. (2016). When (ish) is my bus? User- centered visualizations of uncertainty in everyday, mobile predictive systems. Proceedings of the 2016 CHI Conference on Human Factors in Computing Systems, 5092–5103.
Klaise, J., Van Looveren, A., Vacanti, G., & Coca, A. (2021). Monitoring machine learning models in production: A comprehensive survey. arXiv preprint arXiv:2105.02811.
Kuhn, L., Gal, Y., & Farquhar, S. (2023). Semantic uncertainty: Linguistic invariances for uncertainty estimation in natural language generation. Proceedings of the 11th International Conference on Learning Representations.
Lee, J.D., & See, K.A. (2004). Trust in automation: Designing for appropriate reliance.
Human Factors, 46(1), 50–80.
Lewis, P., Perez, E., Piktus, A., Petroni, F., Karpukhin, V., Goyal, N., Küttler, H., Lewis, M., Yih, W., Rocktäschel, T., Riedel, S., & Kiela, D. (2020). Retrieval-augmented generation for knowledge-intensive NLP tasks. Advances in Neural Information Processing Systems, 33, 9459–9474.
Li, J., Cheng, X., Zhao, W.X., Nie, J.Y., & Wen, J.R. (2023). HaluEval: A large-scale hallucination evaluation benchmark for large language models. Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, 6449–6464.
Liao, Q.V., & Vaughan, J.W. (2024). AI transparency in the age of LLMs: A human-centered research roadmap. Harvard Data Science Review, 6(1).
Liu, Y., Iter, D., Xu, Y., Wang, S., Xu, R., & Zhu, C. (2023). G-Eval: NLG evaluation using GPT-4 with better human alignment. Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, 2511–2522.
Lundberg, S.M., & Lee, S.I. (2017). A unified approach to interpreting model predictions.
Advances in Neural Information Processing Systems, 30.
Manakul, P., Liusie, A., & Gales, M.J. (2023). SelfCheckGPT: Zero-resource black-box hallucination detection for generative large language models. Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, 9004–9017.
McKenna, N., Li, T., Cheng, L., Hosseini, M.J., Johnson, M., & Steedman, M. (2023). Sources of hallucination by large language models on inference tasks. Findings of the Association for Computational Linguistics: EMNLP 2023, 2758–2774.
McKinsey & Company. (2024). The state of AI in early 2024: Gen AI adoption spikes and starts to generate value. McKinsey Global Survey.
Min, S., Krishna, K., Lyu, X., Lewis, M., Yih, W.T., Koh, P.W., Iyyer, M., Zettlemoyer, L., & Hajishirzi, H. (2023). FActScore: Fine-grained atomic evaluation of factual precision in long form text generation. Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, 12076–12100.
Nielsen, J. (1993). Usability Engineering. Academic Press.
National Institute of Standards and Technology. (2023). Artificial Intelligence Risk Manage- ment Framework (AI RMF 1.0). NIST AI 100-1.
Parasuraman, R., & Riley, V. (1997). Humans and automation: Use, misuse, disuse, abuse.
Human Factors, 39(2), 230–253.
Pirolli, P., & Card, S. (1999). Information foraging. Psychological Review, 106(4), 643.
Raptis, D., Tselios, N., Kjeldskov, J., & Skov, M.B. (2015). Does size matter? Investigating the impact of mobile phone screen size on users’ perceived usability, effectiveness and efficiency. Proceedings of the 15th International Conference on Human-Computer Interaction with Mobile Devices and Services, 127–136.
Rawte, V., Sheth, A., & Das, A. (2023). A survey of hallucination in large foundation models.
arXiv preprint arXiv:2309.05922.
Ribeiro, M.T., Singh, S., & Guestrin, C. (2016). “Why should I trust you?” Explaining the predictions of any classifier. Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 1135–1144.
Ribeiro, M.T., Lundberg, S., Guestrin, C., & Nushi, B. (2023). Adaptive testing and debugging of NLP models. Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics, 1068–1083.
Schuster, T., Fisch, A., & Barzilay, R. (2021). Get your vitamin C! Robust fact verification with contrastive evidence. Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics, 624–643.
Sculley, D., Holt, G., Golovin, D., Davydov, E., Phillips, T., Ebner, D., Chaudhary, V., Young, M., Crespo, J.F., & Dennison, D. (2015). Hidden technical debt in machine learning systems. Advances in Neural Information Processing Systems, 28.
U.S. Securities and Exchange Commission. (2024). Staff Statement on Artificial Intelligence and the Securities Industry.
Sunshine, J., Egelman, S., Almuhimedi, H., Atri, N., & Cranor, L.F. (2009). Crying wolf: An empirical study of SSL warning effectiveness. Proceedings of the 18th USENIX Security Symposium, 399–416.
Thirunavukarasu, A.J., Ting, D.S.J., Elangovan, K., Gutierrez, L., Tan, T.F., & Ting, D.S.W. (2023). Large language models in medicine. Nature Medicine, 29(8), 1930–1940.
Thorne, J., Vlachos, A., Christodoulopoulos, C., & Mittal, A. (2018). FEVER: A large-scale dataset for fact extraction and verification. Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics, 809–819.
Varshney, N., Yao, W., Zhang, H., Chen, J., & Yu, D. (2023). A stitch in time saves nine: Detecting and mitigating hallucinations of LLMs by validating low-confidence generation. arXiv preprint arXiv:2307.03987.
Vasconcelos, H., Jörke, M., Grunde-McLaughlin, M., Gerstenberg, T., Bernstein, M.S., & Krishna, R. (2023). Explanations can reduce overreliance on AI systems during decision- making. Proceedings of the ACM on Human-Computer Interaction, 7(CSCW1), 1–38.
Vinyals, O., & Le, Q. (2015). A neural conversational model. Proceedings of the ICML Deep Learning Workshop.
Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A., & Zhou, D. (2023). Self-consistency improves chain of thought reasoning in language models. Proceedings of the 11th International Conference on Learning Representations.
Weiser, B. (2023, May 27). Here’s what happens when your lawyer uses ChatGPT. The New York Times.
Wickens, C.D., Hollands, J.G., Banbury, S., & Parasuraman, R. (2015). Engineering Psychol- ogy and Human Performance (4th ed.). Psychology Press.
Yin, M., Wortman Vaughan, J., & Wallach, H. (2019). Understanding the effect of accuracy on trust in machine learning models. Proceedings of the 2019 CHI Conference on Human Factors in Computing Systems, 1–12.
Zhang, Y., Li, Y., Cui, L., Cai, D., Liu, L., Fu, T., Huang, X., Zhao, E., Zhang, Y., Chen, Y., Wang, L., Luu, A.T., Bi, W., Shi, F., & Shi, S. (2023). Siren’s song in the AI ocean: A survey on hallucination in large language models. arXiv preprint arXiv:2309.01219.
Zhao, W.X., Zhou, K., Li, J., Tang, T., Wang, X., Hou, Y., Min, Y., Zhang, B., Zhang, J.,
Dong, Z., Du, Y., Yang, C., Chen, Y., Chen, Z., Jiang, J., Ren, R., Li, Y., Tang, X., Liu, Z., Liu, P., Nie, J.Y., & Wen, J.R. (2023). A survey of large language models. arXiv preprint arXiv:2303.18223.
Zhou, J., Müller, H., Holzinger, A., & Chen, F. (2024). Ethical AI and accountability: Designing AI nutrition labels. AI and Ethics, 4(1), 215–227.
Downloads
Published
How to Cite
Issue
Section
License

This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.
All papers should be submitted electronically. All submitted manuscripts must be original work that is not under submission at another journal or under consideration for publication in another form, such as a monograph or chapter of a book. Authors of submitted papers are obligated not to submit their paper for publication elsewhere until an editorial decision is rendered on their submission. Further, authors of accepted papers are prohibited from publishing the results in other publications that appear before the paper is published in the Journal unless they receive approval for doing so from the Editor-In-Chief.
IJISAE open access articles are licensed under a Creative Commons Attribution-ShareAlike 4.0 International License. This license lets the audience to give appropriate credit, provide a link to the license, and indicate if changes were made and if they remix, transform, or build upon the material, they must distribute contributions under the same license as the original.


