A Systematic Literature Review of Deep Learning-Based Multimodal Approaches for Detecting Abusive Language in Short Videos
Keywords:
BERT, Generative, LSTM, Multimodal, Short Video.Abstract
This research aims to design and implement a comprehensive deep learning-based multimodal framework for accurately detecting abusive language in video content on social media platforms. The framework seeks to leverage the integration of visual, audio, and textual modalities to capture and convey the context within the videos. By combining insights from these modalities, the research aims to enhance the precision, recall, and overall reliability of abusive language detection systems. The optimization of multimodal fusion techniques will be central to this research, involving the testing of various fusion architectures to identify the most effective configuration for real-world applications. One significant challenge addressed by this research is the issue of imbalanced datasets. To tackle this, Generative Adversarial Networks (GANs) will be employed for synthetic data generation, producing realistic and diverse abusive content samples. This approach will improve the model's ability to generalize across different contexts and content types. The proposed framework incorporates state-of-the-art architectures, such as Long Short-Term Memory (LSTM) and Bidirectional Encoder Representations from Transformers (BERT), which will be used to model the temporal dependencies within video sequences. This paper focuses on multimodal representation visual, audio, and text as deep modalities in the hate speech classification process. We divide the data into visual (image) data, audio data, and text data, and optimize the model by enhancing transformer fusion to achieve maximum results. We chose a generative approach because it is more optimal compared to other models. Finally, we suggest that future studies combine Generative Adversarial Networks (GANs) with BERT and LSTM for more effective abusive language detection.
Downloads
References
J. L. Jaxonlangloislrcahotmailcom, N. St-pierre, and M. Hollis, “Short Video Recommendation through Multimodal Feature Fusion with Attention Mechanism Short Video Recommendation through Multimodal Feature Fusion with Attention Mechanism,” pp. 0–6, 2023.
A. Chhabra and D. K. Vishwakarma, “Engineering Applications of Artificial Intelligence Multimodal hate speech detection via multi-scale visual kernels and knowledge distillation architecture,” Eng. Appl. Artif. Intell., vol. 126, no. PB, p. 106991, 2023, doi: 10.1016/j.engappai.2023.106991.
H. M. Sayed, H. E. Eldeeb, and S. A. Taie, “Bimodal variational autoencoder for audiovisual speech recognition,” Mach. Learn., vol. 112, no. 4, pp. 1201–1226, 2023, doi: 10.1007/s10994-021-06112-5.
F. T. Boishakhi, “Multi-modal Hate Speech Detection using Machine Learning,” no. 2017, 2018.
F. Yang and G. Predovic, “Exploring Deep Multimodal Fusion of Text and Photo for Hate Speech Classification,” no. 2017, pp. 11–18, 2019.
T. Caselli, V. Basile, J. Mitrović, and M. Granitzer, “HateBERT: Retraining BERT for Abusive Language Detection in English,” WOAH 2021 - 5th Work. Online Abus. Harms, Proc. Work., pp. 17–25, 2021, doi: 10.18653/v1/2021.woah-1.3.
A. Iqbal, B. Nag, and S. Roy, “Knowledge-Based Systems Deep learning based multimodal emotion recognition using model-level fusion of audio – visual modalities,” Knowledge-Based Syst., vol. 244, p. 108580, 2022, doi: 10.1016/j.knosys.2022.108580.
P. Vijayaraghavan, H. Larochelle, and D. Roy, “Interpretable Multi-Modal Hate Speech Detection,” 2019.
S. Dowlagar and R. Mamidi, “HASOCOne@FIRE-HASOC2020: Using BERT and multilingual BERT models for hate speech detection,” CEUR Workshop Proc., vol. 2826, pp. 180–187, 2020.
Z. Wang, Y. Zhao, X. Cheng, H. Huang, J. Liu, and L. Tang, “Connecting Multi-modal Contrastive Representations,” no. NeurIPS, pp. 1–16, 2023.
M. Farajzadeh-Zanjani, R. Razavi-Far, M. Saif, and V. Palade, “Generative Adversarial Networks: A Survey on Training, Variants, and Applications,” Intell. Syst. Ref. Libr., vol. 217, pp. 7–29, 2022, doi: 10.1007/978-3-030-91390-8_2.
R. Cao, R. K. W. Lee, and T. A. Hoang, “DeepHate: Hate Speech Detection via Multi-Faceted Text Representations,” WebSci 2020 - Proc. 12th ACM Conf. Web Sci., pp. 11–20, 2020, doi: 10.1145/3394231.3397890.
E. Mahajan, H. Mahajan, and S. Kumar, “EnsMulHateCyb : Multilingual hate speech and cyberbully detection in online social media,” Expert Syst. Appl., vol. 236, no. May 2023, p. 121228, 2024, doi: 10.1016/j.eswa.2023.121228.
R. Rajalakshmi, S. Selvaraj, R. Faerie Mattins, P. Vasudevan, and M. Anand Kumar, “HOTTEST: Hate and Offensive content identification in Tamil using Transformers and Enhanced STemming,” Comput. Speech Lang., vol. 78, no. October 2022, p. 101464, 2023, doi: 10.1016/j.csl.2022.101464.
A. Aggarwal et al., “BERT base model for toxic comment analysis on Indonesian social media,” Int. J. Inf. Manag. Data Insights, vol. 1, no. 1, pp. 714–721, 2021, doi: 10.1016/j.jjimei.2020.100004.
F. Yousif and A. Anezi, “applied sciences Neural Networks,” 2022.
R. Pan, J. A. García-díaz, and M. Á. Rodríguez-garcía, “Computer Standards & Interfaces Spanish MEACorpus 2023 : A multimodal speech – text corpus for emotion analysis in Spanish from natural environments,” vol. 90, no. March, 2024.
G. Valle-cano, L. Quijano-sánchez, F. Liberatore, and J. Gómez, “SocialHaterBERT : A dichotomous approach for automatically detecting hate speech on Twitter through textual analysis and user profiles,” Expert Syst. Appl., vol. 216, no. October 2021, p. 119446, 2023, doi: 10.1016/j.eswa.2022.119446.
A. Velankar, H. Patil, and R. Joshi, “A Review of Challenges in Machine Learning based Automated Hate Speech Detection,” pp. 1–9, 2022, [Online]. Available: http://arxiv.org/abs/2209.05294.
R. Cao and R. K.-W. Lee, “HateGAN: Adversarial Generative-Based Data Augmentation for Hate Speech Detection,” pp. 6327–6338, 2021, doi: 10.18653/v1/2020.coling-main.557.
H. Sohn and H. Lee, “MC-BERT4HATE: Hate speech detection using multi-channel bert for different languages and translations,” IEEE Int. Conf. Data Min. Work. ICDMW, vol. 2019-Novem, pp. 551–559, 2019, doi: 10.1109/ICDMW.2019.00084.
K. Abainia, “The Online Behaviour of the Algerian Abusers in Social Media Networks,” vol. 2011, pp. 1–13, 1945, [Online]. Available: http://www.dailymail.co.uk/.
S. Sharifirad, “Boosting Text Classification Performance on Sexist Tweets by Text Augmentation and Text Generation Using a Combination of Knowledge Graphs,” 2nd Workshop on Abusive Language Online - Proceedings of the Workshop, co-located with EMNLP 2018. pp. 107–114, 2018, [Online]. Available: https://api.elsevier.com/content/abstract/scopus_id/85122034405.
M. Gen-recsys et al., A Review of Modern Recommender Systems Using Generative, vol. 1, no. 1. Association for Computing Machinery.
H. Fan et al., “Social media toxicity classification using deep learning: Real-world application uk brexit,” Electron., vol. 10, no. 11, pp. 1–18, 2021, doi: 10.3390/electronics10111332.
S. Khan et al., “BiCHAT: BiLSTM with deep CNN and hierarchical attention for hate speech detection,” J. King Saud Univ. - Comput. Inf. Sci., vol. 34, no. 7, pp. 4335–4344, 2022, doi: 10.1016/j.jksuci.2022.05.006.
M. Das, R. Raj, P. Saha, B. Mathew, M. Gupta, and A. Mukherjee, “HateMM : A Multi-Modal Dataset for Hate Video Classification,” no. Icwsm, 2023.
P. Aggarwal and B. Mathew, HateProof : Are Hateful Meme Detection Systems really Robust ?, vol. 1, no. 1. Association for Computing Machinery.
A. Chhabra, “A literature survey on multimodal and multilingual automatic hate speech identification,” Multimed. Syst., vol. 29, no. 3, pp. 1203–1230, 2023, doi: 10.1007/s00530-023-01051-8.
S. Lee and D. K. Han, “Multimodal Emotion Recognition Fusion Analysis Adapting BERT With Heterogeneous Feature Unification,” IEEE Access, vol. 9, pp. 94557–94572, 2021, doi: 10.1109/ACCESS.2021.3092735.
E. Festus and Ö. Özgöbek, “An inter-modal attention-based deep learning framework using unified modality for multimodal fake news , hate speech and offensive language detection,” Inf. Syst., vol. 123, no. November 2023, p. 102378, 2024, doi: 10.1016/j.is.2024.102378.
A. Mandal et al., “Attentive Fusion: A Transformer-based Approach to Multimodal Hate Speech Detection.”
X. Zhao, Y. Liao, Z. Tang, and Y. Xu, “Integrating audio and visual modalities for multimodal personality trait recognition via hybrid deep learning,” no. January, pp. 1–11, 2023, doi: 10.3389/fnins.2022.1107284.
A. Gupta, J. Johnson, L. Fei-Fei, S. Savarese, and A. Alahi, “Social GAN: Socially Acceptable Trajectories with Generative Adversarial Networks,” Proc. IEEE Comput. Soc. Conf. Comput. Vis. Pattern Recognit., pp. 2255–2264, 2018, doi: 10.1109/CVPR.2018.00240.
G. Zhe et al., “Triangle Generative Adversarial Networks,” Adv. Neural Inf. Process. Syst., vol. 30, no. Nips, 2017.
A. Anaissi, Y. Jia, A. Braytee, M. Naji, and W. Alyassine, “Damage GAN: A Generative Model for Imbalanced Data,” Commun. Comput. Inf. Sci., vol. 1943 CCIS, pp. 48–61, 2024, doi: 10.1007/978-981-99-8696-5_4.
N. Jaafar and Z. Lachiri, “Multimodal fusion methods with deep neural networks and meta-information for aggression detection in surveillance,” vol. 211, no. August 2022, 2023.
Y. Qu, J. Nathaniel, and P. Gentine, “Deep Generative Data Assimilation in Multimodal Setting.”
A. Nalamothu, “Computer Engineering Commons, and the Computer Sciences Commons Repository Citation Repository Citation Nalamothu, Abhishek,” 2019, [Online]. Available: https://corescholar.libraries.wright.edu/etd_allhttps://corescholar.libraries.wright.edu/etd_all/2094.
D. Yan, L. Qi, V. T. Hu, M.-H. Yang, and M. Tang, “Training Class-Imbalanced Diffusion Model Via Overlap Optimization,” 2024, [Online]. Available: http://arxiv.org/abs/2402.10821.
Z. Wu, Q. Zhang, D. Miao, K. Yi, W. Fan, and L. Hu, “HyDiscGAN : A Hybrid Distributed cGAN for Audio-Visual Privacy Preservation in Multimodal Sentiment Analysis,” 2023.
C. Breazzano, D. Croce, and R. Basili, “MT-GAN-BERT: Multi-Task and Generative Adversarial Learning for sustainable Language Processing,” CEUR Workshop Proc., vol. 3015, 2021.
M. A. Ibrahim, N. T. M. Sagala, S. Arifin, R. Nariswari, N. P. Murnaka, and P. W. Prasetyo, “Separating Hate Speech from Abusive Language on Indonesian Twitter,” 2022 Int. Conf. Data Sci. Its Appl. ICoDSA 2022, pp. 187–191, 2022, doi: 10.1109/ICoDSA55874.2022.9862850.
H. Karayiğit, “Homophobic and Hate Speech Detection Using Multilingual-BERT Model on Turkish Social Media,” Inf. Technol. Control, vol. 51, no. 2, pp. 356–375, 2022, doi: 10.5755/j01.itc.51.2.29988.
S. Nuggehalli, J. Zhang, L. Jain, and R. Nowak, “DIRECT: Deep Active Learning under Imbalance and Label Noise,” 2023, [Online]. Available: http://arxiv.org/abs/2312.09196.
R. Alshalan and H. Al-Khalifa, “A deep learning approach for automatic hate speech detection in the saudi twittersphere,” Appl. Sci., vol. 10, no. 23, pp. 1–16, 2020, doi: 10.3390/app10238614.
M. B. Shaikh and D. Chai, “Multimodal fusion for audio-image and video action recognition,” Neural Comput. Appl., vol. 36, no. 10, pp. 5499–5513, 2024, doi: 10.1007/s00521-023-09186-5.
H. Faris, I. Aljarah, M. Habib, and P. A. Castillo, “Hate speech detection using word embedding and deep learning in the Arabic language context,” ICPRAM 2020 - Proc. 9th Int. Conf. Pattern Recognit. Appl. Methods, no. January, pp. 453–460, 2020, doi: 10.5220/0008954004530460.
B. U. Patil, A. D. Virupakshappa, A. Prakash, and B. Vijaya, “Optimized multi-layer self-attention network for feature-level data fusion in emotion recognition,” vol. 13, no. 4, pp. 4435–4444, 2024, doi: 10.11591/ijai.v13.i4.pp4435-4444.
S. R. Stahlschmidt, B. Ulfenborg, and J. Synnergren, “Multimodal deep learning for biomedical data fusion : a review,” vol. 23, pp. 1–15, 2022.
C. Lv, L. Fan, H. Li, J. Ma, W. Jiang, and X. Ma, “Biomedical Signal Processing and Control Leveraging multimodal deep learning framework and a comprehensive audio-visual dataset to advance Parkinson ’ s detection,” Biomed. Signal Process. Control, vol. 95, no. PA, p. 106480, 2024, doi: 10.1016/j.bspc.2024.106480.
J. Jang, Y. Kim, K. Choi, and S. Suh, “Sequential targeting: A continual learning approach for data imbalance in text classification,” Expert Syst. Appl., vol. 179, no. November 2020, p. 115067, 2021, doi: 10.1016/j.eswa.2021.115067.
S. Ramiah, T. Y. Liong, and M. Jayabalan, “Detecting text based image with optical character recognition for English translation and speech using Android,” 2015 IEEE Student Conf. Res. Dev. SCOReD 2015, pp. 272–277, 2015, doi: 10.1109/SCORED.2015.7449339.
R. M. O. Cruz, W. V. de Sousa, and G. D. C. Cavalcanti, “Selecting and combining complementary feature representations and classifiers for hate speech detection,” Online Soc. Networks Media, vol. 28, no. February, 2022, doi: 10.1016/j.osnem.2021.100194.
H. Nguyen and J. M. Chang, “Synthetic Information towards Maximum Posterior Ratio for deep learning on Imbalanced Data,” IEEE Trans. Artif. Intell., 2023, doi: 10.1109/TAI.2023.3330949.
S. Tuarob, M. Satravisut, P. Sangtunchai, S. Nunthavanich, and T. Noraset, “FALCoN : Detecting and classifying abusive language in social networks using context features and unlabeled data,” Inf. Process. Manag., vol. 60, no. 4, p. 103381, 2023, doi: 10.1016/j.ipm.2023.103381.
Y. Zong, O. Mac Aodha, T. Hospedales, and S. Member, “Self-Supervised Multimodal Learning : A Survey,” pp. 1–25.
Y. Chen and X. Chen, “Exploration of Deep Semantic Analysis and Application of Video Images in Visual Communication Design Based on Multimodal Feature Fusion Algorithm,” vol. 15, no. 8, pp. 1051–1061, 2024.
Downloads
Published
How to Cite
Issue
Section
License

This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.
All papers should be submitted electronically. All submitted manuscripts must be original work that is not under submission at another journal or under consideration for publication in another form, such as a monograph or chapter of a book. Authors of submitted papers are obligated not to submit their paper for publication elsewhere until an editorial decision is rendered on their submission. Further, authors of accepted papers are prohibited from publishing the results in other publications that appear before the paper is published in the Journal unless they receive approval for doing so from the Editor-In-Chief.
IJISAE open access articles are licensed under a Creative Commons Attribution-ShareAlike 4.0 International License. This license lets the audience to give appropriate credit, provide a link to the license, and indicate if changes were made and if they remix, transform, or build upon the material, they must distribute contributions under the same license as the original.