A Multimodal Deep Learning Framework for Video-Based Sentiment Analysis
DOI:
https://doi.org/10.71204/rxht1w29Keywords:
Multimodal Deep Learning, Video Sentiment Analysis, Dynamic Fusion Graph, CMU-MOSEIAbstract
Understanding human emotions and sentiments from video data is crucial for developing intelligent engineering systems such as surveillance platforms, human-computer interaction interfaces, and affective computing applications. Addressing the limitations of unimodal models, this study investigates a multimodal deep learning approach that combines text, acoustic, and visual information to enhance predictive performance. Leveraging the CMU-MOSEI dataset comprising over 23,000 annotated video utterances, a Dynamic Fusion Graph Memory Network is developed to dynamically integrate multimodal features through an adaptive memory mechanism that adjusts modality weights during training. Experimental evaluation demonstrates that the Dynamic Fusion Graph (DFG) model achieves superior performance compared to traditional text-only and text-vision fusion baselines, achieving higher accuracy and F1-score on both training and test datasets, particularly in sentiment prediction tasks. These outcomes underscore the inherent complexity and generalization challenges in sentiment analysis relative to emotion recognition. The proposed method represents a step forward in the system-level design of multimodal sentiment analysis (MSA) tools, highlighting both the opportunities and the engineering challenges associated with real-world deployment. Future research will focus on refining the dynamic fusion architecture to improve robustness and efficiency, aiming to contribute to the development of deployable, high-performance multimodal sentiment and emotion analysis systems for practical engineering applications.
References
Alzamzami, F., & El Saddik, A. (2023). Transformer-based feature fusion approach for multimodal visual sentiment recognition using tweets in the wild. IEEE Access, 11, 47070–47079.
Cheng, H., Yang, Z., Zhang, X., & Yang, Y. (2023). Multimodal sentiment analysis based on attentional temporal convolutional network and multi-layer feature fusion. IEEE Transactions on Affective Computing, 14, 3149-3163.
Cui, J. (2024). Does digital strategy, organizational agility, digital lead-ership promote DT? A study of digital strategy, organiza-tional agility, digital leadership affects corporate DT in Chinese technological firms. Journal of Integrated Social Sciences and Humanities, 1(1), 12-23.
Devlin, J., Chang, M.-W., Lee, K., & Toutanova, K. (2019). BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers) (pp. 4171–4186). Association for Computational Linguistics.
Dutta, S., & Ganapathy, S. (2023). HCAM—Hierarchical cross attention model for multi-modal emotion recognition. arXiv preprint arXiv:2304.06910. https://doi.org/10.48550/arXiv.2304.06910
Lin, W., Zhang, Q., Wu, Y. J., & Chen, T. (2023). Running a sustainable social media business: The use of deep learning methods in online-comment short texts. Sustainability, 15(11), 9093.
Liu, Z., Shen, Y., Lakshminarasimhan, V. B., Liang, P. P., Zadeh, A., & Morency, L. P. (2018). Efficient low-rank multimodal fusion with modality-specific factors. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) (pp. 2247–2256). Association for Computational Linguistics.
Tsai, Y.-H. H., Bai, S., Liang, P. P., Kolter, J. Z., Morency, L.-P., & Salakhutdinov, R. (2019). Multimodal transformer for unaligned multimodal language sequences. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics (pp. 6558–6569). Association for Computational Linguistics. https://doi.org/10.18653/v1/P19-1656
Wang, H., Li, X., Ren, Z., Wang, M., & Ma, C. (2023). Multimodal sentiment analysis representations learning via contrastive learning with condense attention fusion. Sensors, 23(5), 2679.
Williams, J., Kleinegesse, S., Comanescu, R., & Radu, O. (2018, July). Recognizing emotions in video using multimodal DNN feature fusion. In Proceedings of the Grand Challenge and Workshop on Human Multimodal Language (Challenge-HML) (pp. 11–19). Association for Computational Linguistics. https://doi.org/10.18653/v1/W18-3302
Xue, X., Zhang, C., Niu, Z., & Wu, X. (2023). Multi-level attention map network for multimodal sentiment analysis. IEEE Transactions on Knowledge and Data Engineering, 35(5), 5105–5118.
Yin, C., Zhang, S., & Zeng, Q. (2023). Hybrid representation and decision fusion towards visual-textual sentiment. ACM Transactions on Intelligent Systems and Technology, 14(3), 1–17.
Zadeh, A. B., Liang, P. P., Poria, S., Cambria, E., & Morency, L. P. (2018, July). Multimodal language analysis in the wild: CMU-MOSEI dataset and interpretable dynamic fusion graph. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) (pp. 2236–2246). Association for Computational Linguistics. https://doi.org/10.18653/v1/P18-1208
Zadeh, A., Chen, M., Poria, S., Cambria, E., & Morency, L. P. (2017). Tensor fusion network for multimodal sentiment analysis. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing (EMNLP) (pp. 1103–1114). Association for Computational Linguistics. https://doi.org/10.18653/v1/D17-1115
Zadeh, A., Liang, P. P., Mazumder, N., Poria, S., Cambria, E., & Morency, L.-P. (2018). Memory fusion network for multi-view sequential learning. Proceedings of the AAAI Conference on Artificial Intelligence, 32(1), 5634-5641.
Zadeh, A., Zellers, R., Pincus, E., & Morency, L. P. (2016). MOSI: Multimodal corpus of sentiment intensity and subjectivity analysis in online opinion videos. arXiv preprint arXiv:1606.06259.
Downloads
Published
Issue
Section
License
Copyright (c) 2025 Jinge Bai, Ainuddin Wahid Bin Abdul Wahab (Author)

This work is licensed under a Creative Commons Attribution 4.0 International License.
All articles published in this journal are licensed under the Creative Commons Attribution 4.0 International License (CC BY 4.0). This license permits unrestricted use, distribution, and reproduction in any medium, provided the original author(s) and source are properly credited. Authors retain copyright of their work, and readers are free to copy, share, adapt, and build upon the material for any purpose, including commercial use, as long as appropriate attribution is given.
