A Multi-Modal Machine Learning Architecture for Resource-Efficient Sensing and Sustainable Edge Intelligence
DOI:
https://doi.org/10.63503/j.ijaimd.2025.214Keywords:
Edge AI, Multi-Modal Learning, Sensor Fusion, Resource-Efficient AI, Sustainable Computing, Adaptive Neural Networks, Embedded Systems, Computational EfficiencyAbstract
Intelligent sensing at the network edge is a tricky issue, even though it is not an easy endeavor to try to maximize accuracy but is rather a skirmish against limited resources. Embedded systems are identifying increased sensors and are becoming omnipresent and the real-time and multi-modal interpretation is booming, rendering traditional and cloud-reliant or computationally intensive machine learning models ineffective. It thus requires the creation of architecture that will handle this wilderness of limited compute and energy in real-time, not monolithic models that have been transplanted out of data centers. The current paper constitutes a computational framework of multi-modal learning at the edge straightforwardly addressing the issue of the efficiency-accuracy trade-off. We do not consider the highly complex suite of WSM-2023 streams of benchmarks as the very classification tasks but instead, an approximation of the rough and rugged and changing sensory landscape of actual deployments. More specifically, we rely on the Controlled Optimization Procedure (COP) which specializes in a rigid comparison of three multi-modal fusion approaches, i.e. Early Fusion, Late Fusion as well as on the Adaptive Gating based Hierarchical Fusion which feature algorithmic paradigms capable of synthesizing information retrieved via various sensors without advance plan of action fusion. Using both intensive statistical and energetic analysis, we show that, though each of the fusion strategies has its strength, the final decision here is that the Adaptive Gating-based Hierarchical Fusion provides better computational efficiency and adaptive robustness and how it can be reconfigured to operate in more degraded and variable sensory situations. The work forms the original merit of adaptive and context-sensitive architecture of complex implementations of sustainable edge intelligence and provides a viable roadmap to follow when selecting perceptual system, sense and reason rather than just manipulation data through a preset and rigidly programmed algorithm.
References
[1] Baltrušaitis, T., Ahuja, C., & Morency, L. P. (2018). Multimodal machine learning: A survey and taxonomy. IEEE transactions on pattern analysis and machine intelligence, 41(2), 423-443. doi: 10.1109/TPAMI.2018.2798607.
[2] Alom, M. Z., Hasan, M., Yakopcic, C., Taha, T. M., & Asari, V. K. (2021). Inception recurrent convolutional neural network for object recognition. Machine Vision and Applications, 32(1), 28. https://doi.org/10.1007/s00138-020-01157-3
[3] Michele, A., Colin, V., & Santika, D. D. (2019). Mobilenet convolutional neural networks and support vector machines for palmprint recognition. Procedia Computer Science, 157, 110-117. https://doi.org/10.1016/j.procs.2019.08.147
[4] Wang, M., Yuan, J., & Wang, Z. (2023, October). Mixture-of-experts learner for single long-tailed domain generalization. In Proceedings of the 31st ACM International Conference on Multimedia (pp. 290-299). https://doi.org/10.1145/3581783.3611871
[5] Lin, C., Yang, P., Wang, Q., Qiu, Z., Lv, W., & Wang, Z. (2023). Efficient and accurate compound scaling for convolutional neural networks. Neural Networks, 167, 787-797. https://doi.org/10.1016/j.neunet.2023.08.053
[6] Ma, T., Wang, W., & Chen, Y. (2023). Attention is all you need: An interpretable transformer-based asset allocation approach. International Review of Financial Analysis, 90, 102876. https://doi.org/10.1016/j.irfa.2023.102876
[7] Li, X., Zhou, T., Wang, H., & Lin, M. (2025). Energy-efficient computation with dvfs using deep reinforcement learning for multi-task systems in edge computing. IEEE Transactions on Sustainable Computing. doi: 10.1109/TSUSC.2025.3593971.
[8] Liu, A., Jiang, W., Huang, S., & Feng, Z. (2025). Multi-Modal Integrated Sensing and Communication in Internet of Things With Large Language Models. IEEE Internet of Things Magazine. doi: 10.1109/MIOT.2025.3575888.
[9] LeCun, Y., Bengio, Y., & Hinton, G. (2015). Deep learning. nature, 521(7553), 436-444. https://doi.org/10.1038/nature14539
[10] Ramachandram, D., & Taylor, G. W. (2017). Deep multimodal learning: A survey on recent advances and trends. IEEE signal processing magazine, 34(6), 96-108. doi: 10.1109/MSP.2017.2738401.
[11] Li, X., Ding, L., Wang, L., & Cao, F. (2017, December). FPGA accelerates deep residual learning for image recognition. In 2017 IEEE 2nd Information Technology, Networking, Electronic and Automation Control Conference (ITNEC) (pp. 837-840). IEEE. doi: 10.1109/ITNEC.2017.8284852.
[12] Zhou, D., Hou, Q., Chen, Y., Feng, J., & Yan, S. (2020, August). Rethinking bottleneck structure for efficient mobile network design. In European conference on computer vision (pp. 680-697). Cham: Springer International Publishing. https://doi.org/10.1007/978-3-030-58580-8_40
[13] Lin, S., Ji, R., Yan, C., Zhang, B., Cao, L., Ye, Q., ... & Doermann, D. (2019). Towards optimal structured cnn pruning via generative adversarial learning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 2790-2799). doi: 10.1109/CVPR.2019.00290.
[14] Wang, Y., Shen, J., Hu, T. K., Xu, P., Nguyen, T., Baraniuk, R., ... & Lin, Y. (2020). Dual dynamic inference: Enabling more efficient, adaptive, and controllable deep inference. IEEE Journal of Selected Topics in Signal Processing, 14(4), 623-633. doi: 10.1109/JSTSP.2020.2979669.
[15] Choi, Y., El-Khamy, M., & Lee, J. (2020). Universal deep neural network compression. IEEE Journal of Selected Topics in Signal Processing, 14(4), 715-726. doi: 10.1109/JSTSP.2020.2975903.