Multimodal Learning
Multimodal learning fuses information from different modalities (text, image, audio, 3D) using cross-modal encoders and attention. Architectures combine modality-specific encoders with fusion modules and joint objectives for alignment and retrieval.