In recent years, the landscape of machine learning has been reshaped by the incorporation of multi-sourced, multimodal data, reflecting a concerted effort to attain a more holistic understanding of a given task. Specifically, the shift from single-modal to multimodal learning has emerged as a promising direction for emulating the human capacity to understand the world through a diversity of senses (e.g., vision, hearing, taste, smell, and touch). This multimodal approach not only enables models to process and interpret various modalities of data, but also enriches their capacity to extract intricate patterns and produce more informed outcomes. Moreover, from both theoretical and practical perspectives, the evolution of multimodal learning can be further fueled by the integration of spatial knowledge, which is a distinctive and critical component of human cognition that can provide machine learning models with a deeper understanding of the spatial context and relationships between entities in their environment. Therefore, this dissertation aims to develop a foundational multimodal learning framework strengthened by spatial knowledge. The significance of this framework will be demonstrated by its validity in strengthening multimodal foundation models (MFMs) with a stronger capability for various geospatial applications (e.g., image geo-localization, urban mixed land use detection, and urban perception prediction), by establishing high-quality, large-scale, geospatial multimodal datasets as benchmarks to evaluate their zero-shot performances, as well as integrating the techniques of spatial-context prompt tuning and spatially explicit contrastive learning to eventually develop geospatial artificial intelligence empowered MFMs (GeoAI MFMs).