Abstract:To address the issues such as limited model accuracy and large parameter scale of traditional single-branch networks in semantic segmentation of remote sensing images, a large-kernel convolution-based multi-modal feature fusion network (LMFNet) module was proposed. An improved large-kernel MobileNetV3 (GMBNetV3) was adopted as the parallel backbone, and multi-source features were fused through cross-self-attention enhancement module. The gated aggregator was utilized to integrate abstract and texture information in the decoding stage. On the public datasets Potsdam and Vaihingen, LMFNet was compared with current advanced multi-modal image segmentation models in terms of performance, and ablation experiments were conducted to verify the functions of each module of the model. The results show that LMFNet improves the segmentation performance of mIoU by approximately 0.32 percentage points~6.50 percentage points compared to other advanced multi-modal segmentation models, while reducing the parameter quantity by 29.3%~73.6%, and the inference speed is increased by 1.7%~45.9% on the Potsdam dataset. The proposed model effectively fuses the differences in image features and can perform semantic segmentation of remote sensing images more clearly, providing strong support for instance segmentation of remote sensing images in urban management.