Very high-resolution (VHR) Earth observation systems provide an ideal data source for man-made structure detection such as building footprint extraction. Manually delineating building footprints from the remotely sensed VHR images, however, is laborious and time-intensive; thus, automation is needed in the building extraction process to increase productivity. Recently, many researchers have focused on developing building extraction algorithms based on the encoder–decoder architecture of convolutional neural networks. However, we observe that this widely adopted architecture cannot well preserve the precise boundaries and integrity of the extracted buildings. Moreover, features obtained by shallow convolutional layers contain irrelevant background noises that degrade building feature representations. This article addresses these problems by presenting a feature decoupling network (FD-Net) that exploits two types of essential building information from the input image, including semantic information that concerns building integrity and edge information that improves building boundaries. The proposed FD-Net improves the existing encoder–decoder framework by decoupling image features into the edge subspace and the semantic subspace; the decoupled features are then integrated by a supervision-guided fusion process considering the heterogeneity between edge and semantic features. Furthermore, a lightweight and effective global context attention module is introduced to capture contextual building information and thus enhance feature representations. Comprehensive experimental results on three real-world datasets confirm the effectiveness of FD-Net in large-scale building mapping. We applied the proposed method to various encoder–decoder variants to verify the generalizability of the proposed framework. Experimental results show remarkable accuracy improvements with less computational cost.