With extensive applications of Unmanned Aircraft Vehicle (UAV) in the field of remote sensing, 3D reconstruction using aerial images has been a vibrant area of research. However, fast large-scale 3D reconstruction is a challenging task. For aerial image datasets, large scale means that the number and resolution of images are enormous, which brings significant computational cost to the 3D reconstruction, especially in the process of Structure from Motion (SfM). In this paper, for fast large-scale SfM, we propose a clustering-aligning framework that hierarchically merges partial structures to reconstruct the full scene. Through image clustering, an overlapping relationship between image subsets is established. With the overlapping relationship, we propose a similarity transformation estimation method based on joint camera poses of common images. Finally, we introduce the closed-loop constraint and propose a similarity transformation-based hybrid optimization method to make the merged complete scene seamless. The advantage of the proposed method is a significant efficiency improvement without a marginal loss in accuracy. Experimental results on the Qinling dataset captured over Qinling mountain covering 57 square kilometers demonstrate the efficiency and robustness of the proposed method.