Significant progress has been witnessed for the descriptor and detector of local features, but there still exist several challenging and intractable limitations, such as insufficient localization accuracy and non-discriminative description, especially in repetitive-or blank-texture regions, which haven’t be well addressed. The coarse feature representation and limited receptive field are considered as the main issues for these limitations. To address these issues, we propose a novel Soft Point-Wise Transformer for Descriptor and Detector, simultaneously mining long-range intrinsic and cross-scale dependencies of local features. Furthermore, our model leverages the distinct transformers based on the soft point-wise attention, substantially decreasing the memory and computation complexity, especially for high-resolution feature maps. In addition, multi-level decoder is constructed to guarantee the high detection accuracy and discriminative description. Extensive experiments demonstrate that our model outperforms the existing stateof-the-art methods on the image matching and visual localization benchmarks.