Learning about the cognitive state of the brain has always been a popular topic. Based on the fact that fluctuations of brain signals and functional connectome (FC) relate to specific human behaviors, deep learning based methods have shown promising results on the prediction of such behaviors by analyzing biological signals. Existing methods either model from static perspectives or apply spatial-temporal graph convolution to extract dynamic properties. However, the static information and dynamic information can reflect global brain activities and local brain activities respectively. Thus, we propose BrainNetFormer to incorporate both static and dynamic properties for human behavior prediction. To be specific, a spatial cross attention module and a temporal cross attention module are introduced for information fusion. In addition, since a specific behavior of subjects can be decomposed into a series of subtasks, we introduce a sub-task regularization loss to assist in training and empower the model to recognize subtasks at each moment. Experiments on the HCP-Task dataset demonstrate the superior performance of the proposed model.