Protein-protein interactions (PPIs) are responsible for various biological processes and cellular functions of all living organisms. The detection of PPIs helps in understanding the roles of proteins and their complex structure. Proteins are commonly represented by amino acid sequences. The method of identifying PPIs is divided into two steps. Firstly, a feature vector from protein representation is extracted. Then, a model is trained on these extracted feature vectors to reveal novel interactions. These days, with the availability of multimodal biomedical data and the successful adoption of deep-learning algorithms in solving various problems of bioinformatics, we can obtain more relevant feature vectors, improving the model’s performance to predict PPIs. Current work utilizes multimodal data as tertiary structure information and sequence-based information. A deep learning-based model, ResNet50, is used to extract features from 3D voxel representation of proteins. To get a compact feature vector from amino acid sequences, stacked autoencoder and quasi-sequence-order (QSO) are utilized. QSO converts the symbolic representation (amino acid sequences) of proteins into their numerical representation. After extracting features from different modalities, these features are concatenated in pairs and then fed into the bi-directional GRU-based classifier to predict PPIs. Our proposed approach achieves an accuracy of 0.9829, which is the best accuracy of 3-fold cross-validation on the human PPI dataset. The results signify that the proposed approach’s performance is better than existing computational methods, such as state-of-the-art stacked autoencoder-based classifiers.