The human vision system assiduously looks for exciting regions in the real world, in images and videos, to reduce the search effort for various tasks, such as object detection and recognition. A spatial attention representation can divulge the exciting segments, blocks or regions in such images. The Conners’ continuous performance test is a visual assessment technique to evaluate the attention and the response inhibition component of executive control to assess attention deficit hyperactivity disorder (ADHD) and other neurological disorders. Artificial Intelligence and Machine Learning models are advancing ever more complex, going from shallow to deep learning over time. Thus, we can achieve higher accuracy and greater precision. However, this also tends to make these models ‘black boxes’, reducing the comprehensibility of the logic played out in the various predictions and outcomes. This raises an obvious question - how do we understand the prediction suggested or recommended by these machine learning models so that we can place trust in them? XAI attempts to make a trade-off between precision, accuracy and interpretability to achieve this. This research work presents an Explainable Artificial Intelligence (XAI) model for a continuous performance test, monitoring multisensor data and multimodal machine learning for engagement analysis. The sensor data considered included body pose, Electrocardiograph, eye gaze, interaction data and facial features via accurate labelling of engagement or disengagement for cognitive attention of a Seek-X type task execution. We used decision trees and XAI to visualize the multisensor multimodal data, which will help us assess the model’s accuracy intuitively and provide us with the explainability of engagement or disengagement for visual interactions.