application domains. However, existing HAR practices such as convolutional networks and recurrent architectures can result in information loss when considering the temporal and sensor modality dimensions separately. In this article, we propose a shallow large receptive field (LRF) attention that addresses this issue by extracting both temporal-wise and modality- wise information for sensor-based HAR scenarios. Our proposed LRF architecture …
Sensor-based human activity recognition (HAR) has become an important task in various application domains. However, existing HAR practices such as convolutional networks and recurrent architectures can result in information loss when considering the temporal and sensor modality dimensions separately. In this article, we propose a shallow large receptive field (LRF) attention that addresses this issue by extracting both temporal-wise and modality-wise information for sensor-based HAR scenarios. Our proposed LRF architecture decomposes a large-kernel convolution into three cascaded submodules, including a local depth-wise convolution across different sensor modalities, a long-range depth-wise dilation convolution along temporal sequences, and a plain convolution. We further reconstruct it in a visual transformer (ViT)-like hierarchical style. We validate the proposed LRF architecture on three commonly used HAR datasets as well as a weakly labeled dataset that involves multimodality sensing data from smartphones or wearable sensors. Our experiments show that our proposed method outperforms several state-of-the-art (SOTA) benchmarks, achieving the accuracy of 97.35% in UCI-HAR, 98.88% in WISDM, 97.26% in USC-HAD, 96.77% in weakly labeled, and 91.15% in KU-HAR datasets with a similar number of parameters.