follow an unknown joint Markov model and users select the channel to transmit data. The
objective is to find a policy that maximizes the expected long-term number of successful
transmissions. The problem is formulated as a partially observable Markov decision process
with unknown system dynamics. To overcome the challenges of unknown dynamics and
prohibitive computation, we apply the concept of reinforcement learning and implement a …