Recently, cross modal compression (CMC) is proposed to compress highly redundant visual data into a compact, common, human-comprehensible domain (such as text) to preserve semantic fidelity for semantic-related applications. However, CMC only achieves a certain level of semantic fidelity at a constant rate, and the model aims to optimize the probability of the ground truth text but not directly semantic fidelity. To tackle the problems, we propose a novel scheme named rate-distortion optimized CMC (RDO-CMC). Specifically, we model the text generation process as a Markov decision process and propose rate-distortion reward which is used in reinforcement learning to optimize text generation. In rate-distortion reward, the distortion measures both the semantic fidelity and naturalness of the encoded text. The rate for the text is estimated by the sum of the amount of information of all the tokens in the text since the amount of information of each token is a lower bound of coding bits. Experimentally, RDO-CMC effectively controls the rate in the CMC framework and achieves competitive performance on MSCOCO dataset.