This paper demonstrates how semi-supervised learning and human-in-the-loop crowdsourcing can address machine translation challenges common in low-resource languages. According to a Facebook study, only 53% of the world’s population have access to encyclopedic knowledge in their first language. 1 Just over half of all tweets are in English (Hong, Convertino, and Chi 2011). The implications of this gap in information by language are substantial, particularly when sources of factual information are limited. For instance the United Nations launched the ‘Verified’initiative, to combat the adverse effects on public health of false information about COVID-19. 2
Machine translation is a challenging problem in itself (Korošec 2011; Okpor 2014), and although there have been some great advances made in recent years, particularly in neural machine translation (Rapp, Sharoff, and Zweigenbaum 2016; Hirschberg and Manning 2015; Lample et al. 2017), few systems (Martinus and Abbott 2019; Leventhal et al. 2020; Dossou and Emezue 2020) have been trained to communicate in African languages (Heine and Nurse 2000). This research focuses on Bambara, the most widelyspoken language in the Mande family (Vydrin 2018) in Western Sub-Saharan Africa that includes 60 to 75 languages spoken by 30 to 40 million people. Bambara, the vernacular language of Mali, has approximately 16 million L1 (primary language) and L2 (secondary language) speakers. We attempt to combine the recent success of neural machine translation (NMT), with semi-supervised approaches to generalized machine translation (Cheng 2019). Used alone, NMT appears to be unsuited for under-resourced languages such as Bambara due to the lack of the quantity of labeled data (parallel digital texts) needed. Incorporating semi-supervised learning allows human-computer real-time collaboration in disambiguating word translations.