Review Article Open Access

Speech Corpora for Different Languages: A Systematic Review

Vladimir Igorevich Fedoseev1, Anton Aleksandrovich Konev 1 and Natalia Sergeevna Repyuk1
  • 1 Department of Complex Information Security of Computer Systems, Tomsk State University of Control Systems and Radio Electronics, Tomsk, Russia

Abstract

The study of speech signals relies on carefully curated audio recordings, which are compiled and stored within specialized speech corpora. This article provides a comprehensive overview of such corpora across multiple languages, with particular focus on Russian, English, and Arabic. It notes that Russian and Arabic are represented by fewer corpora compared to the more extensive resources available for English. The discussion includes an examination of typical speech corpus structures, a description of standard parameters for characterizing corpora, and an outline of common metrics used to describe the speech signal itself.

Journal of Computer Science
Volume 22 No. 1, 2026, 9-24

DOI: https://doi.org/10.3844/jcssp.2026.9.24

Submitted On: 29 July 2024 Published On: 2 February 2026

How to Cite: Fedoseev, V. I., Konev , A. A. & Repyuk, N. S. (2026). Speech Corpora for Different Languages: A Systematic Review. Journal of Computer Science, 22(1), 9-24. https://doi.org/10.3844/jcssp.2026.9.24

  • 17 Views
  • 3 Downloads
  • 0 Citations

Download

Keywords

  • Dataset
  • Pronunciation
  • Speech Corpora
  • Transcript
  • Speech Recognition