Why this matters
Khasi is a low-resource language with very few publicly available aligned speech–text corpora. This sample release gives researchers a vetted, studio-quality preview (100 sentence pairs with aligned WAV audio and metadata) so teams can evaluate dataset format and baseline performance before requesting broader access.
What Sets It Apart
- Gold-standard, human-created alignments: each English sentence was translated by native Khasi speakers and linked to a validated studio-quality WAV recording (16-bit PCM). This reduces noise commonly found in web-scraped speech datasets, so small-scale experiments reflect cleaner upper-bound performance.
- Multi-modality and narrow scope: the sample pairs parallel text with audio and includes speaker gender metadata, making it immediately usable for ASR, TTS prototyping, and machine translation evaluation without heavy preprocessing.
- Access-controlled full corpus: the public sample is intentionally small (100 examples) to enable evaluation and collaboration discovery, while the extended production-grade corpus is available through a permissioned request process, preserving contributor consent and licensing constraints.
Who It's For (and trade-offs)
Great fit if you need a reliable small testbed to evaluate ASR/TTS pipelines or translation models for Khasi, verify ingestion/metadata schemas, or demonstrate feasibility to stakeholders. The dataset’s curated quality makes it useful for benchmarking and linguistic analysis.
Look elsewhere if you need large-scale training data out of the box: the sample size (100 pairs) is insufficient for training robust production ASR/TTS models without augmentation or external data. Also note the release is under a restricted license; commercial or redistribution uses require contact and approval from Medharvix.
Where It Fits
Use this sample as a quality-controlled validation set or for early-stage experiments that measure model behavior on clean, native-speaker data. For production training, plan to combine this resource with additional in-domain or synthetic data and follow the dataset’s access/licensing workflow.
Notes on access and provenance
- Publisher: Medharvix Systems (BhasaFlow initiative).
- Public sample size: 100 sentence pairs.
- Audio: WAV, 16-bit PCM.
- Created: 2026-04-28.
- Contact for full-corpus access and licensing: [email protected].
This introduction aims to clarify where the sample is most useful and the concrete limitations you’ll face when moving from evaluation to production.
