AIAny - BhasaFlow Khasi-English Parallel Sample v1

Why this matters

Khasi is a low-resource language with very few publicly available aligned speech–text corpora. This sample release gives researchers a vetted, studio-quality preview (100 sentence pairs with aligned WAV audio and metadata) so teams can evaluate dataset format and baseline performance before requesting broader access.

What Sets It Apart

Gold-standard, human-created alignments: each English sentence was translated by native Khasi speakers and linked to a validated studio-quality WAV recording (16-bit PCM). This reduces noise commonly found in web-scraped speech datasets, so small-scale experiments reflect cleaner upper-bound performance.
Multi-modality and narrow scope: the sample pairs parallel text with audio and includes speaker gender metadata, making it immediately usable for ASR, TTS prototyping, and machine translation evaluation without heavy preprocessing.
Access-controlled full corpus: the public sample is intentionally small (100 examples) to enable evaluation and collaboration discovery, while the extended production-grade corpus is available through a permissioned request process, preserving contributor consent and licensing constraints.

Who It's For (and trade-offs)

Great fit if you need a reliable small testbed to evaluate ASR/TTS pipelines or translation models for Khasi, verify ingestion/metadata schemas, or demonstrate feasibility to stakeholders. The dataset’s curated quality makes it useful for benchmarking and linguistic analysis.

Look elsewhere if you need large-scale training data out of the box: the sample size (100 pairs) is insufficient for training robust production ASR/TTS models without augmentation or external data. Also note the release is under a restricted license; commercial or redistribution uses require contact and approval from Medharvix.

Where It Fits

Use this sample as a quality-controlled validation set or for early-stage experiments that measure model behavior on clean, native-speaker data. For production training, plan to combine this resource with additional in-domain or synthetic data and follow the dataset’s access/licensing workflow.

Notes on access and provenance

Publisher: Medharvix Systems (BhasaFlow initiative).
Public sample size: 100 sentence pairs.
Audio: WAV, 16-bit PCM.
Created: 2026-04-28.
Contact for full-corpus access and licensing: [email protected].

This introduction aims to clarify where the sample is most useful and the concrete limitations you’ll face when moving from evaluation to production.

BhasaFlow Khasi-English Parallel Sample v1

Introduction

What Sets It Apart

Who It's For (and trade-offs)

Where It Fits

Information

Categories

Tags

More Items

K12-KGraph: A Curriculum-Aligned Knowledge Graph for Benchmarking and Training Educational LLMs

FLARE-MedFM/PancancerCTSeg

The Stack v3 (HuggingFaceCode/stack-v3-train)