Sovereign Odia TTS · Kaggle Prototype Plan

What You'll Build

The 4-Weekend Plan

▸ NOTE

By the end of weekend 4, you will have: a working IndicF5 demo, a fine-tuned LoRA adapter that improves Odia output, a blind A/B comparison demo on Hugging Face Spaces, and concrete data showing whether sovereign Odia TTS is worth pursuing further. Total cost: ₹0.

Weekend 112 hrs

Baseline + Test Set

Run IndicF5 on 200 Odia sentences, build diagnostic test set, get 3 native speakers to listen and rate.

Weekend 216 hrs

Data Collection + Pipeline

Download 30 hours of OTV / Doordarshan content, run through VAD + Whisper pipeline, end with ~5 hours clean training data.

Weekend 312 hrs

LoRA Fine-Tune

Train LoRA adapter on the 5 hours of clean data. Inference comparison vs baseline on test set.

Weekend 412 hrs

Public Demo + Iteration

Hugging Face Space with A/B comparison. Share with Odia community. Decide go/no-go on real funding pursuit.

Reality Check

Kaggle Constraints — Read This Before You Start

What Kaggle gives you (free)

GPU quota

30 hours/week of P100 or T4

Resets Saturday 00:00 UTC every week

VRAM

16 GB (P100) or 16 GB (T4)

P100 is faster for FP32, T4 is better for FP16/BF16. Pick T4 for IndicF5.

Session length

Max 12 hours per session

Hard cutoff. Plan training to checkpoint every hour.

Idle timeout

Killed after 20 min of inactivity

Use a keep-alive cell or check progress every 15 min

Persistent storage

20 GB Datasets + Models

Upload your data once as a Kaggle Dataset, reuse across sessions

RAM

13 GB system RAM, 16 GB GPU VRAM

Tight. Be careful with audio loading — use streaming or chunks.

Disk

73 GB working directory (/kaggle/working)

Wiped between sessions. Save outputs to /kaggle/working then download or push to Kaggle Dataset.

✕ TRAP

The session-length trap: If your fine-tune takes 14 hours, you cannot complete it in one session. You MUST checkpoint to disk every hour and design training to resume from checkpoint. We'll handle this properly.

⚠ WATCH OUT

The 30-hour weekly quota trap: A single LoRA experiment burns ~8 hours. You can run 3-4 experiments per week max. Plan each experiment carefully — don't waste hours on debugging that you could have done locally.

Pre-flight Checklist

What You Need Before Weekend 1

Account setup (do this Friday evening)

Kaggle account

Verify phone number — required for GPU access

Hugging Face account

Free. You'll need it to download IndicF5 weights and to gate-accept its license

Hugging Face token

Settings → Access Tokens → New token (read access). Save it — you'll add it to Kaggle Secrets

Accept IndicF5 terms

Visit huggingface.co/ai4bharat/IndicF5 and click 'Agree' to access weights

GitHub account

For backing up notebooks. Optional but recommended.

Identify 3 native Odia speakers

Friends/family who'll listen to 30 audio samples each weekend. They are your evaluators — non-negotiable.

Goal: Know exactly where IndicF5 fails on Odia

Baseline Evaluation — Establish Truth

▸ NOTE

Saturday: build the test set and run IndicF5 inference. Sunday: get human evaluation from 3 native speakers. Output: a CSV that becomes your forever-benchmark for measuring every future improvement.

3 hours · No GPU needed

Saturday Morning: Test Set Creation

1.1

Build the 200-sentence Odia diagnostic test set2 hrs

This is the most important asset you build the entire weekend. Every future experiment compares against the IndicF5 output on these sentences. Spend the time to do it right.

# /kaggle/working/test_sentences.csv format:
# id,category,sentence,priority

ID    CATEGORY              SENTENCE                              COUNT
─────────────────────────────────────────────────────────────────────
T01   retroflex_contrast    ଟମାଟୋ ତମ ଘରେ ଅଛି                       30
T02   retroflex_lateral_ɭ   ଜଳ ଓ ଫଳ ସବୁଠାରୁ ଭଲ                      20
T03   conjunct_clusters     ସ୍କୁଲ ସ୍ୱାସ୍ଥ୍ୟ ଦ୍ୱାର                     30
T04   schwa_deletion        ଘର ବଜାର ଯିବାକୁ ସମୟ                      20
T05   yes_no_question       ତୁ ଆସିଛ କି?                              15
T06   wh_question           କେବେ ଆସିବ ତୁ?                            15
T07   exclamation           କେଡେ ସୁନ୍ଦର ଅଛି!                          10
T08   numbers_dates         ଆଜି ୧୫ ଅଗଷ୍ଟ ୨୦୨୪                         20
T09   news_register         ଭାରତ ସରକାର ନୂଆ ନୀତି ଘୋଷଣା                10
T10   conversational        ଆରେ ଭାଇ କୁଆଡେ ଯିବୁ?                       10
T11   long_sentence         (multi-clause sentences > 15 words)      10
T12   names_places          ଭୁବନେଶ୍ୱର, ସମ୍ବଲପୁର, କଟକ                   10

# DO NOT use GPT/Claude to generate Odia sentences.
# Source from: Odia Wikipedia, Odia news websites, Odia textbooks.
# Better: Have a native Odia speaker write 50 sentences from each category.

⚠ WATCH OUT

The temptation to ask Claude or GPT to generate Odia sentences is huge. Don't. LLMs make subtle errors in low-resource languages — wrong grammar, anglicized constructions, missing diacritics. Your test set must be verified by a native Odia speaker before you use it. This takes 2 hours but saves you from invalidating your entire benchmark.

1.2

Identify 3 reference voice samples1 hr

IndicF5 is reference-conditioned — you give it a 5-10 second audio clip and it synthesizes in that voice. You need 3 different reference voices to test how voice choice affects quality.

# Where to get reference voices (zero cost, today):

# Option A: YouTube Odia content (download with yt-dlp on your laptop)
#   • OTV news anchor — clean, formal, professional
#   • Odia podcast voice — conversational
#   • Doordarshan archive voice — broadcast standard

# Use yt-dlp on laptop, NOT Kaggle:
yt-dlp -x --audio-format wav --audio-quality 0 \
       --postprocessor-args "-ar 24000 -ac 1" \
       "https://youtube.com/watch?v=XXX"

# Then trim to a clean 6-10 second segment with no music:
ffmpeg -i input.wav -ss 00:00:30 -t 7 -ar 24000 -ac 1 ref_otv.wav

# What makes a GOOD reference clip:
#   ✓ 6-10 seconds (longer = better voice fingerprint)
#   ✓ Clear speech, no background music
#   ✓ Single speaker, no overlap
#   ✓ Natural prosody (not over-emphasized)
#   ✓ Matches the gender/style you want output to be

# What makes a BAD reference clip:
#   ✗ Too short (< 4s) — voice characteristics not captured
#   ✗ Background music — IndicF5 will copy the music too
#   ✗ Multiple speakers — model gets confused
#   ✗ Heavy emotion — biases all output to that emotion

6 hours · GPU notebook

Saturday Afternoon: Kaggle Setup + IndicF5 Inference

1.3

Create the Kaggle notebook30 min

# Kaggle.com → New Notebook → Settings:
#   Accelerator: GPU T4 x2  (or P100 if T4 unavailable)
#   Internet: ON  (required to download IndicF5)
#   Persistence: Files only

# In Notebook → Add-ons → Secrets:
#   Add: HF_TOKEN = your_hugging_face_token

# Cell 1 — Setup (run once per session)
import os
from kaggle_secrets import UserSecretsClient
os.environ['HF_TOKEN'] = UserSecretsClient().get_secret("HF_TOKEN")

!pip install -q git+https://github.com/ai4bharat/IndicF5.git 2>&1 | tail -5
!pip install -q soundfile librosa pandas 2>&1 | tail -3
!huggingface-cli login --token $HF_TOKEN

# Verify GPU
import torch
print(f"CUDA: {torch.cuda.is_available()}")
print(f"Device: {torch.cuda.get_device_name(0)}")
print(f"VRAM: {torch.cuda.get_device_properties(0).total_memory / 1e9:.1f} GB")
# Expected: T4, 14.7 GB

1.4

Upload test data to Kaggle Datasets30 min

Don't upload audio files inside notebooks. Create a Kaggle Dataset, upload once, mount it into all future notebooks. This survives session resets.

# Locally on your laptop:
mkdir odia-tts-baseline-data
cp test_sentences.csv odia-tts-baseline-data/
cp ref_otv.wav ref_dd.wav ref_podcast.wav odia-tts-baseline-data/
echo "OTV news anchor reference" > odia-tts-baseline-data/ref_otv_text.txt  # transcript of ref clip
# (write actual Odia transcript of each reference clip)

# Upload to Kaggle:
# kaggle.com/datasets → New Dataset → Upload folder
# Title: "odia-tts-baseline-data"
# License: CC0 if you own it, else "Other"
# Visibility: Private (it has your unreleased work)

# In Kaggle notebook → Add Data → search your dataset → Add
# It mounts at: /kaggle/input/odia-tts-baseline-data/

1.5

Load IndicF5 and run baseline inference3 hrs

# Cell 2 — Load IndicF5
from transformers import AutoModel
import numpy as np
import soundfile as sf
import pandas as pd
from pathlib import Path
import time

print("Loading IndicF5 (400M params, ~3 GB download)...")
t0 = time.time()
model = AutoModel.from_pretrained("ai4bharat/IndicF5", trust_remote_code=True)
model = model.to("cuda")
model.eval()
print(f"Loaded in {time.time()-t0:.1f}s")

# Cell 3 — Setup paths
DATA = Path("/kaggle/input/odia-tts-baseline-data")
OUT  = Path("/kaggle/working/baseline_outputs")
OUT.mkdir(exist_ok=True)

# Load test sentences
df = pd.read_csv(DATA / "test_sentences.csv")
print(f"Test sentences: {len(df)}")

# Load reference voices
REFS = {
    "otv":     (DATA / "ref_otv.wav",     open(DATA / "ref_otv_text.txt").read().strip()),
    "dd":      (DATA / "ref_dd.wav",      open(DATA / "ref_dd_text.txt").read().strip()),
    "podcast": (DATA / "ref_podcast.wav", open(DATA / "ref_podcast_text.txt").read().strip()),
}

# Cell 4 — Run inference (THE CRITICAL CELL)
results = []
errors = []

for i, row in df.iterrows():
    for ref_name, (ref_path, ref_text) in REFS.items():
        out_path = OUT / f"{row['id']}_{ref_name}.wav"

        # Skip if already done (in case session resumes)
        if out_path.exists():
            results.append({"id": row['id'], "ref": ref_name, "ok": True, "path": str(out_path)})
            continue

        try:
            t0 = time.time()
            audio = model(
                row['sentence'],
                ref_audio_path=str(ref_path),
                ref_text=ref_text,
            )
            elapsed = time.time() - t0

            # Normalize and save
            if audio.dtype == np.int16:
                audio = audio.astype(np.float32) / 32768.0
            sf.write(out_path, np.array(audio, dtype=np.float32), samplerate=24000)

            results.append({
                "id": row['id'], "category": row['category'],
                "sentence": row['sentence'], "ref": ref_name,
                "duration_s": len(audio) / 24000,
                "synth_time_s": elapsed,
                "rtf": elapsed / (len(audio) / 24000),
                "ok": True, "path": str(out_path),
            })
            print(f"[{i:3d}/{len(df)}] {row['id']}_{ref_name} · {elapsed:.1f}s")
        except Exception as e:
            errors.append({"id": row['id'], "ref": ref_name, "error": str(e)})
            print(f"[ERROR] {row['id']}_{ref_name}: {e}")

# Cell 5 — Save results manifest
results_df = pd.DataFrame(results)
results_df.to_csv("/kaggle/working/baseline_results.csv", index=False)
pd.DataFrame(errors).to_csv("/kaggle/working/baseline_errors.csv", index=False)
print(f"\nSuccess: {len(results)}/{len(df)*3}")
print(f"Mean synthesis time: {results_df['synth_time_s'].mean():.2f}s")
print(f"Mean RTF: {results_df['rtf'].mean():.3f}")  # Real-Time Factor

✓ TIP

At 200 sentences × 3 reference voices = 600 inference calls. Each takes ~5-10s on T4. Total ≈ 60-90 minutes of GPU time. Comfortably fits in one session.

⚠ WATCH OUT

Memory leak warning: If you hit OOM after ~300 inferences, add torch.cuda.empty_cache() every 50 iterations. IndicF5's flow-matching decoder accumulates intermediate tensors.

1.6

Package outputs for human evaluation2 hrs

# Cell 6 — Package for evaluation
# Create a zip of all audio + a CSV that evaluators will fill in

import shutil

EVAL_DIR = Path("/kaggle/working/eval_package")
EVAL_DIR.mkdir(exist_ok=True)

# Copy audio
for r in results:
    shutil.copy(r['path'], EVAL_DIR / Path(r['path']).name)

# Build evaluator CSV
eval_df = pd.DataFrame(results)[['id', 'category', 'sentence', 'ref']]
eval_df['filename'] = eval_df['id'] + '_' + eval_df['ref'] + '.wav'
eval_df['naturalness_1to5'] = ''      # MOS
eval_df['intelligibility_1to5'] = ''  # Can you understand it?
eval_df['accent_correct_yn'] = ''     # Does it sound Odia (not Hindi-influenced)?
eval_df['mispronunciations'] = ''     # List specific words pronounced wrong
eval_df['overall_pass'] = ''          # 'pass' or 'fail' for production use
eval_df.to_csv(EVAL_DIR / "evaluation_form.csv", index=False)

# Build instructions for evaluator
instructions = '''
Odia TTS Baseline Evaluation
============================

You'll listen to ~600 audio clips of synthesized Odia speech.
For each clip, fill in the evaluation_form.csv:

1. naturalness_1to5: How natural does it sound? (1=robotic, 5=human)
2. intelligibility_1to5: How clearly can you understand? (1=unclear, 5=perfectly clear)
3. accent_correct_yn: Does it sound like Odia, not Hindi? (y/n)
4. mispronunciations: List any words you noticed mispronounced
5. overall_pass: Would this be acceptable in production? (pass/fail)

Listen on headphones in a quiet room.
Estimated time: 3-4 hours total. Spread across multiple sessions.
'''
(EVAL_DIR / "INSTRUCTIONS.txt").write_text(instructions)

# Zip everything
shutil.make_archive("/kaggle/working/odia_baseline_eval", 'zip', EVAL_DIR)
print(f"Package size: {Path('/kaggle/working/odia_baseline_eval.zip').stat().st_size / 1e6:.0f} MB")
# Download from Kaggle's Output panel — should be ~150-200 MB

Critical phase — distribute the work

Sunday: Human Evaluation

Send the eval package to your 3 native speakers

Each evaluator gets

200 sentences × 1 reference voice = 200 clips

Split the 3 ref voices across 3 evaluators — they each rate one voice condition

Time per evaluator

3-4 hours of focused listening

Strongly suggest split across 2 days

Required

Headphones, quiet room, sober

Not exaggerating — drunken evaluation is real and produces invalid data

Pay them

₹500-1000 per evaluator

Even friends. Their time is worth money. This is the ONLY money you spend this weekend.

1.7

Aggregate results into baseline metrics2 hrs

# Sunday evening: collect 3 filled CSVs from evaluators
# Upload them to Kaggle as new dataset version

# Cell 7 — Analyze baseline
import pandas as pd
import numpy as np

ev1 = pd.read_csv("/kaggle/input/.../eval_v1.csv")
ev2 = pd.read_csv("/kaggle/input/.../eval_v2.csv")
ev3 = pd.read_csv("/kaggle/input/.../eval_v3.csv")
all_evals = pd.concat([ev1, ev2, ev3])

# Overall MOS by category
print("=== BASELINE MOS BY CATEGORY ===")
print(all_evals.groupby('category')['naturalness_1to5'].agg(['mean', 'std', 'count']))

# Per-reference comparison
print("\n=== MOS BY REFERENCE VOICE ===")
print(all_evals.groupby('ref')['naturalness_1to5'].agg(['mean', 'std']))

# Intelligibility breakdown
print("\n=== INTELLIGIBILITY ===")
print(f"Mean: {all_evals['intelligibility_1to5'].mean():.2f}")
print(f"% scoring >= 4: {(all_evals['intelligibility_1to5'] >= 4).mean() * 100:.1f}%")

# Accent correctness
print("\n=== ACCENT CORRECTNESS ===")
print(f"% rated as 'sounds Odia': {(all_evals['accent_correct_yn'] == 'y').mean() * 100:.1f}%")

# Top failure categories
print("\n=== WORST CATEGORIES (need fine-tuning attention) ===")
worst = all_evals.groupby('category')['naturalness_1to5'].mean().sort_values()
print(worst.head(5))

# Common mispronunciations (will guide G2P fixes)
print("\n=== TOP MISPRONUNCIATIONS ===")
mispron = all_evals['mispronunciations'].dropna().str.split(',').explode().str.strip()
print(mispron.value_counts().head(20))

Decision gate at end of Weekend 1

# Three possible outcomes. Your strategy depends on which one you got:

OUTCOME A: Mean MOS >= 3.8, intelligibility > 85%, accent correct > 80%
  → IndicF5 is already very good for Odia
  → Weekend 2-3: Skip heavy fine-tuning. Build the demo. Apply for funding NOW.
  → You may not need to fine-tune at all. The product question (distribution,
     UX, packaging) is more important than the model question.

OUTCOME B: Mean MOS 3.0-3.8, some specific failure categories
  → STANDARD PATH. Proceed with weekends 2-3 as planned.
  → Focus fine-tuning data collection on the failing categories from your analysis.
  → Don't waste training data on what already works.

OUTCOME C: Mean MOS < 3.0, accent feels wrong, intelligibility < 70%
  → IndicF5's Odia training was insufficient.
  → Heavier intervention needed. Consider whether to fine-tune harder
     OR pivot strategy (Indic Parler-TTS as alternative base, or different approach).
  → Don't proceed blindly with weekend 2-3 plan if you're in this bucket.

Goal: ~5 hours of clean Odia audio with transcripts

Build the Fine-Tuning Dataset

▸ NOTE

5 hours sounds small but is enough for a LoRA fine-tune that proves whether the approach works. The goal isn't a production model yet — it's evidence that fine-tuning IndicF5 on Odia improves it. If 5 hrs gives a measurable bump, then 50 hrs will give a big bump.

⚠ WATCH OUT

Quality over quantity, hard rule: 5 hours of clean studio-quality audio outperforms 50 hours of noisy YouTube. Set strict thresholds and reject aggressively. Most of your weekend is filtering, not collecting.

6-8 hours, mostly unattended

Saturday: Source & Download

2.1

Pick 3-5 high-quality Odia channels1 hr

# Best sources for Odia speech, ranked by expected yield:

#1 Doordarshan Odia (DD Odia)
   - Government broadcaster, formal Odia
   - Minimal background music in news segments
   - Consistent broadcast-quality audio
   - https://youtube.com/@ddodisha
   - Expected yield after filtering: 40-50%

#2 OTV (Odisha TV) - news only
   - Studio anchor segments, clear formal Odia
   - Skip field reports, debates, talk shows
   - Search: "OTV news anchor" or "OTV bulletin"
   - Expected yield: 25-30%

#3 Odia audiobook channels
   - Search "ଓଡ଼ିଆ ଗଳ୍ପ" (Odia stories) on YouTube
   - Single narrator, clean recording, narrative prosody
   - Expected yield: 50-60%

#4 Odia podcast channels (skip if amateur)
   - Only if professionally recorded
   - Expected yield: variable

# Avoid for Weekend 2 dataset:
✗ Movie/serial dialogues (heavy music, multiple speakers)
✗ Bhajans / religious chants (musical, repetitive)
✗ Phone-quality interviews (poor SNR)
✗ Outdoor / field reports (wind, traffic noise)

2.2

Download with yt-dlp on your LAPTOP (not Kaggle)3-4 hrs (mostly unattended)

⚠ WATCH OUT

Don't use Kaggle for downloading. Kaggle's outbound bandwidth is unreliable for yt-dlp at scale, sessions time out, and you're wasting GPU time on CPU-bound work. Run downloads on your laptop overnight; upload the results to Kaggle as a Dataset.

# On your laptop (Linux/Mac/WSL):
pip install yt-dlp

# Create a list of YouTube URLs (50-100 videos targeting ~40-50 hours total)
cat > odia_urls.txt << 'EOF'
https://www.youtube.com/watch?v=VIDEO_ID_1
https://www.youtube.com/watch?v=VIDEO_ID_2
... (50-100 URLs)
EOF

# Download audio only, 16kHz mono WAV
mkdir -p raw_audio
yt-dlp \
  --batch-file odia_urls.txt \
  --extract-audio \
  --audio-format wav \
  --audio-quality 0 \
  --postprocessor-args "-ar 16000 -ac 1" \
  --output "raw_audio/%(id)s.%(ext)s" \
  --write-info-json \
  --no-overwrites \
  --sleep-interval 3

# Check download size:
du -sh raw_audio/
# Expected: 30-40 GB for ~40 hours of audio at 16kHz mono

2.3

Pre-filter on laptop before uploading to Kaggle2 hrs

# Filter on laptop — saves Kaggle storage + GPU time

# Quick filter: drop very short or very long files
import os
from pathlib import Path

raw_dir = Path("raw_audio")
keep_dir = Path("filtered_audio")
keep_dir.mkdir(exist_ok=True)

for wav in raw_dir.glob("*.wav"):
    size_mb = wav.stat().st_size / 1e6
    # Roughly: 16kHz mono = 1MB per minute
    duration_min = size_mb
    if 5 <= duration_min <= 30:  # Keep 5-30 minute segments
        wav.rename(keep_dir / wav.name)

# Check what survived
total_minutes = sum(f.stat().st_size / 1e6 for f in keep_dir.glob("*.wav"))
print(f"Kept: {len(list(keep_dir.glob('*.wav')))} files, ~{total_minutes/60:.1f} hours")
# Target: 30-40 hours of raw audio survives this filter

# Upload to Kaggle as new Dataset
# kaggle datasets create -p filtered_audio --dir-mode zip

The actual filtering — most of GPU time goes here

Sunday: Run the Quality Pipeline on Kaggle

2.4

Stage 1 — VAD + segmentation2 hrs GPU

# Kaggle notebook — Stage 1 of pipeline
!pip install -q silero-vad torch torchaudio soundfile

import torch, torchaudio
from silero_vad import load_silero_vad, read_audio, get_speech_timestamps
from pathlib import Path
import json

vad_model = load_silero_vad()

INPUT_DIR = Path("/kaggle/input/odia-raw-audio")
OUT_DIR = Path("/kaggle/working/segments")
OUT_DIR.mkdir(exist_ok=True)
manifest = []

for wav_file in INPUT_DIR.glob("*.wav"):
    audio = read_audio(str(wav_file), sampling_rate=16000)

    # Get speech segments
    segments = get_speech_timestamps(
        audio, vad_model,
        threshold=0.5,
        sampling_rate=16000,
        min_speech_duration_ms=3000,    # Min 3s segments for TTS
        max_speech_duration_s=20,       # Max 20s — keeps GPU memory bounded
        min_silence_duration_ms=500,
        speech_pad_ms=200,
    )

    for i, ts in enumerate(segments):
        start_s, end_s = ts['start'] / 16000, ts['end'] / 16000
        duration = end_s - start_s
        if duration < 3 or duration > 20:
            continue

        seg_audio = audio[ts['start']:ts['end']]
        seg_id = f"{wav_file.stem}_{i:04d}"
        seg_path = OUT_DIR / f"{seg_id}.wav"
        torchaudio.save(str(seg_path), seg_audio.unsqueeze(0), 16000)

        manifest.append({
            "segment_id": seg_id,
            "source_file": wav_file.name,
            "start_s": start_s, "end_s": end_s, "duration_s": duration,
            "path": str(seg_path),
        })

# Save manifest
import pandas as pd
pd.DataFrame(manifest).to_csv("/kaggle/working/segments_manifest.csv", index=False)
print(f"Total segments: {len(manifest)}")
print(f"Total speech hours: {sum(s['duration_s'] for s in manifest)/3600:.1f}")
# Expected: ~10-15 hours after VAD (from 30-40 hr raw)

2.5

Stage 2 — Whisper transcription + filter3-4 hrs GPU

# Kaggle notebook — Stage 2: Transcribe + filter low-confidence
!pip install -q faster-whisper

from faster_whisper import WhisperModel
import pandas as pd
import json
from pathlib import Path

# Load Whisper large-v3
print("Loading Whisper large-v3 (3 GB)...")
whisper = WhisperModel("large-v3", device="cuda", compute_type="float16")

manifest = pd.read_csv("/kaggle/working/segments_manifest.csv")
results = []

for i, row in manifest.iterrows():
    try:
        segments, info = whisper.transcribe(
            row['path'],
            language="or",                  # Force Odia
            word_timestamps=True,
            beam_size=5,
            temperature=0.0,
            condition_on_previous_text=False,
        )
        segments = list(segments)
        if not segments:
            continue

        text = " ".join(s.text for s in segments).strip()
        avg_logprob = sum(s.avg_logprob for s in segments) / len(segments)
        no_speech = max(s.no_speech_prob for s in segments)
        word_probs = [w.probability for s in segments for w in s.words]
        min_word_prob = min(word_probs) if word_probs else 0

        # FILTER: reject low-quality transcripts
        if avg_logprob < -0.7:        continue  # Whisper unsure
        if no_speech > 0.3:           continue  # Probably not speech
        if min_word_prob < 0.5:       continue  # Some word very unsure
        if len(text.split()) < 3:     continue  # Too short
        if len(text.split()) > 60:    continue  # Too long for TTS training
        if info.language_probability < 0.85: continue  # Not confidently Odia

        results.append({
            **row.to_dict(),
            "transcript": text,
            "whisper_confidence": float(avg_logprob),
            "lang_prob": float(info.language_probability),
        })

        if i % 100 == 0:
            print(f"[{i}/{len(manifest)}] Kept: {len(results)}")
    except Exception as e:
        print(f"[ERROR] {row['segment_id']}: {e}")

# Save filtered manifest
filtered_df = pd.DataFrame(results)
filtered_df.to_csv("/kaggle/working/transcribed_segments.csv", index=False)
print(f"\nAfter Whisper filter: {len(filtered_df)} segments")
print(f"Hours: {filtered_df['duration_s'].sum()/3600:.1f}")
# Expected: ~6-8 hours survive

2.6

Stage 3 — Quality scoring (DNSMOS)1-2 hrs GPU

# Stage 3 — DNSMOS scoring (Microsoft's automatic MOS predictor)
!pip install -q onnxruntime-gpu requests
!wget -q https://github.com/microsoft/DNS-Challenge/raw/master/DNSMOS/DNSMOS/sig_bak_ovr.onnx
!wget -q https://github.com/microsoft/DNS-Challenge/raw/master/DNSMOS/DNSMOS/model_v8.onnx

import onnxruntime as ort
import librosa
import numpy as np

# DNSMOS scoring function (simplified — full code on MS DNS-Challenge GitHub)
sess = ort.InferenceSession("model_v8.onnx", providers=['CUDAExecutionProvider'])

def score_dnsmos(audio_path):
    audio, sr = librosa.load(audio_path, sr=16000)
    # DNSMOS expects 9-second windows
    if len(audio) < 9 * 16000:
        audio = np.pad(audio, (0, 9*16000 - len(audio)))
    audio = audio[:9*16000]

    # Run inference
    input_features = audio.astype(np.float32).reshape(1, -1)
    out = sess.run(None, {'input_1': input_features})
    return float(out[0][0][2])  # Overall MOS

# Score all transcribed segments
df = pd.read_csv("/kaggle/working/transcribed_segments.csv")
df['dnsmos'] = df['path'].apply(score_dnsmos)

# Final filter: only keep DNSMOS >= 3.5 (good quality)
final = df[df['dnsmos'] >= 3.5].copy()
final = final.sort_values('dnsmos', ascending=False)

# Cap at 5 hours total (we don't need more for prototype)
total_seconds = 0
keep_indices = []
for idx, row in final.iterrows():
    if total_seconds + row['duration_s'] > 5 * 3600:
        break
    keep_indices.append(idx)
    total_seconds += row['duration_s']

final_5hr = final.loc[keep_indices]
final_5hr.to_csv("/kaggle/working/training_dataset_v1.csv", index=False)
print(f"Final training set: {len(final_5hr)} segments, {total_seconds/3600:.2f} hours")
print(f"Mean DNSMOS: {final_5hr['dnsmos'].mean():.2f}")
print(f"Mean Whisper confidence: {final_5hr['whisper_confidence'].mean():.2f}")

# Save as Kaggle Dataset for Weekend 3
# Files to save: training_dataset_v1.csv + all the .wav files referenced in it

End of Weekend 2 — Yield Reality

Raw downloaded

30-40 hours

From YouTube

After VAD segmentation

10-15 hours

Pure speech segments only

After Whisper confidence filter

6-8 hours

Reliable transcripts only

After DNSMOS quality filter

4-6 hours

This is your training set

Yield rate

~15%

As predicted in original pipeline design

The moment of truth — does it actually help?

LoRA Fine-Tune of IndicF5

▸ NOTE

This is the technical climax of the prototype. By Sunday evening, you'll know if fine-tuning IndicF5 on Odia data measurably improves output. The answer determines whether the entire sovereign Odia TTS thesis is viable.

⚠ WATCH OUT

Important caveat: IndicF5's repository as of late 2025 doesn't ship an official fine-tuning script. You'll be using the underlying F5-TTS training infrastructure with IndicF5 weights as the starting checkpoint. This works because IndicF5 is architecturally identical to F5-TTS — but expect some integration debugging.

3 hours — most goes to debugging the integration

Saturday Morning: Setup F5-TTS Training Code

3.1

Clone F5-TTS training code30 min

# Kaggle notebook for Weekend 3
# Settings: GPU T4 x2, Internet ON

!git clone https://github.com/SWivid/F5-TTS.git /kaggle/working/F5-TTS
%cd /kaggle/working/F5-TTS
!pip install -q -e ". [eval]"
!pip install -q peft accelerate

# Verify import
import sys
sys.path.insert(0, "/kaggle/working/F5-TTS/src")
from f5_tts.model import CFM, DiT
print("F5-TTS training code loaded")

3.2

Convert IndicF5 weights to F5-TTS-compatible format1 hr

# IndicF5 uses HF transformers wrapper. F5-TTS training expects raw checkpoint.
# We need to extract the underlying F5-TTS state_dict from IndicF5.

import torch
from transformers import AutoModel

# Load IndicF5
indicf5 = AutoModel.from_pretrained("ai4bharat/IndicF5", trust_remote_code=True)

# Extract the underlying model state
# (Inspect indicf5.named_parameters() to find the F5 backbone)
backbone_state = {}
for name, param in indicf5.named_parameters():
    # IndicF5 prefixes the F5 model — strip the prefix
    if name.startswith("model."):
        backbone_state[name.replace("model.", "")] = param.data.clone()

# Save as F5-TTS-compatible checkpoint
torch.save({
    "model_state_dict": backbone_state,
    "step": 0,  # We're starting fresh fine-tuning
    "ema_model_state_dict": backbone_state,  # Initialize EMA same as model
}, "/kaggle/working/indicf5_base_for_finetune.pt")

print(f"Backbone params: {len(backbone_state)}")
print("Saved as indicf5_base_for_finetune.pt")

⚠ WATCH OUT

The exact parameter name mapping between IndicF5 and F5-TTS might need adjustment. Inspect with indicf5.state_dict().keys() first. This is the most likely place where you'll spend 1-2 hours debugging — IndicF5's HF wrapper may name things differently than vanilla F5-TTS.

3.3

Prepare training data in F5-TTS format1 hr

# F5-TTS training expects: a directory of .wav files + a CSV with (audio_path, text)
# Format your Weekend 2 output to match

import pandas as pd
import shutil
from pathlib import Path

src_csv = "/kaggle/input/odia-training-data-v1/training_dataset_v1.csv"
src_audio = Path("/kaggle/input/odia-training-data-v1/")
out_dir = Path("/kaggle/working/training_data")
out_dir.mkdir(exist_ok=True)

df = pd.read_csv(src_csv)

# F5-TTS expects 24kHz audio. Resample if needed.
import librosa, soundfile as sf
ft_records = []
for i, row in df.iterrows():
    audio, sr = librosa.load(row['path'], sr=24000)
    out_path = out_dir / f"{row['segment_id']}.wav"
    sf.write(out_path, audio, 24000)
    ft_records.append({
        "audio_path": str(out_path),
        "text": row['transcript'],
        "duration": row['duration_s'],
    })

ft_df = pd.DataFrame(ft_records)
ft_df.to_csv("/kaggle/working/training_manifest_f5.csv", index=False)
print(f"Training manifest: {len(ft_df)} samples, {ft_df['duration'].sum()/3600:.2f} hrs")

2 hours

Saturday Afternoon: Configure LoRA

3.4

Wrap F5-TTS model with PEFT LoRA1.5 hrs

from peft import LoraConfig, get_peft_model, TaskType
import torch
from f5_tts.model import CFM, DiT

# Load base F5-TTS architecture
model = CFM(
    transformer=DiT(
        dim=1024, depth=22, heads=16, ff_mult=2,
        text_dim=512, conv_layers=4,
    ),
    mel_spec_kwargs=dict(
        n_fft=1024, hop_length=256, win_length=1024,
        n_mel_channels=100, target_sample_rate=24000,
    ),
)

# Load IndicF5 weights into it
ckpt = torch.load("/kaggle/working/indicf5_base_for_finetune.pt", map_location="cpu")
model.load_state_dict(ckpt['model_state_dict'], strict=False)
print("Loaded IndicF5 weights into F5-TTS architecture")

# Configure LoRA — target attention layers in transformer
lora_config = LoraConfig(
    r=16,                              # Low rank — fits Kaggle VRAM
    lora_alpha=32,
    target_modules=[
        "to_q", "to_v",                # Attention query+value
        "ff.0", "ff.2",                # FFN layers (sometimes named differently)
    ],
    lora_dropout=0.1,
    bias="none",
)

model = get_peft_model(model, lora_config)
model.print_trainable_parameters()
# Expected: ~4-8M trainable / 400M total (1-2%)

# Move to GPU
model = model.to("cuda")
print(f"VRAM: {torch.cuda.memory_allocated() / 1e9:.1f} GB")
# Should be ~4-5 GB — plenty of room on T4 16 GB

✕ TRAP

The target_modules trap: F5-TTS's exact attention layer names may be to_q/to_kv instead of to_q/to_v. Run print([n for n,_ in model.named_modules()][:50]) to see actual names before configuring LoRA. Wrong target_modules = LoRA fails silently with no error.

3.5

Setup training loop with Kaggle session safety1 hr

# Critical: Kaggle sessions die at 12 hours. Build resumability from the start.

import torch
from torch.utils.data import DataLoader
from pathlib import Path

CHECKPOINT_DIR = Path("/kaggle/working/lora_checkpoints")
CHECKPOINT_DIR.mkdir(exist_ok=True)

def save_checkpoint(model, optimizer, step, loss):
    ckpt = {
        "step": step,
        "lora_state_dict": {k: v for k, v in model.state_dict().items() if "lora" in k.lower()},
        "optimizer_state_dict": optimizer.state_dict(),
        "loss": loss,
    }
    # Atomic write — don't corrupt mid-save
    tmp = CHECKPOINT_DIR / f"latest.pt.tmp"
    torch.save(ckpt, tmp)
    tmp.rename(CHECKPOINT_DIR / "latest.pt")
    # Also keep periodic snapshots
    if step % 500 == 0:
        torch.save(ckpt, CHECKPOINT_DIR / f"step_{step}.pt")

def load_checkpoint(model, optimizer):
    latest = CHECKPOINT_DIR / "latest.pt"
    if not latest.exists():
        return 0
    ckpt = torch.load(latest, map_location="cuda")
    # Load only LoRA params
    model.load_state_dict({**model.state_dict(), **ckpt['lora_state_dict']}, strict=False)
    optimizer.load_state_dict(ckpt['optimizer_state_dict'])
    print(f"Resumed from step {ckpt['step']}")
    return ckpt['step']

# Optimizer — only train LoRA params
optimizer = torch.optim.AdamW(
    [p for p in model.parameters() if p.requires_grad],
    lr=1e-4,
    weight_decay=0.01,
)

# Resume if checkpoint exists
start_step = load_checkpoint(model, optimizer)

8 hours of GPU time, then evaluate

Saturday Evening + Sunday: Train

3.6

Run LoRA training6-8 hrs

# Training loop — the actual thing
from f5_tts.dataset import F5TTSDataset  # adapt to your installed version
from torch.utils.data import DataLoader

# Build dataset
dataset = F5TTSDataset(
    csv_path="/kaggle/working/training_manifest_f5.csv",
    target_sample_rate=24000,
    max_duration=20.0,
)
loader = DataLoader(
    dataset, batch_size=4,             # Small batch for T4 VRAM
    num_workers=2, shuffle=True,
)

# Train
TOTAL_STEPS = 5000                     # ~6-8 hrs on T4
LOG_EVERY = 50
SAVE_EVERY = 200

model.train()
step = start_step
import time
t_start = time.time()
running_loss = 0

while step < TOTAL_STEPS:
    for batch in loader:
        if step >= TOTAL_STEPS: break

        # Move to GPU
        mel = batch['mel'].to("cuda")
        text = batch['text']
        durations = batch['durations'].to("cuda")

        # Forward + flow-matching loss
        loss, _ = model(mel, text=text, lens=durations)

        # Backward
        loss.backward()
        torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
        optimizer.step()
        optimizer.zero_grad()

        running_loss += loss.item()
        step += 1

        # Logging
        if step % LOG_EVERY == 0:
            avg_loss = running_loss / LOG_EVERY
            elapsed = time.time() - t_start
            eta = (elapsed / step) * (TOTAL_STEPS - step)
            print(f"step {step:5d} | loss {avg_loss:.4f} | "
                  f"elapsed {elapsed/60:.1f}m | ETA {eta/60:.1f}m")
            running_loss = 0

        # Checkpoint
        if step % SAVE_EVERY == 0:
            save_checkpoint(model, optimizer, step, loss.item())

# Final save
save_checkpoint(model, optimizer, step, loss.item())
print(f"\nTraining complete: {TOTAL_STEPS} steps in {(time.time()-t_start)/3600:.1f} hours")

⚠ WATCH OUT

If your session dies mid-training: No problem. Restart the notebook, re-run cells 1-5, and the load_checkpoint() will resume from where you left off. This is why we built it that way.

⚠ WATCH OUT

If loss isn't decreasing: Common causes: (1) target_modules wrong, no params actually training; (2) learning rate too low — try 3e-4; (3) data preprocessing mismatch with what F5-TTS expects. Loss should drop from ~5.0 to ~2.5 in first 500 steps.

3.7

Sunday: Inference comparison3 hrs

# Sunday morning: generate fine-tuned outputs on the SAME 200 sentences from Weekend 1

# Load fine-tuned model
ft_ckpt = torch.load(CHECKPOINT_DIR / "latest.pt", map_location="cuda")
model.load_state_dict({**model.state_dict(), **ft_ckpt['lora_state_dict']}, strict=False)
model.eval()

# Inference on test sentences
import pandas as pd
test_df = pd.read_csv("/kaggle/input/odia-baseline-data/test_sentences.csv")
OUT_FT = Path("/kaggle/working/finetune_outputs")
OUT_FT.mkdir(exist_ok=True)

for i, row in test_df.iterrows():
    for ref_name, (ref_path, ref_text) in REFS.items():  # Same refs as Weekend 1
        out_path = OUT_FT / f"{row['id']}_{ref_name}_ft.wav"
        with torch.no_grad():
            audio = model.sample(
                text=row['sentence'],
                ref_audio=ref_path,
                ref_text=ref_text,
                steps=32,
                cfg_strength=2.0,
            )
        sf.write(out_path, audio.cpu().numpy(), 24000)

# Now: package side-by-side comparison for evaluators
EVAL = Path("/kaggle/working/comparison_eval")
EVAL.mkdir(exist_ok=True)

import shutil
for i, row in test_df.iterrows():
    for ref_name, _ in REFS.items():
        # Baseline (Weekend 1) and Fine-tuned (Weekend 3) — randomized labels for blind eval
        import random
        labels = ['A', 'B']
        random.shuffle(labels)
        baseline_label, ft_label = labels[0], labels[1]

        shutil.copy(
            f"/kaggle/input/baseline-outputs/{row['id']}_{ref_name}.wav",
            EVAL / f"{row['id']}_{ref_name}_{baseline_label}.wav"
        )
        shutil.copy(
            OUT_FT / f"{row['id']}_{ref_name}_ft.wav",
            EVAL / f"{row['id']}_{ref_name}_{ft_label}.wav"
        )

# Build blind A/B form
records = [{"id": row['id'], "ref": rn, "category": row['category'],
            "sentence": row['sentence'],
            "preferred_AB": "", "much_better_yn": ""}
           for _, row in test_df.iterrows() for rn in REFS]
pd.DataFrame(records).to_csv(EVAL / "ab_form.csv", index=False)

Sunday evening: collect blind A/B preferences

# Send same 3 evaluators the comparison_eval package
# They listen to A vs B for each sentence, pick the better one
# DO NOT tell them which is baseline vs fine-tuned — blind comparison

# Aggregate results:
ev = pd.concat([pd.read_csv(f) for f in eval_csvs])
preference = ev['preferred_AB'].value_counts(normalize=True)

# Decode A/B back to baseline/finetune (you saved the mapping)
print("Fine-tune preferred:", preference.get('finetune', 0) * 100, "%")
print("Baseline preferred:", preference.get('baseline', 0) * 100, "%")
print("No preference:", preference.get('tie', 0) * 100, "%")

# Decision rule:
# Fine-tune preferred > 60%  → fine-tuning works, scale up
# 50-60%                     → marginal improvement, need more data
# < 50%                      → fine-tuning hurt, debug pipeline

Make it real, get external feedback

Public Demo + Strategic Decision

▸ NOTE

By end of Weekend 4: a public demo on Hugging Face Spaces that anyone can use. Real native Odia speakers from outside your circle test it. You collect honest feedback. You decide whether to pursue real funding or pivot.

6 hours

Saturday: Build the Demo

4.1

Create Hugging Face Space (free hosting)2 hrs

# Create new Space at huggingface.co/new-space
# - Owner: your username
# - Space name: odia-tts-demo
# - SDK: Gradio
# - Hardware: CPU basic (free) — add GPU later if traffic warrants
# - Visibility: Public

# In the Space repo, create app.py:

import gradio as gr
import torch
from transformers import AutoModel
from peft import PeftModel
import soundfile as sf
import numpy as np
import tempfile

# Load base IndicF5
print("Loading IndicF5...")
base_model = AutoModel.from_pretrained("ai4bharat/IndicF5", trust_remote_code=True)

# Load YOUR LoRA adapter (uploaded as separate HF model)
# After Weekend 3, upload the lora_state_dict to:
# huggingface.co/YOUR_USERNAME/odia-tts-lora-v1
# Then reference it here:
# (Note: actual LoRA loading depends on how IndicF5 was wrapped — adjust as needed)

REFERENCE_VOICES = {
    "OTV Anchor": ("examples/ref_otv.wav", "OTV reference transcript here"),
    "DD Odia":    ("examples/ref_dd.wav",  "DD Odia reference transcript"),
    "Narrative":  ("examples/ref_pod.wav", "Podcast narrator transcript"),
}

def synthesize(text, voice_choice, use_finetune):
    ref_path, ref_text = REFERENCE_VOICES[voice_choice]

    # Pick model
    model = ft_model if use_finetune else base_model

    audio = model(text, ref_audio_path=ref_path, ref_text=ref_text)
    if audio.dtype == np.int16:
        audio = audio.astype(np.float32) / 32768.0

    out = tempfile.NamedTemporaryFile(suffix=".wav", delete=False)
    sf.write(out.name, audio, 24000)
    return out.name

with gr.Blocks(title="Sovereign Odia TTS Demo") as demo:
    gr.Markdown("# 🎙 ସ୍ଵାଧୀନ ଓଡ଼ିଆ ସ୍ୱର — Sovereign Odia TTS")
    gr.Markdown("Compare base IndicF5 vs our fine-tuned version on Odia text.")

    with gr.Row():
        text = gr.Textbox(label="Odia Text", placeholder="ଆଜି ଭୁବନେଶ୍ୱରରେ...", lines=3)
    with gr.Row():
        voice = gr.Dropdown(choices=list(REFERENCE_VOICES.keys()), value="OTV Anchor", label="Reference Voice")
        ft = gr.Checkbox(label="Use fine-tuned model", value=True)

    btn = gr.Button("Synthesize", variant="primary")
    audio_out = gr.Audio(label="Output", type="filepath")

    btn.click(synthesize, inputs=[text, voice, ft], outputs=audio_out)

    gr.Markdown("---")
    gr.Markdown("Built by [you] using IndicF5 (AI4Bharat). Apache 2.0.")
    gr.Markdown("Feedback: [your-email] · GitHub: [link]")

demo.launch()

4.2

Upload LoRA adapter to Hugging Face1 hr

# Push your fine-tuned LoRA weights to HF for the Space to use

from huggingface_hub import HfApi, create_repo

create_repo("YOUR_USERNAME/odia-tts-lora-v1", private=False, exist_ok=True)

api = HfApi()
api.upload_file(
    path_or_fileobj="/kaggle/working/lora_checkpoints/latest.pt",
    path_in_repo="adapter_model.bin",
    repo_id="YOUR_USERNAME/odia-tts-lora-v1",
)

# Add a model card describing what this is
README = '''
# Sovereign Odia TTS — LoRA adapter v1

LoRA fine-tune of [ai4bharat/IndicF5](https://huggingface.co/ai4bharat/IndicF5)
on 5 hours of clean Odia speech (DD Odia + OTV news).

- Base model: IndicF5 (400M params)
- LoRA rank: 16, alpha: 32
- Training: 5000 steps, ~6 hours on T4
- License: MIT (matches base)

This is an early prototype. Quality is being evaluated.

## Usage
[code snippet here]

## Evaluation
[link to MOS results]
'''
with open("/tmp/README.md", "w") as f:
    f.write(README)
api.upload_file(path_or_fileobj="/tmp/README.md", path_in_repo="README.md",
                repo_id="YOUR_USERNAME/odia-tts-lora-v1")

4.3

Add example sentences and visual polish2 hrs

# Add to app.py — examples that showcase the model

EXAMPLES = [
    ["ଆଜି ଭୁବନେଶ୍ୱରରେ ବର୍ଷା ହେଉଛି ।", "OTV Anchor", True],
    ["ତୁ ସକାଳେ କଣ ଖାଇଛୁ?", "Narrative", True],
    ["ସରକାର ନୂଆ ନୀତି ଘୋଷଣା କରିଛନ୍ତି।", "DD Odia", True],
    ["୨୦୨୪ ମସିହାରେ ଭାରତ ଚନ୍ଦ୍ରଯାନ ୩ ଉତ୍ସର୍ଗ କରିଥିଲା।", "DD Odia", True],
]

with gr.Blocks(...):
    # ...
    gr.Examples(examples=EXAMPLES, inputs=[text, voice, ft])

# Make the Space discoverable:
# - Add tags: "odia", "indic", "tts", "sovereign-ai"
# - Add proper README to the Space
# - Submit to https://huggingface.co/spaces/lhoestq/awesome-spaces lists

6 hours

Sunday: Distribute & Collect Real Feedback

4.4

Share with Odia community2 hrs

# Post in these communities — be specific, ask for feedback:

# r/Odia subreddit (~5k members)
"""
Built a sovereign Odia TTS prototype, free demo: [HF Space link]
Compares fine-tuned IndicF5 vs base. Feedback specifically wanted on:
- Does it sound like Odia or Hindi-influenced?
- Mispronunciation of common words?
- Voice quality on news vs conversational text?
"""

# Twitter/X — tag @ai4bharat, @indianai, Odia influencers
# LinkedIn — your network, mention Bhashini Mission
# Telegram Odia tech groups — search for active ones
# Direct outreach: 5-10 Odia journalists, Odia Wikipedia editors

# What you're optimizing for:
# - 50+ unique users try the demo
# - 10+ leave structured feedback
# - 1-2 connections that lead to funding conversations

4.5

Track usage and feedbackongoing

# Add basic analytics to your Space (Gradio supports this)

import gradio as gr
import json
from datetime import datetime

def synthesize_with_logging(text, voice, ft):
    # Log usage
    log = {
        "ts": datetime.utcnow().isoformat(),
        "text_length": len(text),
        "voice": voice,
        "finetuned": ft,
    }
    with open("/data/usage.jsonl", "a") as f:
        f.write(json.dumps(log) + "\n")

    return synthesize(text, voice, ft)

# Add feedback button
with gr.Row():
    good_btn = gr.Button("👍 Good")
    bad_btn  = gr.Button("👎 Bad")
    detail   = gr.Textbox(label="What was wrong? (optional)")
    submit   = gr.Button("Submit feedback")

4.6

The strategic decision meeting (with yourself)2 hrs

End of Weekend 4: the honest assessment

Sunday evening, sit with the data and decide. Three possible outcomes:

OUTCOME A: "It works — fine-tune meaningfully better than base"
  Evidence:  Blind A/B prefers fine-tune > 60%
             MOS bump >= 0.4 from baseline
             Native speakers say "this sounds Odia"
             Demo gets 100+ uses, several positive comments
  Decision:  Pursue real funding (Bhashini Mission, AI4Bharat grants,
             AWS Activate, NVIDIA Inception, MeitY).
             Use the prototype as your pitch.
  Timeline:  Funding takes 3-6 months. Use that time to build studio
             voice, expand training data to 50+ hrs, refine pipeline.

OUTCOME B: "Marginal — small improvement, not transformative"
  Evidence:  Fine-tune slight preference (50-60%)
             MOS bump 0.1-0.3
             Some categories better, some worse
             Demo gets some interest, no clear product-market fit
  Decision:  Two paths:
             (1) Iterate — collect 20+ hrs better data, retrain
                 (2 more weekends, see if quality jumps)
             (2) Pivot — focus on PRODUCT layer instead of model
                 (Build Odia voice-over service using base IndicF5)

OUTCOME C: "It doesn't work — fine-tune no better or worse"
  Evidence:  No clear A/B preference
             Audio quality issues (artifacts, prosody breaks)
             Native speakers reject both versions
  Decision:  Don't throw good money after bad. Three options:
             (1) Pivot to using base IndicF5 as-is, focus on product
             (2) Try Indic Parler-TTS as alternative base
             (3) Decide TTS isn't the right product, use 4 weekends as
                 valuable learning, redirect to a different problem.

The actual deliverables

What you have at the end of 4 weekends

Concrete artifacts

Diagnostic test set

200 sentences × 3 voices, native-validated

Reusable benchmark for all future work

Baseline metrics

MOS, intelligibility, accent correctness

Quantified IndicF5 capability on Odia

Training dataset v1

~5 hours clean Odia, transcribed

Reusable for any future TTS work

LoRA adapter weights

50 MB file on Hugging Face

MIT licensed, public

Public demo

Hugging Face Space, free, accessible

Anyone can try it. Includes A/B comparison.

Real user feedback

50+ users, 10+ structured responses

Validates or invalidates the entire thesis

A/B preference data

Blind comparison results

Statistically meaningful sample, not anecdotes

Funding-ready pitch

Demo + data + clear next steps

If Outcome A, you can apply to grants tomorrow

Total cost

₹0 in cloud + ₹3000 evaluator fees

Plus ~50 hours of your time across 4 weekends

✓ TIP

This is what veteran engineers mean by "build the smallest thing that gives you real information." Four weekends, zero cloud spend, you go from "we should build Odia TTS" to one of three clear strategic positions backed by data. That clarity is worth more than any infrastructure.

Common Failures + Fixes

Things That Will Go Wrong

⚠ WATCH OUT

These aren't hypothetical — they're the issues that hit every team building this kind of pipeline. Knowing about them in advance saves hours.

PROBLEM 1

OOM error during inference (Weekend 1)

Cause: IndicF5 + reference audio + intermediate tensors > 15 GB on T4

# 1. Reduce reference audio length to 5s max
# 2. Add explicit cache clearing every 50 samples
torch.cuda.empty_cache()

# 3. If still OOM, switch to FP16:
model = model.half()
# IndicF5 may have FP16 issues — test with single sample first

PROBLEM 2

Whisper transcribes Odia as Hindi or Bengali

Cause: Auto language detection fails — Odia is low-resource in Whisper

# Force Odia explicitly:
segments, info = whisper.transcribe(
    audio_path,
    language="or",         # NOT 'hi' or auto
    initial_prompt="ଓଡ଼ିଆ ଭାଷାରେ ସମ୍ବାଦ",  # Odia hint helps
)

PROBLEM 3

Kaggle session dies, lost training progress

Cause: Idle timeout (20 min) or 12 hr limit hit

# 1. Check checkpoint frequency — every 200 steps minimum
# 2. Add a keep-alive cell that prints every 5 min:
import time
while True:
    print(f"Active at step {step}")
    time.sleep(300)
# (Run in separate cell — only useful for evaluation phases)

# 3. Always check checkpoint before assuming work is lost:
ls /kaggle/working/lora_checkpoints/

PROBLEM 4

LoRA target_modules error: 'no modules named X found'

Cause: F5-TTS attention layer naming differs from your config

# Print actual module names:
for name, _ in model.named_modules():
    if 'attn' in name.lower() or 'linear' in name.lower():
        print(name)

# Common actual names in F5-TTS:
# - attn.to_q, attn.to_kv, attn.to_out
# - ff.0, ff.2 (FFN sublayers)
# Update LoraConfig target_modules accordingly

PROBLEM 5

Training loss is NaN after a few hundred steps

Cause: Learning rate too high OR FP16 underflow OR data preprocessing bug

# 1. Lower LR first:
optimizer = torch.optim.AdamW(..., lr=5e-5)  # was 1e-4

# 2. Switch to BF16 (more stable than FP16):
model = model.bfloat16()

# 3. Check data: print 5 random batches, verify mel shapes are sane:
for i, batch in enumerate(loader):
    if i >= 5: break
    print(batch['mel'].shape, batch['mel'].mean(), batch['mel'].std())

PROBLEM 6

Fine-tuned output sounds worse than baseline

Cause: Either: (a) too few training steps, (b) bad training data, (c) catastrophic forgetting

# Diagnostic checks:

# 1. Did loss decrease meaningfully?
# Should drop from ~5.0 to ~2.5 over 5000 steps
# If stuck at 4.0+ → LoRA isn't learning, check target_modules

# 2. Run inference on TRAINING set
# If model can't reproduce training data → undertrained, train more
# If it can but test fails → overfitting, reduce steps or add data

# 3. A/B compare on multiple categories
# If failing only on rare categories → need more diverse training data
# If failing across the board → something fundamentally wrong with FT

PROBLEM 7

yt-dlp gets blocked by YouTube

Cause: Too many requests, IP rate-limited

# Slow down, use cookies:
yt-dlp --cookies-from-browser firefox \
       --sleep-interval 10 \
       --max-sleep-interval 30 \
       --rate-limit 1M \
       URL

# Or: spread downloads across days, smaller batches

PROBLEM 8

Hugging Face Space crashes or runs out of memory

Cause: Free CPU tier has 16 GB RAM total, IndicF5 alone is ~3 GB

# Options:
# 1. Don't load both base and FT versions in memory
#    Load on-demand, unload after use
#
# 2. Upgrade to ZeroGPU (free for HF Pro $9/mo)
#    Provides A100 access for short bursts
#
# 3. Replace inference with cached pre-generated samples
#    Pre-generate 100 example outputs, serve from disk

PROBLEM 9

Native speakers say 'sounds wrong' but can't articulate why

Cause: Prosody issues — model getting words right but rhythm/intonation wrong

# This is a HARD problem. Common causes:

# 1. Reference audio prosody bleeds through too much
#    Try: shorter reference (4-5s vs 10s)
#    Try: different reference voices, see if pattern is consistent

# 2. Odia phrase boundaries different from IndicF5's training distribution
#    Won't be fixed with current data — needs more Odia-specific FT data

# 3. Pitch contour issues at sentence ends
#    Diagnostic: look at F0 of generated audio, compare to ground truth
#    librosa.yin(audio) for pitch extraction

PROBLEM 10

License confusion: can I commercially deploy this?

Cause: IndicF5 is MIT, but training data has mixed licenses

# IndicF5 was trained on:
# - Rasa: research use only (CC BY-NC)
# - IndicTTS: research use 
# - LIMMITS: research challenge data
# - IndicVoices-R: CC BY 4.0

# Your fine-tuned LoRA inherits ambiguity.
# For PROTOTYPE/DEMO: fine, MIT covers you
# For COMMERCIAL: need lawyer review, OR retrain only on
#    your own data + clearly-licensed data
# 
# This is a problem worth solving BEFORE pursuing serious funding.

Honest decision rules

When to give up vs when to push through

Push through if...

Loss is decreasing during training (even slowly)
You've made measurable improvement on at least one category in test set
Native speakers say specific things are better (even if overall MOS is mixed)
You haven't tried 2 weekends yet

Stop and reassess if...

Loss is NaN or oscillating wildly after 1000+ steps
3 evaluators independently say fine-tuned is worse
You've spent 2+ weekends debugging infrastructure, not training
Your test set MOS dropped from baseline — you're moving backward
Demo gets 100+ uses with consistently negative feedback

▸ NOTE

The hardest skill: distinguishing "this approach is wrong" from "this approach needs more time." Most people quit too early on hard problems and persist too long on dead-end ones. Set explicit decision criteria before you start, then trust them when the data comes in.