What You'll Build
The 4-Weekend Plan
▸ NOTE
By the end of weekend 4, you will have: a working IndicF5 demo,
a fine-tuned LoRA adapter that improves Odia output, a blind A/B
comparison demo on Hugging Face Spaces, and concrete data
showing whether sovereign Odia TTS is worth pursuing further.
Total cost: ₹0.
Weekend 112 hrs
Baseline + Test Set
Run IndicF5 on 200 Odia sentences, build diagnostic test set,
get 3 native speakers to listen and rate.
Weekend 216 hrs
Data Collection + Pipeline
Download 30 hours of OTV / Doordarshan content, run through
VAD + Whisper pipeline, end with ~5 hours clean training data.
Weekend 312 hrs
LoRA Fine-Tune
Train LoRA adapter on the 5 hours of clean data. Inference
comparison vs baseline on test set.
Weekend 412 hrs
Public Demo + Iteration
Hugging Face Space with A/B comparison. Share with Odia
community. Decide go/no-go on real funding pursuit.
Reality Check
Kaggle Constraints — Read This Before You Start
What Kaggle gives you (free)
GPU quota
30 hours/week of P100 or T4
Resets Saturday 00:00 UTC every week
VRAM
16 GB (P100) or 16 GB (T4)
P100 is faster for FP32, T4 is better for FP16/BF16. Pick T4
for IndicF5.
Session length
Max 12 hours per session
Hard cutoff. Plan training to checkpoint every hour.
Idle timeout
Killed after 20 min of inactivity
Use a keep-alive cell or check progress every 15 min
Persistent storage
20 GB Datasets + Models
Upload your data once as a Kaggle Dataset, reuse across
sessions
RAM
13 GB system RAM, 16 GB GPU VRAM
Tight. Be careful with audio loading — use streaming or
chunks.
Disk
73 GB working directory (/kaggle/working)
Wiped between sessions. Save outputs to /kaggle/working then
download or push to Kaggle Dataset.
✕ TRAP
The session-length trap: If your fine-tune
takes 14 hours, you cannot complete it in one session. You MUST
checkpoint to disk every hour and design training to resume from
checkpoint. We'll handle this properly.
⚠ WATCH OUT
The 30-hour weekly quota trap: A single LoRA
experiment burns ~8 hours. You can run 3-4 experiments per week
max. Plan each experiment carefully — don't waste hours on
debugging that you could have done locally.
Pre-flight Checklist
What You Need Before Weekend 1
Account setup (do this Friday evening)
Kaggle account
Verify phone number — required for GPU access
Hugging Face account
Free. You'll need it to download IndicF5 weights and to
gate-accept its license
Hugging Face token
Settings → Access Tokens → New token (read access). Save it
— you'll add it to Kaggle Secrets
Accept IndicF5 terms
Visit huggingface.co/ai4bharat/IndicF5 and click 'Agree' to
access weights
GitHub account
For backing up notebooks. Optional but recommended.
Identify 3 native Odia speakers
Friends/family who'll listen to 30 audio samples each
weekend. They are your evaluators — non-negotiable.
Goal: Know exactly where IndicF5 fails on Odia
Baseline Evaluation — Establish Truth
▸ NOTE
Saturday: build the test set and run IndicF5 inference. Sunday:
get human evaluation from 3 native speakers. Output: a CSV that
becomes your forever-benchmark for measuring every future
improvement.
3 hours · No GPU needed
Saturday Morning: Test Set Creation
1.1
Build the 200-sentence Odia diagnostic test set2 hrs
This is the most important asset you build the entire weekend.
Every future experiment compares against the IndicF5 output on
these sentences. Spend the time to do it right.
# /kaggle/working/test_sentences.csv format: # id,category,sentence,priority ID CATEGORY SENTENCE COUNT ───────────────────────────────────────────────────────────────────── T01 retroflex_contrast ଟମାଟୋ ତମ ଘରେ ଅଛି 30 T02 retroflex_lateral_ɭ ଜଳ ଓ ଫଳ ସବୁଠାରୁ ଭଲ 20 T03 conjunct_clusters ସ୍କୁଲ ସ୍ୱାସ୍ଥ୍ୟ ଦ୍ୱାର 30 T04 schwa_deletion ଘର ବଜାର ଯିବାକୁ ସମୟ 20 T05 yes_no_question ତୁ ଆସିଛ କି? 15 T06 wh_question କେବେ ଆସିବ ତୁ? 15 T07 exclamation କେଡେ ସୁନ୍ଦର ଅଛି! 10 T08 numbers_dates ଆଜି ୧୫ ଅଗଷ୍ଟ ୨୦୨୪ 20 T09 news_register ଭାରତ ସରକାର ନୂଆ ନୀତି ଘୋଷଣା 10 T10 conversational ଆରେ ଭାଇ କୁଆଡେ ଯିବୁ? 10 T11 long_sentence (multi-clause sentences > 15 words) 10 T12 names_places ଭୁବନେଶ୍ୱର, ସମ୍ବଲପୁର, କଟକ 10 # DO NOT use GPT/Claude to generate Odia sentences. # Source from: Odia Wikipedia, Odia news websites, Odia textbooks. # Better: Have a native Odia speaker write 50 sentences from each category.
⚠ WATCH OUT
The temptation to ask Claude or GPT to generate Odia
sentences is huge. Don't. LLMs make subtle errors in
low-resource languages — wrong grammar, anglicized
constructions, missing diacritics. Your test set must be
verified by a native Odia speaker before you use it. This
takes 2 hours but saves you from invalidating your entire
benchmark.
1.2
Identify 3 reference voice samples1 hr
IndicF5 is reference-conditioned — you give it a 5-10 second
audio clip and it synthesizes in that voice. You need 3
different reference voices to test how voice choice affects
quality.
# Where to get reference voices (zero cost, today):
# Option A: YouTube Odia content (download with yt-dlp on your laptop)
# • OTV news anchor — clean, formal, professional
# • Odia podcast voice — conversational
# • Doordarshan archive voice — broadcast standard
# Use yt-dlp on laptop, NOT Kaggle:
yt-dlp -x --audio-format wav --audio-quality 0 \
--postprocessor-args "-ar 24000 -ac 1" \
"https://youtube.com/watch?v=XXX"
# Then trim to a clean 6-10 second segment with no music:
ffmpeg -i input.wav -ss 00:00:30 -t 7 -ar 24000 -ac 1 ref_otv.wav
# What makes a GOOD reference clip:
# ✓ 6-10 seconds (longer = better voice fingerprint)
# ✓ Clear speech, no background music
# ✓ Single speaker, no overlap
# ✓ Natural prosody (not over-emphasized)
# ✓ Matches the gender/style you want output to be
# What makes a BAD reference clip:
# ✗ Too short (< 4s) — voice characteristics not captured
# ✗ Background music — IndicF5 will copy the music too
# ✗ Multiple speakers — model gets confused
# ✗ Heavy emotion — biases all output to that emotion
6 hours · GPU notebook
Saturday Afternoon: Kaggle Setup + IndicF5 Inference
1.3
Create the Kaggle notebook30 min
# Kaggle.com → New Notebook → Settings:
# Accelerator: GPU T4 x2 (or P100 if T4 unavailable)
# Internet: ON (required to download IndicF5)
# Persistence: Files only
# In Notebook → Add-ons → Secrets:
# Add: HF_TOKEN = your_hugging_face_token
# Cell 1 — Setup (run once per session)
import os
from kaggle_secrets import UserSecretsClient
os.environ['HF_TOKEN'] = UserSecretsClient().get_secret("HF_TOKEN")
!pip install -q git+https://github.com/ai4bharat/IndicF5.git 2>&1 | tail -5
!pip install -q soundfile librosa pandas 2>&1 | tail -3
!huggingface-cli login --token $HF_TOKEN
# Verify GPU
import torch
print(f"CUDA: {torch.cuda.is_available()}")
print(f"Device: {torch.cuda.get_device_name(0)}")
print(f"VRAM: {torch.cuda.get_device_properties(0).total_memory / 1e9:.1f} GB")
# Expected: T4, 14.7 GB
1.4
Upload test data to Kaggle Datasets30 min
Don't upload audio files inside notebooks. Create a Kaggle
Dataset, upload once, mount it into all future notebooks. This
survives session resets.
# Locally on your laptop: mkdir odia-tts-baseline-data cp test_sentences.csv odia-tts-baseline-data/ cp ref_otv.wav ref_dd.wav ref_podcast.wav odia-tts-baseline-data/ echo "OTV news anchor reference" > odia-tts-baseline-data/ref_otv_text.txt # transcript of ref clip # (write actual Odia transcript of each reference clip) # Upload to Kaggle: # kaggle.com/datasets → New Dataset → Upload folder # Title: "odia-tts-baseline-data" # License: CC0 if you own it, else "Other" # Visibility: Private (it has your unreleased work) # In Kaggle notebook → Add Data → search your dataset → Add # It mounts at: /kaggle/input/odia-tts-baseline-data/
1.5
Load IndicF5 and run baseline inference3 hrs
# Cell 2 — Load IndicF5
from transformers import AutoModel
import numpy as np
import soundfile as sf
import pandas as pd
from pathlib import Path
import time
print("Loading IndicF5 (400M params, ~3 GB download)...")
t0 = time.time()
model = AutoModel.from_pretrained("ai4bharat/IndicF5", trust_remote_code=True)
model = model.to("cuda")
model.eval()
print(f"Loaded in {time.time()-t0:.1f}s")
# Cell 3 — Setup paths
DATA = Path("/kaggle/input/odia-tts-baseline-data")
OUT = Path("/kaggle/working/baseline_outputs")
OUT.mkdir(exist_ok=True)
# Load test sentences
df = pd.read_csv(DATA / "test_sentences.csv")
print(f"Test sentences: {len(df)}")
# Load reference voices
REFS = {
"otv": (DATA / "ref_otv.wav", open(DATA / "ref_otv_text.txt").read().strip()),
"dd": (DATA / "ref_dd.wav", open(DATA / "ref_dd_text.txt").read().strip()),
"podcast": (DATA / "ref_podcast.wav", open(DATA / "ref_podcast_text.txt").read().strip()),
}
# Cell 4 — Run inference (THE CRITICAL CELL)
results = []
errors = []
for i, row in df.iterrows():
for ref_name, (ref_path, ref_text) in REFS.items():
out_path = OUT / f"{row['id']}_{ref_name}.wav"
# Skip if already done (in case session resumes)
if out_path.exists():
results.append({"id": row['id'], "ref": ref_name, "ok": True, "path": str(out_path)})
continue
try:
t0 = time.time()
audio = model(
row['sentence'],
ref_audio_path=str(ref_path),
ref_text=ref_text,
)
elapsed = time.time() - t0
# Normalize and save
if audio.dtype == np.int16:
audio = audio.astype(np.float32) / 32768.0
sf.write(out_path, np.array(audio, dtype=np.float32), samplerate=24000)
results.append({
"id": row['id'], "category": row['category'],
"sentence": row['sentence'], "ref": ref_name,
"duration_s": len(audio) / 24000,
"synth_time_s": elapsed,
"rtf": elapsed / (len(audio) / 24000),
"ok": True, "path": str(out_path),
})
print(f"[{i:3d}/{len(df)}] {row['id']}_{ref_name} · {elapsed:.1f}s")
except Exception as e:
errors.append({"id": row['id'], "ref": ref_name, "error": str(e)})
print(f"[ERROR] {row['id']}_{ref_name}: {e}")
# Cell 5 — Save results manifest
results_df = pd.DataFrame(results)
results_df.to_csv("/kaggle/working/baseline_results.csv", index=False)
pd.DataFrame(errors).to_csv("/kaggle/working/baseline_errors.csv", index=False)
print(f"\nSuccess: {len(results)}/{len(df)*3}")
print(f"Mean synthesis time: {results_df['synth_time_s'].mean():.2f}s")
print(f"Mean RTF: {results_df['rtf'].mean():.3f}") # Real-Time Factor
✓ TIP
At 200 sentences × 3 reference voices = 600 inference calls.
Each takes ~5-10s on T4. Total ≈ 60-90 minutes of GPU time.
Comfortably fits in one session.
⚠ WATCH OUT
Memory leak warning: If you hit OOM after
~300 inferences, add
torch.cuda.empty_cache()
every 50 iterations. IndicF5's flow-matching decoder
accumulates intermediate tensors.
1.6
Package outputs for human evaluation2 hrs
# Cell 6 — Package for evaluation
# Create a zip of all audio + a CSV that evaluators will fill in
import shutil
EVAL_DIR = Path("/kaggle/working/eval_package")
EVAL_DIR.mkdir(exist_ok=True)
# Copy audio
for r in results:
shutil.copy(r['path'], EVAL_DIR / Path(r['path']).name)
# Build evaluator CSV
eval_df = pd.DataFrame(results)[['id', 'category', 'sentence', 'ref']]
eval_df['filename'] = eval_df['id'] + '_' + eval_df['ref'] + '.wav'
eval_df['naturalness_1to5'] = '' # MOS
eval_df['intelligibility_1to5'] = '' # Can you understand it?
eval_df['accent_correct_yn'] = '' # Does it sound Odia (not Hindi-influenced)?
eval_df['mispronunciations'] = '' # List specific words pronounced wrong
eval_df['overall_pass'] = '' # 'pass' or 'fail' for production use
eval_df.to_csv(EVAL_DIR / "evaluation_form.csv", index=False)
# Build instructions for evaluator
instructions = '''
Odia TTS Baseline Evaluation
============================
You'll listen to ~600 audio clips of synthesized Odia speech.
For each clip, fill in the evaluation_form.csv:
1. naturalness_1to5: How natural does it sound? (1=robotic, 5=human)
2. intelligibility_1to5: How clearly can you understand? (1=unclear, 5=perfectly clear)
3. accent_correct_yn: Does it sound like Odia, not Hindi? (y/n)
4. mispronunciations: List any words you noticed mispronounced
5. overall_pass: Would this be acceptable in production? (pass/fail)
Listen on headphones in a quiet room.
Estimated time: 3-4 hours total. Spread across multiple sessions.
'''
(EVAL_DIR / "INSTRUCTIONS.txt").write_text(instructions)
# Zip everything
shutil.make_archive("/kaggle/working/odia_baseline_eval", 'zip', EVAL_DIR)
print(f"Package size: {Path('/kaggle/working/odia_baseline_eval.zip').stat().st_size / 1e6:.0f} MB")
# Download from Kaggle's Output panel — should be ~150-200 MB
Critical phase — distribute the work
Sunday: Human Evaluation
Send the eval package to your 3 native speakers
Each evaluator gets
200 sentences × 1 reference voice = 200 clips
Split the 3 ref voices across 3 evaluators — they each rate
one voice condition
Time per evaluator
3-4 hours of focused listening
Strongly suggest split across 2 days
Required
Headphones, quiet room, sober
Not exaggerating — drunken evaluation is real and produces
invalid data
Pay them
₹500-1000 per evaluator
Even friends. Their time is worth money. This is the ONLY
money you spend this weekend.
1.7
Aggregate results into baseline metrics2 hrs
# Sunday evening: collect 3 filled CSVs from evaluators
# Upload them to Kaggle as new dataset version
# Cell 7 — Analyze baseline
import pandas as pd
import numpy as np
ev1 = pd.read_csv("/kaggle/input/.../eval_v1.csv")
ev2 = pd.read_csv("/kaggle/input/.../eval_v2.csv")
ev3 = pd.read_csv("/kaggle/input/.../eval_v3.csv")
all_evals = pd.concat([ev1, ev2, ev3])
# Overall MOS by category
print("=== BASELINE MOS BY CATEGORY ===")
print(all_evals.groupby('category')['naturalness_1to5'].agg(['mean', 'std', 'count']))
# Per-reference comparison
print("\n=== MOS BY REFERENCE VOICE ===")
print(all_evals.groupby('ref')['naturalness_1to5'].agg(['mean', 'std']))
# Intelligibility breakdown
print("\n=== INTELLIGIBILITY ===")
print(f"Mean: {all_evals['intelligibility_1to5'].mean():.2f}")
print(f"% scoring >= 4: {(all_evals['intelligibility_1to5'] >= 4).mean() * 100:.1f}%")
# Accent correctness
print("\n=== ACCENT CORRECTNESS ===")
print(f"% rated as 'sounds Odia': {(all_evals['accent_correct_yn'] == 'y').mean() * 100:.1f}%")
# Top failure categories
print("\n=== WORST CATEGORIES (need fine-tuning attention) ===")
worst = all_evals.groupby('category')['naturalness_1to5'].mean().sort_values()
print(worst.head(5))
# Common mispronunciations (will guide G2P fixes)
print("\n=== TOP MISPRONUNCIATIONS ===")
mispron = all_evals['mispronunciations'].dropna().str.split(',').explode().str.strip()
print(mispron.value_counts().head(20))
Decision gate at end of Weekend 1
# Three possible outcomes. Your strategy depends on which one you got:
OUTCOME A: Mean MOS >= 3.8, intelligibility > 85%, accent correct > 80%
→ IndicF5 is already very good for Odia
→ Weekend 2-3: Skip heavy fine-tuning. Build the demo. Apply for funding NOW.
→ You may not need to fine-tune at all. The product question (distribution,
UX, packaging) is more important than the model question.
OUTCOME B: Mean MOS 3.0-3.8, some specific failure categories
→ STANDARD PATH. Proceed with weekends 2-3 as planned.
→ Focus fine-tuning data collection on the failing categories from your analysis.
→ Don't waste training data on what already works.
OUTCOME C: Mean MOS < 3.0, accent feels wrong, intelligibility < 70%
→ IndicF5's Odia training was insufficient.
→ Heavier intervention needed. Consider whether to fine-tune harder
OR pivot strategy (Indic Parler-TTS as alternative base, or different approach).
→ Don't proceed blindly with weekend 2-3 plan if you're in this bucket.
Goal: ~5 hours of clean Odia audio with transcripts
Build the Fine-Tuning Dataset
▸ NOTE
5 hours sounds small but is enough for a LoRA fine-tune that
proves whether the approach works. The goal isn't a production
model yet — it's evidence that fine-tuning IndicF5 on Odia
improves it. If 5 hrs gives a measurable bump, then 50 hrs will
give a big bump.
⚠ WATCH OUT
Quality over quantity, hard rule: 5 hours of
clean studio-quality audio outperforms 50 hours of noisy
YouTube. Set strict thresholds and reject aggressively. Most of
your weekend is filtering, not collecting.
6-8 hours, mostly unattended
Saturday: Source & Download
2.1
Pick 3-5 high-quality Odia channels1 hr
# Best sources for Odia speech, ranked by expected yield: #1 Doordarshan Odia (DD Odia) - Government broadcaster, formal Odia - Minimal background music in news segments - Consistent broadcast-quality audio - https://youtube.com/@ddodisha - Expected yield after filtering: 40-50% #2 OTV (Odisha TV) - news only - Studio anchor segments, clear formal Odia - Skip field reports, debates, talk shows - Search: "OTV news anchor" or "OTV bulletin" - Expected yield: 25-30% #3 Odia audiobook channels - Search "ଓଡ଼ିଆ ଗଳ୍ପ" (Odia stories) on YouTube - Single narrator, clean recording, narrative prosody - Expected yield: 50-60% #4 Odia podcast channels (skip if amateur) - Only if professionally recorded - Expected yield: variable # Avoid for Weekend 2 dataset: ✗ Movie/serial dialogues (heavy music, multiple speakers) ✗ Bhajans / religious chants (musical, repetitive) ✗ Phone-quality interviews (poor SNR) ✗ Outdoor / field reports (wind, traffic noise)
2.2
Download with yt-dlp on your LAPTOP (not Kaggle)3-4 hrs (mostly unattended)
⚠ WATCH OUT
Don't use Kaggle for downloading. Kaggle's
outbound bandwidth is unreliable for yt-dlp at scale,
sessions time out, and you're wasting GPU time on CPU-bound
work. Run downloads on your laptop overnight; upload the
results to Kaggle as a Dataset.
# On your laptop (Linux/Mac/WSL): pip install yt-dlp # Create a list of YouTube URLs (50-100 videos targeting ~40-50 hours total) cat > odia_urls.txt << 'EOF' https://www.youtube.com/watch?v=VIDEO_ID_1 https://www.youtube.com/watch?v=VIDEO_ID_2 ... (50-100 URLs) EOF # Download audio only, 16kHz mono WAV mkdir -p raw_audio yt-dlp \ --batch-file odia_urls.txt \ --extract-audio \ --audio-format wav \ --audio-quality 0 \ --postprocessor-args "-ar 16000 -ac 1" \ --output "raw_audio/%(id)s.%(ext)s" \ --write-info-json \ --no-overwrites \ --sleep-interval 3 # Check download size: du -sh raw_audio/ # Expected: 30-40 GB for ~40 hours of audio at 16kHz mono
2.3
Pre-filter on laptop before uploading to Kaggle2 hrs
# Filter on laptop — saves Kaggle storage + GPU time
# Quick filter: drop very short or very long files
import os
from pathlib import Path
raw_dir = Path("raw_audio")
keep_dir = Path("filtered_audio")
keep_dir.mkdir(exist_ok=True)
for wav in raw_dir.glob("*.wav"):
size_mb = wav.stat().st_size / 1e6
# Roughly: 16kHz mono = 1MB per minute
duration_min = size_mb
if 5 <= duration_min <= 30: # Keep 5-30 minute segments
wav.rename(keep_dir / wav.name)
# Check what survived
total_minutes = sum(f.stat().st_size / 1e6 for f in keep_dir.glob("*.wav"))
print(f"Kept: {len(list(keep_dir.glob('*.wav')))} files, ~{total_minutes/60:.1f} hours")
# Target: 30-40 hours of raw audio survives this filter
# Upload to Kaggle as new Dataset
# kaggle datasets create -p filtered_audio --dir-mode zip
The actual filtering — most of GPU time goes here
Sunday: Run the Quality Pipeline on Kaggle
2.4
Stage 1 — VAD + segmentation2 hrs GPU
# Kaggle notebook — Stage 1 of pipeline
!pip install -q silero-vad torch torchaudio soundfile
import torch, torchaudio
from silero_vad import load_silero_vad, read_audio, get_speech_timestamps
from pathlib import Path
import json
vad_model = load_silero_vad()
INPUT_DIR = Path("/kaggle/input/odia-raw-audio")
OUT_DIR = Path("/kaggle/working/segments")
OUT_DIR.mkdir(exist_ok=True)
manifest = []
for wav_file in INPUT_DIR.glob("*.wav"):
audio = read_audio(str(wav_file), sampling_rate=16000)
# Get speech segments
segments = get_speech_timestamps(
audio, vad_model,
threshold=0.5,
sampling_rate=16000,
min_speech_duration_ms=3000, # Min 3s segments for TTS
max_speech_duration_s=20, # Max 20s — keeps GPU memory bounded
min_silence_duration_ms=500,
speech_pad_ms=200,
)
for i, ts in enumerate(segments):
start_s, end_s = ts['start'] / 16000, ts['end'] / 16000
duration = end_s - start_s
if duration < 3 or duration > 20:
continue
seg_audio = audio[ts['start']:ts['end']]
seg_id = f"{wav_file.stem}_{i:04d}"
seg_path = OUT_DIR / f"{seg_id}.wav"
torchaudio.save(str(seg_path), seg_audio.unsqueeze(0), 16000)
manifest.append({
"segment_id": seg_id,
"source_file": wav_file.name,
"start_s": start_s, "end_s": end_s, "duration_s": duration,
"path": str(seg_path),
})
# Save manifest
import pandas as pd
pd.DataFrame(manifest).to_csv("/kaggle/working/segments_manifest.csv", index=False)
print(f"Total segments: {len(manifest)}")
print(f"Total speech hours: {sum(s['duration_s'] for s in manifest)/3600:.1f}")
# Expected: ~10-15 hours after VAD (from 30-40 hr raw)
2.5
Stage 2 — Whisper transcription + filter3-4 hrs GPU
# Kaggle notebook — Stage 2: Transcribe + filter low-confidence
!pip install -q faster-whisper
from faster_whisper import WhisperModel
import pandas as pd
import json
from pathlib import Path
# Load Whisper large-v3
print("Loading Whisper large-v3 (3 GB)...")
whisper = WhisperModel("large-v3", device="cuda", compute_type="float16")
manifest = pd.read_csv("/kaggle/working/segments_manifest.csv")
results = []
for i, row in manifest.iterrows():
try:
segments, info = whisper.transcribe(
row['path'],
language="or", # Force Odia
word_timestamps=True,
beam_size=5,
temperature=0.0,
condition_on_previous_text=False,
)
segments = list(segments)
if not segments:
continue
text = " ".join(s.text for s in segments).strip()
avg_logprob = sum(s.avg_logprob for s in segments) / len(segments)
no_speech = max(s.no_speech_prob for s in segments)
word_probs = [w.probability for s in segments for w in s.words]
min_word_prob = min(word_probs) if word_probs else 0
# FILTER: reject low-quality transcripts
if avg_logprob < -0.7: continue # Whisper unsure
if no_speech > 0.3: continue # Probably not speech
if min_word_prob < 0.5: continue # Some word very unsure
if len(text.split()) < 3: continue # Too short
if len(text.split()) > 60: continue # Too long for TTS training
if info.language_probability < 0.85: continue # Not confidently Odia
results.append({
**row.to_dict(),
"transcript": text,
"whisper_confidence": float(avg_logprob),
"lang_prob": float(info.language_probability),
})
if i % 100 == 0:
print(f"[{i}/{len(manifest)}] Kept: {len(results)}")
except Exception as e:
print(f"[ERROR] {row['segment_id']}: {e}")
# Save filtered manifest
filtered_df = pd.DataFrame(results)
filtered_df.to_csv("/kaggle/working/transcribed_segments.csv", index=False)
print(f"\nAfter Whisper filter: {len(filtered_df)} segments")
print(f"Hours: {filtered_df['duration_s'].sum()/3600:.1f}")
# Expected: ~6-8 hours survive
2.6
Stage 3 — Quality scoring (DNSMOS)1-2 hrs GPU
# Stage 3 — DNSMOS scoring (Microsoft's automatic MOS predictor)
!pip install -q onnxruntime-gpu requests
!wget -q https://github.com/microsoft/DNS-Challenge/raw/master/DNSMOS/DNSMOS/sig_bak_ovr.onnx
!wget -q https://github.com/microsoft/DNS-Challenge/raw/master/DNSMOS/DNSMOS/model_v8.onnx
import onnxruntime as ort
import librosa
import numpy as np
# DNSMOS scoring function (simplified — full code on MS DNS-Challenge GitHub)
sess = ort.InferenceSession("model_v8.onnx", providers=['CUDAExecutionProvider'])
def score_dnsmos(audio_path):
audio, sr = librosa.load(audio_path, sr=16000)
# DNSMOS expects 9-second windows
if len(audio) < 9 * 16000:
audio = np.pad(audio, (0, 9*16000 - len(audio)))
audio = audio[:9*16000]
# Run inference
input_features = audio.astype(np.float32).reshape(1, -1)
out = sess.run(None, {'input_1': input_features})
return float(out[0][0][2]) # Overall MOS
# Score all transcribed segments
df = pd.read_csv("/kaggle/working/transcribed_segments.csv")
df['dnsmos'] = df['path'].apply(score_dnsmos)
# Final filter: only keep DNSMOS >= 3.5 (good quality)
final = df[df['dnsmos'] >= 3.5].copy()
final = final.sort_values('dnsmos', ascending=False)
# Cap at 5 hours total (we don't need more for prototype)
total_seconds = 0
keep_indices = []
for idx, row in final.iterrows():
if total_seconds + row['duration_s'] > 5 * 3600:
break
keep_indices.append(idx)
total_seconds += row['duration_s']
final_5hr = final.loc[keep_indices]
final_5hr.to_csv("/kaggle/working/training_dataset_v1.csv", index=False)
print(f"Final training set: {len(final_5hr)} segments, {total_seconds/3600:.2f} hours")
print(f"Mean DNSMOS: {final_5hr['dnsmos'].mean():.2f}")
print(f"Mean Whisper confidence: {final_5hr['whisper_confidence'].mean():.2f}")
# Save as Kaggle Dataset for Weekend 3
# Files to save: training_dataset_v1.csv + all the .wav files referenced in it
End of Weekend 2 — Yield Reality
Raw downloaded
30-40 hours
From YouTube
After VAD segmentation
10-15 hours
Pure speech segments only
After Whisper confidence filter
6-8 hours
Reliable transcripts only
After DNSMOS quality filter
4-6 hours
This is your training set
Yield rate
~15%
As predicted in original pipeline design
The moment of truth — does it actually help?
LoRA Fine-Tune of IndicF5
▸ NOTE
This is the technical climax of the prototype. By Sunday
evening, you'll know if fine-tuning IndicF5 on Odia data
measurably improves output. The answer determines whether the
entire sovereign Odia TTS thesis is viable.
⚠ WATCH OUT
Important caveat: IndicF5's repository as of
late 2025 doesn't ship an official fine-tuning script. You'll be
using the underlying F5-TTS training infrastructure with IndicF5
weights as the starting checkpoint. This works because IndicF5
is architecturally identical to F5-TTS — but expect some
integration debugging.
3 hours — most goes to debugging the integration
Saturday Morning: Setup F5-TTS Training Code
3.1
Clone F5-TTS training code30 min
# Kaggle notebook for Weekend 3
# Settings: GPU T4 x2, Internet ON
!git clone https://github.com/SWivid/F5-TTS.git /kaggle/working/F5-TTS
%cd /kaggle/working/F5-TTS
!pip install -q -e ". [eval]"
!pip install -q peft accelerate
# Verify import
import sys
sys.path.insert(0, "/kaggle/working/F5-TTS/src")
from f5_tts.model import CFM, DiT
print("F5-TTS training code loaded")
3.2
Convert IndicF5 weights to F5-TTS-compatible format1 hr
# IndicF5 uses HF transformers wrapper. F5-TTS training expects raw checkpoint.
# We need to extract the underlying F5-TTS state_dict from IndicF5.
import torch
from transformers import AutoModel
# Load IndicF5
indicf5 = AutoModel.from_pretrained("ai4bharat/IndicF5", trust_remote_code=True)
# Extract the underlying model state
# (Inspect indicf5.named_parameters() to find the F5 backbone)
backbone_state = {}
for name, param in indicf5.named_parameters():
# IndicF5 prefixes the F5 model — strip the prefix
if name.startswith("model."):
backbone_state[name.replace("model.", "")] = param.data.clone()
# Save as F5-TTS-compatible checkpoint
torch.save({
"model_state_dict": backbone_state,
"step": 0, # We're starting fresh fine-tuning
"ema_model_state_dict": backbone_state, # Initialize EMA same as model
}, "/kaggle/working/indicf5_base_for_finetune.pt")
print(f"Backbone params: {len(backbone_state)}")
print("Saved as indicf5_base_for_finetune.pt")
⚠ WATCH OUT
The exact parameter name mapping between IndicF5 and F5-TTS
might need adjustment. Inspect with
indicf5.state_dict().keys()
first. This is the most likely place where you'll spend 1-2
hours debugging — IndicF5's HF wrapper may name things
differently than vanilla F5-TTS.
3.3
Prepare training data in F5-TTS format1 hr
# F5-TTS training expects: a directory of .wav files + a CSV with (audio_path, text)
# Format your Weekend 2 output to match
import pandas as pd
import shutil
from pathlib import Path
src_csv = "/kaggle/input/odia-training-data-v1/training_dataset_v1.csv"
src_audio = Path("/kaggle/input/odia-training-data-v1/")
out_dir = Path("/kaggle/working/training_data")
out_dir.mkdir(exist_ok=True)
df = pd.read_csv(src_csv)
# F5-TTS expects 24kHz audio. Resample if needed.
import librosa, soundfile as sf
ft_records = []
for i, row in df.iterrows():
audio, sr = librosa.load(row['path'], sr=24000)
out_path = out_dir / f"{row['segment_id']}.wav"
sf.write(out_path, audio, 24000)
ft_records.append({
"audio_path": str(out_path),
"text": row['transcript'],
"duration": row['duration_s'],
})
ft_df = pd.DataFrame(ft_records)
ft_df.to_csv("/kaggle/working/training_manifest_f5.csv", index=False)
print(f"Training manifest: {len(ft_df)} samples, {ft_df['duration'].sum()/3600:.2f} hrs")
2 hours
Saturday Afternoon: Configure LoRA
3.4
Wrap F5-TTS model with PEFT LoRA1.5 hrs
from peft import LoraConfig, get_peft_model, TaskType
import torch
from f5_tts.model import CFM, DiT
# Load base F5-TTS architecture
model = CFM(
transformer=DiT(
dim=1024, depth=22, heads=16, ff_mult=2,
text_dim=512, conv_layers=4,
),
mel_spec_kwargs=dict(
n_fft=1024, hop_length=256, win_length=1024,
n_mel_channels=100, target_sample_rate=24000,
),
)
# Load IndicF5 weights into it
ckpt = torch.load("/kaggle/working/indicf5_base_for_finetune.pt", map_location="cpu")
model.load_state_dict(ckpt['model_state_dict'], strict=False)
print("Loaded IndicF5 weights into F5-TTS architecture")
# Configure LoRA — target attention layers in transformer
lora_config = LoraConfig(
r=16, # Low rank — fits Kaggle VRAM
lora_alpha=32,
target_modules=[
"to_q", "to_v", # Attention query+value
"ff.0", "ff.2", # FFN layers (sometimes named differently)
],
lora_dropout=0.1,
bias="none",
)
model = get_peft_model(model, lora_config)
model.print_trainable_parameters()
# Expected: ~4-8M trainable / 400M total (1-2%)
# Move to GPU
model = model.to("cuda")
print(f"VRAM: {torch.cuda.memory_allocated() / 1e9:.1f} GB")
# Should be ~4-5 GB — plenty of room on T4 16 GB
✕ TRAP
The target_modules trap: F5-TTS's exact
attention layer names may be
to_q/to_kv
instead of
to_q/to_v. Run
print([n for n,_ in model.named_modules()][:50])
to see actual names before configuring LoRA. Wrong
target_modules = LoRA fails silently with no error.
3.5
Setup training loop with Kaggle session safety1 hr
# Critical: Kaggle sessions die at 12 hours. Build resumability from the start.
import torch
from torch.utils.data import DataLoader
from pathlib import Path
CHECKPOINT_DIR = Path("/kaggle/working/lora_checkpoints")
CHECKPOINT_DIR.mkdir(exist_ok=True)
def save_checkpoint(model, optimizer, step, loss):
ckpt = {
"step": step,
"lora_state_dict": {k: v for k, v in model.state_dict().items() if "lora" in k.lower()},
"optimizer_state_dict": optimizer.state_dict(),
"loss": loss,
}
# Atomic write — don't corrupt mid-save
tmp = CHECKPOINT_DIR / f"latest.pt.tmp"
torch.save(ckpt, tmp)
tmp.rename(CHECKPOINT_DIR / "latest.pt")
# Also keep periodic snapshots
if step % 500 == 0:
torch.save(ckpt, CHECKPOINT_DIR / f"step_{step}.pt")
def load_checkpoint(model, optimizer):
latest = CHECKPOINT_DIR / "latest.pt"
if not latest.exists():
return 0
ckpt = torch.load(latest, map_location="cuda")
# Load only LoRA params
model.load_state_dict({**model.state_dict(), **ckpt['lora_state_dict']}, strict=False)
optimizer.load_state_dict(ckpt['optimizer_state_dict'])
print(f"Resumed from step {ckpt['step']}")
return ckpt['step']
# Optimizer — only train LoRA params
optimizer = torch.optim.AdamW(
[p for p in model.parameters() if p.requires_grad],
lr=1e-4,
weight_decay=0.01,
)
# Resume if checkpoint exists
start_step = load_checkpoint(model, optimizer)
8 hours of GPU time, then evaluate
Saturday Evening + Sunday: Train
3.6
Run LoRA training6-8 hrs
# Training loop — the actual thing
from f5_tts.dataset import F5TTSDataset # adapt to your installed version
from torch.utils.data import DataLoader
# Build dataset
dataset = F5TTSDataset(
csv_path="/kaggle/working/training_manifest_f5.csv",
target_sample_rate=24000,
max_duration=20.0,
)
loader = DataLoader(
dataset, batch_size=4, # Small batch for T4 VRAM
num_workers=2, shuffle=True,
)
# Train
TOTAL_STEPS = 5000 # ~6-8 hrs on T4
LOG_EVERY = 50
SAVE_EVERY = 200
model.train()
step = start_step
import time
t_start = time.time()
running_loss = 0
while step < TOTAL_STEPS:
for batch in loader:
if step >= TOTAL_STEPS: break
# Move to GPU
mel = batch['mel'].to("cuda")
text = batch['text']
durations = batch['durations'].to("cuda")
# Forward + flow-matching loss
loss, _ = model(mel, text=text, lens=durations)
# Backward
loss.backward()
torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
optimizer.step()
optimizer.zero_grad()
running_loss += loss.item()
step += 1
# Logging
if step % LOG_EVERY == 0:
avg_loss = running_loss / LOG_EVERY
elapsed = time.time() - t_start
eta = (elapsed / step) * (TOTAL_STEPS - step)
print(f"step {step:5d} | loss {avg_loss:.4f} | "
f"elapsed {elapsed/60:.1f}m | ETA {eta/60:.1f}m")
running_loss = 0
# Checkpoint
if step % SAVE_EVERY == 0:
save_checkpoint(model, optimizer, step, loss.item())
# Final save
save_checkpoint(model, optimizer, step, loss.item())
print(f"\nTraining complete: {TOTAL_STEPS} steps in {(time.time()-t_start)/3600:.1f} hours")
⚠ WATCH OUT
If your session dies mid-training: No
problem. Restart the notebook, re-run cells 1-5, and the
load_checkpoint() will resume from where you left off. This
is why we built it that way.
⚠ WATCH OUT
If loss isn't decreasing: Common causes:
(1) target_modules wrong, no params actually training; (2)
learning rate too low — try 3e-4; (3) data preprocessing
mismatch with what F5-TTS expects. Loss should drop from
~5.0 to ~2.5 in first 500 steps.
3.7
Sunday: Inference comparison3 hrs
# Sunday morning: generate fine-tuned outputs on the SAME 200 sentences from Weekend 1
# Load fine-tuned model
ft_ckpt = torch.load(CHECKPOINT_DIR / "latest.pt", map_location="cuda")
model.load_state_dict({**model.state_dict(), **ft_ckpt['lora_state_dict']}, strict=False)
model.eval()
# Inference on test sentences
import pandas as pd
test_df = pd.read_csv("/kaggle/input/odia-baseline-data/test_sentences.csv")
OUT_FT = Path("/kaggle/working/finetune_outputs")
OUT_FT.mkdir(exist_ok=True)
for i, row in test_df.iterrows():
for ref_name, (ref_path, ref_text) in REFS.items(): # Same refs as Weekend 1
out_path = OUT_FT / f"{row['id']}_{ref_name}_ft.wav"
with torch.no_grad():
audio = model.sample(
text=row['sentence'],
ref_audio=ref_path,
ref_text=ref_text,
steps=32,
cfg_strength=2.0,
)
sf.write(out_path, audio.cpu().numpy(), 24000)
# Now: package side-by-side comparison for evaluators
EVAL = Path("/kaggle/working/comparison_eval")
EVAL.mkdir(exist_ok=True)
import shutil
for i, row in test_df.iterrows():
for ref_name, _ in REFS.items():
# Baseline (Weekend 1) and Fine-tuned (Weekend 3) — randomized labels for blind eval
import random
labels = ['A', 'B']
random.shuffle(labels)
baseline_label, ft_label = labels[0], labels[1]
shutil.copy(
f"/kaggle/input/baseline-outputs/{row['id']}_{ref_name}.wav",
EVAL / f"{row['id']}_{ref_name}_{baseline_label}.wav"
)
shutil.copy(
OUT_FT / f"{row['id']}_{ref_name}_ft.wav",
EVAL / f"{row['id']}_{ref_name}_{ft_label}.wav"
)
# Build blind A/B form
records = [{"id": row['id'], "ref": rn, "category": row['category'],
"sentence": row['sentence'],
"preferred_AB": "", "much_better_yn": ""}
for _, row in test_df.iterrows() for rn in REFS]
pd.DataFrame(records).to_csv(EVAL / "ab_form.csv", index=False)
Sunday evening: collect blind A/B preferences
# Send same 3 evaluators the comparison_eval package
# They listen to A vs B for each sentence, pick the better one
# DO NOT tell them which is baseline vs fine-tuned — blind comparison
# Aggregate results:
ev = pd.concat([pd.read_csv(f) for f in eval_csvs])
preference = ev['preferred_AB'].value_counts(normalize=True)
# Decode A/B back to baseline/finetune (you saved the mapping)
print("Fine-tune preferred:", preference.get('finetune', 0) * 100, "%")
print("Baseline preferred:", preference.get('baseline', 0) * 100, "%")
print("No preference:", preference.get('tie', 0) * 100, "%")
# Decision rule:
# Fine-tune preferred > 60% → fine-tuning works, scale up
# 50-60% → marginal improvement, need more data
# < 50% → fine-tuning hurt, debug pipeline
Make it real, get external feedback
Public Demo + Strategic Decision
▸ NOTE
By end of Weekend 4: a public demo on Hugging Face Spaces that
anyone can use. Real native Odia speakers from outside your
circle test it. You collect honest feedback. You decide whether
to pursue real funding or pivot.
6 hours
Saturday: Build the Demo
4.1
Create Hugging Face Space (free hosting)2 hrs
# Create new Space at huggingface.co/new-space
# - Owner: your username
# - Space name: odia-tts-demo
# - SDK: Gradio
# - Hardware: CPU basic (free) — add GPU later if traffic warrants
# - Visibility: Public
# In the Space repo, create app.py:
import gradio as gr
import torch
from transformers import AutoModel
from peft import PeftModel
import soundfile as sf
import numpy as np
import tempfile
# Load base IndicF5
print("Loading IndicF5...")
base_model = AutoModel.from_pretrained("ai4bharat/IndicF5", trust_remote_code=True)
# Load YOUR LoRA adapter (uploaded as separate HF model)
# After Weekend 3, upload the lora_state_dict to:
# huggingface.co/YOUR_USERNAME/odia-tts-lora-v1
# Then reference it here:
# (Note: actual LoRA loading depends on how IndicF5 was wrapped — adjust as needed)
REFERENCE_VOICES = {
"OTV Anchor": ("examples/ref_otv.wav", "OTV reference transcript here"),
"DD Odia": ("examples/ref_dd.wav", "DD Odia reference transcript"),
"Narrative": ("examples/ref_pod.wav", "Podcast narrator transcript"),
}
def synthesize(text, voice_choice, use_finetune):
ref_path, ref_text = REFERENCE_VOICES[voice_choice]
# Pick model
model = ft_model if use_finetune else base_model
audio = model(text, ref_audio_path=ref_path, ref_text=ref_text)
if audio.dtype == np.int16:
audio = audio.astype(np.float32) / 32768.0
out = tempfile.NamedTemporaryFile(suffix=".wav", delete=False)
sf.write(out.name, audio, 24000)
return out.name
with gr.Blocks(title="Sovereign Odia TTS Demo") as demo:
gr.Markdown("# 🎙 ସ୍ଵାଧୀନ ଓଡ଼ିଆ ସ୍ୱର — Sovereign Odia TTS")
gr.Markdown("Compare base IndicF5 vs our fine-tuned version on Odia text.")
with gr.Row():
text = gr.Textbox(label="Odia Text", placeholder="ଆଜି ଭୁବନେଶ୍ୱରରେ...", lines=3)
with gr.Row():
voice = gr.Dropdown(choices=list(REFERENCE_VOICES.keys()), value="OTV Anchor", label="Reference Voice")
ft = gr.Checkbox(label="Use fine-tuned model", value=True)
btn = gr.Button("Synthesize", variant="primary")
audio_out = gr.Audio(label="Output", type="filepath")
btn.click(synthesize, inputs=[text, voice, ft], outputs=audio_out)
gr.Markdown("---")
gr.Markdown("Built by [you] using IndicF5 (AI4Bharat). Apache 2.0.")
gr.Markdown("Feedback: [your-email] · GitHub: [link]")
demo.launch()
4.2
Upload LoRA adapter to Hugging Face1 hr
# Push your fine-tuned LoRA weights to HF for the Space to use
from huggingface_hub import HfApi, create_repo
create_repo("YOUR_USERNAME/odia-tts-lora-v1", private=False, exist_ok=True)
api = HfApi()
api.upload_file(
path_or_fileobj="/kaggle/working/lora_checkpoints/latest.pt",
path_in_repo="adapter_model.bin",
repo_id="YOUR_USERNAME/odia-tts-lora-v1",
)
# Add a model card describing what this is
README = '''
# Sovereign Odia TTS — LoRA adapter v1
LoRA fine-tune of [ai4bharat/IndicF5](https://huggingface.co/ai4bharat/IndicF5)
on 5 hours of clean Odia speech (DD Odia + OTV news).
- Base model: IndicF5 (400M params)
- LoRA rank: 16, alpha: 32
- Training: 5000 steps, ~6 hours on T4
- License: MIT (matches base)
This is an early prototype. Quality is being evaluated.
## Usage
[code snippet here]
## Evaluation
[link to MOS results]
'''
with open("/tmp/README.md", "w") as f:
f.write(README)
api.upload_file(path_or_fileobj="/tmp/README.md", path_in_repo="README.md",
repo_id="YOUR_USERNAME/odia-tts-lora-v1")
4.3
Add example sentences and visual polish2 hrs
# Add to app.py — examples that showcase the model
EXAMPLES = [
["ଆଜି ଭୁବନେଶ୍ୱରରେ ବର୍ଷା ହେଉଛି ।", "OTV Anchor", True],
["ତୁ ସକାଳେ କଣ ଖାଇଛୁ?", "Narrative", True],
["ସରକାର ନୂଆ ନୀତି ଘୋଷଣା କରିଛନ୍ତି।", "DD Odia", True],
["୨୦୨୪ ମସିହାରେ ଭାରତ ଚନ୍ଦ୍ରଯାନ ୩ ଉତ୍ସର୍ଗ କରିଥିଲା।", "DD Odia", True],
]
with gr.Blocks(...):
# ...
gr.Examples(examples=EXAMPLES, inputs=[text, voice, ft])
# Make the Space discoverable:
# - Add tags: "odia", "indic", "tts", "sovereign-ai"
# - Add proper README to the Space
# - Submit to https://huggingface.co/spaces/lhoestq/awesome-spaces lists
6 hours
Sunday: Distribute & Collect Real Feedback
4.4
Share with Odia community2 hrs
# Post in these communities — be specific, ask for feedback: # r/Odia subreddit (~5k members) """ Built a sovereign Odia TTS prototype, free demo: [HF Space link] Compares fine-tuned IndicF5 vs base. Feedback specifically wanted on: - Does it sound like Odia or Hindi-influenced? - Mispronunciation of common words? - Voice quality on news vs conversational text? """ # Twitter/X — tag @ai4bharat, @indianai, Odia influencers # LinkedIn — your network, mention Bhashini Mission # Telegram Odia tech groups — search for active ones # Direct outreach: 5-10 Odia journalists, Odia Wikipedia editors # What you're optimizing for: # - 50+ unique users try the demo # - 10+ leave structured feedback # - 1-2 connections that lead to funding conversations
4.5
Track usage and feedbackongoing
# Add basic analytics to your Space (Gradio supports this)
import gradio as gr
import json
from datetime import datetime
def synthesize_with_logging(text, voice, ft):
# Log usage
log = {
"ts": datetime.utcnow().isoformat(),
"text_length": len(text),
"voice": voice,
"finetuned": ft,
}
with open("/data/usage.jsonl", "a") as f:
f.write(json.dumps(log) + "\n")
return synthesize(text, voice, ft)
# Add feedback button
with gr.Row():
good_btn = gr.Button("👍 Good")
bad_btn = gr.Button("👎 Bad")
detail = gr.Textbox(label="What was wrong? (optional)")
submit = gr.Button("Submit feedback")
4.6
The strategic decision meeting (with yourself)2 hrs
End of Weekend 4: the honest assessment
Sunday evening, sit with the data and decide. Three possible
outcomes:
OUTCOME A: "It works — fine-tune meaningfully better than base"
Evidence: Blind A/B prefers fine-tune > 60%
MOS bump >= 0.4 from baseline
Native speakers say "this sounds Odia"
Demo gets 100+ uses, several positive comments
Decision: Pursue real funding (Bhashini Mission, AI4Bharat grants,
AWS Activate, NVIDIA Inception, MeitY).
Use the prototype as your pitch.
Timeline: Funding takes 3-6 months. Use that time to build studio
voice, expand training data to 50+ hrs, refine pipeline.
OUTCOME B: "Marginal — small improvement, not transformative"
Evidence: Fine-tune slight preference (50-60%)
MOS bump 0.1-0.3
Some categories better, some worse
Demo gets some interest, no clear product-market fit
Decision: Two paths:
(1) Iterate — collect 20+ hrs better data, retrain
(2 more weekends, see if quality jumps)
(2) Pivot — focus on PRODUCT layer instead of model
(Build Odia voice-over service using base IndicF5)
OUTCOME C: "It doesn't work — fine-tune no better or worse"
Evidence: No clear A/B preference
Audio quality issues (artifacts, prosody breaks)
Native speakers reject both versions
Decision: Don't throw good money after bad. Three options:
(1) Pivot to using base IndicF5 as-is, focus on product
(2) Try Indic Parler-TTS as alternative base
(3) Decide TTS isn't the right product, use 4 weekends as
valuable learning, redirect to a different problem.
The actual deliverables
What you have at the end of 4 weekends
Concrete artifacts
Diagnostic test set
200 sentences × 3 voices, native-validated
Reusable benchmark for all future work
Baseline metrics
MOS, intelligibility, accent correctness
Quantified IndicF5 capability on Odia
Training dataset v1
~5 hours clean Odia, transcribed
Reusable for any future TTS work
LoRA adapter weights
50 MB file on Hugging Face
MIT licensed, public
Public demo
Hugging Face Space, free, accessible
Anyone can try it. Includes A/B comparison.
Real user feedback
50+ users, 10+ structured responses
Validates or invalidates the entire thesis
A/B preference data
Blind comparison results
Statistically meaningful sample, not anecdotes
Funding-ready pitch
Demo + data + clear next steps
If Outcome A, you can apply to grants tomorrow
Total cost
₹0 in cloud + ₹3000 evaluator fees
Plus ~50 hours of your time across 4 weekends
✓ TIP
This is what veteran engineers mean by "build the smallest thing
that gives you real information." Four weekends, zero cloud
spend, you go from "we should build Odia TTS" to one of three
clear strategic positions backed by data. That clarity is worth
more than any infrastructure.
Common Failures + Fixes
Things That Will Go Wrong
⚠ WATCH OUT
These aren't hypothetical — they're the issues that hit every
team building this kind of pipeline. Knowing about them in
advance saves hours.
PROBLEM 1
OOM error during inference (Weekend 1)
Cause: IndicF5 + reference audio + intermediate tensors > 15 GB on
T4
# 1. Reduce reference audio length to 5s max # 2. Add explicit cache clearing every 50 samples torch.cuda.empty_cache() # 3. If still OOM, switch to FP16: model = model.half() # IndicF5 may have FP16 issues — test with single sample first
PROBLEM 2
Whisper transcribes Odia as Hindi or Bengali
Cause: Auto language detection fails — Odia is low-resource in Whisper
# Force Odia explicitly:
segments, info = whisper.transcribe(
audio_path,
language="or", # NOT 'hi' or auto
initial_prompt="ଓଡ଼ିଆ ଭାଷାରେ ସମ୍ବାଦ", # Odia hint helps
)
PROBLEM 3
Kaggle session dies, lost training progress
Cause: Idle timeout (20 min) or 12 hr limit hit
# 1. Check checkpoint frequency — every 200 steps minimum
# 2. Add a keep-alive cell that prints every 5 min:
import time
while True:
print(f"Active at step {step}")
time.sleep(300)
# (Run in separate cell — only useful for evaluation phases)
# 3. Always check checkpoint before assuming work is lost:
ls /kaggle/working/lora_checkpoints/
PROBLEM 4
LoRA target_modules error: 'no modules named X found'
Cause: F5-TTS attention layer naming differs from your config
# Print actual module names:
for name, _ in model.named_modules():
if 'attn' in name.lower() or 'linear' in name.lower():
print(name)
# Common actual names in F5-TTS:
# - attn.to_q, attn.to_kv, attn.to_out
# - ff.0, ff.2 (FFN sublayers)
# Update LoraConfig target_modules accordingly
PROBLEM 5
Training loss is NaN after a few hundred steps
Cause: Learning rate too high OR FP16 underflow OR data preprocessing
bug
# 1. Lower LR first:
optimizer = torch.optim.AdamW(..., lr=5e-5) # was 1e-4
# 2. Switch to BF16 (more stable than FP16):
model = model.bfloat16()
# 3. Check data: print 5 random batches, verify mel shapes are sane:
for i, batch in enumerate(loader):
if i >= 5: break
print(batch['mel'].shape, batch['mel'].mean(), batch['mel'].std())
PROBLEM 6
Fine-tuned output sounds worse than baseline
Cause: Either: (a) too few training steps, (b) bad training data, (c)
catastrophic forgetting
# Diagnostic checks: # 1. Did loss decrease meaningfully? # Should drop from ~5.0 to ~2.5 over 5000 steps # If stuck at 4.0+ → LoRA isn't learning, check target_modules # 2. Run inference on TRAINING set # If model can't reproduce training data → undertrained, train more # If it can but test fails → overfitting, reduce steps or add data # 3. A/B compare on multiple categories # If failing only on rare categories → need more diverse training data # If failing across the board → something fundamentally wrong with FT
PROBLEM 7
yt-dlp gets blocked by YouTube
Cause: Too many requests, IP rate-limited
# Slow down, use cookies:
yt-dlp --cookies-from-browser firefox \
--sleep-interval 10 \
--max-sleep-interval 30 \
--rate-limit 1M \
URL
# Or: spread downloads across days, smaller batches
PROBLEM 8
Hugging Face Space crashes or runs out of memory
Cause: Free CPU tier has 16 GB RAM total, IndicF5 alone is ~3 GB
# Options: # 1. Don't load both base and FT versions in memory # Load on-demand, unload after use # # 2. Upgrade to ZeroGPU (free for HF Pro $9/mo) # Provides A100 access for short bursts # # 3. Replace inference with cached pre-generated samples # Pre-generate 100 example outputs, serve from disk
PROBLEM 9
Native speakers say 'sounds wrong' but can't articulate why
Cause: Prosody issues — model getting words right but
rhythm/intonation wrong
# This is a HARD problem. Common causes: # 1. Reference audio prosody bleeds through too much # Try: shorter reference (4-5s vs 10s) # Try: different reference voices, see if pattern is consistent # 2. Odia phrase boundaries different from IndicF5's training distribution # Won't be fixed with current data — needs more Odia-specific FT data # 3. Pitch contour issues at sentence ends # Diagnostic: look at F0 of generated audio, compare to ground truth # librosa.yin(audio) for pitch extraction
PROBLEM 10
License confusion: can I commercially deploy this?
Cause: IndicF5 is MIT, but training data has mixed licenses
# IndicF5 was trained on: # - Rasa: research use only (CC BY-NC) # - IndicTTS: research use # - LIMMITS: research challenge data # - IndicVoices-R: CC BY 4.0 # Your fine-tuned LoRA inherits ambiguity. # For PROTOTYPE/DEMO: fine, MIT covers you # For COMMERCIAL: need lawyer review, OR retrain only on # your own data + clearly-licensed data # # This is a problem worth solving BEFORE pursuing serious funding.
Honest decision rules
When to give up vs when to push through
Push through if...
- Loss is decreasing during training (even slowly)
- You've made measurable improvement on at least one category in test set
- Native speakers say specific things are better (even if overall MOS is mixed)
- You haven't tried 2 weekends yet
Stop and reassess if...
- Loss is NaN or oscillating wildly after 1000+ steps
- 3 evaluators independently say fine-tuned is worse
- You've spent 2+ weekends debugging infrastructure, not training
- Your test set MOS dropped from baseline — you're moving backward
- Demo gets 100+ uses with consistently negative feedback
▸ NOTE
The hardest skill: distinguishing "this
approach is wrong" from "this approach needs more time." Most
people quit too early on hard problems and persist too long on
dead-end ones. Set explicit decision criteria
before you start, then trust them when the data comes
in.