>Article

Building MultiSub: From Weekend Idea to 3,000+ Users

February 10, 2026/6 min read

SaaSAITypeScriptOpenAI WhisperNext.js

The Problem Worth Solving

I watch a lot of content in languages I don't speak fluently. Korean dramas with my girlfriend, Japanese interviews, English podcasts with mediocre auto-captions. The subtitle experience on most platforms is either locked behind a paywall, machine-generated garbage, or simply absent.

I kept thinking: someone should build a clean, fast subtitle tool that actually works. Then I realized I was a developer. So I built it.

Starting Small

MultiSub started as a weekend experiment. I wanted to see how good OpenAI's Whisper model had gotten at transcription. The answer: very good. Surprisingly good. Good enough to build a product around.

The first version was embarrassingly simple — a Next.js app with a file upload, a server action that called the Whisper API, and a textarea that showed the output. No database, no auth, no styling. Just a proof of concept I could share with people to get reactions.

The reactions were positive. So I kept going.

The Stack Decision

For a SaaS that needs to ship fast and scale reliably, I went with what I know:

Next.js — full-stack in one repo, great DX, easy to deploy
TypeScript — I don't touch production code without it anymore
Supabase — auth + database + storage without the infra headaches
Stripe — billing that just works
Vercel — deployments that take 30 seconds

The only unusual choice was keeping everything in a single Next.js monorepo instead of splitting into a separate API. For a solo developer, fewer moving parts means faster iteration. When you're building alone, time is your scarcest resource.

The Architecture

The core of MultiSub is a processing pipeline:

// Simplified processing flow
async function processFile(fileUrl: string, options: ProcessOptions) {
  // 1. Download and validate the file
  const audio = await extractAudio(fileUrl);

  // 2. Chunk for long files (Whisper has a 25MB limit)
  const chunks = await splitAudioIfNeeded(audio, MAX_CHUNK_SIZE);

  // 3. Transcribe each chunk in parallel
  const transcriptions = await Promise.all(
    chunks.map((chunk) => transcribeChunk(chunk, options.language))
  );

  // 4. Merge and format as SRT/VTT
  return formatSubtitles(mergeTranscriptions(transcriptions), options.format);
}

The chunking logic was the trickiest part. Whisper handles audio up to 25MB, but users upload feature films. You need to split on silence boundaries, not arbitrary timestamps, otherwise you get broken sentences at the seams. Getting this right took three rewrites.

The Translation Layer

Transcription was the first half. Translation made it actually useful.

I integrated with the OpenAI chat API for translations — it handles context far better than dedicated translation APIs for subtitle text. Subtitles have unique quirks: they're short, they carry timing, they sometimes reference previous lines. A general translation API treats each line as independent. GPT-4 understands the conversation.

The prompt engineering here matters a lot:

const systemPrompt = `You are a professional subtitle translator.
Preserve the tone, timing cues, and conversational flow.
Keep translations concise — subtitles must be readable at speed.
Output only the translated text, maintaining the same line structure.`;

It's simple, but getting this right took iteration. Early versions produced technically correct but unnaturally long translations that nobody wanted to read at subtitle speed.

Growing to 3,000 Users

I didn't have a marketing strategy. I had a product that worked well, and I talked about it in the places where people who needed it spent time.

Language learning communities were the most receptive. People studying Japanese or Korean often can't find subtitles for obscure content. They were generating subtitles manually, or suffering through bad auto-captions. MultiSub solved a real problem for them, and they told each other about it.

A few things that worked:

Showing the output quality. Instead of explaining what the tool does, I'd drop side-by-side comparisons of YouTube auto-captions vs Whisper transcriptions. The quality difference is obvious and visual.

Free tier that actually works. Too many tools give you a crippled free tier to force upgrades. MultiSub's free tier processes real files. People convert. That trust-building matters more than any growth hack.

Responding to every support request personally. In the early days, every user who emailed got a response from me within a few hours. Several of them became paying customers specifically because of that. Word-of-mouth is underrated.

What I Got Wrong

Underpricing. My initial pricing was too low. When I raised it, churn didn't spike the way I expected. People who get value from tools pay for them. I left money on the table for months.

Building features nobody asked for. I spent two weeks building a batch processing dashboard that exactly three users have ever used. Those two weeks could have been a much-requested feature: subtitle burn-in (embedding subtitles directly into the video). I built that in four days and it's now one of the most-used features.

Not talking to users early enough. I launched and waited to see what would happen instead of proactively reaching out to early users. The gap between "what you think users want" and "what they actually want" is large. User interviews close it fast.

The Technical Lesson I Keep Relearning

Every time I try to be clever — caching aggressively, premature optimization, complex state management — it comes back to bite me. The code that's lasted longest is the simple code. The pipeline I showed above hasn't changed in six months. The complex smart-chunking algorithm I wrote early on? Rewrote it twice, then replaced it with a dumb version that was easier to reason about.

Build the simple version first. You can always add complexity. You can't easily remove it.

What's Next

MultiSub is profitable as a side project, but it's not where I spend most of my engineering hours day-to-day. My day job at Janitos gives me interesting problems in document processing and AI automation. Some of those learnings feed back into how I think about MultiSub's architecture.

The plan for 2026 is straightforward: better real-time processing (right now it's async with status polling — users want instant), speaker diarization (who's speaking when), and better mobile support. The user base is surprisingly mobile-heavy for a file processing tool.

The journey from weekend project to a product with real users has been the most valuable education I've gotten as a developer. No bootcamp or tutorial teaches you what shipping something real does.

If you're on the fence about building that thing you've been thinking about — just start. The first version doesn't have to be good. It just has to exist.