From Kill the Ums to Voice Infrastructure

A post-mortem on my filler-removal app

1. Context

In early 2025 I built a web app that takes an audio recording, removes filler words like “um”, “uh”, and “äh”, and outputs a cleaner version in the speaker’s own cloned voice.

Under the hood it used:

A simple web frontend and backend (Bolt.new, Supabase, external APIs).
Stripe for payments.
A pay-as-you-go model via Stripe Checkout: users bought processing credits priced per minute of input audio, with no subscriptions.
ASR (Whisper or similar) to transcribe and timestamp the audio.
An LLM to remove fillers and light disfluencies from the transcript.
ElevenLabs fast voice cloning to re-synthesise the cleaned text in the original speaker’s voice.

The goal was to offer podcasters and creators a simple way to sound more polished, without manual editing and without losing their authentic voice.

The app shipped. It technically worked. It got essentially zero traction.

This is a post-mortem of that project: what I thought would happen, what actually happened, and what I learned about AI audio, user needs, and distribution along the way.

2. Original Hypothesis

2.1 Problem Hypothesis

Belief

Podcasters and creators hate their own filler words and disfluencies. They would gladly pay for an automatic way to clean those up.

More specifically:

Filler words make people sound unprofessional.
Manual editing is tedious and time-consuming.
Existing tools do some cleanup, but there is room for a dedicated and smarter product.

2.2 Solution Hypothesis

Belief

A web app that removes fillers and re-voices the cleaned script in the user’s own cloned voice is a valuable and differentiated product.

Key assumptions baked into that:

Full re-synthesis (TTS plus cloning) is better than surgical waveform editing.
Podcasters are okay with an AI version of their voice as long as it sounds good.
“Fewer ums” is a strong enough value proposition on its own.
It is fine that the product lives outside their main editing tools.

Most of these assumptions turned out to be either wrong or only partially true.

3. What I Actually Built

3.1 Product Shape

The shipped version allowed users to:

Upload audio or record in the browser.
Create a one-time voice clone via ElevenLabs samples.

The pipeline looked like this:

Transcribe audio with timestamps.
Use an LLM to clean the text, remove fillers, and fix some disfluencies.
Use ElevenLabs to re-synthesise the cleaned script in the cloned voice.

The output was:

A new, polished audio file.
Optional subtitles.

Under the hood it used a Bolt.new frontend, Supabase for authentication, storage, and edge functions, Stripe Checkout for credits, and ElevenLabs plus Whisper for the heavy lifting.

3.2 What Worked Technically

The pipeline itself worked and the before-and-after difference was clear.
Voice cloning was surprisingly good for short-form content.
The infrastructure choices (Supabase plus external AI APIs plus Stripe) were fast to build and easy to operate.

I also learned a lot about:

ASR → text → TTS chains.
Latency, batching, and cost trade-offs.
Voice cloning UX and consent considerations.

On paper, it was a nice project. It just was not a good product.

3.3 Timeline and Numbers

I built the first version over a couple of days for a hackathon. After that, I kept it online for a few months as a live experiment.

During that time:

Traffic was in the low hundreds of visitors in total.
Only a handful of people ever uploaded a file.
No one converted to paid minutes.

I also shared it in a few niche communities and entered it into a hackathon. Nothing suggested there was a strong pull for this product in this particular shape.

From a numbers perspective, ClipClean never got beyond the stage of “toy used mostly by the builder”.

4. Why It Did Not Get Traction

4.1 The Problem Was Already “Good Enough” Solved

Most podcasters who care about filler words are already using tools like Descript, Cleanvoice, Podcastle, Adobe Podcast and others.

These tools:

Remove fillers in one click.
Handle silence trimming, noise reduction, compression and similar tasks.
Live directly in the editing workflow.

For the average podcaster, the pain of “ums” is not large enough to justify:

Cloning a voice.
Uploading episodes to a separate app.
Waiting for TTS re-synthesis.

Lesson: competing against “good enough” inside a user’s existing tool is much harder than it looks from the outside.

4.2 The Solution Was Heavier Than the Pain

The pipeline is objectively cool, but that was not enough.

Transcribe, clean text, re-voice in your own clone.

For most users this is a sledgehammer for a thumbtack:

They do not need full re-synthesis to remove fillers.
They are not actively asking for an AI replica of their voice.
Many are happy with some imperfections, because authenticity matters more.

The result was technically impressive and practically overkill.

Lesson: just because something is technically possible and fun does not mean it maps to a strong enough user desire.

4.3 Wrong Default Target: Generic Podcasters

I implicitly optimised for a “generic podcaster” as the user:

Hosts who want cleaner sounding shows.
People annoyed by their own speech patterns.

In reality, this group splits into three subgroups:

Authenticity-first creators. They accept fillers as part of their style and removing everything actually feels wrong.
Process-heavy shows. They already have an editor or a solid editing workflow and they are not hunting for a new tool for one specific step.
Non-native speakers. This is the interesting niche, but the product and messaging were never focused squarely on them.

The one group where the idea does resonate is:

Non-native speakers who are self-conscious about sounding hesitant and would like more fluent-sounding audio in their own voice.

I never leaned fully into that niche and the product itself was not clearly designed or branded as a “speak more fluently in your second language” tool.

Lesson: a vague “for podcasters” target hides the one niche that might actually care enough.

4.4 The Video Story Was Weak

For audio-only content, timing is not a big problem.

For video, which is where many creators live, timing becomes much more important:

Re-synthesised audio is shorter and differently paced.
Lip sync breaks for talking heads.
Fixing that properly means either editing the video to match the new audio, or generating a synthetic talking head that matches the TTS.

Both of those options are essentially entirely different products.

So the app was implicitly audio-only in a world where much creator output is video-first.

Lesson: full TTS re-synthesis and lip-visible video do not work well together unless you are willing to solve video as well.

4.5 Distribution Reality and SEO Check

There was also a distribution problem.

There is no real plugin marketplace for this kind of thing:

Podcast hosts and directories do not expose audio-processing plugin stores.
Descript, Riverside and similar tools do not have public marketplaces for third-party audio effects.
Audio plugin marketplaces for VST or AU exist, but they are music-oriented and require a very different product shape.

There was no natural place to list it and get organic discovery. The app would have had to win attention the hard way, by manually reaching creators, running content, or building a full distribution engine.

I also did a sanity check on SEO. When I looked at search volumes for obvious queries such as “remove filler words from audio” or “podcast um remover”, demand was essentially non-existent. The organic search channel for this exact value proposition was close to zero.

Given that:

The product did not live inside existing tools.
There was no marketplace to piggyback on.
SEO demand for the “remove ums” idea was tiny.

I did not invest heavily in further distribution, because the product–market fit signals were weak from the start.

I could have forced the issue with aggressive outreach, cold emails, paid ads, or content marketing. Given the weak pull, I decided not to spend weeks or months trying to brute-force adoption. The distribution story was not only hard. It also did not feel worth solving for this particular product.

Lesson: there is no distribution cheat code here. Either the product is embedded where work already happens, or it has to earn its own audience from scratch. Ideally SEO should at least tell you that there is an audience.

4.6 Even as an Asset, It Did Not Move

As a last step, I tried to sell the project as an asset (code plus domain plus pipeline) on a marketplace such as Flippa.

It was pre-revenue, with no traction and a narrow use case.
Interest was low and it did not sell.

That is another data point. Even among buyers who like small AI tools, a filler-removal app with no traction and no clear wedge is a hard sell.

Lesson: if both users and acquirers are lukewarm, the right move is to archive the project and keep only the parts that compound, such as skills, infrastructure patterns, and code.

5. What I Learned About AI Audio and Voice

Beyond the specific product, the project was a good deep dive into the current state of AI audio.

5.1 The Core Stack Is Mature

Whisper-class ASR models are strong enough to drive production pipelines.
Modern voice cloning, such as ElevenLabs, can produce convincing and on-brand voices with relatively little data.
LLMs are capable of decent disfluency removal and light rewriting on real and messy transcripts.

This combination opens up many possibilities beyond “kill the ums”.

Working with cloned voices forces you to think about:

How you obtain and store consent.
How you communicate what is synthetic versus original.
How much control users have over what is re-voiced and what stays real.

Some people are very comfortable with an AI version of their voice. Others are uneasy once they hear it. That alone changes how you need to design the product, the onboarding, and the defaults.

Lesson: with voice, the technical pipeline is only half the problem. The other half is comfort, expectations, and trust.

5.3 Timing, Alignment, and Human Perception

Small timing differences that do not matter for audio-only can be very noticeable on video.

I gained a much better intuition for:

How tightly audio needs to track original pacing to feel natural.
Where listeners are sensitive versus where they do not care.
How important context (music, b-roll, or slides) is for hiding small desync.

These instincts are reusable for any future voice or dubbing product.

6. Opportunity: Pivoting the Tech, Not the Product

While this specific product did not land, the underlying machinery is reusable.

The stack I built – ASR → text → LLM → TTS with voice cloning and per-minute billing – can be a foundation for more promising directions, for example:

Multilingual dubbing
- Take scripts or recordings and produce fluent voiceovers in multiple languages using the same voice identity.
- Here full re-synthesis is a clear advantage, because you cannot just edit the original waveform.
Script-to-speech tools for founders and experts
- Turn written content, such as blog posts, documentation, or newsletters, into narrated audio in the author’s voice.
- This plays nicely with asynchronous content and does not fight existing editing workflows.
Internal “voice infrastructure” for future projects
- Reuse the ASR → LLM → TTS chain as an internal component for other SaaS ideas.
- Avoid solving auth, billing, file handling, and basic orchestration from scratch next time.

Each of these plays more to the strengths of the stack, without competing head-on with Descript’s filler-removal button.

The key shift is to focus on use cases where full re-synthesis is clearly an advantage, not an over-engineered way to solve a solvable editing problem.

7. Biggest Takeaways

If I had to condense the whole experience into a few points:

Distribution and workflow integration matter more than clever pipelines.
- If you are not inside the tools people already use, you need a very strong and distinct value proposition.
“Technically impressive” is not the same as “emotionally compelling”.
- Getting rid of “um” and “ah” is nice. It is rarely the thing people are desperate to fix.
Niches beat generic personas.
- “For podcasters” was too broad. “For non-native speakers who want to sound more fluent in English” would have been more honest and more interesting.
Post-mortems are assets, not obituaries.
- The code, the infrastructure patterns, and the learnings about AI audio are a toolbox. The fact that this particular product did not take off does not make the work wasted.

8. Closing

I did not get the outcome I imagined when I started this app. There is no neat MRR chart, no case studies from happy podcasters, no “we scaled to X users” story.

What I do have is:

A tested, working pipeline for ASR → LLM → TTS with cloned voices.
A much clearer mental model of the audio and podcasting tool landscape.
A concrete reminder to validate distribution and workflow fit earlier.

That is enough to call the project a useful experiment, and a good foundation for whatever voice-related thing comes next.

From Kill the Ums to Voice Infrastructure

From Kill the Ums to Voice Infrastructure

1. Context

2. Original Hypothesis

2.1 Problem Hypothesis

2.2 Solution Hypothesis

3. What I Actually Built

3.1 Product Shape

3.2 What Worked Technically

3.3 Timeline and Numbers

4. Why It Did Not Get Traction

4.1 The Problem Was Already “Good Enough” Solved

4.2 The Solution Was Heavier Than the Pain

4.3 Wrong Default Target: Generic Podcasters

4.4 The Video Story Was Weak

4.5 Distribution Reality and SEO Check

4.6 Even as an Asset, It Did Not Move

5. What I Learned About AI Audio and Voice

5.1 The Core Stack Is Mature

5.2 UX, Consent, and Expectations Matter

5.3 Timing, Alignment, and Human Perception

6. Opportunity: Pivoting the Tech, Not the Product

7. Biggest Takeaways

8. Closing