Founder Notes
From Kill the Ums to Voice Infrastructure
What a filler-removal app taught me about niches, infra cost, and product focus
Founder Notes
What a filler-removal app taught me about niches, infra cost, and product focus
A post-mortem on my filler-removal app
In early 2025 I built a web app that takes an audio recording, removes filler words like “um”, “uh”, and “äh”, and outputs a cleaner version in the speaker’s own cloned voice.
Under the hood it used:
The goal was to offer podcasters and creators a simple way to sound more polished, without manual editing and without losing their authentic voice.
The app shipped. It technically worked. It got essentially zero traction.
This is a post-mortem of that project: what I thought would happen, what actually happened, and what I learned about AI audio, user needs, and distribution along the way.
Belief
Podcasters and creators hate their own filler words and disfluencies. They would gladly pay for an automatic way to clean those up.
More specifically:
Belief
A web app that removes fillers and re-voices the cleaned script in the user’s own cloned voice is a valuable and differentiated product.
Key assumptions baked into that:
Most of these assumptions turned out to be either wrong or only partially true.
The shipped version allowed users to:
The pipeline looked like this:
The output was:
Under the hood it used a Bolt.new frontend, Supabase for authentication, storage, and edge functions, Stripe Checkout for credits, and ElevenLabs plus Whisper for the heavy lifting.
I also learned a lot about:
On paper, it was a nice project. It just was not a good product.
I built the first version over a couple of days for a hackathon. After that, I kept it online for a few months as a live experiment.
During that time:
I also shared it in a few niche communities and entered it into a hackathon. Nothing suggested there was a strong pull for this product in this particular shape.
From a numbers perspective, ClipClean never got beyond the stage of “toy used mostly by the builder”.
Most podcasters who care about filler words are already using tools like Descript, Cleanvoice, Podcastle, Adobe Podcast and others.
These tools:
For the average podcaster, the pain of “ums” is not large enough to justify:
Lesson: competing against “good enough” inside a user’s existing tool is much harder than it looks from the outside.
The pipeline is objectively cool, but that was not enough.
Transcribe, clean text, re-voice in your own clone.
For most users this is a sledgehammer for a thumbtack:
The result was technically impressive and practically overkill.
Lesson: just because something is technically possible and fun does not mean it maps to a strong enough user desire.
I implicitly optimised for a “generic podcaster” as the user:
In reality, this group splits into three subgroups:
The one group where the idea does resonate is:
Non-native speakers who are self-conscious about sounding hesitant and would like more fluent-sounding audio in their own voice.
I never leaned fully into that niche and the product itself was not clearly designed or branded as a “speak more fluently in your second language” tool.
Lesson: a vague “for podcasters” target hides the one niche that might actually care enough.
For audio-only content, timing is not a big problem.
For video, which is where many creators live, timing becomes much more important:
Both of those options are essentially entirely different products.
So the app was implicitly audio-only in a world where much creator output is video-first.
Lesson: full TTS re-synthesis and lip-visible video do not work well together unless you are willing to solve video as well.
There was also a distribution problem.
There is no real plugin marketplace for this kind of thing:
There was no natural place to list it and get organic discovery. The app would have had to win attention the hard way, by manually reaching creators, running content, or building a full distribution engine.
I also did a sanity check on SEO. When I looked at search volumes for obvious queries such as “remove filler words from audio” or “podcast um remover”, demand was essentially non-existent. The organic search channel for this exact value proposition was close to zero.
Given that:
I did not invest heavily in further distribution, because the product–market fit signals were weak from the start.
I could have forced the issue with aggressive outreach, cold emails, paid ads, or content marketing. Given the weak pull, I decided not to spend weeks or months trying to brute-force adoption. The distribution story was not only hard. It also did not feel worth solving for this particular product.
Lesson: there is no distribution cheat code here. Either the product is embedded where work already happens, or it has to earn its own audience from scratch. Ideally SEO should at least tell you that there is an audience.
As a last step, I tried to sell the project as an asset (code plus domain plus pipeline) on a marketplace such as Flippa.
That is another data point. Even among buyers who like small AI tools, a filler-removal app with no traction and no clear wedge is a hard sell.
Lesson: if both users and acquirers are lukewarm, the right move is to archive the project and keep only the parts that compound, such as skills, infrastructure patterns, and code.
Beyond the specific product, the project was a good deep dive into the current state of AI audio.
This combination opens up many possibilities beyond “kill the ums”.
Working with cloned voices forces you to think about:
Some people are very comfortable with an AI version of their voice. Others are uneasy once they hear it. That alone changes how you need to design the product, the onboarding, and the defaults.
Lesson: with voice, the technical pipeline is only half the problem. The other half is comfort, expectations, and trust.
Small timing differences that do not matter for audio-only can be very noticeable on video.
I gained a much better intuition for:
These instincts are reusable for any future voice or dubbing product.
While this specific product did not land, the underlying machinery is reusable.
The stack I built – ASR → text → LLM → TTS with voice cloning and per-minute billing – can be a foundation for more promising directions, for example:
Multilingual dubbing
Script-to-speech tools for founders and experts
Internal “voice infrastructure” for future projects
Each of these plays more to the strengths of the stack, without competing head-on with Descript’s filler-removal button.
The key shift is to focus on use cases where full re-synthesis is clearly an advantage, not an over-engineered way to solve a solvable editing problem.
If I had to condense the whole experience into a few points:
Distribution and workflow integration matter more than clever pipelines.
“Technically impressive” is not the same as “emotionally compelling”.
Niches beat generic personas.
Post-mortems are assets, not obituaries.
I did not get the outcome I imagined when I started this app. There is no neat MRR chart, no case studies from happy podcasters, no “we scaled to X users” story.
What I do have is:
That is enough to call the project a useful experiment, and a good foundation for whatever voice-related thing comes next.