Building the Backbone of Generative Media

On our first day as a company, we had pretty much nothing — just a small pile of GPUs (eight, to be exact), a command-line interface to stable diffusion, and a hunch that this technology could be something transformative. But what we had in spades was technical curiosity and frustration. We were frustrated by how long it took to generate an image — 19 seconds for a single SD15 render. So we optimized. Hard.

That tiny frustration became the foundation for our focus: building infrastructure that could help generative models perform in the real world, on real workloads, at real scale.

Since then, generative media — not just images, but now also video and audio — has exploded in complexity and ambition. This ecosystem is moving faster than anything we've seen since the early days of mobile. The video modeling space, especially, is astonishing in its pace. New models drop nearly every two weeks, and yet most of them you never hear about, because in this environment, anything not best-in-class is quietly shelved.

For a long time, LLMs got most of the spotlight, and for good reason — the leap in capabilities has been incredible. But we chose to focus elsewhere. The creative flow — image, video, sound — is different. It’s fast-twitch, iterative, often subjective. We felt this domain deserved its own native infrastructure, one tuned for speed, for fluid experimentation, and for the particular alchemy of prompt + model + post-processing.

This is why we didn’t compete in LLMs. Instead, we built tools to support the artists, engineers, and indie researchers trying to move pixels, not paragraphs.

In hindsight, that seems obvious. It didn’t at the time.

We believed that generative media would become more than a parlor trick. And optimizing inference times, caching model weights locally, developing a distributed orchestration system — those were not glamorous problems, but they were consequential ones. Our infrastructure now manages tens of thousands of GPUs across multiple clouds. We’ve brought down cold start times for video models from five seconds (on Kubernetes) to under one. It's easy to forget how much that matters, until you're the user waiting on a loading bar.

The thing is — this ecosystem isn’t about raw models. It’s not just a leaderboard game. Adoption, we've found, hinges on fit to purpose. Video models, especially, still carry distinct personalities — one excels at motion, another at style, yet another at coherence. Sometimes, older models hang around because they’ve nailed a niche (e.g. logos, or minimalistic renders) that newer ones haven't trained for. In the image domain, fine-tuning happens 1,000x more often than with LLMs, simply because visual needs are so idiosyncratic. Everyone wants their own look, their own style, their own brand consistency.

So our approach has been to remain flexible rather than canonical. We’re building not just infrastructure, but a kind of marketplace — a file platform that connects developers with the best tools for their use case, while supporting model developers in packaging and scaling their work. We sometimes partner ahead of release, helping push the models themselves to performance thresholds that would make them viable in production.

This tight feedback loop — build, test, optimize, deploy — is only possible because our team structure supports it. Most of us are engineers. More than half work directly in applied ML. Our go-to-market team is small by design, and salespeople are chosen less for their persuasion and more for their empathy — we look for people who want to serve engineers, not just sell to them.

Part of what’s kept us aligned is maintaining a deep closeness with customers. We live inside Slack Connect channels, troubleshooting side-by-side, logging usage patterns, listening far more than we pitch. Even today, founders still carry customer conversations. It helps us stay grounded.

Looking ahead, we think 2025 will be the year AI video becomes ubiquitous — not just in flashy demos, but in everyday pipelines. The tech stack isn’t finished, not by a long shot. But already, the edges are being redrawn. Ad creatives are being generated on-the-fly. Short-form videos are A/B tested in real time. Some of our favorite experiments are the unexpected ones: IKEA playing with virtual room unpacking, or game studios exploring generative cutscenes that respond to player behavior.

This is where we're headed — generative media not as a novelty, but as core infrastructure for how things get made. The rise of video games with interactive story logic or adaptive visuals is not five years away. It’s next year.

None of this feels accidental. Generative video’s rise, in particular, feels inevitable now. The pieces are in place: model capabilities are catching up to real creative vision, infrastructure is finally responsive enough for complex outputs, and the number of builders is multiplying.

For us, staying useful means staying curious. It means responding quickly when the terrain shifts. It means holding performance and reliability as sacred, because even the most beautiful model can't succeed if it stalls under load.

We’ll keep investing in what makes creative tooling not just possible, but delightful. That might mean faster inference, smarter scheduling, or better surfaces for chaining tasks. But it will always mean staying unnervingly close to our customers — listening at the edge, where the future is being sketched in real time.