Just go with an inference provider like fireworks/together/modal/baseten

claytonjy · on May 6, 2024

An issue we've had when looking into some of these is that, they provide a layer of software abstraction we're not looking for. I don't want to use some providers bespoke library to wrap my model code; I just want to use NVIDIA Triton, either by providing an image or by providing a model repo. I only want the inference provider to handle the hardware.

I understand that's exactly what those provider _don't_ want, because it means they can't lock us in. But particularly when comparing an inference provider to GCP, where we already run everything in Triton on GKE, I don't want to rewrite my code just to see what their hardware layer is like.

Another complication is we often run multiple tightly-integrated models for a single application, where having them on the same GPU is critical. This is tricky or impossible in some inference-provider-frameworks.

There's too many options for running the latest LLM, and far fewer for running a bespoke set of fine-tuned models on GPUs.