Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

What are folk's experiences with alternative cloud GPUs for inference?

If you're doing a lot of model training, buying GPUs or long-term reservations of GPUs is a no-brainer. But when it comes to inference, latency matters and it gets trickier talking between e.g. your AWS infra and your GPUs somewhere else.

It seems lots of providers can give you enough to get by doing inference in a company's earliest stages. But what if I need hundreds or thousands of A100s during peak usage? Is anyone doing this successfully with a non-hyperscaler?



Just go with an inference provider like fireworks/together/modal/baseten


An issue we've had when looking into some of these is that, they provide a layer of software abstraction we're not looking for. I don't want to use some providers bespoke library to wrap my model code; I just want to use NVIDIA Triton, either by providing an image or by providing a model repo. I only want the inference provider to handle the hardware.

I understand that's exactly what those provider _don't_ want, because it means they can't lock us in. But particularly when comparing an inference provider to GCP, where we already run everything in Triton on GKE, I don't want to rewrite my code just to see what their hardware layer is like.

Another complication is we often run multiple tightly-integrated models for a single application, where having them on the same GPU is critical. This is tricky or impossible in some inference-provider-frameworks.

There's too many options for running the latest LLM, and far fewer for running a bespoke set of fine-tuned models on GPUs.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: