Good point, although it's possible that with the extreme price of GPUs that it cost more to train by buying hardware than it would to rent. For example it might take two to three years before the GPUs are paid for by customers.
Linux reserved cost of p3.16xlarge is $146,362.0800 annually. On-demand cost is $214,444.8000 annually
I am pretty damn sure I could build a 8 GPU Intel Xeon E5-2686 v4 (Broadwell) (that's what Amazon uses - it's $30 to $75 on eBay) server for less than that and come out ahead on electricity even at full throttle. RTX 4090 are just under $2000 each on eBay.
8 GPU × $2000 (RTX 4090) + $1000 (for the rest of the computer) = $17,000
If pulling 2kW continuously at 15 cents per kW*hr for 1 year that's 2000 watts × 365 days × (0.15/(kW×hr)) or $2,628
In total the computer will cost $19,628 if you throw it in the dumpster at the end of each calendar year of using it.
If you stack internet cost of $200 a month on top, that's $2400 a year, which raises your annual cost to: $22,028
This is still $124,334 cheaper per year than one AWS 8-GPU server if you fully depreciate your own hardware at the end of year 1 to $0.
I could hire an engineer in America to babysit it with the money left over.
Are consumer grade RTX 4090 cards going to be suitable for running full tilt 24/7 for a year? Those things are fine to stress on the latest game for a few hours at a time, but would probably cause some defects from significant heat stress after just a few days at 100%.
This is inconsequential when you're playing Overwatch for a few hours a night and a frame drops now and again. If you're training an iteratively developed LLM though, physical defects could propagate into huge deficiencies in the final model.
Yep absolutely, crypto miners have been doing it for years.
I still think it would be impractical at scale because they are so much more hot and power hungry than the datacenter cards, and you would be lucky to score one or two if you’re on a wait list.
Except you can absolutely obtain 4090s today, while enterprise hardware is (was? haven't looked at the data) recently, which is the exact opposite scenario you mentioned.
I'm actually really surprised that you can still buy 4090s for under $2,000 (cheapest available I saw was $1,800 new and I only took 30 seconds to look), but you can usually sell certain models for quite a bit more. For example, my used 4090 FE is currently worth more than I paid for it.
I've played with AI, and while admittedly I've not done anything super serious, I can tell you that both the 3090 and 4090 are more than capable of performing. Tie them with a power efficient AMD CPU and you have something that can be competitive with enterprise (somewhat).
I've seen the pricing of "cloud" offerings and I've toyed with the idea of creating an "AI Cloud" because I have access to really fast internet and super cheap electricity, but I haven't executed because I'm most certainly not a salesperson. I do, however, know enough about marketing that one should not target price, so there is that...
I don't think they'd become a fire hazard, but it is true that one would likely pick something else for this application.
Having said that, switching to something like the Tesla V100-SXM2-16GB wouldn't cost that much more.
TBH, I'm shocked at how many people treat Amazon as the first choice for this stuff. Much of it isn't even what most would consider a "production" workload. You are paying for a lot of enterprise-readiness that you don't need for training.
> TBH, I'm shocked at how many people treat Amazon as the first choice for this stuff
You can thank Amazon's legions of salespeople for that, particularly the end of year junket in Las Vegas where attendees are so pampered that about the only thing they won't do is suck your dick
Oh, yeah, they'll also yell at you on stage if you complain about their UI
Though this comparison is really only relevant for a couple of machines. Beyond that, at this cost, if you pay AWS list prices "at scale" you're doing something very wrong.
Don't get me wrong - I've frequently argued that AWS is price gouging and relying on peoples lack of understanding of how the devops costs of running your own works out, but it doesn't take a huge budget before this calculation will look very different (still cheaper to own your own, though).
You can build old Xeon based but only has 40 lane PCIe. For training 8 GPUs how do you push data fast? I’m using 7000 series Epyc for this to get 128 lanes. Have you built this kind of machine? You see good speed with 40 lane? Curious because then I can use old Tyan motherboard which comes in full case with good layout for multi GPU. Epyc based I have to use riser and custom frame which is painful.