We're absolutely seeing the same thing even in non-GPU traditional servers. Here...

phh · on May 6, 2024

You feel like it's new, I don't. I've evaluated migrating from full blown servers to cloud at various scales for various usages: OTA server for 1M active devices, stat server for 1M active devices, build server for Android (both pure Android that parallelizes and checks out super nearly and vendor Android where they broke all of that). In all those cases the cost of cloud was like an order of magnitude costlier. You're mentioning SSD, the various clouds I tried (I couldn't remember which I tried sorry...) had bad storage performance (it was especially bad when building Android). Also not all those servers were maintained by my employer, when we pay top end 100Gbps server we can pay someone to maintain it for us as well for much lower money than the "cloud tax".

I have no doubt there are great use case for cloud, and that at the proper scale you can negotiate, and I understand that startups might be faster moving with the cloud. But I feel like the highest value cloud provides is 1. replacing capex with opex, and 2. making scaling easier with the direction: cloud is pay first, get questions from direction later. "on-prem" is negotiate with the direction until the service degrades and then scramble to integrate the new server under the pressure.

metadat · on May 7, 2024

> In all those cases the cost of cloud was like an order of magnitude costlier.

Don't discount the value-add of your skills. Giving any of my family members a stack of bare metal servers would serve no end-purpose or web requests. Approximately zero of said servers would even end up plugged in at all, much less operationalized. My family all works in tech or similarly demanding fields. That is, the cost of "the ops team / guy" can be significant for small and medium enterprises.

Despite this, the AWS pricing margins have grown to the point of excess, and are no longer competitive, even against other major cloud players. This is strange to me because the tooling is all interoperable (e.g., Terraform/Vaggi-form), with no fatal friction or lock-in. It is a rare use-case that the latest CPU is worth paying a penny extra for compared to a 3 or 4 generation old chip. The only winner is the cloud provider because it improves their COGS ratio.

TL;DR: the top tier clouds are priced on the order of luxury goods, on par with a Bugatti or G6 jet. That is, unless you ruthlessly track and prune each expense, which costs you time and attention which could have been spent growing the business or at least non-overhead tasks. The eventuality of managing your own DIY fleet of machines is a total headache, as any given computer may work flawlessly for the next ten years or only the next 10 minutes. When it goes south, you're back to being a monkey plugging in cables and scratching your head. This sad activity is only a few people's cup of tea.

Obligatory reference to Warren G's Regulators:

"You've gotta be handy with the steel, if you know what I mean."

https://youtube.com/watch?v=hms5vmekId4

Edit: Sorry for the rambling comment, I suppose it is a more complex topic than I realized prior to crafting the words above.

exceptione · on May 7, 2024

> When it goes south, you're back to being a monkey plugging in cables and scratching your head.

That is if you rent rack space. If you rent bare metal and the bare metal has a failing disk, you file a request to replace said disk, and the data center engineer will be plugging cables, not you.

That means you have to worry about hardware failures, but you don't have the inconvenience of having to be physically present.

jiggawatts · on May 6, 2024

> start struggling over the next

They’re already struggling!

In the past, the three big clouds would deploy cutting-edge CPUs at scale ahead of general availability for ordinary rackmount servers.

Now?

The AMD EPYC 9004 series processors were announced over a year ago in March 2023, but are still trickling out as “preview” in selected regions in Azure. Similarly, Intel Xeon fourth-gen CPUs haven’t even been announced by Azure, but Intel is already shipping fifth-generation CPUs!

I suspect that up until a couple of years ago, the usage of public cloud was increasing at such a pace that the providers were buying a truckload of CPUs every six months, so they were keeping up with the latest tech.

They must have had new signups dry up as soon as interest rates went up, and they’re now milking their existing kit instead of expanding with new generation servers.

wmf · on May 6, 2024

Hyperdisk is a SAN; it's not comparable to local storage. Unfortunately Google's local SSDs are also overpriced.

015a · on May 6, 2024

Sure, but you literally cannot use any other kind of SSD with the n4-class instances, and n4 are the only instances they offer on Emerald Rapids; they're advertised as general purpose, flexible, and high performance, basically their workhorses. If you want to use a local SSD you have to use older generation chips.

candiddevmike · on May 6, 2024

IMO, the big thing holding back customers (especially smaller ones) from going on-premise these days is networking. Getting a fat pipe similar to what you get with an instance on a cloud provider can be prohibitively expensive (internet service, gear, staff, etc), especially when you want it to be highly available.

tempest_ · on May 6, 2024

It is really the second one. The minute you want a second site or even HA at a single site the complexity and costs start to explode.

cpill · on May 6, 2024

k3s is very simple to setup and and nodes to. if the machines are on the same LAN or the internet then it's not such a complex job albeit you need to know the basics of kubernetes.

fwip · on May 6, 2024

Kubernetes is only the software layer, a lot of the cost is in the hardware/infrastructure, and in the salary for those experts.

fffrantz · on May 6, 2024

Agreed 100 percent. Software is the easy part. Getting HVAC, power and network up to the levels of cloud providers is difficult to get right and prohibitively expensive.

For instance, the cost for a pair of redundant symmetric gigabit fiber is in the thousands a month and may require tens of thousands of construction costs. These quickly add up, and the upfront costs can quickly reach six figures.

bradstewart · on May 6, 2024

Not to mention security compliance. If you can afford all of that, seems pretty likely you'll also have SOC2/etc needs. Being able to "ignore" the whole physical security aspect of that stuff is a huge benefit of the cloud.

amluto · on May 6, 2024

There’s a huge middle ground between on-prem and GCP/AWS. You can rent space and connectivity from in very competent datacenter without any of these big fixed costs.

hot_gril · on May 6, 2024

Can rent the space, but you still have to buy the hardware. Maybe there's money to be made running some low-availability cloud service offering newer hardware.

amluto · on May 6, 2024

Have you checked the price for a system capable of using two redundant 10Gbps links lately? It’s cheap. You could put gear like this in your closet at home and not feel particularly silly about it, especially if you are willing to buy still-current used enterprise gear.

For that matter, have you checked the price, in qty 1, of a server that will absolutely destroy anything reasonable from a major cloud vendor in terms of IOPS to stick behind that switch or router? Even if you believe the numbers on the website of a major server vendor and forget to ask for a discount, it’s still quite reasonable in comparison to a major cloud.

hot_gril · on May 6, 2024

Yeah, tends to be. But it's more efficient for multiple customers who don't need the hardware full-time to share it. Someone could set that up without all the expensive HA guarantees and other stuff a regular cloud provides. Maybe was too niche in the past, but now with the AI boom...

saltminer · on May 6, 2024

I remember seeing a quote for 500/500 metro E from Comcast several years ago. $12k to install, $1.2k/mo. And that only involved laying a few miles of fiber, no redundancy. Dedicated lines are no joke. If you're AWS or GCP, you can be your own ISP and mitigate this to some extent, but that's just the physical connection they save on.

You can always save by going on-prem, assuming you have no uptime requirements. But the moment you sign an SLA, those savings go out the window.

tempest_ · on May 6, 2024

Yeah, k3s doesn't buy 2 routers, 2 switches, 2 PDUs, 2 firewalls, 2 proxies to sit in front of k3s, 2 internet connections (if those are offered) etc etc the list goes on. Not to mention that HA things like to come in 3s.

Then if you are going to have remember that cloud networking is pretty beefy and if you want k3s to do distributed storage you will need some pretty beefy network hardware.

There are a lot of things hidden in the cloud costs that people forget about.

The one thing running your own stuff does allow you to do is make choices and trade offs. If this switch goes down and we have 6 hours of downtime to replace it what is that worth etc.

hot_gril · on May 6, 2024

Starting with a basic web backend, you probably have a database that you can't simply run replicas of.

wmf · on May 6, 2024

Are you talking about colo or an office? Because carrier-neutral colos are pre-wired with plenty of bandwidth that's 10x-100x cheaper than public clouds. Yes, you need routers but the savings elsewhere should pay for them.

carbocation · on May 6, 2024

Also on GCP, the vCPUs are usually hyperthreads (except for t2d- instance types, and perhaps a few others). So that machine you've described has 1 CPU core.

williamstein · on May 6, 2024

Reference: https://cloud.google.com/compute/docs/machine-resource#recom...

The t2d, t2a and h3 instance types have vCPU = core, and all other instance types have vCPU = thread.

hot_gril · on May 6, 2024

Clouds have historically been designed for high-availability workloads, which are very hard to handle yourself. It doesn't always make sense for experimentation or AI training, though they might be trying to optimize more for that now. At past startups, we were fine just buying machines to run on-prem.

jeffbee · on May 7, 2024

The $22/mo hyperdisk is for customers who want effortless durability. The $25 NVMe device is for applications where the data is worthless (not pejorative; there are many worthwhile applications where the data written to storage is of no durable value). It makes sense that there are two different price points.

ikiris · on May 6, 2024

if you don't need hyperdisk, why are you on that type of server?

williamstein · on May 6, 2024

On Google Cloud hyperdisks can be used on h3, c3, c3d, m3, n4 instance types and are required for n4. I.e., you are not allowed to use the n4 instance type without using a hyperdisk.

015a · on May 6, 2024

Its the only instance class that's on Emerald Rapids. So if you want the best that Intel has to offer, you need to adopt hyperdisks.

But, to be clear: We're not.

acchow · on May 6, 2024

Curious what the use case is for targeting a specific CPU. These are virtual CPUs anyways, so what benefit does using the latest intel chip offer?

evilduck · on May 6, 2024

Sometimes the instruction sets change (like the relatively recently added AVX10 extensions) and you have a workload that specifically needs those? I'm just guessing though.