A JSON file and a case statement is *hard*? You can even validate the JSON as yo...

mattbillenstein · on Nov 30, 2018

Just run the stupid thing on a VM for however long it takes, why would anyone need all this complexity to run some dumb program that should just be a cron job to begin with. And why should the limitations of some platform dictate how one structures their code to begin with? You're solving for the wrong problem.

scarface74 · on Nov 30, 2018

“Just do everything like we would do on premise” is the very reason that cloud implementations end up costing more than they should.

Let’s see all the reasons not to use a VM.

1. If this job is run based on when a file comes in, instead of a simple S3 event -> lambda, then you have to do S3 Event -> SNS -> SQS and then poll the queue.

2. Again if you are running a job based on when a file comes in, what do you do if you have 10 files coming in at the same time and you want to run them simultaneously? Do you have ten processes running and polling the queue? What if there is a spike of 100 files coming in, do you have 100 processes running just in case? With the lambda approach you get autoscaling. True you can set up two Cloudwatch alarms - one to trigger when the number of items in the queue is X and to scale down when the number of items in the queue is below Y, then set up an autoscaling group and launch configuration. Of course since we are automating our deployments that means we also have to set up a CloudFormation template with a user data section to tell the VM what to download and the description of the alarms, scale in, scale out rules, the autoscaling group, the launch configuration, the definition of the EC2, etc.

Alternatively, you can vastly over provision that one VM to handle spikes.

The template for the lambda and the state machine will be a lot simpler.

And after all of that work, we still don’t have the fine grained control of the autoscaling with the EC2 instance and cron job that we would have with the lambda and state machine.

3. Now if we do have a VM and just one cron job, not only do we have a VM running and costing us money waiting on a file to come in, we also have a single point of failure. Yes we can alleviate that by having an autoscaling group with one instance and set up a health check that automatically kills the instance if it is unhealthy and bring up another one and set the group up to span multiple AZs.

mattbillenstein · on Dec 1, 2018

I disagree - the rampant complexity of systems you can build using all these extra services is the bigger problem, and at scale, all these extra services on AWS cost more than their OSS equivalent running on VMs. Take a look at ELB for example, there's a per-request charge that yes is very small, but at 10k req/s it's significant - not to mention the extra bytes processed charges. Nginx/haproxy with DNS load balancing can do pretty much what ELB is offering. In the very small none of this matters, but you can just run smaller/bigger VMs as needed; prices starting at the cost of a hamburger per month.

Re reliablity, I have VMs on AWS with multiple years of uptime - it's the base building block of the entire system, so it's expected to work all the time. You monitor those systems with any sort of nagios-type tool and alert into something like pager duty for when things do go wrong, but they mostly don't. Redundancy I build in at the VM layer.

Additional services on AWS however have their own guarantees and SLAs and they're more often down than the low-level virtual machines are - so it's an inherently less reliable way to build systems imo, but to each their own.

Also, this notion that you can conveniently split your processing up into 15m chunks is just crazy -- you may never know how long some batch job is going to take any given day if it's long running. They're typically leaning on some other systems outside your control that may have differing availability over time. I met a guy who built some scraper on lambda that did something convoluted and he was happy to make like 100k updates in the db overnight -- I built a similar thing using a single core python job running on a VM that did 40MM updates in the same amount of time. It's just silly what people think simple systems are not capable of.

scarface74 · on Dec 1, 2018

I disagree - the rampant complexity of systems you can build using all these extra services is the bigger problem,

So now I am going to manage my own queueing system, object storage system, messaging system, Load Balancer, etc? My time is valuable.

and at scale, all these extra services on AWS cost more than their OSS equivalent running on VMs. Take a look at ELB for example, there's a per-request charge that yes is very small, but at 10k req/s it's significant - not to mention the extra bytes processed charges. Nginx/haproxy with DNS load balancing can do pretty much what ELB is offering.

ELB automatically scales with load. Do you plan to overprovision to handle peak load? My manager would laugh at me if I were more concerned with saving a few dollars than having the cross availability zone, fault tolerant, AWS managed ELB.

In the very small none of this matters, but you can just run smaller/bigger VMs as needed; prices starting at the cost of a hamburger per month.

Again, we have processes that don’t have to process that many messages during a lull but can quickly scale up 20x. Do we provision our VMs so they can handle a peak load?

Re reliablity, I have VMs on AWS with multiple years of uptime - it's the base building block of the entire system, so it's expected to work all the time. You monitor those systems with any sort of nagios-type tool and alert into something like pager duty for when things do go wrong, but they mostly don't.

Why would I want to be alerted when things go wrong and be awaken in the middle of the night? I wouldn’t even take a job where management was so cheap that they wouldn’t spend money on fault tolerance and scalability and instead expected me to do wake up in the middle of the night.

The lambda runtime doesn’t “go down”. Between Route53 health checks, ELB health checks, EC2 health checks, and autoscaling, things would really have to hit the fan for something to go down after a successful deployment.

Even if we had a slow resource leak, health checks would just kill the instance and bring up a new one. The alert I would get is not something I would have to wake up in the middle of the night for, it would be something we could take our time looking at the next day or even detach an instance from the autoscaling group to investigate it.

Redundancy I build in at the VM layer. Additional services on AWS however have their own guarantees and SLAs and they're more often down than the low-level virtual machines are - so it's an inherently less reliable way to build systems imo, but to each their own.

When have you ever heard that the lambda runtime is down (I’m completely making up that term - ie the thing that runs lambdas).

Also, this notion that you can conveniently split your processing up into 15m chunks is just crazy -- you may never know how long some batch job is going to take any given day if it's long running.

If one query is taking up more than 15 minutes, it’s probably locking tables and we need to optimize the ETL process....

And you still didn’t answer the other question. How do you handle scaling? What if your workload goes up 10x-100x?

I met a guy who built some scraper on lambda that did something convoluted and he was happy to make like 100k updates in the db overnight -- I built a similar thing using a single core python job running on a VM that did 40MM updates in the same amount of time. It's just silly what people think simple systems are not capable of.

Then he was doing it wrong. What do you think a lambda is? It’s just a preconfigured VM. There is nothing magic you did by running on Linux EC2 instance that you also couldn’t have optimized running a lambda - which is just a Linux VM with preinstalled components.

Yes costs do scale as you grow and they may scale linear but the cost should scale at lower slope than your revenues. But then again, I don’t work for a low margin B2C company. B2B companies have much greater margins to work with and the cost savings of not having to manage infrastructure is well worth it.

If I get called once after I get off work about something going down, it’s automatically my top priority to find the single point of failure and get rid of it.

Another example is SFTP. Amazon just announced a managed SFTP server. We all know that we could kludge something together much cheaper than what Amazon is offering. But my manager jumped at the chance to get rid of our home grown solution that we wouldn’t have to manage ourselves.