I work for a big shipping company which uses AWS to host our nextgen platform. We run on a wide variety of AWS resources and technologies, and thus we depend on AWS being Healthy in our region when our business is running the cash registers. So, we build alerting out, paging our On-Call Guy whenever AWS Health tells us something is broken. The trouble is, our alerting system catches anything and everything, up to and including notifications that a database will be upgraded to the latest Aurora release two months from now. Woprse, those notifications come outr at very odd times, and our On-Call Guy (who is often me), complains that he’s paged at 3:00am for such nonsense.

We rotate a one-week On-Call shift. The first few days, as the On-Call Guy, you’re either dumbstruck by how many nuisance pages you get all night long or you don’t get bothered and sleep like a baby. It often takes several days before you’re ready to be pissed off by the all night long nuisance pages, but by then, you’re so busy heads down working day-to-day changes thta you don’t prioritize fixing the all night long nuisance pages. So, the system has stayed broke for years.

I keep threatening to fix it, but I never do. But I recently found this blog post here, that maybe outlines how to make a reasonable AWS Health Alerting System, so I’m going to spend some time now figuring this out.

Design Ideas

I envision a Lambda which reads a json object from an S3 bucket or a record from Dynamodb which contains configuration data, specifically, perhaps lists of alerts to ignore, convert to daily emails, alert on immediately etc. The Lambda would be inbvoked by AWS Health and would itself send the message onto SNS etc. to deliver to the On-Call Guy. Thus we filter the messages AWS Health gives us into 4 quadrants.