2018-06-05 - 12:22-12:47 ET (25 minutes total outage)
At 12:22 pm ET on 5 June 2018 we experienced a widespread outage across all of our production applications lasting approximately 25 minutes. Ultimately these outages were due to a bug in our Web Application Firewall (WAF) service, which is used to block malicious traffic. To resolve this outage we disabled the WAF across all of our applications, which required a configuration change plus a reboot to at least 9 different production environments. After each was rebooted, traffic handling returned to normal as new web servers came online.
12:21 - Security team creates manual rule in WAF (standard operation)
12:23 - Greenhouse engineering notified that our apps are down
12:23 - Incident response team convenes
12:24 - Status Page incident created
12:32 - (approx) WAF implicated in outage
12:34 - Status Page update clarified which apps were down
12:35 - Status Page update clarified “firewall issue” was cause
12:35 - (approx) Attempted to switch WAF from blocking to warning mode
12:42 - (approx) Began disabling WAF in apps one-by-one
12:44 - Monitoring systems indicate apps are recovering
12:47 - Monitoring systems indicate apps are fully recovered
12:53 - Status Page incident updated to Monitoring status
13:05 - Status Page incident updated to Resolved status
Ultimately the root cause was found to be a bug in our WAF provider’s server-side service. They recently introduced a feature which introduced a bug that, in some corner cases, would cause all traffic to be blocked.
On the morning of the incident we noticed that our marketing site at www.greenhouse.io was being scanned by a particular set of IP addresses. Our security team created a rule in the WAF to block those IPs, with an expiring time for that rule, which is part of their standard operating procedures. The rule expiry time was the corner case that triggered this bug in our WAF vendor's system, and soon after that change all of our traffic was blocked by the WAF, not just the malicious traffic that we specified. The vendor admitted to having introduced the bug which triggered this incident on May 24th and remedied the bug on June 5th.