Greenhouse Apps Unavailable

Incident Report for Greenhouse

Postmortem

Greenhouse Software Outage

2018-06-05 - 12:22-12:47 ET (25 minutes total outage)

Summary

At 12:22 pm ET on 5 June 2018 we experienced a widespread outage across all of our production applications lasting approximately 25 minutes. Ultimately these outages were due to a bug in our Web Application Firewall (WAF) service, which is used to block malicious traffic. To resolve this outage we disabled the WAF across all of our applications, which required a configuration change plus a reboot to at least 9 different production environments. After each was rebooted, traffic handling returned to normal as new web servers came online.

Timeline

12:21 - Security team creates manual rule in WAF (standard operation)
12:23 - Greenhouse engineering notified that our apps are down
12:23 - Incident response team convenes
12:24 - Status Page incident created
12:32 - (approx) WAF implicated in outage
12:34 - Status Page update clarified which apps were down
12:35 - Status Page update clarified “firewall issue” was cause
12:35 - (approx) Attempted to switch WAF from blocking to warning mode
12:42 - (approx) Began disabling WAF in apps one-by-one
12:44 - Monitoring systems indicate apps are recovering
12:47 - Monitoring systems indicate apps are fully recovered
12:53 - Status Page incident updated to Monitoring status
13:05 - Status Page incident updated to Resolved status

Root Cause and Next steps

Ultimately the root cause was found to be a bug in our WAF provider’s server-side service. They recently introduced a feature which introduced a bug that, in some corner cases, would cause all traffic to be blocked.

Background

On the morning of the incident we noticed that our marketing site at www.greenhouse.io was being scanned by a particular set of IP addresses. Our security team created a rule in the WAF to block those IPs, with an expiring time for that rule, which is part of their standard operating procedures. The rule expiry time was the corner case that triggered this bug in our WAF vendor's system, and soon after that change all of our traffic was blocked by the WAF, not just the malicious traffic that we specified. The vendor admitted to having introduced the bug which triggered this incident on May 24th and remedied the bug on June 5th.

Posted Jun 11, 2018 - 14:35 UTC

Resolved

Our team has not identified any additional performance degradations for Greenhouse Recruiting, Greenhouse Onboarding, or Job Boards. We apologize for any inconvenience this issue has caused for you and your teams. We will publish a post-mortem with more details shortly.

Posted Jun 05, 2018 - 17:05 UTC

Monitoring

Our team has deployed a change and Greenhouse Recruiting, Greenhouse Onboarding, and Job Boards are now back online. We will continue monitoring for any additional disruption.

Posted Jun 05, 2018 - 16:53 UTC

Identified

Our team has identified a firewall issue that is blocking traffic to Greenhouse Recruiting, Greenhouse Onboarding, and Job Boards. We are working on deploying a fix and will post another update soon.

Posted Jun 05, 2018 - 16:35 UTC

Update

Greenhouse Recruiting, Job Boards, and Onboarding are all currently unavailable. We are actively investigating and will provide an update when we have more information.

Posted Jun 05, 2018 - 16:34 UTC

Investigating

Greenhouse Recruiting is currently unavailable.

Posted Jun 05, 2018 - 16:24 UTC

This incident affected: Greenhouse Recruiting (Silo 1) and Greenhouse Job Boards, Greenhouse Onboarding.