Increased Errors Greenhouse Recruiting
Incident Report for Greenhouse
Postmortem

WHAT HAPPENED?

Beginning at 3:22pm ET on 1/24/2019, Greenhouse Recruiting began serving increased errors for some users.

At 5:06pm ET, we applied a partial fix and started to see improvements for many Greenhouse Recruiting users.

At 5:34pm ET, we identified the root cause of the issue and applied a fix for all users. Error rates then returned to normal for all users.

WHAT WAS THE EFFECT?

For the duration of this incident, users intermittently received errors or blank pages while accessing Greenhouse Recruiting.

WHO WAS AFFECTED?

All Greenhouse Recruiting customers were affected intermittently for the during of this incident, though some may have experienced errors for a shorter period of time.

Candidates applying through Greenhouse Job Boards or the Job Board API were unaffected. Greenhouse Onboarding and Harvest API were also unaffected.

WHAT WAS THE CAUSE?

During the afternoon of the incident, we rotated our underlying servers as part of routine infrastructure maintenance. When the new servers came back up, some of them started up with a version of the Greenhouse Recruiting that was running the same code, but a different version of compiled assets. This different version of compiled assets had never been pushed to our CDN, so any requests that hit these new servers were unable to fetch the JS, CSS and image files referenced by our front end.

Once we isolated the root cause, we were able to push assets to our CDN for some users, but not for all. Ultimately we decided that rolling back to a previous version of Greenhouse Recruiting was the safest fix, and after we took that action errors were resolved for all users.

The errors that occurred during this incident were unrelated to those that impacted a portion of our customers on 1/23/2019.

WHAT ARE WE DOING TO PREVENT THIS FROM OCCURRING AGAIN?

This incident exposed a monitoring gap at our CDN layer. We will be adding monitoring over this particular error condition and will be doing an audit of our front end monitoring to identify other gaps.

We working on a fix to the underlying issue that allowed two version of assets to exist for a single version of our code.

We have high expectations of reliability of applications. In the past two days we have not met those expectations. We deeply apologize for the inconvenience this incidents have caused. If you have any questions or concerns, please reach out via: https://support.greenhouse.io/hc/en-us/requests/new

Posted Jan 25, 2019 - 14:55 EST

Resolved
Our monitoring system has indicated that error rates in Greenhouse Recruiting have returned to normal. We deeply apologize for the inconvenience that this incident has caused.

We will be issuing a detailed post-mortem on this incident.
Posted Jan 24, 2019 - 18:06 EST
Monitoring
Our team identified the cause of the increased errors and has deployed a fix. We will continue monitoring for any additional disruptions.
Posted Jan 24, 2019 - 17:38 EST
Investigating
Our team is investigating reports of slow performance and errors loading certain pages in Greenhouse Recruiting for some customers.
Posted Jan 24, 2019 - 16:52 EST
This incident affected: Greenhouse Recruiting (Silo 1).