Greenhouse Job Boards unavailable
Incident Report for Greenhouse
Postmortem

WHAT HAPPENED?

Greenhouse Job Boards and the Job Board API were unavailable for approximately 10 minutes from 10:18pm EDT on May 8th 2017. This was due to issues with our caching layer in our job boards infrastructure.

WHAT WAS THE EFFECT?

Greenhouse Job Boards and the Job Board API were unavailable for a period of approximately 10 minutes. Other parts of the Greenhouse platform, such as Greenhouse Recruiting, Greenhouse Onboarding, and Greenhouse Analytics were not affected and remained 100% available. No external candidate applications could be submitted during this period.

WHO WAS AFFECTED?

This disruption affected all of our customers using hosted or embedded job boards, as well as any career sites built using the Job Board API.

WHAT WAS THE CAUSE?

Due to a memory issue, our caching servers were automatically restarted at 10:18pm EDT on May 8th 2017. Unfortunately, the web servers that were depending on these caching servers were unable to connect to them after the restart. This caused the web servers to enter an 'unhealthy' state, prompting our routing infrastructure to serve error pages with the status of 503 Service Unavailable.

Once the issue was identified, it was resolved by restarting the web servers. By 10:28pm EDT, the job boards and Job Board API were fully functional.

WHAT ARE WE DOING TO PREVENT THIS FROM OCCURRING AGAIN?

We will be upgrading our caching infrastructure to ensure that these memory issues are less likely to occur in the future, as well as ensuring that restarts will be handled more gracefully. We will also be improving our monitoring on this part of our infrastructure to provide us with an earlier indication that something is wrong.

We take the availability of our software very seriously, and are committed to making changes to prevent this kind of downtime happening again. Please accept our apologies for any inconvenience caused.

Posted May 09, 2017 - 22:24 EDT

Resolved
This incident has now been resolved. We will publish a post mortem once we identify the root cause.

We apologize for any inconvenience that this issue may have caused.
Posted May 08, 2017 - 22:50 EDT
Monitoring
Job Boards are now operational. The outage was caused by an unresponsive component of our caching infrastructure and affected the availability of our hosted and embedded job boards as well as our Job Boards API for a period of approximately 9 minutes from 10:19pm to 10:28pm EDT. We are monitoring the situation and are investigating the root cause and will provide updates as we learn more.
Posted May 08, 2017 - 22:37 EDT
Investigating
We have received alerts regarding the availability of our job boards. We are currently investigating the cause and scope of this issue, and will provide updates as we learn more.
Posted May 08, 2017 - 22:27 EDT