Increased Errors for some Greenhouse Customers

Incident Report for Greenhouse

Postmortem

WHAT HAPPENED?

Beginning at 6:15 PM ET on Friday, 1/25/2019, some users of Greenhouse Recruiting began encountering errors when using the application.

At around 4:00 PM ET on Sunday 1/27/2019 we were notified of periodic failures by a customer, which were elevated to our support team.

At 5:15 PM ET on Sunday, 1/27/2019, we implemented a fix and error rates returned to normal.

‌

WHAT WAS THE EFFECT?

For the duration of this incident, some users would receive a generic "We're sorry, but something went wrong" error message. This message was consistently resolved by refreshing the page.

‌

WHO WAS AFFECTED?

Customers who access Greenhouse Recruiting through https://app2.greenhouse.io received these errors intermittently. Customers using https://app.greenhouse.io were unaffected by this incident. Job boards and the job board API were not affected by this incident.

‌

WHAT WAS THE CAUSE?

The disks that host one of our key datastores were taken out of service due to disk pressure and were not replaced. These disks were approximately 80% full and we believed we had enough space, but were removed preemptively by the cluster scheduler. The length of the incident was due to light weekend traffic and the low error rate was under the threshold necessary to trigger internal alerts.

‌

WHAT ARE WE DOING TO PREVENT THIS FROM OCCURRING AGAIN?

The disk storage space on the Redis Sentinels was already a planned upgrade coming from our infrastructure team. This upgrade will occur as scheduled. Greenhouse will also re-examine our monitoring to ensure that errors like these trigger internal alerts on a more appropriate timeline. If you have any questions or concerns, please reach out via: https://support.greenhouse.io/hc/en-us/requests/new

Posted Jan 28, 2019 - 13:15 EST

Resolved

We implemented a fix and the error rate has been brought back to normal. Work continues to prevent the error from re-occuring. We will provide a post-mortem tomorrow.

Posted Jan 27, 2019 - 17:36 EST

Monitoring

A fix has been implemented and we are monitoring the results.

Posted Jan 27, 2019 - 17:12 EST

Identified

The issue has been identified and a fix is being implemented.

Posted Jan 27, 2019 - 17:07 EST

Investigating

We are currently investigating this issue.

Posted Jan 27, 2019 - 16:58 EST

This incident affected: Greenhouse Recruiting (Silo 1).