Beginning at 6:15 PM ET on Friday, 1/25/2019, some users of Greenhouse Recruiting began encountering errors when using the application.
At around 4:00 PM ET on Sunday 1/27/2019 we were notified of periodic failures by a customer, which were elevated to our support team.
At 5:15 PM ET on Sunday, 1/27/2019, we implemented a fix and error rates returned to normal.
WHAT WAS THE EFFECT?
For the duration of this incident, some users would receive a generic "We're sorry, but something went wrong" error message. This message was consistently resolved by refreshing the page.
WHO WAS AFFECTED?
Customers who access Greenhouse Recruiting through https://app2.greenhouse.io received these errors intermittently. Customers using https://app.greenhouse.io were unaffected by this incident. Job boards and the job board API were not affected by this incident.
WHAT WAS THE CAUSE?
The disks that host one of our key datastores were taken out of service due to disk pressure and were not replaced. These disks were approximately 80% full and we believed we had enough space, but were removed preemptively by the cluster scheduler. The length of the incident was due to light weekend traffic and the low error rate was under the threshold necessary to trigger internal alerts.
WHAT ARE WE DOING TO PREVENT THIS FROM OCCURRING AGAIN?
The disk storage space on the Redis Sentinels was already a planned upgrade coming from our infrastructure team. This upgrade will occur as scheduled. Greenhouse will also re-examine our monitoring to ensure that errors like these trigger internal alerts on a more appropriate timeline. If you have any questions or concerns, please reach out via: https://support.greenhouse.io/hc/en-us/requests/new