System outage

Incident Report for redirection.io status page

Postmortem

On March 10th, 2021, our hosting company has had a major outage in one of its datacenters.

Our SaaS infrastructure was impacted and all our systems went off. While we were prepared to have machines going off without normal service interruption, we did not anticipate the fact that a whole datacenter would burn.

We want to apologize to our customers for this incident, and explain what measures we have taken to avoid such issues to happen again.

Minutes

march 10th, 6:47 am We get aware of the issue

march 10th, 8:06 am We communicate publicly about the incident

march 10th, 8:20 am Informations are coming in a quite sparse way from our hosting provider. We decide to deploy all our infrastructure in a new location. However, the account management tools of our hosting provider have issues, and we cannot order new servers. We decide to switch to another hosting company as an emergency measure.

march 10th, 9:30 am Deploy in progress.
In the meantime, we collect the daily backups from the production. While creating the new servers, we hit a quota limit and proceed to our new hosting company account verification.

march 10th, 10:15 am The quota has been upgraded, we are popping all the infrastructure machines.
We regain access to the DNS management for our domain.

march 10th, 10:45 am More details are coming from our hosting provider. While not destroyed, the hourly incremental backups that we run will not be accessible in a short matter. However, we do have access to a complete daily backup of all non volatile data (accounts, organizations, projects, rules, instances, etc.) performed yesterday at 8am and stored in another location. Until the power delivery to our former datacenter is re-enabled, we will not have access to the hourly backups, nor to volatile data backups (mainly, historical statistics and logs).

march 10th, 12 am The platform is back up, with all our customers data apart from:

historical logs that we'll be able to restore once we get access to the datacenter storing our backups.
changes performed in the projects yesterday (March 9th, 2021, 8am UTC)

march 10th, 2:00 pm Ops operations and verficiations continuing on our platform.

march 10th, 3:20 pm the first version of this page is made avaiable.

march 10th, 6:40 pm the systems are monitored and okay. Most of the platform should be working correctly. We are still missing the packages repository, which will be restored tomorrow morning.

march 11th, 7:30 am The packages repository for the v2 of the agent and the proxy is back online. We are starting the restore of legacy versions.

march 11th, 11:00 am Part of the 1.x branch packages are available, we are restoring more of them. If your infrastructure relies on one specific version of our packages, please send a mail to our support team so we can prioritize this version.
Our benchmarking suite is back available.

march 11th, 1 pm Re-uploaded some missing media from our documentation pages. Wrote this post-mortem. We are starting the remediation plan that will help avoid such issues in the future.

‌

Further actions and lessons

While we were able to mitigate the issue and redeploy all the infrastructure in a few hours, there are several lessons than can be learned. We are not happy to have faced a hours-long downtime, so there are a few actions that we intend to undergo in the next few days or weeks. Being available to help our customers manage their website traffic when they need it is our mission, and we will strive to deliver the best availability possible.

✅ availability of an external status page (hosted somewhere else).

✅ incremental backups replication: the incremental hourly backups will be replicated in another location (and another storage provider).

✅ more points of presence for the Cloudflare customers APIs: the API used by Cloudflare workers will be made available in several locations, operated by several hosting providers. Edit: we now have several PoP for our Cloudflare and Fastly APIs

✅ we are adding rules cache in the Cloudflare workers, so that cached rules can get applied in the case of an unavailability of these nodes.

Posted Mar 22, 2021 - 12:17 CET

Resolved

Our hosting company has had a major outage in one of its datacenters. Our SaaS infrastructure is impacted and all our systems are off. While we were prepared to have machines going off without normal service interruption, we did not anticipate the fact that a whole datacenter would burn.

We are investigating and working on getting the situation back to normal.

Posted Mar 10, 2021 - 07:00 CET