On March 10th, 2021, our hosting company has had a major outage in one of its datacenters.
Our SaaS infrastructure was impacted and all our systems went off. While we were prepared to have machines going off without normal service interruption, we did not anticipate the fact that a whole datacenter would burn.
We want to apologize to our customers for this incident, and explain what measures we have taken to avoid such issues to happen again.
march 10th, 6:47 am We get aware of the issue
march 10th, 8:06 am We communicate publicly about the incident
march 10th, 8:20 am Informations are coming in a quite sparse way from our hosting provider. We decide to deploy all our infrastructure in a new location. However, the account management tools of our hosting provider have issues, and we cannot order new servers. We decide to switch to another hosting company as an emergency measure.
march 10th, 9:30 am Deploy in progress.
In the meantime, we collect the daily backups from the production. While creating the new servers, we hit a quota limit and proceed to our new hosting company account verification.
march 10th, 10:15 am The quota has been upgraded, we are popping all the infrastructure machines.
We regain access to the DNS management for our domain.
march 10th, 10:45 am More details are coming from our hosting provider. While not destroyed, the hourly incremental backups that we run will not be accessible in a short matter. However, we do have access to a complete daily backup of all non volatile data (accounts, organizations, projects, rules, instances, etc.) performed yesterday at 8am and stored in another location. Until the power delivery to our former datacenter is re-enabled, we will not have access to the hourly backups, nor to volatile data backups (mainly, historical statistics and logs).
march 10th, 12 am The platform is back up, with all our customers data apart from:
march 10th, 2:00 pm Ops operations and verficiations continuing on our platform.
march 10th, 3:20 pm the first version of this page is made avaiable.
march 10th, 6:40 pm the systems are monitored and okay. Most of the platform should be working correctly. We are still missing the packages repository, which will be restored tomorrow morning.
march 11th, 7:30 am The packages repository for the v2 of the agent and the proxy is back online. We are starting the restore of legacy versions.
march 11th, 11:00 am Part of the 1.x branch packages are available, we are restoring more of them. If your infrastructure relies on one specific version of our packages, please send a mail to our support team so we can prioritize this version.
Our benchmarking suite is back available.
march 11th, 1 pm Re-uploaded some missing media from our documentation pages. Wrote this post-mortem. We are starting the remediation plan that will help avoid such issues in the future.
Further actions and lessons
While we were able to mitigate the issue and redeploy all the infrastructure in a few hours, there are several lessons than can be learned. We are not happy to have faced a hours-long downtime, so there are a few actions that we intend to undergo in the next few days or weeks. Being available to help our customers manage their website traffic when they need it is our mission, and we will strive to deliver the best availability possible.
✅ availability of an external status page (hosted somewhere else).
✅ incremental backups replication: the incremental hourly backups will be replicated in another location (and another storage provider).
🔲 more points of presence for the Cloudflare customers APIs: the API used by Cloudflare workers will be made available in several locations, operated by several hosting providers.
🔲 we are adding rules cache in the Cloudflare workers, so that cached rules can get applied in the case of an unavailability of these nodes.