/Disaster Recovery

Disaster Recovery

Purpose

The purpose of this Disaster Recover (DR) Procedure is to provide a general overview of the steps the ErgoPlus team takes toward preparedness for disaster scenarios, and steps that will be taken should a disaster scenario occur.

Recovery Scenarios

The following scenarios include some of the possible scenarios that would necessitate disaster recovery steps to be taken:

Human Error
The ErgoPlus team strives to automate as many things as possible, however there are still some processes that are subject to human error. Examples of this include but are not limited to: deployment, data migration, database maintenance, customer support etc.
Data Corruption
The integrity of one or more tenants data is compromised or data is lost entirely
Machine Recovery
Hardware failures in cloud architecture requiring the provisioning and deployment of new cloud instances
Data Center Destruction
The destruction of part or all of an AWS data center

Recovery Objectives

Recovery Point Objective (RPO) is the maximum acceptable amount of time since the last data recovery point. This metric describes what ErgoPlus considers to be an acceptable amount of data loss (measured in time) between the last backup point and a service interruption. ErgoPlus considers a maximum RPO of four (4) hours to be tolerable.
Recovery Time Objective (RTO) is the maximum acceptable delay between the interruption of service and the restoration of service. This metric describes what ErgoPlus considers to be an acceptable timeframe that service may be unavailable when a disaster scenario occurs. ErgoPlus considers a maximum RTO of twenty four (24) hours to be tolerable.

Recovery Process

In the event of a DR scenario the Director of Product will declare a state of disaster and communicate initial assessment to all impacted customers within eight (8) hours of service disruption.
Once a disaster has been declared the Director of Product will notify all critical engineering and operations staff and activate immediate mitigation efforts.
Engineering and Operations staff will determine any and all affected services and, in the event that automated recovery has failed, the team will work to manually reprovision those services using source-controlled IAC resources.
Following successful deployment and smoke testing of new resources, the Director of Product will communicate “All Clear” status with all customers and stakeholders.
A final message will be coordinated with Customer Support to communicate relevant incident information and future prevention steps with affected customers.