This document details best practices to maximize uptime for applications built on top of ClustrixDB. This covers a wide range of topics, from environmental requirements to change management procedures, all of which ultimately impact availability of your application. Many of these are standard best practices or concepts you are likely already familiar with.
Designing for high availability minimizes risk with the following strategies:
- Careful control of environment and application changes
- Elimination of single points of failure
- Provisioning of sufficient additional resources to allow for continued operation in the event of failure
- Regularly scheduled and tested backup and recovery plan
Use Multiple Operation Environments
An operationally mature organization may have as many as four different environments:
- Disaster Recovery
For highest availability, a production environment may contain a pair of identical clusters which replicate in a Master-Master configuration. Only one cluster should actively take writes, with the other available for immediate failover (active/passive); alternatively, if there are distinct applications or databases, these may be configured such that one cluster is active for a set of applications, while the other cluster is active for another. Having both clusters take writes (active/active) is possible, but presents a number of operational challenges (see Master-Master Replication). Managing cut-over of application load from one cluster to the other can be handled through use of an external load balancer, or by reconfiguration of application servers.
A disaster recovery environment is typically sited in a geographically distinct location, and includes a cluster which replicates from the production environment, along with application servers which are able to provide site functionality (possibly with degraded performance) in case of site-wide failure of the production environment.
A staging environment will typically be a (scaled down) facsimile of the production environment, including comparable application servers and data sets. It is used to validate changes to the application software as well as upgrades to the ClustrixDB software. A crucial part of the staging environment is a test automation framework which allows for the exercise of the application and database with a load approximating peak load in the production environment.
A development environment allows for more ad hoc development, where the risk of developers interfering with each other's work is inconsequential.
To achieve optimum uptime goals Clustrix highly recommends the use of multiple clusters, which can provide the following:
- Isolation of application development from the production database
- Isolation of analytic workload from time-critical transactional workload
- Disaster recovery/business continuance failover for site-wide outages
- Pre-production validation of application changes and ClustrixDB software upgrades
Highly Available Production Environment
To take full advantage of ClustrixDB's fault tolerant architecture, the following environmental and provisioning requirements should be met:
- Each node's power supply should be connected to a separate power circuit
- Each production cluster should have two backend network (InfiniBand or Ethernet) switches
- Each backend network (InfiniBand or Ethernet) switch should be connected to a separate power circuit
- Each node should connect to the production network via two Ethernet ports (interfaces are bonded), to two separate Ethernet switches
- Cluster nodes should be over-provisioned to accomodate storage and concurrency requirements in the event of node failure
The vast majority of software failures arise from changes in application behavior, whether due to a bug in the application itself, or exposure of a bug in an underlying layer such as the database. Accordingly, best practice is to first thoroughly validate such changes in non-production environment(s), and then carefully roll out into production, with rollback plans in place to undo changes in case of surprises.
Application Change Management
The existence of development and staging environments allows an organization to roll out new applications and application changes in a safe manner.
- New application development is restricted to the development environment where malformed queries and other abuse of the database can only impact other development work.
- Once the code
has stabilized, it can be incorporated into the staging environment,
where it is tested at load, and in conjunction with the existing
- Note that the Historical Workload Analysis tool in the Clustrix Insight user interface provides a simple but powerful way to compare the composition of workload before and after a code change.
- After validating proper functionality and performance in the staging environment, the changes can be rolled into production.
ClustrixDB Upgrade Best Practices
Upgrading the software on your ClustrixDB cluster can be treated in much the same way as application code changes. While ClustrixDB software releases are thoroughly tested in-house, changes such as new compiler optimizations can have unanticipated impact in customer workloads; validation of the new release on a staging cluster running a simulated workload allows discovery of such issues prior to production roll-out.
In an ideal operational environment, customers can participate in ClustrixDB' beta program, obtaining early release candidates for use in their development cluster. Once a qualified release is available, they can test in their staging environment to eliminate problems evident under heavy workloads. When upgrading production, if a pair of clusters is available, one cluster should be upgraded first; the cluster can then undergo a full day or week of load before upgrading the second cluster, providing for failback to the second cluster running the prior, stable release.
Application Configuration for High Availability
While eliminating single points of failure in your application stack is beyond the scope of this document, the following guidelines pertain specifically to how your application interacts with the ClustrixDB database layer:
- Application servers should connect through the VIP
- Application software should have retry logic for database transaction or connection failures
- Initial retry timing should be aggressive as cluster typically handles component failures in a few seconds
- Retry frequency and iterations should also accomodate possible longer duration recovery
Backup and Recovery
ClustrixDB parallel fast backup provides for rapid backup which allows for near-constant backup time as your cluster grows, as the work is split across the nodes. When planning your backup strategy, consider the following:
- The FTP server typically becomes the bottleneck during backup
- It should be provisioned with 10Gb or several bonded 1Gb Ethernet interfaces
- It should have fast enough disk I/O to sink parallel writes from all nodes in the cluster
- It should have sufficient disk capacity to handle two or more full backups of the cluster (1/2 of total used capacity, as we back up only one replica from each slice)
- Off-site archival of backups
- Regular tests of the ability to restore from backup
- ClustrixDB parallel restore supports restoring subsets of a backup, as well as restoring to alternate locations (different database names), allowing for regular "spot checks" where a sufficiently large cluster for full restore is unavailable
- (Off-site) archival of binlogs using the repclient utility to support point-in-time restoration
- This also requires the use of repserver to replay binlogs
- The full process for point in time restoration is described in Point in Time Restoration