AFDB logo
mn :: comp

Risk Management Strategies

What is Risk Management?

The importance of online computing systems such as extranets, CRM and ERP practically goes without saying; simply put, the modern company is not able to function if these systems fail. Even systems of a less crucial nature have high value, in terms of the opportunity cost of work that was already devoted and the impact on user perception when sites go down. However, systems do go down; through errors, through malicious activity, through external factors that are beyond control, any number of forces exist which can destroy the delicate equilibrium defined as "a working system."

Risk of something going catastrophically wrong is a given factor, which very little can be done to eliminate; few IT departments are equipped to completely prevent natural disaster, power failure, cascading equipment failure, administrator mistakes, faulty vendor patches, security breaches, or data center accidents. While many of these scenarios can be mitigated, there is no perfect silver bullet. Rather, every testing plan, security measure, and building feature can be thought of as something which is done to decrease likelihood and impact of disaster.

Because there is no real removal of risk, system designers and administrators must think through the risk to management of a given scenario, as defined in the terms "disaster recovery" and "business continuance."

Disaster Recovery

The most basic level of risk mitigation is to ensure that the complete loss of a system can eventually be recovered from; that is, to ensure that the system could be recreated. At its simplest, DR can be a commitment to plan, perform, and verify system and data backups to nonvolatile removable media on a regular basis; other recommendations are complete and updated documentation of what's been built and a plan for how to turn a box of tapes into a working system again. Strictly speaking, DR helps the business ensure that they will be able to recover from a disaster eventually. However, it doesn't say how quickly that recovery will happen; there are usually a great many complex components to replace, and in the worst of cases there will be heavy contention for those replacement resources from the other customers at an affected data center.

Business Continuance

Planning for resources quickly leads into the next stage of risk mitigation, which is business continuance. This is the process of attempting to ensure a time frame in which business can continue. Strategies may involve "cold" spare systems set aside at another location, a stand-by system, or it could go all the way to a fully fault-tolerant and load-balanced multi-data center deployment. The challenge is maintaining focus on planning rather than gear. Many of the systems destroyed in the 9/11 attack had full or partial redundancy across the river in New Jersey, redundancy that was useless for days or even weeks because it relied on the systems and network administrators to bring it up.

What Can Be Done?

Clustering

Within an online system, the simplest and most effective form of fault-tolerance is clustering of components. At the presentation layer this may mean several small web servers or mail transfer agents behind a hardware load balancer; at the intelligence layer this may mean multiple application servers in a master-dispatcher/multiple-slaves arrangement; at the storage layer this may mean a primary database with another in standby. This sort of clustering does not prevent system-wide outages, but it does provide safety from the sort of management errors and equipment failures that happen in day-to-day administration. Additionally, designing for this form of fault-tolerance provides the incidental benefit of greatly improved scalability, because nodes in a group can be added or subtracted with ease.

Load-Balancing

At the presentation layer, many vendors offer load-balancing via dedicated switches (Foundry, F5, Cisco, Alteon). These switches simply present a virtual IP address which represents the server farm, then direct traffic for that VIP to specific servers which it knows to be alive and ready to answer the query. Load-balancing algorithms include round-robin, least-connections and primary/secondary; while different names may be used by different companies, these algorithms are basic throughout the industry. Additionally, most load-balancers now have some ability to support persistent sessions (required for SSL and stateful functions like shopping carts).

In the intelligence layer, load-balancing is almost always implemented as a software function provided by the application server suite or the operating system. These systems typically work in a master-dispatcher/multiple-slave fashion, which is to say that one node recieves all requests and then dispatches those requests to other, slave nodes. The designer may choose to provide for slaves to take over the master's dispatch role via a virtual IP transfer. Other fault-tolerance and load-balancing methodologies used in this layer include "tying" of application servers to presentation servers (A always uses D, B always uses E, C always uses F, if E fails then B stops accepting new connections), or implementation of another hardware load-balancer.

In the storage layer, fault-tolerance becomes much more challenging to implement, largely because of the data integrity requirements imposed by modern databases. Known as the ACID test, enterprise-class databases will implement Atomicity, Consistency, Isolation, and Durability in every transaction. The details of ACID are unimportant for the purposes of this paper, with one exception: they are designed to ensure data integrity under adverse conditions, and they make server-farm and master-dispatcher load-balancing schemes very difficult to implement. However, there are still ways to provide fault-tolerance in the database system. On a side note, one thing to look for is clients who are using an ACID-compliant database for tasks that would be better served by raw storage from a NAS -- these clients can save a great deal of money by downsizing the database and adding a file server.

Active/Passive Clustering

The simplest way to cluster a database is to use a passive standby system with the same configuration as a fallback. This system can be manually brought up after the primary node fails, which is called cold standby, or it can be an active and aware partner via clustering technology (e.g. Veritas or Microsoft Cluster Server). The latter method is called hot standby and typically assumes a separate storage solution.

Storage is an issue in active/passive clustering, because there are not a great many solutions for mounting the same storage read-write from multiple servers. In other words, your architecture options are to do cold standby and manually mount the storage device from the passive node (assuming that the active node didn't corrupt the filesystem on its way down), or to maintain two equally-sized chunks of storage. SAN or NAS solutions will allow the passive node to mount the storage read-only though, so the passive node can be safely used to do things like database reporting, live export-style backups, &c.

Active/Passive clustering is a unpopular option with many cash-strapped companies, because it requires standby hardware and storage -- typically some very high-end systems, as well. While less powerful systems and lower-quality storage can certainly be used for the passive node, doing so is considered risky. After all, if the system load is too great for the backup database to handle, the backup might as well not exist. Additionally low performance on the backup system will act as a goad to shift operations back to the primary node as soon as possible, which is a down-time inducing operation that 7x24 shops will be loath to consider. If the systems are equally sized, there is no need to switch back to A until B has a problem.

Active/Active Clustering

The other way to cluster is Active/Active. There are recent commercial solutions for master-dispatcher database solutions, but the essentially conservative DBA crowd is slow to move in this direction (especially those who were burned by Oracle Parallel Server). Oracle 9iRAC is the most well-known database in this model. A far more common solution for Active/Active Clustering is to manually redesign the database into multiple instances running on multiple nodes. This is particularly common for ASPs, who will divide databases by client, but the database can also be divided up by functions as long as cross-database traffic is strictly minimized. Once the databases have been logically divided into multiple instances (e.g. Instance 1 and Instance 2), Active/Passive Clustering can be used to distribute it:
Active/Active Clustering
Instance 1
Instance 2
Node A
Primary
Primary
Node B
Secondary
Secondary

Of course, this sort of design raises a whole new set of design challenges, particularly in load-distribution and planning. For one thing, resource utilization must be carefully monitored, as the failure of a heavily-loaded node can produce a cascading effect that may take down the other node as well. Additionally, the division of data into database instances must be carefully considered according to the rules of data normalization, as excessive cross-instance traffic will be very bad for performance. Finally, this sort of design effectively isolates data storage into silos of paired nodes, which must be located on the same LAN and vertically scaled; a third or fourth node would make this very complicated indeed. A master-dispatcher load-balancing and fault-tolerance solution would be the better choice by far.

Backup and Restore

All of the techniques and technologies reviewed to this point are intended to prevent outages due to simple failures and mistakes: if an administrator accidentally reboots the wrong machine or a power supply fails, the site will not go down. Higher level problems such as data corruption or a building-wide disaster can only be protected against in one way: the data must be regularly copied from the production machines to something else. Modern backup systems are far from perfect, and a great deal of effort has been expended in search of something better. It is advisable to look into multiple layers of backup, particularly when using a storage vendor. It is also wise to consider backups in terms of restoral time: on the one hand, the configuration and data files are far more important than the base OS and frequently represent a small percentage of the disk utilization; on the other hand installing a base OS, bringing it up to the same patch level, and then restoring configuration and data may be more time consuming than simply restoring the latest weekly backup of the entire system and then restoring three or four incrementals. Backup types fall into two general complementary categories, and some combination of the two is recommended.

Network Backup

A network level backup is a simply copy of the data to another location: web server configuration files may be copied to the file server, the database export may be copied to the DBA's laptop, the development server's CVS repository may be copied to a backup system in another data center. This sort of copy is easy to perform and restore from, but can rapidly become costly in terms of metered long-distance bandwidth and is typically not used for large amounts of data. The rsync tool can help reduce bandwidth consumption, but typically network-based backup is best used to consolidate data within a data center in preparation for back up to removable media or for making emergency backups (belt-and-suspenders to the main removable backup). Network backup to another system is also costly in terms of disk space, and it is unusual to see more than a week or two of backups kept.

Removable Media Backup

For long-term archiving and offsite storage, removable media is the only reasonable way to back things up. Historically re-usable magnetic tape has been the price-performance ratio leader, as well as offering well-defined storage lifetimes when maintained properly. However, for some backup tasks the best solution may be an optical disk (CD-R or DVD-R), magneto-optical hybrid systems, or even fault-tolerant arrays of inexpensive ATA hard drives. Backup systems are as a rule expensive to purchase, configure, manage, and operate, with typical restoral success rates falling short of 100%, but the cost is certainly sustainable when compared with the alternative of losing the data without hope of retrieval.

In the typical shared data-center storage system the data is backed up over the management network on a daily incremental, weekly full basis, which means that the daily backup is only those files which have changed in the last twenty-four hours. An additional complication is introduced by the fact that live systems typically have lots of files open for writing, which means that they can be changed in the middle of the backup procedure, which means that the copy on tape will be corrupt. All backup systems are not created equal and some are better at handling open files than others.

However, databases complicate matters because the on-disk files are always active and cannot simply be copied without introducing inconsistencies. The database must be backed up using a software agent (essentially a client that requests a full dump of the data), or by using the database's own export tools to dump the entire database to disk. Both methods are resource-intensive, and the export method requires doubling the available storage. Note however that scratch space for database export does not need to be on the highly reliable and expensive SAN system; a direct-attached disk array in RAID0 will be entirely sufficient. However scratch space is implemented, its cost and the time required to perform an export should be weighed carefully against an agent-based backup.

Off-site Storage

Whatever type of removable media is chosen, its removable nature allows for long-term off-site storage. Off-site storage is highly recommended as a disaster recovery measure, as opposed to losing the backup in the same disaster than takes out the data center. Typically media is stored on site for a week, as most recovery requests will occur in this first week. Next, a courier will remove the media to a nearby storage facility where it will be kept for one year. Most data centers require the storage facility to be located no closer than five miles and no farther than twenty miles away; while the risk of a generalized disaster disabling or preventing access to the storage facility exists, the inconvenience of long-distance courier service is an overweighing consideration. Few managed service providers offer customized off-site storage plans, but customers can pursue contracts directly with the storage facility operator in order to extend the length of storage.

Data Vaulting

A variant on network backups, the practice known as data vaulting involves long distance copies of large amounts of data on a near-real-time (asynchronous write) or real time (synchronous write) basis. For instance, a company may consider backing up its production-line systems from its offices to a colocation or managed hosting provider's location as a business continuance measure. Tools for enabling this sort of replication solution include SRDF channel extenders (CNT) for customers who use the EMC Symmetrix, hard-drive level appliances (MiraLink), file-system level software (Veritas Storage Replicator), or application level software (Quest Shareplex). There are also companies that offer this service as a main product offering, such as Arsenal Digital and Hewlett Packard's SunGard division. A far simpler solution exists in the capability of common databases such as Oracle and Microsoft SQL Server to replay log files and thereby recreate transactions; a scripted solution to capture logs from the primary database, copy them to the secondary and replay them is colloquially known as "log-shipping." Log-shipping is an effective asynchronous tool, but it is somewhat resource-intensive and data can be lost if the primary system fails before the next run of the script. If the system can afford some degree of data loss, this solution is entirely adequate; if no data loss can be tolerated, the more complex and expensive solutions above must be considered.

Cold standby gear

Whether data is vaulted on a continual basis, log-shipped on a regular basis, or simply restored from removable media in the event of disaster, it does little good without servers, operating systems, and properly configured applications to run it on. Cold standby systems are a requirement of any business continuance plan. It is frequently assumed that there will be ample opportunity in the event of a disaster to locate systems, configure them, and bring them up; this may be true of a single customer's outage, but is a flawed assumption in the event of complete data center outage.  In order to be reliable, systems should be online, patched, and properly configured at all times so that the rollover from primary to secondary requires nothing more than a DNS alteration.

However, it is certainly reasonable for most businesses to scale cold-standby systems down within reason, as few customers will expect 100% performance after a data center failure. Caution is required, but it is certainly reasonable to implement cold-standby without local fault tolerance clustering, on uni-processor systems, or using less-expensive storage media (e.g. a NAS or DADS array instead of a Symmetrix SAN).

Multi-Data Center Deployments

Of course, once the design and setup and cost of cold-standby systems and data replication has been considered, it is not uncommon for customers to seek something more from their expenditure than simple insurance against a systemic failure. If the data centers are  physically far from one another and both located near (in a network sense) concentrations of customers, then it may make sense to consider scaling the second location up and using it as a primary service point.

At its simplest, the multi-data center deployment is two exact copies of the infrastructure, each providing backup to the other as described in the active/active database clustering section. Much like the database cluster, this scenario is most applicable to ASP-style operations in which the data and functionality is divided into discrete units by customer; e.g., a Japan data center may be used to support Asian customers and a California data center may be used to support American customers. However, customers that do not exactly fit this model can also take advantage of multi-data center deployments.

One common model involves databases with a great deal of read activity and relatively little write activity: for instance, information retrieval services, event ticketing systems, and data analysis services. In this model, the intelligence level services may be modified to read from one database and write to another, after which read-only copies of the database may be established using simple log-shipping. Because writes are still directed back to the central database performance of writes will be relatively poor, so use of this model is only encouraged for low-write activity. In the event of failure at the primary data center, one of the read-only databases must be promoted to read-write. This illuminates the core issue of multi-data center deployments: data integrity. Writes to the formerly read-only database will have to be synchronized back into the primary database when it comes back online, a time-consuming and somewhat perilous procedure. Additionally, all active/passive clustering solutions can fall prey to a condition known as "split-brain syndrome," in which each node thinks that it is the active primary node, causing no end of confusion.

Global Server Load Balancing

The simplest component in the multi-data center deployment scenario is the one that has received the most vendor attention: global server load balancing, or GSLB. Assuming that the data is replicated properly and the applications are the same at all data centers, GSLB is the process of determining the best data center to send a given user to. This is typically done by setting DNS Time To Live for the site to a low number, thereby forcing all client connections to query the Start Of Authority server, which is the GSLB device. This device will use techniques ranging from geo-location database lookup to BGP analysis to ping and traceroute in order to determine the client's location and the best data center to direct that client to, which will be the DNS A record that is returned. GSLB products are offered by Server Load Balancing switch vendors (Foundry, F5, Cisco) and by Content Delivery Network operators (Akamai, Speedera). The SLB products typically integrate well with local SLB switches, whereas the CDN products are sold as globalized services. Because GSLB appliances need to be installed as active/passive partners at multiple locations to be effectively fault-tolerant, CDN services are typically somewhat more cost-effective, at least in the short term. GSLB systems are prone to split-brain problems, which will result in multiple A records being returned to the client. RFC-compliant DNS resolvers can handle this, but it may still be a problem if one of the GSLB devices is returning wildly inaccurate infomation. CDN GSLB services, as a competency required to enable their core product, are considered to be somewhat better managed and monitored against this possibility.

Advanced Topics

Minor Disaster Definition

It is relatively simple to define what a data center failure might look like: terrorist attack, natural disaster, power outage. These scenarios are also relatively uncommon compared to the myriad of lesser disasters that can make a system unusable. A risk management strategy should include analysis of the cost of fail-over to a secondary site and return of services to the primary, definitions of the triggers required for fail-over, and a plan for return from fail-over to primary. Automated solutions need to be studied and tested carefully before implementation, as flip-flops would be potentially disastrous to data integrity.

Conclusion

Much like the nebulous IT concept known as "information security," risk management is a matter of trade-offs. Solutions are less than 100% effective, costly to design and implement, difficult to operate and manage, and require constant justification against what seems the relatively unlikely occurence of system failure and data loss. Statistically speaking, data loss and system failures will happen sooner rather than later, but there are always customers who have been lucky enough to remain at the thin edge of the bell curve and lose very little data. However, to operate with no risk management policy at all is to tempt fate. There is no organization without some need for backup: even those who work with time-sensitive data have a need to preserve their system configurations and should consider providing for simple load-balancing. The challenge is to accurately determine what level of risk can be tolerated, what level of insurance can be afforded, and what solutions should be implemented to provide this insurance.


Copyright 2003, Jack Coates. This document may be redistributed freely as long as it is not altered.

Last modified: Nov 25, 2005 12:48 pm.
Contact me.

Powered by Zope