|
|
| mn :: comp | |
The importance of online computing systems such as extranets, CRM and ERP
practically goes without saying; simply put, the modern company is not able
to function if these systems fail. Even systems of a less crucial nature have
high value, in terms of the opportunity cost of work that was already devoted
and the impact on user perception when sites go down. However, systems do
go down; through errors, through malicious activity, through external factors
that are beyond control, any number of forces exist which can destroy the
delicate equilibrium defined as "a working system."
Risk of something going catastrophically wrong is a given factor, which very
little can be done to eliminate; few IT departments are equipped to completely
prevent natural disaster, power failure, cascading equipment failure, administrator
mistakes, faulty vendor patches, security breaches, or data center accidents.
While many of these scenarios can be mitigated, there is no perfect silver
bullet. Rather, every testing plan, security measure, and building feature
can be thought of as something which is done to decrease likelihood and impact
of disaster.
Because there is no real removal of risk, system designers and administrators
must think through the risk to management of a given scenario, as defined
in the terms "disaster recovery" and "business continuance."
The most basic level of risk mitigation is to ensure that the complete loss
of a system can eventually be recovered from; that is, to ensure that the
system could be recreated. At its simplest, DR can be a commitment to plan,
perform, and verify system and data backups to nonvolatile removable media
on a regular basis; other recommendations are complete and updated documentation
of what's been built and a plan for how to turn a box of tapes into a working
system again. Strictly speaking, DR helps the business ensure that they will
be able to recover from a disaster eventually. However, it doesn't say how
quickly that recovery will happen; there are usually a great many complex
components to replace, and in the worst of cases there will be heavy contention
for those replacement resources from the other customers at an affected data
center.
Planning for resources quickly leads into the next stage of risk mitigation,
which is business continuance. This is the process of attempting to ensure
a time frame in which business can continue. Strategies may involve "cold"
spare systems set aside at another location, a stand-by system, or it could
go all the way to a fully fault-tolerant and load-balanced multi-data center
deployment. The challenge is maintaining focus on planning rather than gear.
Many of the systems destroyed in the 9/11 attack had full or partial redundancy
across the river in New Jersey, redundancy that was useless for days or even weeks
because it relied on the systems and network administrators to bring it up.
Within an online system, the simplest and most effective form of fault-tolerance
is clustering of components. At the presentation layer this may mean several
small web servers or mail transfer agents behind a hardware load balancer;
at the intelligence layer this may mean multiple application servers in a
master-dispatcher/multiple-slaves arrangement; at the storage layer this
may mean a primary database with another in standby. This sort of clustering
does not prevent system-wide outages, but it does provide safety from the
sort of management errors and equipment failures that happen in day-to-day
administration. Additionally, designing for this form of fault-tolerance
provides the incidental benefit of greatly improved scalability, because
nodes in a group can be added or subtracted with ease.
At the presentation layer, many vendors offer load-balancing via dedicated switches
(Foundry, F5, Cisco, Alteon). These switches simply present a virtual
IP address which represents the server farm, then direct traffic for that
VIP to specific servers which it knows to be alive and ready to answer the
query. Load-balancing algorithms include round-robin, least-connections and
primary/secondary; while different names may be used by different companies,
these algorithms are basic throughout the industry. Additionally, most load-balancers
now have some ability to support persistent sessions (required for SSL and
stateful functions like shopping carts).
In the intelligence layer, load-balancing is almost always implemented as
a software function provided by the application server suite or the operating
system. These systems typically work in a master-dispatcher/multiple-slave
fashion, which is to say that one node recieves all requests and then dispatches
those requests to other, slave nodes. The designer may choose to provide
for slaves to take over the master's dispatch role via a virtual IP transfer.
Other fault-tolerance and load-balancing methodologies used in this layer
include "tying" of application servers to presentation servers (A always
uses D, B always uses E, C always uses F, if E fails then B stops accepting
new connections), or implementation of another hardware load-balancer.
In the storage layer, fault-tolerance becomes much more challenging to implement,
largely because of the data integrity requirements imposed by modern databases.
Known as the ACID
test, enterprise-class databases will implement Atomicity, Consistency,
Isolation, and Durability in every transaction. The details of ACID are unimportant
for the purposes of this paper, with one exception: they are designed to ensure
data integrity under adverse conditions, and they make server-farm and master-dispatcher
load-balancing schemes very difficult to implement. However, there are still
ways to provide fault-tolerance in the database system. On a side note, one
thing to look for is clients who are using an ACID-compliant database for
tasks that would be better served by raw storage from a NAS -- these clients
can save a great deal of money by downsizing the database and adding a file
server.
The simplest way to cluster a database is to use a passive standby system
with the same configuration as a fallback. This system can be manually brought
up after the primary node fails, which is called cold standby, or it can be
an active and aware partner via clustering technology (e.g. Veritas or Microsoft
Cluster Server). The latter method is called hot standby and typically assumes
a separate storage solution.
Storage is an issue in active/passive clustering, because there are not a
great many solutions for mounting the same storage read-write from multiple
servers. In other words, your architecture options are to do cold standby
and manually mount the storage device from the passive node (assuming that
the active node didn't corrupt the filesystem on its way down), or to maintain
two equally-sized chunks of storage. SAN or NAS solutions will allow the passive
node to mount the storage read-only though, so the passive node can be safely
used to do things like database reporting, live export-style backups, &c.
Active/Passive clustering is a unpopular option with many cash-strapped companies,
because it requires standby hardware and storage -- typically some very high-end
systems, as well. While less powerful systems and lower-quality storage can
certainly be used for the passive node, doing so is considered risky. After
all, if the system load is too great for the backup database to handle, the
backup might as well not exist. Additionally low performance on the backup
system will act as a goad to shift operations back to the primary node as
soon as possible, which is a down-time inducing operation that 7x24 shops
will be loath to consider. If the systems are equally sized, there is no need
to switch back to A until B has a problem.
The other way to cluster is Active/Active. There are recent commercial solutions
for master-dispatcher database solutions, but the essentially conservative
DBA crowd is slow to move in this direction (especially those who were burned
by Oracle Parallel Server). Oracle 9iRAC is the most well-known database
in this model. A far more common solution for Active/Active Clustering is
to manually redesign the database into multiple instances running on multiple
nodes. This is particularly common for ASPs, who will divide databases by
client, but the database can also be divided up by functions as long as cross-database
traffic is strictly minimized. Once the databases have been logically divided
into multiple instances (e.g. Instance 1 and Instance 2), Active/Passive Clustering
can be used to distribute it:
| Active/Active Clustering |
Instance 1 |
Instance 2 |
| Node A |
Primary |
Primary |
| Node B |
Secondary |
Secondary |
Of course, this sort of design raises a whole new set of design challenges,
particularly in load-distribution and planning. For one thing, resource utilization
must be carefully monitored, as the failure of a heavily-loaded node can produce
a cascading effect that may take down the other node as well. Additionally,
the division of data into database instances must be carefully considered
according to the rules of data normalization, as excessive cross-instance
traffic will be very bad for performance. Finally, this sort of design effectively
isolates data storage into silos of paired nodes, which must be located on
the same LAN and vertically scaled; a third or fourth node would make this
very complicated indeed. A master-dispatcher load-balancing and fault-tolerance
solution would be the better choice by far.
All of the techniques and technologies reviewed to this point are intended
to prevent outages due to simple failures and mistakes: if an administrator
accidentally reboots the wrong machine or a power supply fails, the site will
not go down. Higher level problems such as data corruption or a building-wide
disaster can only be protected against in one way: the data must be regularly
copied from the production machines to something else. Modern backup systems
are far from perfect, and a great deal of effort has been expended in search
of something better. It is advisable to look into multiple layers of backup,
particularly when using a storage vendor. It is also wise to consider backups
in terms of restoral time: on the one hand, the configuration and data files
are far more important than the base OS and frequently represent a small
percentage of the disk utilization; on the other hand installing a base OS,
bringing it up to the same patch level, and then restoring configuration
and data may be more time consuming than simply restoring the latest weekly
backup of the entire system and then restoring three or four incrementals.
Backup types fall into two general complementary categories, and some combination
of the two is recommended.
A network level backup is a simply copy of the data to another location:
web server configuration files may be copied to the file server, the database
export may be copied to the DBA's laptop, the development server's CVS repository
may be copied to a backup system in another data center. This sort of copy
is easy to perform and restore from, but can rapidly become costly in terms
of metered long-distance bandwidth and is typically not used for large amounts
of data. The rsync tool can help reduce
bandwidth consumption, but typically network-based backup is best used to
consolidate data within a data center in preparation for back up to removable
media or for making emergency backups (belt-and-suspenders to the main removable
backup). Network backup to another system is also costly in terms of disk
space, and it is unusual to see more than a week or two of backups kept.
For long-term archiving and offsite storage, removable media is the only
reasonable way to back things up. Historically re-usable magnetic tape has
been the price-performance ratio leader, as well as offering well-defined
storage lifetimes when maintained properly. However, for some backup tasks
the best solution may be an optical disk (CD-R or DVD-R), magneto-optical
hybrid systems, or even fault-tolerant arrays of inexpensive ATA hard drives.
Backup systems are as a rule expensive to purchase, configure, manage, and
operate, with typical restoral success rates falling short of 100%, but the
cost is certainly sustainable when compared with the alternative of losing
the data without hope of retrieval.
In the typical shared data-center storage system the data is backed up over
the management network on a daily incremental, weekly full basis, which means
that the daily backup is only those files which have changed in the last
twenty-four hours. An additional complication is introduced by the fact that
live systems typically have lots of files open for writing, which means that
they can be changed in the middle of the backup procedure, which means that
the copy on tape will be corrupt. All backup systems are not created equal
and some are better at handling open files than others.
However, databases complicate matters because the on-disk files are always
active and cannot simply be copied without introducing inconsistencies. The
database must be backed up using a software agent (essentially a client that
requests a full dump of the data), or by using the database's own export
tools to dump the entire database to disk. Both methods are resource-intensive,
and the export method requires doubling the available storage. Note however
that scratch space for database export does not need to be on the highly
reliable and expensive SAN system; a direct-attached disk array in RAID0
will be entirely sufficient. However scratch space is implemented, its cost
and the time required to perform an export should be weighed carefully against
an agent-based backup.
Whatever type of removable media is chosen, its removable nature allows for long-term off-site storage. Off-site storage is highly recommended as a disaster recovery measure, as opposed to losing the backup in the same disaster than takes out the data center. Typically media is stored on site for a week, as most recovery requests will occur in this first week. Next, a courier will remove the media to a nearby storage facility where it will be kept for one year. Most data centers require the storage facility to be located no closer than five miles and no farther than twenty miles away; while the risk of a generalized disaster disabling or preventing access to the storage facility exists, the inconvenience of long-distance courier service is an overweighing consideration. Few managed service providers offer customized off-site storage plans, but customers can pursue contracts directly with the storage facility operator in order to extend the length of storage.
A variant on network backups, the practice known as data vaulting involves
long distance copies of large amounts of data on a near-real-time (asynchronous
write) or real time (synchronous write) basis. For instance, a company may
consider backing up its production-line systems from its offices to a colocation
or managed hosting provider's location as a business continuance measure.
Tools for enabling this sort of replication solution
include SRDF channel extenders (CNT) for
customers who use the EMC Symmetrix, hard-drive level appliances (MiraLink), file-system level software
(Veritas
Storage Replicator), or application level software (Quest Shareplex). There
are also companies that offer this service as a main product offering, such
as Arsenal Digital and Hewlett Packard's SunGard division. A far simpler
solution exists in the capability of common databases such as Oracle and
Microsoft SQL Server to replay log files and thereby recreate transactions;
a scripted solution to capture logs from the primary database, copy them
to the secondary and replay them is colloquially known as "log-shipping."
Log-shipping is an effective asynchronous tool, but it is somewhat resource-intensive
and data can be lost if the primary system fails before the next run of the
script. If the system can afford some degree of data loss, this solution
is entirely adequate; if no data loss can be tolerated, the more complex
and expensive solutions above must be considered.
Whether data is vaulted on a continual basis, log-shipped on a regular basis,
or simply restored from removable media in the event of disaster, it does
little good without servers, operating systems, and properly configured applications
to run it on. Cold standby systems are a requirement of any business continuance
plan. It is frequently assumed that there will be ample opportunity in the
event of a disaster to locate systems, configure them, and bring them up;
this may be true of a single customer's outage, but is a flawed assumption
in the event of complete data center outage. In order to be reliable,
systems should be online, patched, and properly configured at all times so
that the rollover from primary to secondary requires nothing more than a
DNS alteration.
However, it is certainly reasonable for most businesses to scale cold-standby
systems down within reason, as few customers will expect 100% performance
after a data center failure. Caution is required, but it is certainly reasonable
to implement cold-standby without local fault tolerance clustering, on uni-processor
systems, or using less-expensive storage media (e.g. a NAS or DADS array
instead of a Symmetrix SAN).
Of course, once the design and setup and cost of cold-standby systems and
data replication has been considered, it is not uncommon for customers to
seek something more from their expenditure than simple insurance against
a systemic failure. If the data centers are physically far from one
another and both located near (in a network sense) concentrations of customers,
then it may make sense to consider scaling the second location up and using
it as a primary service point.
At its simplest, the multi-data center deployment is two exact copies of
the infrastructure, each providing backup to the other as described in the
active/active database clustering section. Much like the database cluster,
this scenario is most applicable to ASP-style operations in which the data
and functionality is divided into discrete units by customer; e.g., a Japan
data center may be used to support Asian customers and a California data
center may be used to support American customers. However, customers that
do not exactly fit this model can also take advantage of multi-data center
deployments.
One common model involves databases with a great deal of read activity and
relatively little write activity: for instance, information retrieval services,
event ticketing systems, and data analysis services. In this model, the intelligence
level services may be modified to read from one database and write to another,
after which read-only copies of the database may be established using simple
log-shipping. Because writes are still directed back to the central database
performance of writes will be relatively poor, so use of this model is only
encouraged for low-write activity. In the event of failure at the primary
data center, one of the read-only databases must be promoted to read-write.
This illuminates the core issue of multi-data center deployments: data integrity.
Writes to the formerly read-only database will have to be synchronized back
into the primary database when it comes back online, a time-consuming and
somewhat perilous procedure. Additionally, all active/passive clustering
solutions can fall prey to a condition known as "split-brain syndrome," in
which each node thinks that it is the active primary node, causing no end
of confusion.
The simplest component in the multi-data center deployment scenario is the
one that has received the most vendor attention: global server load balancing,
or GSLB. Assuming that the data is replicated properly and the applications
are the same at all data centers, GSLB is the process of determining the
best data center to send a given user to. This is typically done by setting
DNS Time To Live for the site to a low number, thereby forcing all client
connections to query the Start Of Authority server, which is the GSLB device.
This device will use techniques ranging from geo-location database lookup
to BGP analysis to ping and traceroute in order to determine the client's
location and the best data center to direct that client to, which will be
the DNS A record that is returned. GSLB products are offered by Server Load
Balancing switch vendors (Foundry, F5, Cisco) and by Content Delivery Network
operators (Akamai, Speedera). The SLB products typically integrate well with
local SLB switches, whereas the CDN products are sold as globalized services.
Because GSLB appliances need to be installed as active/passive partners at
multiple locations to be effectively fault-tolerant, CDN services are typically
somewhat more cost-effective, at least in the short term. GSLB systems are
prone to split-brain problems, which will result in multiple A records being
returned to the client. RFC-compliant DNS resolvers can handle this, but it
may still be a problem if one of the GSLB devices is returning wildly inaccurate
infomation. CDN GSLB services, as a competency required to enable their core
product, are considered to be somewhat better managed and monitored against
this possibility.
It is relatively simple to define what a data center failure might look like:
terrorist attack, natural disaster, power outage. These scenarios are also
relatively uncommon compared to the myriad of lesser disasters that can make
a system unusable. A risk management strategy should include analysis of
the cost of fail-over to a secondary site and return of services to the primary,
definitions of the triggers required for fail-over, and a plan for return
from fail-over to primary. Automated solutions need to be studied and tested
carefully before implementation, as flip-flops would be potentially disastrous
to data integrity.
Much like the nebulous IT concept known as "information security," risk management
is a matter of trade-offs. Solutions are less than 100% effective, costly
to design and implement, difficult to operate and manage, and require constant
justification against what seems the relatively unlikely occurence of system
failure and data loss. Statistically speaking, data loss and system failures
will happen sooner rather than later, but there are always customers who
have been lucky enough to remain at the thin edge of the bell curve and lose
very little data. However, to operate with no risk management policy at all
is to tempt fate. There is no organization without some need for backup:
even those who work with time-sensitive data have a need to preserve their
system configurations and should consider providing for simple load-balancing.
The challenge is to accurately determine what level of risk can be tolerated,
what level of insurance can be afforded, and what solutions should be implemented
to provide this insurance.
|
Last modified: Nov 25, 2005 12:48 pm.
|
||
|
|