Data Center Disaster Recovery
KwaiSeng Consulting Systems Engineer
Presentation_ID
© 2006 Cisco Systems, Inc. All rights reserved.
Cisco Confidential
1
Agenda Data Center—The Evolution Data Center Disaster Recovery Objectives Failure Scenarios Design Options
Components of Disaster Recovery Site Selection—Front End GSLB Server High Availability—Clustering Data Replication and Synchronization—SAN Extension
Data Center Technology Trends Summary © 2006 Cisco Systems, Inc. All rights reserved.
2
The Evolution of Data Centers
© 2006 Cisco Systems, Inc. All rights reserved.
3
Data Center Evolution Networked Data Center Phase
Business Agility
Data Center Continuous Data Center Availability Virtualization
Compute Evolution
Internet Computing
Data Center Consolidation Network Optimization
Data Center Networking
Client/ Server Mainframes Content Networking
1. Consolidation 2. Integration 3. Virtualization 4. High Availability
Thin Client: HTTP TCP/IP Terminal
1960
1980 © 2006 Cisco Systems, Inc. All rights reserved.
2000
Network Evolution 2010 4
Today’s Data Center Integration of Many Systems and Services N-Tier Applications
Storage Network
Front End Network Application/Server Optimization Security
Web Servers
WAN/ Internet
Cache Resilient IP Firewall
FC Switch
DR Data Center Scalable Infrastructure
NAS
Application and Server Optimization App Servers IDS
Content Switch
VSANs FC Switch
Data Center Security MAN/ Internet
DC Storage Networks Distributed Data Centers
DB Servers Mainframe
IP Comm.
Operations FC Switch
RAID
Tape
Metro Network DWDM/SONET/Ethernet
FC SAN © 2006 Cisco Systems, Inc. All rights reserved.
Secondary Data Center
5
What Is Distributed Data Center?
App A
App B
App A
App C
Data Replication FC
FC
Primary Data Center
© 2006 Cisco Systems, Inc. All rights reserved.
Secondary Data Center 6
Distributed Data Centers Required by disaster recovery and business continuance Avoid single, concentrated data depositary High availability of applications and data access Load balancing together with performance scalability Better response and optimal content routing: proximity to clients
© 2006 Cisco Systems, Inc. All rights reserved.
7
Front-End IP Access Layer
“Content Routing” Site Selection App A
App B
App A
FC
App C
FC
Primary Data Center
© 2006 Cisco Systems, Inc. All rights reserved.
Secondary Data Center 8
Application and Database Layer
App A
App B
“Content Switching” Load Balancing “Server Clustering” High Availability
App A
FC
App C
FC
Primary Data Center
© 2006 Cisco Systems, Inc. All rights reserved.
Secondary Data Center 9
Backend SAN Extension
App A
App B
“Storage” and “Optical” Data Replication and Transporting
App A
FC
App C
FC
Primary Data Center
© 2006 Cisco Systems, Inc. All rights reserved.
Secondary Data Center 10
Data Center Disaster Recovery
© 2006 Cisco Systems, Inc. All rights reserved.
11
Agenda Introduction to Data Center—The Evolution Data Center Disaster Recovery Objectives Failure Scenarios Design Options
Components of Disaster Recovery Site Selection—Front End GSLB Server High Availability—Clustering Data Replication and Synchronization—San Extension
Data Center Technology Trends Summary © 2006 Cisco Systems, Inc. All rights reserved.
12
Disaster Recovery Recovery of data and resumption of service—Ensuring business can recover and continue after failure or disaster Ability of a business to adapt, change and continue when confronted with various outside impacts Mitigating the impact of a disaster
© 2006 Cisco Systems, Inc. All rights reserved.
13
Disaster Recovery What It Means for Business Business Resilience Continued Operation of Business During a Failure
Business Continuance Restoration of Business After a Failure
Disaster Recovery Protecting Data Through Offsite Data Replication and Backup
© 2006 Cisco Systems, Inc. All rights reserved.
Zero Down Time Is the Ultimate Goal 14
Disaster Recovery Planning Business Impact Analysis (BIA) Determines the impacts of various disasters to specific business functions and company assets
Risk analysis Identifies important functions and assets that are critical to company’s operations
Disaster Recovery Plan (DRP) Restores operability of the target systems, applications, or computing facility at the secondary data center after the disaster
© 2006 Cisco Systems, Inc. All rights reserved.
15
Disaster Recovery Objectives Recovery Point Objective (RPO) The point in time (prior to the outage) in which system and data must be restored to Tolerable lost of data in event of disaster or failure The impact of data loss and the cost associated with the loss
Recovery Time Objective (RTO) The period of time after an outage in which the systems and data must be restored to the predetermined RPO The maximum tolerable outage time
© 2006 Cisco Systems, Inc. All rights reserved.
16
Recovery Point/Time vs. Cost Critical Data Is Recovered
Systems Recovered and Operational
Disaster Strikes
Time Recovery Point time t0 Days
Tape backup
Recovery Time Time t1
Hours
Mins
Secs
Time t2 Secs Mins
Periodic Asynchronous Synchronous Extended Replication Replication Replication Cluster
$$$ Increasing Cost
Smaller RPO/RTO Higher $$$, replication, hot standby © 2006 Cisco Systems, Inc. All rights reserved.
Hours Days
Weeks
Manual Migration
Tape Restore
$$$ Increasing Cost
Larger RPO/RTO Lower $$$, tape backup/restore, cold standby 17
Agenda Introduction to Data Center—The Evolution Data Center Disaster Recovery Objectives Failure Scenarios Design Options
Components of Disaster Recovery Site Selection—Front End GSLB Server High Availability—Clustering Data Replication and Synchronization—San Extension
Data Center Technology Trends Summary © 2006 Cisco Systems, Inc. All rights reserved.
18
Failure Scenarios Disaster Could Mean Many Types of Failure Network failure Device failure Storage failure Site failure
© 2006 Cisco Systems, Inc. All rights reserved.
19
Network Failures ISP failure Dual ISP connections
Service Provider A
Internet
Service Provider B
Multiple ISP
Connection failure within the network EtherChannel® Multiple route paths
© 2006 Cisco Systems, Inc. All rights reserved.
20
Device Failures Routers, switches, FWs
Service Provider A
Internet
Service Provider B
HSRP VRRP
Hosts HA cluster LB server farm NIC teaming
© 2006 Cisco Systems, Inc. All rights reserved.
21
Storage Failures Disk arrays RAID
Service Provider A
Internet
Service Provider B
Disk controllers Storage Replication Site to Site Mirroring Optimization
© 2006 Cisco Systems, Inc. All rights reserved.
22
Site Failures Partial site failure Application maintenance
Service Provider A
Internet
Service Provider B
Application migration Application scheduled DR exercise
Complete site failure Disaster
© 2006 Cisco Systems, Inc. All rights reserved.
23
Agenda Introduction to Data Center—The Evolution Data Center Disaster Recovery Objectives Failure Scenarios Design Options
Components of Disaster Recovery Site Selection—Front End GSLB Server High Availability—Clustering Data Replication and Synchronization—San Extension
Data Center Technology Trends Summary © 2006 Cisco Systems, Inc. All rights reserved.
24
Warm Standby A data center that is equipped with hardware and communications interfaces capable of providing backup operating support Latest backups from the production data center must be delivered Network access needs to be activated Application needs to be manually started
© 2006 Cisco Systems, Inc. All rights reserved.
25
Disaster Recovery—Active/Standby
App A
App B
App A
App C
IP/Optical Network FC
Primary Data Center
© 2006 Cisco Systems, Inc. All rights reserved.
Secondary Data Center (Warm Standby)
FC
26
Hot Standby A data center that is environmentally ready and has sufficient hardware, software to provide data processing service with little down time Hot backup offers disaster recovery, with little or no human intervention Application data is replicated from the primary site A hot backup site provides better RTO/RPO than warm standby but cost more to implement Business continuance
© 2006 Cisco Systems, Inc. All rights reserved.
27
Disaster Recovery—Active/Standby
App A
App B
App A
App C
IP/Optical Network FC
FC
Primary Data Center © 2006 Cisco Systems, Inc. All rights reserved.
Secondary Data Center 28
Active/Active DR Design Multiple Tiers of Application Service Provider A
Internet
Service Provider B
Presentation Tier Application Tier Storage Tier
© 2006 Cisco Systems, Inc. All rights reserved.
29
Active/Active Data Centers Internal Network
Service Provider A
Internet
Service Provider B
Internal Network
Active/Active Web Hosting Active/Active Application Processing Active/Standby Database Processing or Active/Active for Different Application © 2006 Cisco Systems, Inc. All rights reserved.
30
Components of Disaster Recovery
© 2006 Cisco Systems, Inc. All rights reserved.
31
Agenda Introduction to Data Center—The Evolution Data Center Disaster Recovery Objectives Failure Scenarios Design Options
Components of Disaster Recovery Site Selection—Front End GSLB Server High Availability—Clustering Data Replication and Synchronization—SAN Extension
Data Center Technology Trends Summary © 2006 Cisco Systems, Inc. All rights reserved.
32
Site Selection Mechanisms Site selection mechanisms depend on the technology or mix of technologies adopted for request routing: 1. HTTP redirect 2. DNS-based 3. L3 Routing with Route Health Injection (RHI)
Health of servers and/or applications needs to be taken into account Optionally, other metrics (like load) can be measured and utilized for a better selection
© 2006 Cisco Systems, Inc. All rights reserved.
33
HTTP Redirection—Traffic Flow
http://www.cisco.com/ http://www1.cisco.com/
Kee ves pali
1 /1. .com P TT isco ed om H v / c o M co.c ET w w . G 2 1. t: w 30 2.cis 1 . s P/1 www Ho T n: HT 2. atio c Lo 3. GET/H TTP/1.1 Host: ww w2.cisco .co
m
HTTP/1.1 200 OK
http://www2.cisco.com/ © 2006 Cisco Systems, Inc. All rights reserved.
34
DNS-Based Site Selection—Traffic Flow Authoritative Name Server for .com
Root Name Server for/ DNS Proxy 2 3
Authoritative Name Server cisco.com
4 5 6
1 10
8
7
Authoritative Name Server www.cisco.com
http://www.cisco.com/ UDP:53 TCP:80
Data Center 1 © 2006 Cisco Systems, Inc. All rights reserved.
es aliv p e Ke
Ke epa live s
9
Client
Data Center 2 35
Route Health Injection—Implementation Client A
Router 11
Router 13
Client B
Router 10
Low Cost
Router 12
Very High Cost Location A Backup Location for VIP x.y.w.z
© 2006 Cisco Systems, Inc. All rights reserved.
Location B Preferred Location for VIP x.y.w.z
36
Site Selection Summary Redundancy
Convergence
App Health Visibility
Site Persistence
Active/Active
No
No
Yes
DNS
Active/Active
DNS Cache
Yes
No
RHI
Active/Standby
Within Secs
Yes
No
Mode HTTP Re-Direct
© 2006 Cisco Systems, Inc. All rights reserved.
37
Agenda Introduction to Data Center—The Evolution Data Center Disaster Recovery Objectives Failure Scenarios Design Options
Components of Disaster Recovery Site Selection—Front End GSLB Server High Availability—Clustering Data Replication and Synchronization—San Extension
Data Center Technology Trends Summary © 2006 Cisco Systems, Inc. All rights reserved.
38
Cluster Overview Load Balancing Cluster : multiple copies of the same application against the same data set, usually read only High Availability Cluster : multiple copies of application that requires access to a common data depository, usually read and write Clustering provides benefits for availability, reliability, scalability, and manageability
© 2006 Cisco Systems, Inc. All rights reserved.
Web Servers
Application Servers
Database Servers
39
High Availability Cluster Design Public Network : Client /Application requests
Private Network : Interconnection between nodes
APP Cluster Software Cluster Enabler OS
Storage Disk : Shared storage array, NAS or SAN © 2006 Cisco Systems, Inc. All rights reserved.
40
HA Cluster Application View Active/standby Standby takes over when active fails Two-node or multi-node
Active/active Database requests load balanced all nodes Lock mechanism ensures data integrity
Shared everything
Node1
Node2
Each node mounts all storage resources Provides a single layout reference system for all nodes
Shared nothing Each node mounts only its “semi-private” storage Data stored on the peer system’s storage is accessed via the peer-peer communication
© 2006 Cisco Systems, Inc. All rights reserved.
41
Geo-Clusters Considerations Geo-Cluster: Cluster That Span Multiple Data Centers WAN Local
Remote Datacenter
Datacenter
Node2
Node1
Challenges: Disk Replication Synchronous or Asynchronous 2 x RTT
Split brain L2 heart-beats Storage
© 2006 Cisco Systems, Inc. All rights reserved.
42
HA Cluster Challenges : Split-Brain Split-brain : Active nodes concurrently accessing the same disk, leads to data corruption Node1
Node2
Resolution : Use a Quorum, a tie breaker for gaining access to the disk
Data Corruption
© 2006 Cisco Systems, Inc. All rights reserved.
43
Layer 2 Heartbeats Extended L2 Network : L2 adjacency required for node’s heartbeat. Extending VLAN across site is hazardous Node1
Resolution : L3 Capability for Cluster Heartbeat. EoMPLS to carry L2 hearbits across DR sites.
© 2006 Cisco Systems, Inc. All rights reserved.
WAN Local Datacenter
Remote Datacenter
Public Layer 2 Network Private Layer 2 Network
Node2
Disk Replication Synchronous or Asynchronous
44
Storage Disk Zoning Storage Zoning : Taking over of storage disk array when active node fails.
Node1
Node2 Standby
Active
Extended SAN
Resolution : Cluster software to communicate with the Cluster Enabler. Instructs the Disk Array to perform an failover when failure is detected.
sym1320 RW RW
© 2006 Cisco Systems, Inc. All rights reserved.
sym1291 WD WD 45
Agenda Introduction to Data Center—The Evolution Data Center Disaster Recovery Objectives Failure Scenarios Design Options
Components of Disaster Recovery Site Selection—Front End GSLB Server High Availability—Clustering Data Replication and Synchronization—San Extension
Data Center Technology Trends Summary © 2006 Cisco Systems, Inc. All rights reserved.
46
Storage for Applications Presentation tier Unrelated small data files commonly stored on internal disks Manual distribution
Application processing tier Transitional, unrelated data Small files residing on file systems May use RAID to spread data over multiple disks
Storage tier Large, permanent data files or raw data Large batch updates, most likely real time Log and data on separate volumes
© 2006 Cisco Systems, Inc. All rights reserved.
47
Replication: Modes of Operation Synchronous All data written to local and remote arrays before I/O is complete and acknowledged to host Speed of Light = 3 x 108m/s (Vacuum) ≈ 3.3µs/km Speed through Fiber ≈ ⅔ c ≈ 5µs/km 2 RTT per write I/O = 20µs/km
Asynchronous Write acknowledged and I/O is complete after write to local array; changes (writes) are replicated to remote array asynchronously
© 2006 Cisco Systems, Inc. All rights reserved.
48
Synchronous vs. Asynchronous TradeOff Enterprises Must Evaluate the Trade-Offs Synchronous
Asynchronous
Impact to Application Performance
No Application Performance Impact
Distance Limited (Are Both Sites Within the Same Threat Radius)
Unlimited Distance (Second Site Outside Threat Radius)
No Data Loss
Exposure to Possible Data Loss
Maximum tolerable distance ascertained by assessing each application Cost of data loss © 2006 Cisco Systems, Inc. All rights reserved.
49
Data Replication with DB Example • DB name • Creation date • Backup performed • Redo log time period • Datafile state
Control Files
Identify
Control files identify other files making up the database and records content and state of the db Datafile is only updated periodically Redo logs record db changes resulting from transactions
Datafiles
Record Changes To
• Table spaces • Indexes • Data dictionary © 2006 Cisco Systems, Inc. All rights reserved.
Redo Log Files • Database changes
Used to play back changes that may not have been written to datafile when failure occurred Typically archived as they fill to local and DR site destinations
50
Data Replication with DB Example (Cont.) Time
...
t0
...
...
Archived Redo Logs
Hot Backup of Datafiles and Control Files Taken at Time t0
Online Redo Logs
t1
Failure or Disaster Occurs at Time t1 • Media failure (e.g., disk) • Human error (datafile deletion) • Database corruption
Database restored to state at time of failure (time t1) by: 1. Restoring control files and datafiles from last hot backup (time t0) 2. Sequentially replaying changes from subsequent redo logs (archived and online)—changes made between time t0 and t1
© 2006 Cisco Systems, Inc. All rights reserved.
51
Data Replication with DB Example (Cont.) Primary Site
Redo Logs (Cyclic) Copy of Every Committed Transaction
Database
Synchronously Replicated for Zero Loss
Secondary Site Earlier DB Backups
SAN Extension Transport
Database Copy at Time t0 Point in Time Copy Taken When DB Quiescent
Redo Logs (Cyclic)
Database Copy at Time t0
Replicated/Copied
Archive Logs
Replicated/Copied
Archive Logs
Mixture of Sync and Async Replication Technologies Commonly Used • Usually only redo logs sync replicated to remote site • Archive logs created from redo log and copied when redo log switches • Point in Time (PiT) copies of datafiles and control files copied periodically (e.g., nightly) © 2006 Cisco Systems, Inc. All rights reserved.
52
Data Center Interconnection Options Internet
High Density Multilayer LAN Switch
Stateful Firewalls
Stateful Firewalls
Content Caching
Content Caching
Server Load Balancing
SONET/SDH
DWDM/ CWDM Back-End Application Servers
IP/Metro E
© 2006 Cisco Systems, Inc. All rights reserved.
High Density Multilayer LAN Switch
Front-End Application Servers
Front-End Application Servers
Enterprise-Class Storage Arrays
Server Load Balancing Intrusion Detection
Intrusion Detection
High Density Multilayer SAN Director
Internet
Back-End Application Servers
Enterprise-Class Storage Arrays
High Density Multilayer SAN Director
53
Data Center Transport Options Increasing Distance Data Center Campus Metro
Optical
Dark Fiber Sync
Regional
National
Limited by Optics (Power Budget)
CWDM Sync (2Gbps)
Limited by Optics (Power Budget)
DWDM Sync (2Gbps Lambda)
Limited by BB_Credits
IP
SONET/SDH Sync (1Gbps+ Subrate) Async MDS9000 FCIP Sync (Metro Eth)
© 2006 Cisco Systems, Inc. All rights reserved.
Async (1Gbps+)
54
DATA CENTER ARCHITECTURE TRENDS
© 2005 Cisco Systems, Inc. All rights reserved. © 2006 Cisco Systems, Inc. All rights reserved.
55 55
Cisco Data Center Vision Server Data Storage Fabric Network Network Network
LAN WAN MAN
SAN
HPC Cluster GRID
Intelligent Information Network
Enterprise Applications
VIRTUALIZATION Management of resources independent of underlying physical infrastructure to increase utilization, efficiency and flexibility
AUTOMATION Dynamic provisioning and autonomic Information Lifecyle Management (ILM) to enable business agility Business Policies On-Demand Service Oriented
Compute
CONSOLIDATION
Network
Centralization and standardization to lower costs, improve efficiency and uptime
Storage
© 2006 Cisco Systems, Inc. All rights reserved.
Compute Network Storage
56
Summary
© 2006 Cisco Systems, Inc. All rights reserved.
57
What we have talk so far? DR and its Business Objectives Define budget, Technical solution Management Buy In DR is a process
Components of a Data Center Multi Tier Architecture Front-end, Application, Backend Database
Techniques in Data Center Disaster Recovery HTML Re-Direction/GSS/RHI Clustering SAN extension
Trends in Data Center Technology © 2006 Cisco Systems, Inc. All rights reserved.
58
Today’s Data Centers Require an Architectural Approach to… Protect with Business Resilience Tighten security Improve business continuance
Optimize with Consolidation Improve operational efficiency and resource utilization Lower complexity and cost of ownership
Grow towards Services-oriented Infrastructure Align virtualized resources with business demands Automate infrastructure to respond dynamically © 2006 Cisco Systems, Inc. All rights reserved.
59
The Big Picture—The Cisco Data Center The Emerging Data Center Architecture
MAINFRAME CONNECTIVITY
ENTERPRISE TAPE STORAGE
ENTERPRISE DISK STORAGE ENTERPRISE SAN SWITCHING
Virtual Fabrics (VSANs)
MDS 9000 Family
Embedded Intelligent Storage Services
Storage Virtualization Data Replication Svcs
Embedded Intelligent Network Services
Fabric Routing Svcs
Server Balancing
Multiprotocol Gateway Services
VPN Termination
Embedded Intelligent Virtualization Services V
Server Virtualization VFrame
Virtual I/O
SSL Termination
Catalyst 6500 Family
Firewall Services
TOPSPIN FAMILY
Intrusion Detection
Grid/Utility Computing Low Latency RDMA Services Clustering
Server Farm Switching NAS
Enterprise NAS Storage
WIN
ENTERPRISE GRID
UNIX
UNIX/Windows Servers
© 2006 Cisco Systems, Inc. All rights reserved.
SERVER FABRIC SWITCHING
Blade Servers
Virtual Private Server Fabric #1
Virtual Private Virtual Private Blade Server Server Fabric #3 Fabric #2
60
What’s Next? A Security Strategy to Protect the Data Center Understands the vulnerabilities, and apply the relevant mitigations
Leverage on Cisco’s Technology to Optimize the Server Resources Reducing TCO for DRs Virtualization to maximize resource invested Grow DC infrastructure, enabling Business Agility Automating computing resources provisioning Speed of deploying new services
© 2006 Cisco Systems, Inc. All rights reserved.
61
Q and A
© 2006 Cisco Systems, Inc. All rights reserved.
62
© 2006 Cisco Systems, Inc. All rights reserved.
63