Network Improvement Method Road to 5 9’s • Establish a standard measurement method • Define business goals as related to metrics • Categorize failures, root causes, and improvements • Take action for root cause resolution and improvement implementation
What Is “High Availability”? • The ability to define, achieve, and sustain “target availability objectives” across services and/or technologies supported in the network that align with the objectives of the business (i.e. 99.9%, 99.99%, 99.999%) Availability
Why Improve Network Availability? Recent Studies by Sage Research Determined That US-Based Service Providers Encountered: • Percent of downtime unscheduled: 44% that is • 18% of customers experience over 100 hours of unscheduled downtime or an availability of 98.5% • Average cost of network downtime per year: $21.6 million or $2,169 per minute!
Downtime: Costs Too Much!!! SOURCE: Sage Research, IP Service Provider Provider Downtime Study: Analysis Analysis of Downtime Causes, Costs and Containment Strategies, August 17, 2001, Prepared for Cisco SPLOB NMS-2T20 9594_04_2004_c2 9592_04_2004_c1
N e t a r c B a s e P a c k a g e : S e c u r i t y a n d A d m i n i s t r a t i o n
Graphical Reports
Views
Netrac Integrated GUI Service CNM Orientation Customer
Netrac
A la la rm rm Sc Sc re re en en
Applications Service Management Service eView Service Monitoring
Fault
Performance
NeTkT
PMM
Trouble Ticketing
Performance Analysis and Trends
Correlator+ O p Advanced t i Correlation and o n Root Cause a l Analysis Fault Mgmt. Alarm Surveillance
NetCAP Planning
E n g i n NetCAP Configuration NetCAP Provisioning e e r ( i M I Service Views n n a g s v Impact t e e W r n o Topology o r t S o r k l a Assign/ v r O e y Service Def. ) Design r d e r Sync Activate
Asset Mgmt.
Change Mgmt.
CDR Analysis CallExpert CDR Analysis and Reports
Mediations Device Expert
N e t r a c A P O I S t o S o N t h M e S r B e l l S o u t h
Circuit Diversity • Problem: if links follow a common path through service provider network, you are back to single-point-of-fail single-point-of-failure ure • Solution: employ as much circuit diversity as possible Links Terminate at Different Devices
Links Use Different Paths in SP Network
(Physical Diversity)
(Geographic Diversity)
Enterprise
Service Provider Diversity? NMS-2T20 9594_04_2004_c2
Configuration/Change What Are the Time Bombs? • No technical ownership • Large failure domains • Layer (II/III) design • Loose or non risk-aware change management • High levels of network inconsistency • Lack of network standards (SW, HW, config) • No capacity planning or performance management
Configuration/Change MTTR―Mean Time to Repair • No identified tiered support mechanism with
individuals who know and understand the network (lack of expertise) • Poor documentation (topology and config) • Large failure domain difficult to understand
and determine root-cause • Networks with control-plane resource issues
require major topology, config and upgrade changes NMS-2T20 9594_04_2004_c2
Resource Utilization What Happens when Networks Fail? • Resource constraints CPU/memory Inability to process messages Inability to process routing updates Routing or bridging loops
Calculated Availability • Calculated availability based on network design, component MTBF and MTTR • MTBF = Mean Time Between Failure Calculated by measuring the average time between failures on a device
• MTTR = Mean Time To Repair The time between when the device/network broke and when it was brought back into service
Cisco Internal Tools: Calculated Availability Contact Your Sales Team for Quality Data • MTBF query tool
MTBF for components can be requested from Cisco User enters part number/product family and predicted p redicted MTBF is provided A system is a chassis populated with Field Replaceable Units (FRU) and software
• NARC: Network Availability and Reliability
Calculation Excel spread sheet, calculates availability/downtime availability/downtime for a system/network given MTBF and MTTR NMS-2T20 9594_04_2004_c2
Calculated Availability Key Points • Carried out at design time • Availability can be increased by decreasing MTTR or increasing MTBF or both • If service availability target is 99.999% calculated availability must be better than 99.999% Customer experience shows MTBF can be typically 2 x MTBF listed; this may not necessarily be a good thing
• Series components reduce availability, parallel (redundant) components increase availability • Complex networks require modelling tools to calculate engineered availability • Core networks are designed for high availability to a single point of failure; i.e., needs to be 99.999% available with any
Availability Measurement Methodologies • Ping (network availability, device availability) • Service assurance agent • Trouble ticket reporting DPM: Defects Per Million Defect may be one user/customer down for one minute or one hour IUM: Impacted User Minutes Number of users affected
Creating an HA Culture People • Executive messaging: communicate business plans for high availability and the importance of improvement • Reward positive behavior • Provide world-class training to staff • Create a cross-functional technical team and availability champion
Process • • • •
Identify and resolve process deficiencies Start an availability improvement quality process Root-cause analysis Collect and report availability availability metrics metrics
Tools • Availability measurement • Processes for consistency (automate where possible) • Metrics for identifying areas of service improvement NMS-2T20 9594_04_2004_c2
Change Management • Change management refers to the consistent process of successfully managing change within an organization • Process includes: Change controller Change documentation requirements Risk level assignment Validation and approval procedures Change meetings Emergency change procedures Post mortem review and root-cause Document change output requirements Change management system and metrics NMS-2T20 9594_04_2004_c2
New Solution Deployment • The biggeston challenge is minimizing the impact the existing networking environment • Success requires structured processes that include resources from planning, design, network management, and implementation
Configuration Management • Collection of processes and tools to: Promote network consistency Provide up to date network documentation Asset management
• Benefits Lower support costs Lower network costs due to device, circuit, and user tracking tools and processes that identify unused network components Improved network availability due to a improved time to resolve problems (MTTR) NMS-2T20 9594_04_2004_c2
Software Version Control The Process of Software Version Control Is Critical to Software Consistency and Overall Software Reliability! • Publish and communicate certified device software version standards for identified software tracks • Quality gates during implementation implementati on process • Scheduled periodic audits to ensure network is in sync with the certified standard • Utilize tools to identify, track and sort software versions
Maintain Documentation • Current device, link, and end-user inventory • Configuration version control system • Software version control • TACACS configuration log • Network topology
Baselining • How does the network behave normally? • CPU, memory, backplane, buffers, link utilization • Collect data (show commands, performance data) • Determine non-normal thresholds
Develop a Capacity Planning Strategy, Including Common Techniques, Tools, MIB Variables, and Thresholds Used for Capacity Planning NMS-2T20 9594_04_2004_c2
Baselining and Exception Management • Alert mechanisms for performance exceptions • Create trouble ticket to track proactive issues • Investigate and make recommendations accordingly
What-If Analysis What-If Analysis Centers around Network Change and How the Change Affects the Environment • Identify higher risk changes • Determine potential resource issues (CPU, memory, buffer, backplane, link util, device resources) • Ask questions • If possible, take it to the lab • If possible, slow start implementation and measure key resource areas NMS-2T20 9594_04_2004_c2
Fault Management • Fault management Process of identifying faults through through the use of network management toolsets NMS architecture design and resiliency Syslog collection, monitoring and Analysis SNMP trap collection and notification Exception reporting and analysis
Fault Management Architecture • NMS stations: Centralized vs. distributed architecture Located close to the network core Adequate bandwidth and separation from other services Redundant hardware/network connectivity
• NMS UPS (Uninterruptible Power Supply) All NMS systems should be protected against power failures NMS-2T20 9594_04_2004_c2
SNMP Trap Collection and Notification The Collection and Notification of SNMP Traps Is Essential to Rapid Identification and Resolution • SNMP trap collection SNMP traps include generic traps and platform or technology specific traps Traps must be properly and consistently configured on all network devices as well as the network management systems
• SNMP trap notification NMS systems should notify and alert when a trap has been received
Syslog • Collection Establish a centralized system to log all device messages Implement consistent Syslog server and logging configurations on all network devices
• Monitoring A tool or script that parses Syslog files for pre-determined messages and sends real time alerts or notifications to an event management system
• Analysis Periodic review and analysis of Syslog data should be performed daily NMS-2T20 9594_04_2004_c2
Problem Management • Problem tracking systems Allows the organization to document, track and report on infrastructure technology problems Reactive/proactive issues
• Priority and escalation procedures Help to ensure that business-impactin business-impacting g issues are assigned a priority and quickly escalated to support groups that can resolve the issue
• Tiered operations structure The network support structure should allow ample resources for problem resolution, proactive analysis, specialty areas, and escalation NMS-2T20 9594_04_2004_c2
Client liaison Recording problems Closing problems Initial problem determination Problem classification Problem escalation to tier 2 Problem resolution for non-complex pro blems, end user issues Management reports for case and availability reporting
• • • • • • • • •
Help desk mentoring Help desk support for complex problems Recording network incidents Resolving network incidents Implementing configuration and network system engineering changes Feedback to the design team on operational issues Problem resolution and network solution documentation Network site implementation documentation Escalation to tier 3
• • • • • • • • • •
Network design Network instrumentation for management Complex problem resolution Complex problem identification and isolation Network monitoring Network tuning Network troubleshooting Network capacity Network standards Network documentation integration
• Vendor problem resolution • Vendor problem reporting • Vendor SLA reporting 96
Presentation_ID.scr
Problem Management Problem Priority Definitions Problem Types
Problem Priority Categories Urgent: P1 • Severe business impact
Site Service Outages
• Loss of service or outage at a location
High: P2 • High business impact
Site Service Impairments
• Degradation, possible workaround exists • Service impairment
Medium: P3 • Minimal business impact
Client Service Problems
• Some specific network functionality is lost • Loss of redundancy
Low: P4 • No business impact
Client Admin and Change Requests NMS-2T20 9594_04_2004_c2
Preparation, Prevention and Response Security Basics for High Availability Networks • Preparation Create usage policy statements Conduct a risk analysis Establish a security team structure
• Prevention Approving security changes Monitoring security of your network
Disaster Recovery • A disaster recovery plan covers The hardware and software required to run critical business applications The associated processes to transition smoothly in the event of a disaster
• Assess your mission-critical business processes and associated applications before creating the full disaster recovery plan • Critical steps for best-practice disaster recovery: Disaster recovery planning Resiliency and backup services Vendor support services NMS-2T20 9594_04_2004_c2
Resiliency and Backup Services • Resiliency and backup services are a key part of disaster recovery • Cisco defines network resiliency as the ability to recover from any network failure or issue whether it is related to a disaster, link, hardware, design, or network services • A HA network design is often the foundation for disaster recovery and might handle some minor or local disasters • Key tasks for resiliency planning and backup services include the following: Assess the resiliency of your network, identify gaps and risks Review your current backup services Implement network resiliency and backup services
Vendor Support Services • Having support services from your major vendors in place adds a strong value to disaster recovery planning • For example, specific managed hot standby sites or on-site services with rapid response times can significantly ease disaster recovery • Key questions regarding vendor support include: Are support contracts in place? Has the disaster recovery plan been reviewed by the vendors, and are the vendors included in the escalation processes? Does the vendor have sufficient resources to support the disaster recovery?
• Most vendors have experience handling disaster situations and can offer additional support NMS-2T20 9594_04_2004_c2
Global Server Load Balancing, Stateful NAT, Stateful IPSec, DNS, DHCP, Cisco Server Load Balancing, IP QoS HSRP, VRRP, GLBP, MPLS-TE, IP Event Dampening , Graceful Restart (GR) in BGP, OSPF NSF, ISIS NSF, IP QoS
Networking Transport Evolution Enterprise Scenario • Traffic originating in Enterprise network are
transported using
Ethernet/Fast Ethernet/Gigabit Ethernet for a majority of local area networks ATM and Frame Relay for WAN connectivity Metro Ethernet/Metro Optical DPT/RPR, ATM over SONET, Packet over SONET, etc. MPLS/IPSec
Networking Transport Evolution Service Provider Scenario • Traffic originating on service provider network backbone include Circuit-based like TDM voice and fax Packet-based like IP Cell-based like ATM or Frame Relay
• Majority of traffic is transported over SONET/SDH • Explosive growth of data compared to voice: POS • Scalable technologies like DPT/RPR use SONET/SDH framing and infrastructure: Metro and access networks • DWDM provides scalable solutions to prevent fiber exhaustion: Metro and long haul networks
EtherChannel Protocol • A logical aggregation of similar similar links links (up to 8): 10/100/1000/10GE ports • Operates between switches, routers, and certain vendors’ NICs
Configuring EtherChannel On a Catalyst ® 6000: Console> (enable) set port channel 2/2-8 mode desirable Ports 2/2-8 left admin_group 1. Ports 2/2-8 joined admin_group 2. Console> (enable)
On a Cisco 7500: Router(config)# interface port-channel 1 Router(config)# ip address 10.0.0.1 255.255.255.0 Router(config)# ip route-cache distributed Router(config)# interface fasteth 0/0 Router(config)# no ip address Router(config)# channel-group 1 Router(config)# interface fasteth 0/1 Router(config)# no ip address Router(config)# channel-group 1 FastEthernet 0/1 added as member-2 to fechannel1 NMS-2T20 9594_04_2004_c2
Spanning Tree Extensions • Extensions decrease STP convergence time • PortFast for access ports (Link4) bypasses listening-learning phases • UplinkFast for direct root link failure (Link2): about 3 to 5 seconds convergence
Root Bridge
Link 1
Link 2
Link 3
X Blocked
• BackboneFast for indirect link failure (Link1): cuts convergence time by Max_Age seconds
IEEE 802.1w: Rapid Spanning Tree • Takes advantage of today’s topologies (full-duplex point-to-point links) • Remarkably similar to Uplinkfast/Backbonefast • No more network-wide timers when all switches run 802.1w • Handshake mechanism between bridges
Root
Proposal 1
Agreement 2
Proposal
Agreement
3
4
• Proposal-Agreement messaging (“I want to become designated: do you agree?”) • Can achieve 1+ second of convergence NMS-2T20 9594_04_2004_c2
Resilient Packet Ring (RPR) Standard • RPR is a layer 2 transport architecture Based on dual counter-rotating ring architecture Uses the best of Ethernet and SONET/SDH Uses SRP-fairness algorithm
• Standards-based on IEEE 802.17 RPR Protocol Draft • IEEE 802.17 is based on Cisco’s SRP (RFC 2892) • Supported on high-end devices • DPT/RPR name used interchangeably • Cisco is committed to SRP and IEEE standards Related Session: OPT 2043 802.17 and Spatial Reuse Protocol (SRP) Protocols NMS-2T20 9594_04_2004_c2
• A pair of dark fiber strands • A pair of DWDM derived wavelengths • A SDH add-drop STM-n circuit • Any combination of the above segments NMS-2T20 9594_04_2004_c2
SF: Signal Fail based on PHY-sensed link failure or keepalive keepa live failure failure SD: Signal Degrade based on PHY-sensed link degradation condition
• Manual FS: Forced Switch initiated by the user MS: Manual Switch initiated by the user
• Detection delay L1 Holdoff: Used to delay the protection response to a PHY-sensed failure (0 to 200 ms) Keepalive Timer: Used to determine the duration of keepalive loss before a protection condition is raised (2 to 200 ms); keepalive frames100 arems also fairness updates and are transmitted approx. every NMS-2T20 9594_04_2004_c2
802.17 Protocol: Protection Protection Steering • This protection mechanism requires all stations to exchange protection details, flush the existing queues (for strict traffic) and recalculate the new traffic path prior to completing the protection event Flush + Recalculate S1
physical-layer restoration • Restoration of failure within 50ms • Physical state is communicated to L3 • Available on SONET/SDH line cards on routers • K1/K2 link-layer control information of line
overhead (LOH) frame • Two types of APS Single router APS Multi-router APS NMS-2T20 9594_04_2004_c2
Multi-Router APS • The major benefit of multi-router APS is protection
against fiber faults, linecard faults and even complete router failures
• Usually the working port is configured on one
router and the protect port is configured on a different router • Supported on Cisco high-end routing platforms • Multi-Router APS is a hybrid which depends
partially on APS switching and partially on layer 3 routing to direct the flow of packets • The two routers communicate control information
using protect group protocol NMS-2T20 9594_04_2004_c2
Anatomy of an MR-APS Switchover Due to LOS Detected by the Working Router PE Router#1
i nn g
CE Router (West)
ADM#1 W o r k k SONET RING
ADM#2
PGP Channel
K1/K2 P r o o K 1 / K t ee c 2 S t ( w i i g n t h h
a l i n g )
1. 2. 3. 4. 5. 6. 7. 8.
PE Routers (East)
CE Routers (East)
Router#2
Initially packets are routed over the working lines which are active PE Router#1 detects LOS on received Working line and starts to bring the interface down Working router sends a PGP “State Change” message to the protect router Protect router signals Switch-to-protect request to ADM using K1/K2 bytes ADM selects the protect li ne and sends K1/K2 response back to protect router Router selects protect line and sends PGP “Working Disable” message to working router Working router deselects the working line After the routers reconverge, packets get routed over the newly active protect lines
Protect Group Protocol • Protect Group Protocol: Proprietary protocol sent
as UDP packets (Port 172) between routers with MR-APS • Messages are retransmitted if no reply or Ack • PGP Hellos are sent at regular intervals • Authenticated by a configurable authentication
string sent with messages • Supports protocol versioning • Switching may occur due to LC/router crash, signal degradation, LOS (SF), manual switch NMS-2T20 9594_04_2004_c2
GLBP Entities (Definitions) • GLBP Group A GLBP group consists of one or more GLBP gateways configured with the same GLBP group number
• GLBP Gateway A gateway or router running the Gateway Load Balancing Protocol; it may participate in one or more GLBP groups
• Virtual IP Address (vIP) An IPv4 address or IPv6 prefix; this is the IP address used as the hosts’ default gateway
• Virtual MAC Address A MAC address that a host may receive when it issues an address resolution request for the virtual IP address; there MAY be multiple virtual MAC address for each GLBP group NMS-2T20 9594_04_2004_c2
GLBP Entities (Definitions) (Cont.) • Active Virtual Gateway (AVG) One Virtual Gateway in a GLBP group is elected Active Virtual Gateway (AVG), and is responsible for operation of the protocol, i.e. allocating MAC addresses
• Active Virtual Forwarder (AVF) One Virtual Forwarder in a GLBP group elected the Active Virtual Forwarder (AVF), and is responsible for forwarding packets sent to a particular virtual MAC address; address; there may be multiple Active Virtual Forwarders in a GLBP group
• Secondary Virtual Forwarder (SVF) A Virtual Forwarder that has learned the virtual MAC address from a Hello message NMS-2T20 9594_04_2004_c2
GLBP • GLBP routers function as one virtual router sharing one virtual IP address but using multiple virtual MAC addresses to forward traffic GLBP uses multicast to communicate between GLBP members with following detail: detail: 224.0.0.102, UDP port 3222 Virtual MAC addresses will be of the form: 0007.b4yy.yyyy where yy.yyyy equals the lower 24 bits; these bits consist of 6 zero bits, 10 bits that correspond to the GLBP group number, and 8 bits that correspond to the virtual forwarder number 0000 0 0001 0001 0000 0010 0007.b400.0102 : last 24 bits = 0000 0000 00 00 000 = GLBP group 1, 1, forwarder 2
• Allows traffic from a single common subnet to go through multiple redundant gateways using a single virtual IP address NMS-2T20 9594_04_2004_c2
GLBP Configuration Rules • Load balancing operates on a per-host basis All outbound traffic for a given host will use the same gateway
• Maximum of 4 MAC addresses per GLBP Group • Load balancing algorithm, 3 types: Round-robin Each virtual forwarder MAC takes turns Weighted Directed load determined by advertised weighting factor Host-dependent Ensures that each host is always given the same sa me vMAC
• Default algorithm is round-robin NMS-2T20 9594_04_2004_c2
GLBP Implementation Issues • Four entries per GLBP group will be used in
the MAC address filter of Ethernet interfaces configured with GLBP groups This may limit the number of groups configurable on an interface that supports only a hardware MAC address filter
• Security includes MD5 authentication • Only use GLBP for layer 2 switched environments So duplicate IP addresses will not be noticed
• Be careful with other IP services NAT, IPSec, Mobile IP, HA NMS-2T20 9594_04_2004_c2
• IP route reachability • IP route metric threshold
• Enhanced Object Tracking is a stand-alone process that tracks objects • HSRP, GLBP and VRRP act as clients seeking services of Enhanced Object Tracking NMS-2T20 9594_04_2004_c2
Benefits of Enhanced Object Tracking • More options to ensure high availability • Can help verify end-to-end path good • Provides scalable solution • Support for GLBP, HSRP, and VRRP
What Can I Track? • Interface “line-protocol” state Tracking process tracks the line-protocol state of the interface
• Interface “routing” state A tracked IP routing object is up when IP routing is enabled, the interface line-protocol is up and IP routing active on the interface
• State of an IP route (reachability) A tracked IP route object is considered up and reachable when a routing table entry exists for the route and the route is reachable
• IP route metric threshold Tracks the scaled metric value of an IP route to determine if it is above or below a threshold NMS-2T20 9594_04_2004_c2
Enhanced Tracking Example Interface IP Routing Tracking Router A Configuration track 100 interface serial1/0 ip routing interface Ethernet0/0 ip address 10.1.0.21 255.255.0.0 standby 1 ip 10.1.0.1 standby 1 priority 105 standby 1 track 100 decrement 10
A
s1/0
Active e0/0
s1/0 B e0/0 Standby
10.1.0.0
• Interface IP routing will go down if: IP routing is disabled globally Interface IP address is unknown (or IP is disabled or failed to negotiate) Interface line-protocol is down
• Useful for interfaces where IP address is negotiated For example, on a serial interface that uses PPP then the lineprotocol could be up (LCP negotiated successfully), but IP could be down (IPCP negotiation failed) NMS-2T20 9594_04_2004_c2
switching performance • Multiprotocol support at layer 3 • Redundancy and load balancing Distribution switch redundancy HSRP/GLBP can be tuned to achieve 1+ second recovery!
Access • Availability, load balancing, QoS and provisioning are the important considerations at this layer • Aggregates wiring closets (access layer) and uplinks to core • Use layer 3 switching in the distribution layer • Protects core from high density peering and problems in access layer NMS-2T20 9594_04_2004_c2
Core Layer Detail • Redundant, fast-converging core • Choice to be made for Layer 2 or Layer 3: Layer 3 favored: Less RP neighbors Better multicast support Scales to more campus network modules
Failure Detection Tuning • Cisco IOS ® exposes some timers which can be
tuned to speed failure detection/convergence • Tweaking will not help a network that already has significant problems • Only tweak if: You have a stable, predictable predictable network You have a lab which can provide an accurate simulation simulation You have a backout plan
Layer 3 Failure Detection Tweaking HSRP: Must Be the Same for All Routers in the Group! Router(config)#int Router(config)# int eth0 Router(config)#standby 10 timers 1 3 Router(config)#standby
HSRP Also Supports Subsecond Timers with the Standby 10 Timers msec 30 msec 90
msec
Keyword:
OSPF: Must Be the Same for All Routers on the Subnet! Router(config)#int eth0 Router(config)#int Router(config)#ip Router(config) #ip ospf hello-interval 1 Router(config)#ip Router(config) #ip ospf dead-interval 3
EIGRP: Must Be the Same for All Routers on the Subnet! Router(config)#int eth0 Router(config)#int Router(config)#ip Router(config) #ip hello-interval eigrp <AS#> 1 Router(config)#ip Router(config) #ip hold-time eigrp <AS#> 3 NMS-2T20 9594_04_2004_c2
Routing Protocol Optimization • LSP throttling: provides the ability to generate LSP
quickly after failure with exponential back-off to handle subsequent multiple failures on the router • SPF throttling: ability to respond to changes very
quickly followed by exponential back-off to handle instabilities in the network • Incremental SPF (ISPF): leaf nodes impacted by
failure will not cause full SPF calculation • Partial route computation • Available in Cisco IOS: 12.0(24)S, 12.2(18)S,
F Reports a New Neighbor; the SPT Need Only to Be Extended behind F; There Is No Need for Router a to Recompute the Whole SPT Router a Will Compute SPF from Node F
IP Event Dampening • Prevents routing protocol churn caused by
constant interface state changes • Supports all IP routing protocols Static routing, RIP, EIGRP, OSPF, IS-IS, BGP In addition, it supports HSRP and CLNS routing Applies on physical interfaces and can’t be applied on subinterfaces individually
IP Event Dampening Absorbs Link Flapping Effects on Routing Protocols Physical Up State of Primary Down Link Logical Up State of Primary Down Link R3 Path
IP Event Dampening: Algorithm interface Serial 0 dampening [half-life] [reuse suppress max-suppress] [restart <penalty>]
• Penalty Penalty:: A value applied to the interface each time it flaps • Half-life: Half-life : Amount of time that must elapse without a flap to reduce penalty by half • Suppress: Suppress: If penalty exceeds this value, interface is suppressed from routing protocols’ perspective • Reuse: Reuse: If penalty goes below this numeric limit, interface is reintroduced to routing protocols
Max-Suppress: Maximum amount of time an interface can • Max-Suppress: be suppressed <penalty>: Determines initial penalty (if any) to be • Restart <penalty>: applied to interface when system boots NMS-2T20 9594_04_2004_c2
iBGP Multi-path: BGP Behavior before iBGP Multi-path
R2 R1 AS 100
R3
R4 AS 200 10.0.0.0/8 R5
• R1 has two paths for 10.0.0.0/8 • Both paths have identical < weight, AS-PATH, origin, localpref, MED >; ONLY next HOPS are different • R1 selects one path as best and send all traffic for 10.0.0.0/8 towards one of the exit points • BGP installs only the best path unlike other routing protocols!! NMS-2T20 9594_04_2004_c2
BGP Multi-path Review • Allows a router to install multi-path in the RIB • Traffic will be sent to destinations on multiple paths
for load balancing and efficient link utilization • Conditions for iBGP multipath selection All attributes (weight, local preference, AS-path entire attribute not just length), origin, MED, and IGP distance are same The next-hops of the paths are different
iBGP Multi-path R1#sh ipentry routefor 10.0.0.0 Routing 10.0.0.0/8 * 20.20.20.3, from 20.20.20.3, Route metric is 0, traffic AS Hops 1 20.20.20.2, from 20.20.20.2, Route metric is 0, traffic AS Hops 1
00:00:09 ago share count is 1 00:00:09 ago share count is 1
R1#show ip cef 10.0.0.0 10.0.0.0/8, version 237, per-destination sharing 0 packets, 0 bytes via 20.20.20.3 20.20.20.3, , 0 dependencies, recursive traffic share 1 next hop 20.20.20.3, FastEthernet0/0 via 20.20.20.3/32 valid adjacency via 20.20.20.2, 20.20.20.2, 0 dependencies, recursive traffic share 1 next hop 20.20.20.2, FastEthernet0/0 via 20.20.20.2/32 valid adjacency
• These two paths are installed in the RIB/FIB • Traffic is load-balanced across the two paths/exit points NMS-2T20 9594_04_2004_c2
• On PE1 [with eiBGP enabled] Traffic coming from Site 1 to Site 2 Incoming traffic is IP Hence only FIB Table will be looked [not LFIB] FIB Table will have the labels for iBGP path(s) only [Since] PE1 doesn’t have any eBGP paths [to Site 2] iBGP multi-path portion of the eiBGP multi-path comes into picture here Hence PE1 only will loadshare on iBGP Paths On the link(s) between
CE3
Site 1 CE1
PE1
PE3
Site 2 CE2
PE2
PE1 and PE2 [iBGP] PE1 and PE3 [iBGP] NMS-2T20 9594_04_2004_c2
Traffic coming from Site 1 to Site 2 Incoming traffic has at least one label [VPN] [MPLS Traffic] Hence only LFIB Table will be looked [not FIB] LFIB will have only eBGP path(s) installed Hence PE2 and PE3 will send traffic on eBGP path(s) only If iBGP paths are also installed in LFIB, we may get into forwarding loops; [e.g.. between PE2 and PE3] [Because we don’t want to forward a packet received from the provider network back into it]
• On PE3 [with eiBGP enabled] Traffic coming from Site 3 to Site 2 Incoming traffic is IP Hence only FIB Table will be looked [not LFIB]
CE3
Site 1 CE1
PE1
PE3
Site 2 CE2
PE2
FIB Table will have the label for iBGP path(s) and IP Forwarding Information for eBGP path(s) [Layer 2 Header, Output Interface] Hence PE3 will send traffic on both eBGP and iBGP path(s) On the link between PE3 and PE2 [iBGP] PE3 and CE2 [eBGP] NMS-2T20 9594_04_2004_c2
Traffic coming from Site 3 to Site 2 Incoming traffic has at least one label [VPN] [MPLS Traffic] Hence only LFIB Table will be looked [not FIB] LFIB will have only eBGP path(s) installed Hence PE2 will send traffic on eBGP Path(s) only If iBGP paths are also installed in LFIB, we may get into routing loops [e.g., between PE2 and PE3] [Because we don’t want to forward a packet received from the Provider Network back into it] NMS-2T20 9594_04_2004_c2
to define explicit paths for traffic with some constraint Path from A to B with 50 Mbps bandwidth
• FRR is a method of protecting MPLS traffic engineering label switched paths • The idea is to locally repair the LSP at the point of failure By re-routing traffic over a pre-defined back-up tunnel Prevents packet loss while IGP converges
• Protection against link and node failures
Related Session: RST-2603 Deploying MPLS Traffic Engineering NMS-2T20 9594_04_2004_c2
• Standby RP is in “hot standby” Standby • Chassis and line card states sync Route Processor • Line protocols: ATM, FR etc. in sync
SSO
• Forwarding table is in sync.
RPR+
RPR
Standby RP is in “hot boot” Startup and running config. in sync During failover line protocols reset Forwarding table is NOT in sync
• • • • •
Standby RP is in “cold boot”. Start up configs are in sync Running configs are not in sync During fail over standby resets line cards and restarts system
NOTE: Router Reload Forces Both Route Processors to Restart
What Is Graceful Restart? • Under certain failure conditions when a routing
process restarts it seeks the help of peer routers to re-learn routes and resume neighbor relationship while: The data traffic continues to be routed between the restarting router and peers The peer does not pre-maturely declare the restarting router dead
NSF/SSO Terminology • NSF capable router (restarting router) A router that preserves it’s forwarding table and rebuilds it’s routing topology after an RP switch over; currently a dual RP router ex: Cisco 7500, 10000, 12000, 7304
• NSF aware router (peer) A router that assists an NSF capable during duri ng restart and can preserve routes reachable via the restarting router
Router B NSF Aware Router A
NSF Capable Router C NSF Unaware
ex: Cisco 7200, 3600, 2600, 1700
• NSF unaware router A router that is not capable of assisting an NSF Capable router during an RP switchover
• NSF capable router is NSF aware too!!!! NMS-2T20 9594_04_2004_c2
Relevant Show Commands Active RP: show ip ospf esr2#show ip ospf Routing Process "ospf 1" with ID 2.2.2.1 and Domain ID 0.0.0.1 Supports only single TOS(TOS0) routes <snip> Number of areas in th this is router i is s 1. 1 norm normal al 0 stub 0 nssa External flood list length 0 Forwarding rwarding enabled, enabled, last NSF restart restart 00:02:51 ago Nonstop Fo (took 37 secs) Area BACKBONE(0) BACKBONE(0) Number of interfaces interfaces in this area area is 1 (0 l loopback) oopback)
Relevant Show Commands (Cont.) Active RP: show ip ospf neighbor detail esr2#show ip ospf neighbor det Neighbor 3.3.3.1, 3.3.3.1, interface address address 192.10.0.3 192.10.0.3 In the area 0 via interface GigabitEthernet1/0/0 Neighbor priority priority is 1, State State is FULL, 7 state state changes DR is 192.10.0.3 BDR is 192.10.0.2 Options is 0x52 LLS Options is 0x1 (LR), last OOB-Resync 00:03:08 ago Dead timer due in 00:00:37 Neighbor is up for 00:03:32
BGP Graceful Restart Timers • Important to keep restart timer below hold time • Default values BGP hold time 180 seconds (3 x 60 sec keepalive) Restart timer default 120 seconds Stale path timer default 180 seconds
• Restart timer is advertised to the peer • Stale path timer is used internally by the router
BGP GR: Deployment Consideration: 1 • Consider routes between R1 and R2 when R1 undergoes graceful restart R1 preserves all routes to AS200 and continues forwarding traffic
AS200 (Non-NSF)
AS300 (NSF Aware)
R3
R2
AS400 (Non-NSF)
R4
R1(NSF/SSO)
R2 reaches AS100 via AS300 All traffic from R2 to R1 goes via R3 All traffic from R1 to R2 goes directly; this can lead to temporary asymmetric routing No packet loss will be experienced from R1 to R2 Some packet loss from R2 to R1 during the re-convergence NMS-2T20 9594_04_2004_c2
BGP GR: Deployment Consideration: 3 • Consider routes between R1 and R3 when R1 undergoes graceful restart R1 preserves all routes to AS300 and continues forwarding traffic
AS200 (Non-NSF)
AS300 (NSF Aware)
R3
R2
AS400 (Non-NSF)
R4 R1(NSF/SSO)
R3 preserves all routes to AS100 and continues forwarding traffic RC2
RC1 AS100 RR
ALL Routers NSF Aware Unless Are Indicated NMS-2T20 9594_04_2004_c2
BGP Graceful Restart Commands R18C12KRP#sh ip bgp nei BGP neighbor is 10.10.104.1, remote AS 100, internal link BGP version 4, remote router ID 10.10.104.1 BGP state = Established, up for 00:00:10 Last read 00:00:09, hold time is 180, keepalive interval is 60 seconds Neighbor capabilities: Route refresh: advertised and received(new) Address family IPv4 Unicast: advertised and received
Indicates Neighbor Is NSF Aware
Graceful Restart Capabilty: advertised and received Remote Restart timer is 140 seconds Address families preserved by peer: IPv4 Unicast
Show Command on Peer Router On Peer of Restarting Router ip9-75b# show ip bgp BGP table version is 209, local router ID is 11.11.11.11 Status codes: s suppressed, d damped, h history, * valid, valid, > best, i – internal S Stale Origin Origin codes: codes: i - IGP, IGP, e - EGP, EGP, ? - incomp incomplet lete e Network Next Hop Metric LocPrf Weight Path *> 11.0.0.0 0.0.0.0 0 32768 i *> S 170.10.10.0/24 180.10.10.3 0 0 200 101e 180.10.10.3 0 0 200 101e *> S 180.10.10.0/24 *> S 190.10.10.0/24 180.10.10.3 5 0 200 101e
EIGRP NSF Operation Summary NSF-EIGRP NSF Capable Router EIGRP NSF Capability Exchange Router Restarts Set Signal Timer; Send Restart Notification Stop Signal Timer; Start Convergence Timer
EIGRP NSF NSF--Aware Peer EIGRP Hello with Restart Options Fields Hello with Restart Bit Set Hello Response w/o Restart Bit Set Null Restart + INIT Restart + Update Packets
EIGRP NSF Timers • On restart router Signal timer: Used to send Hello with Restart bit set; when this timer expires Hellos’ are sent without Restart bit set Convergence timer: Used to set the amount of time the restarting router waits to receive EOT marker from peers
•
On the peer Route hold timer: Used by peer to indicate the amount of time the peer waits to receive routing updates and EOT marker from restarting router
in two flavors IETF version: draft-ietf-isis-restart-0X draft-ietf-isis-restart-0X Cisco version
• The difference between them IETF version depends on neighbors to rebuild the routing table Cisco version does not depend on neighbors to rebuild routing table; peer can be non-NSF aware
Show and Debug Commands: IETF Version show clns neighbor detail Router#show clns nei detail System Id Interface SNPA State Holdtime Type Protocol esr2 PO1/0/0 *HDLC* Up 24 L2 IS-IS Area Address(es): 49.0002 IP Address(es): 180.10.10.1* Uptime: 00:02:27 NSF capable
Global Server Load Balancing, Stateful NAT, Stateful IPSec, DNS, DHCP, Cisco Server Load Balancing, IP QoS HSRP, VRRP, GLBP, MPLS-TE, IP Event Dampening , Graceful Restart (GR) in BGP, OSPF NSF, ISIS NSF, IP QoS
Network Address Translation (NAT) • Originally defined in RFC 1631 • NAT has been a factor in: Reducing address depletion Allowing interconnection of private networks using addresses as defined in RFC 1918 Hiding networks from outside the t he administrative domain
• Typically at domain edges To connect B2B To connect to Internet For VPN connections Between “test” and “production” networks
• These domain interconnect points become critical points of failure
More about NAT: 2102 Deploying and Troubleshooting NAT NMS-2T20 9594_04_2004_c2
Phased Implementation • Stateful NAT is being delivered with Cisco IOS in phases • Phase I: Provides support for protocols that do not imbed IP address and port information within the payload of the IP packet Includes HTTP, ICMP, PING, rcp, rlogin, rsh, TCP, Telnet Requires symmetric routing of return traffic Supports only “inside” NAT pools
• Phase II: The following protocols and applications are targeted for support in Phase II: FTP, H225, H245, PPTP/GRE, NetMeeting Directory (ILS), RAS, SIP (both TCP and UDP based), Skinny, TFTP Asymmetric routing support Support for outside NAT pools, using the configuration command ip nat outside source pool
Dynamic entries, which are extended out of static definitions Support for ip nat inside destination NMS-2T20 9594_04_2004_c2
Mapping-ID? ip nat inside source route-map rm-101 pool SNATPOOL1 mapping-id 10 overload • Used to specify whether or not the local SNAT router will distribute a particular set of locally created entries to a peer SNAT router
• Each dynamically created entry inherits a mapping-id number Comes from the mapping defined on the NAT rule At the point of creation
• Mapping list Specifies which of the entries will be forwarded to peers Provides a way to specify that entries from particular NAT rules should be forwarded
ip nat Stateful id 1 redundancy SNATHSRP mapping-id mappingid 10 mapping-id mappingid 11
IPSec Connection Failures Main Office Remote Site VPN Primary
VPN WAN
VPN Backup
• IPSec connection flows need to be maintained through the correct router in the case of multiple head-end devices • HSRP is used for failover, but can an HSRP vIP be used as the VPN tunnel endpoint?
More IPSec VPN Session SEC- 2011 Deploying Site Site to Site IPsec VPN NMS-2T20 9594_04_2004_c2
IPSec Stateful Failover Main Office Remote Site VPN Primary
VPN WAN
VPN Backup
Features • Ensures transport network is always available: business resiliency • Delivers sub-second central site failover • Scalable to 1000s of remote peers • Transparent to remote sites NMS-2T20 9594_04_2004_c2
Stateful IPSec Tunneling Aggregation Site Access Router
Data Center
IPSec VPN
Tunnels Fault
• Used in conjunction with HSRP; HSRP Virtual IP is used as source/destination for IPSec tunnels • State Synchronization Protocol (SSP) is used to transfer state • TCP connection formed from Active to each Standby router NMS-2T20 9594_04_2004_c2
SYNC and Sync-check • What is exchanged? Sequence number counters and window states IKE session keys Security association attributes, such as cipher, authentication and compression algorithms Standby Integrity (Sync check)
HA for Single Attached Servers • Single point of failure • Dual supervisors-fast stateful recovery • No increase in complexity Harden with Intra-Chassis Redundancy Here
Single Attached Server Mission Critical Application
HA Dual Supervisors Cisco Catalyst 6000 Series 100BaseT GE or GEC NMS-2T20 9594_04_2004_c2
Redundant Servers with Server Load Balancing Virtual Server 10.1.1.1
10.1.1.2
10.1.1.3
10.1.1.4
Cisco IOS-SLB Device User Requesting 10.1.1.1 Gets Directed to One of Several Identical Servers Eliminates the Server as a Single Point of Failure NMS-2T20 9594_04_2004_c2
ip slb serverfarm WEB-FARM real 10.1.1.2 inservice real 10.1.1.3 inservice real 10.1.1.4 Inservice ! ip slb vserver WEBSVR virtual 10.1.1.1 serverfarm WEB-FARM inservice
Cisco IOS Server Load Balancing Image for the Cisco Catalyst 6000 or the Cisco 7200 or Content Switching Module (CSM)
Cisco Global Site Selector (GSS) • GSS becomes authoritative name server for selected
applications (ie, sub-domains) Works with existing DNS infrastructure to connect client to SLB supporting the requested website Monitors load and availability of SLB’s to select the best SLB (site) to support the request
• Benefit: Better control over request resolution process High availability for disaster recovery and GSLB applications Policy-determined, load-balanced resource utilization across sites Improved performance and fast recovery yield positive user experience NMS-2T20 9594_04_2004_c2
Cisco Global Server Load Balancing • In real-time, globally load balance all web-based traffic across multiple data centers • Re-route all traffic to a back up data center in case of a disaster
DataCenter A
DataCenter B SLB, CSM, CSS
SLB, CSM, CSS
• Simplify the management of the DNS process by providing centralized command and control
In Summary… • For HA networking focus on network management,
HA technologies and design optimization (we have covered two; break out sessions cover design optimization is detail) • Understand and choose appropriate redundancy
protocols available for each network layer • Outfit critical edge systems with redundant
intra-chassis components Processor, power, fans, line cards, switch matrix
• Incorporate load sharing when possible • Measure and evaluate improvements • Keep user perspective NMS-2T20 9594_04_2004_c2
Appendix A: Acronyms 2 • MTTR: Mean Time to Repair
• RRI: Reverse Route Injection
• NAT: Network Address Translation
• RU: Rack Unit
• NIC: Network Interface Card
• SLB: Server Load Balancing
• NSF: Non Stop Forwarding
• sNAT: Stateful Network Address Translation
• PAT: Port Address Translation • PAgP: Port Aggregation Protocol • PPP: Point to Point Protocol • PVF: Primary Virtual Forwarder (in GLBP) • RIB: Routing Information Base (Routing Table) • RFC: Request For Comments • RPR: Resilient Packet Ring (L1/L2 Resiliency Technology) • RPR, RPR+: Cisco’s Route Processor Redundancy (Device Resiliency)
• SNMP: Simple Network Management Protocol • SPF: Single Point of Failure: Shortest Path First (in routing protocols) • SSO: Stateful Stateful Switch Ove Over r • SSP: State Synchronization Protocol • SVF: Secondary Virtual Forwarder (in GLBP) • TCP: Transmission Control Protocol • Protocol UDLD: Unidirectional Link Detection