8. High Availability Support (CE failover Support) The ForCES protocol provides mechanisms for CE redundancy and failover, in order to support High Availability as per Reqs[]. FE redundancy and FE to FE interaction is currently out of scope of this draft. There can be multiple redundant CEs and FEs in a ForCES NE. However, at any time there can only be one Primary CE controlling the FEs and there can be multiple secondary CEs. The FE and the CE PL are aware of the primary and secondary CEs. This information (primary, secondary CEs) is configured in the FE, CE PLs during pre-association by FEM, CEM respectively. Only the primary CE sends Control messages to the FEs. The FE may send its event reports, redirection packets to only the Primary CE (Report Primary Mode) or it may send these to both primary and secondary CEs (Report All Mode). (The latter helps with keeping state between CEs synchronized, although it does not guarantee synchronization.) This behavior or HA Modes are configured during Association setup phase but can be changed by the CE anytime during protocol operation. A CE-to-CE synchronization protocol will be needed in most cases to support fast failover, however this will not be defined by the ForCES protocol. During a communication failure between the FE and CE (which is caused due to CE or link reasons, i.e. not FE related), the TML on the FE will trigger the FE PL regarding this failure. This can also be detected using the HB messages between FEs and CEs. The FE PL will send a message (Event Report) to the Secondary CEs to indicate this failure or the CE PL will detect this and one of the Secondary CEs takes over as the primary CE for the FE. During this phase, if the original primary CE comes alive and starts sending any commands to the FE, the FE should ignore those messages and send an Event to all CEs indicating its change in Primary CE. Thus the FE only has one primary CE at a time. An explicit message (Config message- Move command) from the primary CE, can also be used to change the Primary CE for an FE during normal protocol operation. In order to support fast failover, the FE will establish association (setup msg) as well as complete the capability exchange with the Primary as well as all the Secondary CEs (in all scenarios/modes). These two scenarios (Report All, Report Primary) have been illustrated in the figures below. FE CE Primary CE Secondary | | | | Asso Estb,Caps exchg | | 1 |<--------------------->| | | | | | Asso Estb,Caps|exchange | 2 |<----------------------|------------------->| | | | | All msgs | | 3 |<--------------------->| | | | | | packet redirection,|events, HBs | 4 |-----------------------|------------------->| | | | | FAILURE | | | | Event Report (pri CE down) | 5 |------------------------------------------->| | | | All Msgs | 6 |------------------------------------------->| Figure 30: CE Failover for Report All mode FE CE Primary CE Secondary | | | | Asso Estb,Caps exchg | | 1 |<--------------------->| | | | | | Asso Estb,Caps|exchange | 2 |<----------------------|------------------->| | | | | All msgs | | 3 |<--------------------->| | | | | | (HeartBeats| only) | 4 |-----------------------|------------------->| | | | | FAILURE | | | | Event Report (pri CE down) | 5 |------------------------------------------->| | | | All Msgs | 6 |------------------------------------------->| Figure 31: CE Failover for Report Primary Mode 8.1 Responsibilities for HA TML level - Transport level: 1. The TML controls logical connection availability and failover. 2. The TML also controls peer HA managements. At this level, control of all lower layers example transport level (such as IP addresses, MAC addresses etc) and associated links going down are the role of the TML. PL Level: All the other functionality including configuring the HA behavior during setup, the CEIDs are used to identify primary, secondary CEs, protocol Messages used to report CE failure (Event Report), Heartbeat messages used to detect association failure, messages to change primary CE (Config - move), and other HA related operations described before are the PL responsibility. To put the two together, if a path to a primary CE is down, the TML would take care of failing over to a backup path, if one is available. If the CE is totally unreachable then the PL would be informed and it will take the appropriate actions described before.