Hi,
has anybody some experience on failover duration ?
I have a SRX-550M cluster, connected on donwlink side to a (HPE) L3 Switch cluster, in a 'square' architecture :
| |
SRX1--SRX2
| |
SW1--SW2
Each SRX is connected to its SW via 8 aggregated links.
Routing is made with secondary routes : on the SW cluster, one default route to SRX1 and one route with lower priority to SRX2. On the SRX cluster, routes to SW1 and routes with lower priority to SW2.
RG0 and RG1 are configured for uplink interconnexion.
I tried 4 config , combiantions of :
- static or dynamic LAGs,
- BFD to supervise routes in order to accelerate secondary routes activation in case of loss of chassis #1 or interfaces on chassis #1.
I run traffic crossing the whole chain, and measure traffic interruption when I perform a manual failover (by CLI), here are the results :
1. without lacp and without bfd : traffic interruption ~ 1s : very good
2. with lacp and without bfd : traffic interruption ~ 18s
3. without lacp and with bfd : traffic interruption ~ 22s
4. with lacp and with bfd : traffic interruption ~ 28s : very bad
Is that 'normal' , compare to the SRX , to have such high duration as soon as I add protocols ? Or do you think they is a 'problem somewhere' ?
The only clue I found at Juniper is :
(https://www.juniper.net/documentation/en_US/junos/topics/concept/chassis-cluster-redundancy-group-failover-manual-understanding.html)
Caution: Be cautious and judicious in your use of redundancy group 0 manual failovers. A redundancy group 0 failover implies a Routing Engine failover, in which case all processes running on the primary node are killed and then spawned on the new master Routing Engine. This failover could result in loss of state, such as routing state, and degrade performance by introducing system churn.
Thanks for your advices !