Hello Juniper community! I'm new to Juniper, and I'm at a PreK-12 school with about 800 students and 100 faculty. The last couple weeks we've experienced frequent events where we lose Internet connection for about 10 minutes, these happen between 2-10 times a day. These events have been going on a couple times a week all school year, but the TP-link TL-ER5120 we had been using would only max out CPU and so I had no visibility into what was going on. We replaced that with SRX210 on January 7, but that was also maxing out CPU, sessions per second, and max sessions, so we upgraded to an SRX340 [15.1X49-D150.2] on January 16. Now I finally have some insight into what might be happening, but I need some help identifying a root cause.
During one of these events, a typical output for show security flow session summary looks like this:
inet-fw1> show security flow session summary
Unicast-sessions: 26170
Multicast-sessions: 0
Failed-sessions: 125334
Sessions-in-use: 46416
Valid sessions: 26253
Pending sessions: 0
Invalidated sessions: 20163
Sessions in other states: 0
Maximum-sessions: 262144
Which shows tens of thousands of invalidated sessions and hundreds of thousands of failed sessions.
For reference, this is what normal use looked like a few minutes before:
inet-fw1> show security flow session summary
Unicast-sessions: 23597
Multicast-sessions: 0
Failed-sessions: 0
Sessions-in-use: 23822
Valid sessions: 23590
Pending sessions: 0
Invalidated sessions: 232
Sessions in other states: 0
Maximum-sessions: 262144
inet-fw1> show security flow session nat summary
Valid sessions: 23557
Pending sessions: 0
Invalidated sessions: 180
Sessions in other states: 0
Total sessions: 23737
After each event I cleared security flow statistics to reset the counters to 0, and after an event show security flow statistics looked like this:
inet-fw1> show security flow statistics
Current sessions: 19784
Packets forwarded: 5265663
Packets dropped: 3270801
Fragment packets: 3322634
Pre fragments generated: 0
Post fragments generated: 0
Fragment packets are usually 0 or single digit numbers until one of these events start, and after the SRX has recovered and the fragment packets stops incrementing, I can take the number of fragmented packets and divide by the number of minutes and come up with a number around 20k packets per second. This seems roughly equivalent to the pps on our ge-0/0/0 interface to the Internet using monitor interface traffic, which means nearly every packet is being fragmented for a period of 5-10 minutes. According to our ISP, events like this correspond with a spike in our bandwidth utilization saturating our 200Mbps fiber connection. However, I don't know which is the cause and which is the effect. That is, are we over-utilizing our fiber connection and so packets are fragmenting which causes the session drop, or are events like this causing all the sessions to re-establish all at once, which maxes out our bandwidth? FWIW, here is a graph from our ISP of this week with red dots indicating events for which I recorded the times: ![bandwidth utilization graph.png bandwidth utilization graph.png]()
I can definitely report that not all periods of maxed bandwidth result in these events, so I'm not convinced that is the cause. However, I don't see evidence (via show security flow statistics) that these occur overnight or on weekends when there aren't users on campus.
CPU usage does not seem like an issue, and show chassis routing-engine never shows anything concerning. However, we're probably reaching the SRX340 session-creation-per-second limit during these times:
inet-fw1> show security monitoring performance spu
.fpc 0 pic 0
Last 60 seconds:
0: 34 1: 28 2: 34 3: 28 4: 35 5: 34
6: 29 7: 34 8: 27 9: 35 10: 29 11: 35
12: 28 13: 34 14: 28 15: 34 16: 27 17: 35
18: 34 19: 31 20: 35 21: 28 22: 34 23: 28
24: 34 25: 28 26: 35 27: 29 28: 34 29: 28
30: 34 31: 34 32: 28 33: 34 34: 28 35: 35
36: 28 37: 33 38: 27 39: 34 40: 28 41: 34
42: 28 43: 34 44: 32 45: 32 46: 34 47: 28
48: 34 49: 28 50: 35 51: 27 52: 34 53: 27
54: 34 55: 28 56: 33 57: 26 58: 33 59: 34
inet-fw1> show security monitoring fpc 0
FPC 0
PIC 0
CPU utilization : 28 %
Memory utilization : 50 %
Current flow session : 52459
Current flow session IPv4: 52459
Current flow session IPv6: 0
Max flow session : 262144
Total Session Creation Per Second (for last 96 seconds on average): 7857
IPv4 Session Creation Per Second (for last 96 seconds on average): 7857
IPv6 Session Creation Per Second (for last 96 seconds on average): 0
I have typical screens set up for both trust and untrust interfaces, and the only non-zero values are the following:
inet-fw1> show security screen statistics zone trust
TCP SYN flood 9820
SYN flood source 9820
SYN flood destination 0
IP spoofing 3532
TCP FIN no ACK 1
IP block fragment 196
inet-fw1> show security screen statistics zone untrust
IP tear drop 88
I'm not ruling out attacks (either from the outside or from knucklehead students on the inside), but I'm not sure if these are causes or symptoms. I don't know how the counter increments (for a legitimate tear drop attack, should the counter increment for each packet?), but it seems like in 3322634 fragmented packets, it's likely that 88 of them could be identified as a teardrop attack whether it is one or not. Also when establishing so many sessions in a short amount of time, it seems likely that the firewall would perceive that as a SYN flood.
I've played around with values for security screen ids-option trust-screen limit-session source-ip-based to try and limit the sessions that are being under normal circumstances to try and smooth out any spikes. 100 doesn't seem to prevent the events, 20 seems like we're back in dial-up times so I didn't leave it there long enough to test, and 50 seems restricting to end users but still doesn't prevent these events from happening.
Pings from the shell to an IP address a couple hops up the ISP chain time out for most of the event, and the only thing in the logs that seems to correspond with one of these events is:
TOPO_CH: for Instance 0 in routing-instance default received on port ge-0/0/1.0
ge-0/0/1.0 is my trust interface, but again, I'm not sure if this is a cause or symptom.
Here is the output for show system statistics tcp for the first 48 hours this device was in production (11am Thursday to 11am Saturday):
Tcp:
1090164 packets sent
288028 data packets (44316564 bytes)
936 data packets retransmitted (888201 bytes)
0 resends initiated by MTU discovery
751285 ack only packets (2199 packets delayed)
0 URG only packets
0 window probe packets
10 window update packets
100184 control packets
1872506 packets received
262132 acks(for 44312689 bytes)
1494766 duplicate acks
0 acks for unsent data
19204 packets received in-sequence(2582789 bytes)
747406 completely duplicate packets(55306 bytes)
29 old duplicate packets
53 packets with some duplicate data(14654 bytes duped)
264 out-of-order packets(19040 bytes)
0 packets of data after window(0 bytes)
0 window probes
202 window update packets
7 packets received after close
23 discarded for bad checksums
0 discarded for bad header offset fields
0 discarded because packet too short
49791 connection requests
625 connection accepts
9 bad connection attempts
0 listen queue overflows
646 connections established (including accepts)
51211 connections closed (including 38 drops)
14 connections updated cached RTT on close
14 connections updated cached RTT variance on close
3 connections updated cached ssthresh on close
49761 embryonic connections dropped
260991 segments updated rtt(of 308477 attempts)
605 retransmit timeouts
18 connections dropped by retransmit timeout
0 persist timeouts
0 connections dropped by persist timeout
746940 keepalive timeouts
746933 keepalive probes sent
7 connections dropped by keepalive
59440 correct ACK header predictions
9342 correct data packet header predictions
666 syncache entries added
63 retransmitted
38 dupsyn
0 dropped
625 completed
0 bucket overflow
0 cache overflow
32 reset
9 stale
0 aborted
0 badack
0 unreach
0 zone failures
0 cookies sent
0 cookies received
4 SACK recovery episodes
3 segment retransmits in SACK recovery episodes
1429 byte retransmits in SACK recovery episodes
119 SACK options (SACK blocks) received
23 SACK options (SACK blocks) sent
0 SACK scoreboard overflow
0 ACKs sent in response to in-window but not exact RSTs
0 ACKs sent in response to in-window SYNs on established connections
0 rcv packets dropped by TCP due to bad address
0 out-of-sequence segment drops due to insufficient memory
49804 RST packets
6 ICMP packets ignored by TCP
0 send packets dropped by TCP due to auth errors
0 rcv packets dropped by TCP due to auth errors
0 outgoing segments dropped due to policing
And here's another useful output
inet-fw1> show interfaces ge-0/0/0 extensive | find "Flow error statistics"
Flow error statistics (Packets dropped due to):
Address spoofing: 0
Authentication failed: 0
Incoming NAT errors: 0
Invalid zone received packet: 0
Multiple user authentications: 0
Multiple incoming NAT: 0
No parent for a gate: 0
No one interested in self packets: 0
No minor session: 0
No more sessions: 0
No NAT gate: 0
No route present: 0
No SA for incoming SPI: 0
No tunnel found: 0
No session for a gate: 0
No zone or NULL zone binding 0
Policy denied: 19312
Security association not active: 0
TCP sequence number out of window: 122
Syn-attack protection: 0
User authentication errors: 0
Protocol inet, MTU: 1500, Generation: 153, Route table: 0
Flags: Sendbcast-pkt-to-re, Is-Primary
Addresses, Flags: Is-Default Is-Preferred Is-Primary
We have 2 lightly-used web servers in our dmz (that interface shows 15.4MB, 34k packets input and 3.9MB, 30k packets output over 48 hours), but I unplugged that cable anyway to rule them out as a cause and we still experienced events, so I don't think they are to blame.
We also have an inline cachebox and content filter between the SRX340 and our Cisco WS-C3850-12S core switch, and still experienced the issue when they were bypassed so I think I can rule them out as well.
I added the following to security flow, but the security-trace log file is still 0 bytes, so I'm not sure what I need to do to get that working since I think that will be very helpful:
traceoptions {
file size 200k files 5 world-readable;
flag fragmentation;
flag session;
flag tcp-basic;
rate-limit 1000;
}
I'm hoping that someone can help me use the tools this Juniper box has to understand what might be causing these events since I'm still learning. These inturruptions have been very disruptive for the last 2 weeks since school resumed, and I'm hoping with a long weekend I might be able to figure something out, but I'm out of ideas.
Thanks in advance