TCP congestion control

As we know TCP is called reliable – connection-oriented protocol. But why ?

Basically because it keeps the data in its buffer before and after sending it, it makes sure that data was delivered in sequence and receiver send the confirmation about received data (ack). Otherwise data will be re-transmitted there are different ways to re-send the data, let’s explore some of them.

So what is happening under the hood of this massive (big overhead comparing to UDP protocol).

Some ideas about TCP congestion mechanisms  :

  1. Were created with intention that small buffers (of devices) would overflow, packet loss will trigger and TCP would react.  – All good and cool, but in 2017 we have a huge buffers and this might cause some problems here, because it takes time before they will overflow, it creates a delays.
  2. With very small buffers we also have a problem, in case if we have  burst of packets and due to small buffer one/some of them would get lost the TCP can treat them as a congestion – afterwards it will reduce its congestion window and as the result links won’t be able to be filled completely.
  3. Link flapping – this stuff is common in campus networks – link flapping, fading can trick the TCP into thinking that there is an extreme congestion in the network and therefore it should do exponential re-transmissions

To avoid many bad stuff we have mechanisms like :

  • sliding window – changing the window size depending on successfully received traffic
  • stop and wait – one frame per ACK, basically N size of data can be sent until the next ACK
  • cumulative acks – if I Acknowledge packet 3 that means I also acknowledge that I received 2 packets before this one.
  • Go back to N – in case if single packet is lost we will re-transmit all the segment(is good when there are burst of losses), when senders window is larger than receivers ,protocol will use go back to N.
  • Selective repeat – we will re-transmit that packet and only that which was lost
  • Very important to make sure that we are not re-transmitting to early.

Mainly TCP is the end to end host based congestion control mechanism.

  • It reacts to events observable of the end host
  • Uses TCP’s sliding window and flow control
  • Tries to figure out how many packets can safely be outstanding in the network at the time.

You can memorize very simple form of TCP congestion mechanism and build everything on top of it – at least i’m doing so, I might be wrong 🙂

AIMD – Additive increase, multiplicative decrease

  • Basically if packet was received without errors and we got the ACK, we would increase the size of window field : w+1/w
  • If packet was dropped we would use following formula w=w/2 – so basically after first dropped packet we will be cutting the window to half.

AIMD also helps us to fully use the links – window size expands according to AIMD to probe how many bytes the pipe can hold.

Summary for AIMD :

  • Throughput of AIMD flow is sensitive to the drop probability and very sensitive to RTT – round trip time.
  • With many flows each flow will follow it’s own AIMD rule.

We have several TCP flow control methods :


  • slow start (on connection startup to quickly find network capacity or packet timeout)
  1. window starts at max MSS
  2. increasing window for each ACK packet
  3. Exponentially grow congestion window to sense network capacity
  • congestion avoidance state  – to carefully probe when close to maximum network capacity
  • triple duplicate acks
  • fast re-transmission means don’t wait for a time out to re-transmit a missing segment if you receive a tripe duplicate ACK.

FSM for Tahoe Mechanism :



Behaves identical as Tahoe on timeout

  • On triple duplicate ACK it :
  1. sets threshold to congestion window 2
  2. sets congestion window to congestion win/2 – fast recoevry
  3. Inflate congestion window size (fast recovery)
  4. Retrasmit missing segments (fast retransmit)
  5. stay in congestion avoidance state
  • TCP Reno adds in additional optimization, three duplicate ACKs don’t cause TCP to lose an RTT worth of trasnmission it waits for the missing segments to be acked.

FSM for Reno Mechanism :


Basically the difference between Tahoe and Reno is a fast recovery.

Observation signals :

  • increasing ACKs : transfer is going well
  • duplicate ACKs : something was lost, delayed
  • timeout – bad stuff 🙂

In TCP we are also using self-clocking – with help of this sender knows that packet has left the network.

@Credits to Stanford university for providing such a great course – almost all info here is taken from Networking self paced course. 

BGP overview

Would like to put some facts about most famous routing protocol.

BGP – Border Gateway Protocol

According to Wikipedia currently we are using the version (BGP4), which was published as RFC 4271 in 2006.

BGP4 has been in use on the Internet since 1994.

Basically the world was changed by the 3 napkin protocol(picture was found on Network Collective Podcast):


Its simplified finite state machine is (taken from wiki) :


Some basics from BGP :

  • Autonomous routing protocol based on path vector mechanism
  • Slowest routing protocol

By default if route goes down BGP won’t flap it, will wait for 30 seconds before notifying

  • Helps you to be reachable from multiple service providers
  • To fastly reconverge BGP uses internal routing protocols, it relies on lower layer IGP
  • Opposite to other routing protocols BGP has not trust for the neighbors(have multiple filters, you need to agree with you BGP peers how and what will be traversed through he link before you can establish it.
  • BGP runs on top of TCP port 179
  • Has triggered updates (5 seconds internal, 30 seconds external)
  • 13 Different metrics for finding the best route. (largest wieght, highest local preference, locally originated, shortest as path, lowest origin type,lowest med and etc.)
  • All neighbors needs to be manually set –
     neighbor ip remote-as as_number


  • RFC rule about the traffic and BGP – when the packet leaves your AS – it’s not your traffic anymore, so basically you can’t tell anyone else what to do with their traffic.
  • Neighbor must be manually set and directly reachable
  • Multiple session to same neighbor are not permitted and will be dropped.
  • Network command will work differently – you really need to have the network which you want to advertise in your routing table, otherwise it won’t be advertised.It needs to be a direct match to the routing table
  • BGP packets :
  • Open – After configuring neighbor send a hello to neighbor router
  • Update – Used to update the routing table or send any updates for changes
  • Keep alive – BGP has it’s on keepalive mechanism
  • Notification – Any BGP error condition events or any changes would generate a notification messages


  • BGP states
  • Idle – Have a neighbor connected but didn’t talked yet – usually we would see router stuck in this status when something is miss-configured.
  • Active – Tries to establish a communication – a lot of issues also happens here
  • Open sent
  • Open confirm
  • Established – all good


  • Enabling BGP and adding neighbor is really simple :
  • conf t
  • router xx
  • neighbor x.x.x.x remote-as xx
  • Default hold time is 180 seconds – this is the interval after which neighbor will be considered as dead.
  • After applying any BGP rule you need to clear the session : clear ip bgp * (dangerous 🙂 don’t use this in production
  • BGP won’t advertise anything until you won’t specify what to advertise
  • Usually ISP is putting the filter so you will be able to send only those routes which you have agreed with them
  • Filtering is happening by using the route maps
  • When reading the AS-Path value read it from right to left, you will be able to understand in how many autonomous systems this route has passed. Also this is the anti-loop mechanism, if router will see it’s own AS in the path it will drop it.
  • BGP attributes are attached to every route advertisement
  • Route-map is something similar to access-lists, it performs a if then statements which are called match/set – used for modifying the bgp attributes, policy routing, route filtering.
  • EBGP is used to receive the routes and exchange them to uplinks
  • IBGP – used for connection in the same AS
  • IBGP does not modify any BGP attributes
  • IBGP has no loop prevention mechanism
  • BGP split horizon rule is to never advertise a route you received via ibgp to another ibgp peer. – So because of this you need to have a full mesh between your ibgp neighbors or use the route reflectors
  • IBGP peers should be formed by using a loopback interfaces – just to have a multiple paths in case of link failure.
  • IBGP and EBGP have different ad distance : EBGP  learned path AD = 20; IBGP = 200
  • EBGP neighbors must be directly connected, but to bend a rules a bit we can use the ebgp_multihop option.

neighbor xx.xx.xx.xx

ebgp-multihop 2 ( in case if we are using loopback for EBGP)

  • Default route won’t allow to form a EBGP
  • Don’t foget to remove the private-as numbers before advertising to other EBGP peers.
  • Don’t forget to put the route reflectors in cluster – to avoid loops
  • To create AS in the AS confederations can be used.

And there is many more, BGP nowadays is being used everywhere and often for other goals as it was designed, anyway if you want to learn more I would suggest to visiting following links :


Finite State Machines (FSM)

Great approach of understanding how the protocol or program should work is to check it’s FSM – Finite State machine.

It describes the processes of protocol, where it starts and how and where it ends.


For example we can check the TCP FSM :


Picture taken from 

In picture above we can see the procedure of syn, synack, ack. Also we can see how connection is being closed using Fin or timeouts.

By following link below you can find an overview of RSTP state machine :

ARP – Address Resolution Protocol

ARP resolves the mapping issue between two different size protocols.

It allows to MAP an IP(32 bit) and MAC(48 bit) together.

How does it work in a nutshell :

Let’s assume that we have two hosts on different networks – A and B, we want them to communicate with each other, what will happen when host A will try reaching the host B for the first time (arp cache empty) :

  1. As we can determine by ip and subnet mask combination that host B is on another network , Host A will send a broadcast packet which will contain the following fields :
  • HW : 1 (Ethernet)
  • Protocol : 0x0800 IP
  • HW length : (6) 48 bit
  • Protocol length : (4) 32-bit
  • OPCode : 1 (Request)
  • H/W Source : aa:aa:aa:aa:aa:aa – MAC of host A
  • Protocol source : – IP of host A
  • HW Dest – ff:ff:ff:ff:ff:ff – MAC of broadcast address
  • Protocol dest – ip of gateway 

Gateway will respond with ARP reply

  • HW : 1 (Ethernet)
  • Proocol : 0x0800 IP
  • HW length : 6 48 bit
  • Protocol length : (4) 32-bit
  • OPCode : 2 (Reply)
  • H/W Source : gg:gg:gg:gg:gg:gg – Gateways MAC 
  • Protocol source : – IP of gateway
  • HW Dest – : aa:aa:aa:aa:aa:aa
  • Protocol dest : – IP of host A

The same will happen from other side – if gateway has no ARP entry for host B it will broadcast the ARP request to same broadcast domain(let’s assume we are using only one router) it will get a response from the host and will save its address to arp cache.

After request reply exchange we will have the gateway in our arp-cache now we can send the packets to B using the gateway.

  1. Host A will send a packet to default gateway with destination MAC of default gateway and destination IP of host B.
  2. Before sending the packet further gateway will change the source MAC to its own, destination IP will be left the same.
  3. Host B will reply, MAC of gateway will be the dest MAC, and ip of host A will be dest IP.

We need not to forget about :

Gratuitous ARP – in case if machine is changing it’s MAC or services are being moved, we can have a wrong data cached in ARP caches of devices, to update it machine can send gratuitous arp requests.