More about TCP

Just finished the TCP course from INE  :

CCIE R&S: Understanding Transmission Control Protocol (TCP) 

Really recommend this to anyone who is interested in a bit deeper dive of how TCP works, great quality content, Keith Bogart is doing amazing job as instructor !

 

Some TCP facts which I didn’t mentioned in my previous post :

  • Before TCP connection is established TCP goes through the following steps:
  • Closed Initial State
  • Active Open (APP notifies the CPU and requests for service time) – client side
  • Transmission Control Block setup process  – CPU creates TCB to keep track of each new TCP connection – for each TCP session we create a new TCB with unique identifier (socket – SRC IP; SRC TCP Port; DST IP; DST TCP Port)
  • Initial sequence number will be random to prevent attacks on TCP
  • When TCB is ready we are sending SYN(x random Seq# number) – this is called a Syn Sent State
  • Receiving Syn(y random seq# number from server) + ACK (seq# x + 1)
  • Sending ACK(seq# y+1) for that
  • TCP Client side state changes to Established

This is Active Open – TCB creation process, initiated only per need of client.

Also we have Passive Open (server – listening process)

 

  • The initial state – closed
  • TCB creates partiality filled socket – so called unspecified passive open
  • From server is listening on specific port, when receives a SYN from client
  • Changes the state to Syn Received
  • Then responds with ACK and state changes to Established

A bit about Nagle  :

  • On by default on most of operating systems
  • When congestion occurs might be harmful – Storage vendors advising to disable it
  • Basically what Nagle does after getting the CPU time is gathers the data from Buffer and stores it – doesn’t send anything until we have a bytes on a flight  = unacknowledged bytes
  • If yes it will keep storing until his he has those unack bytes, after receiving ACKs for them it will send out multiple segments which we collected from buffer.

MSS VS MTU 

TCP checks the MTU and configures MSS accordingly. If we have a jumbo MTU 9216 MSS will be 9216 – 20 (IP Header) – 20 TCP header.

Relative Sequence Number in Wireshark and Subdisector

  • Wireshark by default shows not the real sequence numbers but the relative ones – this can be changed in menu
  • When troubleshooting HTTP long responds – be aware that by default Allow subdisector to reassemble TCP stream in Wireshark is enabled – which could bring a lot of confusion as it reasembles the packets and wait of full web page load, instead of getting the picture in 1ms you can see it wireshark as getting in 30 seconds, despite that there was no problem at all. Recomending to view a video on this :  Wireshark Tutorial Series #2. Tips and tricks used by insiders and veterans

Sequence Numbers 

  • Built to count the bytes in segment
  • When segment sent only with ACK without the data you might see that seq# is not incrementing – as there is nothing to count
  • Also it might increment even while empty – this called phantom byte – add +1 to seq#

About the Missed ACKs 

  • Not each sent segment will receive an ACK for it – might be we will be acking each 5 segments
  • RTO – Re-transmission timeout value – depends on RTT, also so called SRTT Smooth RTT – when we are comparing multiple RTT values and taking the average.
  • Why TCP is reliable ? Because before and after sending the Segment it puts it into the Re-transmission Queue and only after receiving an ACK it removes the segment from there.
  • Duplicate ACKs are appearing after one of the segments has been delivered but other one was lost – then we are repeating the last ACKed segment – selective repeat here comes into play with option field which helps to retransmit only what is missing.

Sliding Window Rules :

  • Bytes sent and acknowledged  – removing from retransmission queue.
  • Bytes sent but not yet acknowledged
  • Bytes not yet sent for which receiver is ready – usable window
  • Bytes not sent and receiver is not ready to receive them

Urgent Bit 

  • No priorities here, just flagging the segment about it’s urgency – TCP does nothing special here, only allows to upper layer APP to identify the urgent packet.
  • So no VIP delivery from the TCP side.
  • Urgent pointer field points to the first byte of non urgent data.

 

TCP congestion control

As we know TCP is called reliable – connection-oriented protocol. But why ?

Basically because it keeps the data in its buffer before and after sending it, it makes sure that data was delivered in sequence and receiver send the confirmation about received data (ack). Otherwise data will be re-transmitted there are different ways to re-send the data, let’s explore some of them.

So what is happening under the hood of this massive (big overhead comparing to UDP protocol).

Some ideas about TCP congestion mechanisms  :

  1. Were created with intention that small buffers (of devices) would overflow, packet loss will trigger and TCP would react.  – All good and cool, but in 2017 we have a huge buffers and this might cause some problems here, because it takes time before they will overflow, it creates a delays.
  2. With very small buffers we also have a problem, in case if we have  burst of packets and due to small buffer one/some of them would get lost the TCP can treat them as a congestion – afterwards it will reduce its congestion window and as the result links won’t be able to be filled completely.
  3. Link flapping – this stuff is common in campus networks – link flapping, fading can trick the TCP into thinking that there is an extreme congestion in the network and therefore it should do exponential re-transmissions

To avoid many bad stuff we have mechanisms like :

  • sliding window – changing the window size depending on successfully received traffic
  • stop and wait – one frame per ACK, basically N size of data can be sent until the next ACK
  • cumulative acks – if I Acknowledge packet 3 that means I also acknowledge that I received 2 packets before this one.
  • Go back to N – in case if single packet is lost we will re-transmit all the segment(is good when there are burst of losses), when senders window is larger than receivers ,protocol will use go back to N.
  • Selective repeat – we will re-transmit that packet and only that which was lost
  • Very important to make sure that we are not re-transmitting to early.

Mainly TCP is the end to end host based congestion control mechanism.

  • It reacts to events observable of the end host
  • Uses TCP’s sliding window and flow control
  • Tries to figure out how many packets can safely be outstanding in the network at the time.

You can memorize very simple form of TCP congestion mechanism and build everything on top of it – at least i’m doing so, I might be wrong 🙂

AIMD – Additive increase, multiplicative decrease

  • Basically if packet was received without errors and we got the ACK, we would increase the size of window field : w+1/w
  • If packet was dropped we would use following formula w=w/2 – so basically after first dropped packet we will be cutting the window to half.

AIMD also helps us to fully use the links – window size expands according to AIMD to probe how many bytes the pipe can hold.

Summary for AIMD :

  • Throughput of AIMD flow is sensitive to the drop probability and very sensitive to RTT – round trip time.
  • With many flows each flow will follow it’s own AIMD rule.

We have several TCP flow control methods :

TCP TAHOE

  • slow start (on connection startup to quickly find network capacity or packet timeout)
  1. window starts at max MSS
  2. increasing window for each ACK packet
  3. Exponentially grow congestion window to sense network capacity
  • congestion avoidance state  – to carefully probe when close to maximum network capacity
  • triple duplicate acks
  • fast re-transmission means don’t wait for a time out to re-transmit a missing segment if you receive a tripe duplicate ACK.

FSM for Tahoe Mechanism :

tahoefsm

TCP RENO.

Behaves identical as Tahoe on timeout

  • On triple duplicate ACK it :
  1. sets threshold to congestion window 2
  2. sets congestion window to congestion win/2 – fast recoevry
  3. Inflate congestion window size (fast recovery)
  4. Retrasmit missing segments (fast retransmit)
  5. stay in congestion avoidance state
  • TCP Reno adds in additional optimization, three duplicate ACKs don’t cause TCP to lose an RTT worth of trasnmission it waits for the missing segments to be acked.

FSM for Reno Mechanism :

renofsm.png

Basically the difference between Tahoe and Reno is a fast recovery.

Observation signals :

  • increasing ACKs : transfer is going well
  • duplicate ACKs : something was lost, delayed
  • timeout – bad stuff 🙂

In TCP we are also using self-clocking – with help of this sender knows that packet has left the network.

@Credits to Stanford university for providing such a great course – almost all info here is taken from Networking self paced course. 

BGP overview

Would like to put some facts about most famous routing protocol.

BGP – Border Gateway Protocol

According to Wikipedia currently we are using the version (BGP4), which was published as RFC 4271 in 2006.

BGP4 has been in use on the Internet since 1994.

Basically the world was changed by the 3 napkin protocol(picture was found on Network Collective Podcast):

bgp-napkin

Its simplified finite state machine is (taken from wiki) :

670px-BGP_FSM.svg

Some basics from BGP :

  • Autonomous routing protocol based on path vector mechanism
  • Slowest routing protocol

By default if route goes down BGP won’t flap it, will wait for 30 seconds before notifying

  • Helps you to be reachable from multiple service providers
  • To fastly reconverge BGP uses internal routing protocols, it relies on lower layer IGP
  • Opposite to other routing protocols BGP has not trust for the neighbors(have multiple filters, you need to agree with you BGP peers how and what will be traversed through he link before you can establish it.
  • BGP runs on top of TCP port 179
  • Has triggered updates (5 seconds internal, 30 seconds external)
  • 13 Different metrics for finding the best route. (largest wieght, highest local preference, locally originated, shortest as path, lowest origin type,lowest med and etc.)
  • All neighbors needs to be manually set –
     neighbor ip remote-as as_number

     

  • RFC rule about the traffic and BGP – when the packet leaves your AS – it’s not your traffic anymore, so basically you can’t tell anyone else what to do with their traffic.
  • Neighbor must be manually set and directly reachable
  • Multiple session to same neighbor are not permitted and will be dropped.
  • Network command will work differently – you really need to have the network which you want to advertise in your routing table, otherwise it won’t be advertised.It needs to be a direct match to the routing table
  • BGP packets :
  • Open – After configuring neighbor send a hello to neighbor router
  • Update – Used to update the routing table or send any updates for changes
  • Keep alive – BGP has it’s on keepalive mechanism
  • Notification – Any BGP error condition events or any changes would generate a notification messages

 

  • BGP states
  • Idle – Have a neighbor connected but didn’t talked yet – usually we would see router stuck in this status when something is miss-configured.
  • Active – Tries to establish a communication – a lot of issues also happens here
  • Open sent
  • Open confirm
  • Established – all good

 

  • Enabling BGP and adding neighbor is really simple :
  • conf t
  • router xx
  • neighbor x.x.x.x remote-as xx
  • Default hold time is 180 seconds – this is the interval after which neighbor will be considered as dead.
  • After applying any BGP rule you need to clear the session : clear ip bgp * (dangerous 🙂 don’t use this in production
  • BGP won’t advertise anything until you won’t specify what to advertise
  • Usually ISP is putting the filter so you will be able to send only those routes which you have agreed with them
  • Filtering is happening by using the route maps
  • When reading the AS-Path value read it from right to left, you will be able to understand in how many autonomous systems this route has passed. Also this is the anti-loop mechanism, if router will see it’s own AS in the path it will drop it.
  • BGP attributes are attached to every route advertisement
  • Route-map is something similar to access-lists, it performs a if then statements which are called match/set – used for modifying the bgp attributes, policy routing, route filtering.
  • EBGP is used to receive the routes and exchange them to uplinks
  • IBGP – used for connection in the same AS
  • IBGP does not modify any BGP attributes
  • IBGP has no loop prevention mechanism
  • BGP split horizon rule is to never advertise a route you received via ibgp to another ibgp peer. – So because of this you need to have a full mesh between your ibgp neighbors or use the route reflectors
  • IBGP peers should be formed by using a loopback interfaces – just to have a multiple paths in case of link failure.
  • IBGP and EBGP have different ad distance : EBGP  learned path AD = 20; IBGP = 200
  • EBGP neighbors must be directly connected, but to bend a rules a bit we can use the ebgp_multihop option.

neighbor xx.xx.xx.xx

ebgp-multihop 2 ( in case if we are using loopback for EBGP)

  • Default route 0.0.0.0 won’t allow to form a EBGP
  • Don’t foget to remove the private-as numbers before advertising to other EBGP peers.
  • Don’t forget to put the route reflectors in cluster – to avoid loops
  • To create AS in the AS confederations can be used.

And there is many more, BGP nowadays is being used everywhere and often for other goals as it was designed, anyway if you want to learn more I would suggest to visiting following links :

http://packetpushers.net/podcast/podcasts/show-355-whats-wrong-bgp-ietf-99/

http://thenetworkcollective.com/2017/09/hon-li-bgp/

https://www.cbtnuggets.com/it-training/cisco-ccip-bgp-642-661