Just finished the TCP course from INE :
CCIE R&S: Understanding Transmission Control Protocol (TCP)
Really recommend this to anyone who is interested in a bit deeper dive of how TCP works, great quality content, Keith Bogart is doing amazing job as instructor !
Some TCP facts which I didn’t mentioned in my previous post :
- Before TCP connection is established TCP goes through the following steps:
- Closed Initial State
- Active Open (APP notifies the CPU and requests for service time) – client side
- Transmission Control Block setup process – CPU creates TCB to keep track of each new TCP connection – for each TCP session we create a new TCB with unique identifier (socket – SRC IP; SRC TCP Port; DST IP; DST TCP Port)
- Initial sequence number will be random to prevent attacks on TCP
- When TCB is ready we are sending SYN(x random Seq# number) – this is called a Syn Sent State
- Receiving Syn(y random seq# number from server) + ACK (seq# x + 1)
- Sending ACK(seq# y+1) for that
- TCP Client side state changes to Established
This is Active Open – TCB creation process, initiated only per need of client.
Also we have Passive Open (server – listening process)
- The initial state – closed
- TCB creates partiality filled socket – so called unspecified passive open
- From server is listening on specific port, when receives a SYN from client
- Changes the state to Syn Received
- Then responds with ACK and state changes to Established
A bit about Nagle :
- On by default on most of operating systems
- When congestion occurs might be harmful – Storage vendors advising to disable it
- Basically what Nagle does after getting the CPU time is gathers the data from Buffer and stores it – doesn’t send anything until we have a bytes on a flight = unacknowledged bytes
- If yes it will keep storing until his he has those unack bytes, after receiving ACKs for them it will send out multiple segments which we collected from buffer.
MSS VS MTU
TCP checks the MTU and configures MSS accordingly. If we have a jumbo MTU 9216 MSS will be 9216 – 20 (IP Header) – 20 TCP header.
Relative Sequence Number in Wireshark and Subdisector
- Wireshark by default shows not the real sequence numbers but the relative ones – this can be changed in menu
- When troubleshooting HTTP long responds – be aware that by default Allow subdisector to reassemble TCP stream in Wireshark is enabled – which could bring a lot of confusion as it reasembles the packets and wait of full web page load, instead of getting the picture in 1ms you can see it wireshark as getting in 30 seconds, despite that there was no problem at all. Recomending to view a video on this : Wireshark Tutorial Series #2. Tips and tricks used by insiders and veterans
- Built to count the bytes in segment
- When segment sent only with ACK without the data you might see that seq# is not incrementing – as there is nothing to count
- Also it might increment even while empty – this called phantom byte – add +1 to seq#
About the Missed ACKs
- Not each sent segment will receive an ACK for it – might be we will be acking each 5 segments
- RTO – Re-transmission timeout value – depends on RTT, also so called SRTT Smooth RTT – when we are comparing multiple RTT values and taking the average.
- Why TCP is reliable ? Because before and after sending the Segment it puts it into the Re-transmission Queue and only after receiving an ACK it removes the segment from there.
- Duplicate ACKs are appearing after one of the segments has been delivered but other one was lost – then we are repeating the last ACKed segment – selective repeat here comes into play with option field which helps to retransmit only what is missing.
Sliding Window Rules :
- Bytes sent and acknowledged – removing from retransmission queue.
- Bytes sent but not yet acknowledged
- Bytes not yet sent for which receiver is ready – usable window
- Bytes not sent and receiver is not ready to receive them
- No priorities here, just flagging the segment about it’s urgency – TCP does nothing special here, only allows to upper layer APP to identify the urgent packet.
- So no VIP delivery from the TCP side.
- Urgent pointer field points to the first byte of non urgent data.
Old known stuff, probably for everybody who is somehow related to networking, but anyway, putting it here too.
IEEE 802.3x – Wikipedia link
If QoS is enabled and you like to prioritize the traffic, flow control needs to be disabled,as it doesn’t care about any higher level prioritization, just when ingress traffic is coming in faster than receiver can accept it, flow control will kick in and send pause frames until the ingress-egress rate will be equalized or ingress rate is lower than egress of that interface.
A bit more info from Dell FTOS 9 documentation about flow control :
I would use it only for storage – for example iscsi traffic, in separated network, then it won’t do any harm.. probably 🙂
But of course no way of using it on trunk links, other switch facing links and etc.
Installing – Let’s use Debian based manager :
Apt-get install open-iscsi
Apt-get install iperf
Apt-get install nload
Scanning for targets :
Iscsiadmin -m discovery -t st -p target_ip
Iscsiadm -m node -u
Iscsiadm -m node –login
Check if you see the object :
New disk should appear in dev – like sdx \–sdx1
How to partition (if wasn’t partitioned yet)
Connecting the drive to system
Mount /dev/sdc1 /mnt
Creating big file with random data :
Dd if=/dev/urandom of=rnd.20G count=1024 bs=20M
Also you can check the copy speed live via rsync (when copying from mount location to mount for example) –
rsync –progress source destination
Live interface monitoring can be done via
nload – same like top but for nics
And at the end if you’d like to check what is the performance of network(without storage) – connect second host to same switch and run iperf
Iperf -s (on servers side)
Iperf -c x.x.x.x -d (-D option for simultaneous bi-directional bandwidth measurement)
While the transmission speeds are advancing – totally delays are dominated by propagation delay which is bound to the speed of light.
End to end delay is the time from when we sent the first bit of the first line until last bit arrives to destination.
Basically everything can be determined except queuing delay – it’s an unpredictable variable.
Packetization delay is basically how close together we can pack a bits.
So it’s the time from when the first to the last bit of a packet is transmitted.
Formula would be : Packetization delay = P(size) / R(transmit rate)
Propagation delay – l is the amount of time it takes for a single bit to travel over a link at propagation speed c.
Formula would be PropagationDelay=l(cable lenght)/c(speed)
Would like to put some facts about most famous routing protocol.
BGP – Border Gateway Protocol
According to Wikipedia currently we are using the version (BGP4), which was published as RFC 4271 in 2006.
BGP4 has been in use on the Internet since 1994.
Basically the world was changed by the 3 napkin protocol(picture was found on Network Collective Podcast):
Its simplified finite state machine is (taken from wiki) :
Some basics from BGP :
- Autonomous routing protocol based on path vector mechanism
- Slowest routing protocol
By default if route goes down BGP won’t flap it, will wait for 30 seconds before notifying
- Open – After configuring neighbor send a hello to neighbor router
- Update – Used to update the routing table or send any updates for changes
- Keep alive – BGP has it’s on keepalive mechanism
- Notification – Any BGP error condition events or any changes would generate a notification messages
- Idle – Have a neighbor connected but didn’t talked yet – usually we would see router stuck in this status when something is miss-configured.
- Active – Tries to establish a communication – a lot of issues also happens here
- Open sent
- Open confirm
- Established – all good
- Enabling BGP and adding neighbor is really simple :
- conf t
- router xx
- neighbor x.x.x.x remote-as xx
- Default hold time is 180 seconds – this is the interval after which neighbor will be considered as dead.
- After applying any BGP rule you need to clear the session : clear ip bgp * (dangerous 🙂 don’t use this in production
- BGP won’t advertise anything until you won’t specify what to advertise
- Usually ISP is putting the filter so you will be able to send only those routes which you have agreed with them
- Filtering is happening by using the route maps
- When reading the AS-Path value read it from right to left, you will be able to understand in how many autonomous systems this route has passed. Also this is the anti-loop mechanism, if router will see it’s own AS in the path it will drop it.
- BGP attributes are attached to every route advertisement
- Route-map is something similar to access-lists, it performs a if then statements which are called match/set – used for modifying the bgp attributes, policy routing, route filtering.
- EBGP is used to receive the routes and exchange them to uplinks
- IBGP – used for connection in the same AS
- IBGP does not modify any BGP attributes
- IBGP has no loop prevention mechanism
- BGP split horizon rule is to never advertise a route you received via ibgp to another ibgp peer. – So because of this you need to have a full mesh between your ibgp neighbors or use the route reflectors
- IBGP peers should be formed by using a loopback interfaces – just to have a multiple paths in case of link failure.
- IBGP and EBGP have different ad distance : EBGP learned path AD = 20; IBGP = 200
- EBGP neighbors must be directly connected, but to bend a rules a bit we can use the ebgp_multihop option.
ebgp-multihop 2 ( in case if we are using loopback for EBGP)
- Default route 0.0.0.0 won’t allow to form a EBGP
- Don’t foget to remove the private-as numbers before advertising to other EBGP peers.
- Don’t forget to put the route reflectors in cluster – to avoid loops
- To create AS in the AS confederations can be used.
And there is many more, BGP nowadays is being used everywhere and often for other goals as it was designed, anyway if you want to learn more I would suggest to visiting following links :
ARP resolves the mapping issue between two different size protocols.
It allows to MAP an IP(32 bit) and MAC(48 bit) together.
How does it work in a nutshell :
Let’s assume that we have two hosts on different networks – A and B, we want them to communicate with each other, what will happen when host A will try reaching the host B for the first time (arp cache empty) :
- As we can determine by ip and subnet mask combination that host B is on another network , Host A will send a broadcast packet which will contain the following fields :
- HW : 1 (Ethernet)
- Protocol : 0x0800 IP
- HW length : (6) 48 bit
- Protocol length : (4) 32-bit
- OPCode : 1 (Request)
- H/W Source : aa:aa:aa:aa:aa:aa – MAC of host A
- Protocol source : 192.168.0.2 – IP of host A
- HW Dest – ff:ff:ff:ff:ff:ff – MAC of broadcast address
- Protocol dest 192.168.0.1 – ip of gateway
Gateway will respond with ARP reply
- HW : 1 (Ethernet)
- Proocol : 0x0800 IP
- HW length : 6 48 bit
- Protocol length : (4) 32-bit
- OPCode : 2 (Reply)
- H/W Source : gg:gg:gg:gg:gg:gg – Gateways MAC
- Protocol source : 192.168.0.1 – IP of gateway
- HW Dest – : aa:aa:aa:aa:aa:aa
- Protocol dest : 192.168.0.2 – IP of host A
The same will happen from other side – if gateway has no ARP entry for host B it will broadcast the ARP request to same broadcast domain(let’s assume we are using only one router) it will get a response from the host and will save its address to arp cache.
After request reply exchange we will have the gateway in our arp-cache now we can send the packets to B using the gateway.
- Host A will send a packet to default gateway with destination MAC of default gateway and destination IP of host B.
- Before sending the packet further gateway will change the source MAC to its own, destination IP will be left the same.
- Host B will reply, MAC of gateway will be the dest MAC, and ip of host A will be dest IP.
We need not to forget about :
Gratuitous ARP – in case if machine is changing it’s MAC or services are being moved, we can have a wrong data cached in ARP caches of devices, to update it machine can send gratuitous arp requests.