BGP overview

Would like to put some facts about most famous routing protocol.

BGP – Border Gateway Protocol

According to Wikipedia currently we are using the version (BGP4), which was published as RFC 4271 in 2006.

BGP4 has been in use on the Internet since 1994.

Basically the world was changed by the 3 napkin protocol(picture was found on Network Collective Podcast):

bgp-napkin

Its simplified finite state machine is (taken from wiki) :

670px-BGP_FSM.svg

Some basics from BGP :

  • Autonomous routing protocol based on path vector mechanism
  • Slowest routing protocol

By default if route goes down BGP won’t flap it, will wait for 30 seconds before notifying

  • Helps you to be reachable from multiple service providers
  • To fastly reconverge BGP uses internal routing protocols, it relies on lower layer IGP
  • Opposite to other routing protocols BGP has not trust for the neighbors(have multiple filters, you need to agree with you BGP peers how and what will be traversed through he link before you can establish it.
  • BGP runs on top of TCP port 179
  • Has triggered updates (5 seconds internal, 30 seconds external)
  • 13 Different metrics for finding the best route. (largest wieght, highest local preference, locally originated, shortest as path, lowest origin type,lowest med and etc.)
  • All neighbors needs to be manually set –
     neighbor ip remote-as as_number

     

  • RFC rule about the traffic and BGP – when the packet leaves your AS – it’s not your traffic anymore, so basically you can’t tell anyone else what to do with their traffic.
  • Neighbor must be manually set and directly reachable
  • Multiple session to same neighbor are not permitted and will be dropped.
  • Network command will work differently – you really need to have the network which you want to advertise in your routing table, otherwise it won’t be advertised.It needs to be a direct match to the routing table
  • BGP packets :
  • Open – After configuring neighbor send a hello to neighbor router
  • Update – Used to update the routing table or send any updates for changes
  • Keep alive – BGP has it’s on keepalive mechanism
  • Notification – Any BGP error condition events or any changes would generate a notification messages

 

  • BGP states
  • Idle – Have a neighbor connected but didn’t talked yet – usually we would see router stuck in this status when something is miss-configured.
  • Active – Tries to establish a communication – a lot of issues also happens here
  • Open sent
  • Open confirm
  • Established – all good

 

  • Enabling BGP and adding neighbor is really simple :
  • conf t
  • router xx
  • neighbor x.x.x.x remote-as xx
  • Default hold time is 180 seconds – this is the interval after which neighbor will be considered as dead.
  • After applying any BGP rule you need to clear the session : clear ip bgp * (dangerous 🙂 don’t use this in production
  • BGP won’t advertise anything until you won’t specify what to advertise
  • Usually ISP is putting the filter so you will be able to send only those routes which you have agreed with them
  • Filtering is happening by using the route maps
  • When reading the AS-Path value read it from right to left, you will be able to understand in how many autonomous systems this route has passed. Also this is the anti-loop mechanism, if router will see it’s own AS in the path it will drop it.
  • BGP attributes are attached to every route advertisement
  • Route-map is something similar to access-lists, it performs a if then statements which are called match/set – used for modifying the bgp attributes, policy routing, route filtering.
  • EBGP is used to receive the routes and exchange them to uplinks
  • IBGP – used for connection in the same AS
  • IBGP does not modify any BGP attributes
  • IBGP has no loop prevention mechanism
  • BGP split horizon rule is to never advertise a route you received via ibgp to another ibgp peer. – So because of this you need to have a full mesh between your ibgp neighbors or use the route reflectors
  • IBGP peers should be formed by using a loopback interfaces – just to have a multiple paths in case of link failure.
  • IBGP and EBGP have different ad distance : EBGP  learned path AD = 20; IBGP = 200
  • EBGP neighbors must be directly connected, but to bend a rules a bit we can use the ebgp_multihop option.

neighbor xx.xx.xx.xx

ebgp-multihop 2 ( in case if we are using loopback for EBGP)

  • Default route 0.0.0.0 won’t allow to form a EBGP
  • Don’t foget to remove the private-as numbers before advertising to other EBGP peers.
  • Don’t forget to put the route reflectors in cluster – to avoid loops
  • To create AS in the AS confederations can be used.

And there is many more, BGP nowadays is being used everywhere and often for other goals as it was designed, anyway if you want to learn more I would suggest to visiting following links :

http://packetpushers.net/podcast/podcasts/show-355-whats-wrong-bgp-ietf-99/

http://thenetworkcollective.com/2017/09/hon-li-bgp/

https://www.cbtnuggets.com/it-training/cisco-ccip-bgp-642-661

 

Finite State Machines (FSM)

Great approach of understanding how the protocol or program should work is to check it’s FSM – Finite State machine.

It describes the processes of protocol, where it starts and how and where it ends.

STATE 1 -> STATE2 -> STATE3.

For example we can check the TCP FSM :

tcpfsm

Picture taken from http://www.tcpipguide.com/free/t_TCPOperationalOverviewandtheTCPFiniteStateMachineF-2.htm 

In picture above we can see the procedure of syn, synack, ack. Also we can see how connection is being closed using Fin or timeouts.

By following link below you can find an overview of RSTP state machine :

http://www.ieee802.org/1/files/public/docs2000/1wProposedStateDiagrams041.pdf

ARP – Address Resolution Protocol

ARP resolves the mapping issue between two different size protocols.

It allows to MAP an IP(32 bit) and MAC(48 bit) together.

How does it work in a nutshell :

Let’s assume that we have two hosts on different networks – A and B, we want them to communicate with each other, what will happen when host A will try reaching the host B for the first time (arp cache empty) :

  1. As we can determine by ip and subnet mask combination that host B is on another network , Host A will send a broadcast packet which will contain the following fields :
  • HW : 1 (Ethernet)
  • Protocol : 0x0800 IP
  • HW length : (6) 48 bit
  • Protocol length : (4) 32-bit
  • OPCode : 1 (Request)
  • H/W Source : aa:aa:aa:aa:aa:aa – MAC of host A
  • Protocol source : 192.168.0.2 – IP of host A
  • HW Dest – ff:ff:ff:ff:ff:ff – MAC of broadcast address
  • Protocol dest 192.168.0.1 – ip of gateway 

Gateway will respond with ARP reply

  • HW : 1 (Ethernet)
  • Proocol : 0x0800 IP
  • HW length : 6 48 bit
  • Protocol length : (4) 32-bit
  • OPCode : 2 (Reply)
  • H/W Source : gg:gg:gg:gg:gg:gg – Gateways MAC 
  • Protocol source : 192.168.0.1 – IP of gateway
  • HW Dest – : aa:aa:aa:aa:aa:aa
  • Protocol dest : 192.168.0.2 – IP of host A

The same will happen from other side – if gateway has no ARP entry for host B it will broadcast the ARP request to same broadcast domain(let’s assume we are using only one router) it will get a response from the host and will save its address to arp cache.

After request reply exchange we will have the gateway in our arp-cache now we can send the packets to B using the gateway.

  1. Host A will send a packet to default gateway with destination MAC of default gateway and destination IP of host B.
  2. Before sending the packet further gateway will change the source MAC to its own, destination IP will be left the same.
  3. Host B will reply, MAC of gateway will be the dest MAC, and ip of host A will be dest IP.

We need not to forget about :

Gratuitous ARP – in case if machine is changing it’s MAC or services are being moved, we can have a wrong data cached in ARP caches of devices, to update it machine can send gratuitous arp requests.

https://wiki.wireshark.org/Gratuitous_ARP 

VRRP Concepts

Some quick facts :

Protocol – 112; Multicast address 224.0.0.18; Preemption Enabled(by default);Priority=100 + highest IP; Timers 1/3.6;Only master sends hellos;

  • During the re-election all members will send multicast packets with same virtual source MAC – switch may see port flapping in that moment. 
  • Can’t advertise less than a 1 second timer – because of 1 byte field which can be 0 or 1, timers needs to be set on each router locally if you want to have a lower than 1 sec.
  • You better set the timers equally otherwise you might have two master scenario

What problem does it solves ? :

-It’s designed to eliminate the single point of failure in a statically routed network.

In a nutshell – we are making one logical router of two physical ones.

From user guide :

VRRP specifes a MASTER router that owns the next hop IP and MAC address for end stations on a local area network (LAN). The
MASTER router is chosen from the virtual routers by an election process and forwards packets sent to the next hop IP address. If the
MASTER router fails, VRRP begins the election process to choose a new MASTER router and that new MASTER continues routing trafc

VRRP uses the virtual router identifer (VRID) to identify each virtual router confgured. The IP address of the MASTER router is used as
the next hop address for all end stations on the LAN. The other routers the IP addresses represent are BACKUP routers.

 

RFC3768 describes this in details, but basically we have one virtual router with virtual ip and hosts which are using that virtual IP as a gateway, in case if one of the routers from VRRP instance will fail other will still be routing the packets.

In VRRP we have a Master/Backup routers, election process is based on highest IP or router priority.

Master router will be actively working on routing the packets while backup router should “keep the silence” and monitor the availability of master router(using keep alive messages).

What Backup router does while its in the Backup state: taken from RFC3768

While in this state, a VRRP router MUST do the following:

– MUST NOT respond to ARP requests for the IP address(s) associated
with the virtual router.

– MUST discard packets with a destination link layer MAC address
equal to the virtual router MAC address.

– MUST NOT accept packets addressed to the IP address(es) associated
with the virtual router.

In case of different events :

If a Shutdown event is received, then:

o Cancel the Master_Down_Timer
o Transition to the {Initialize} state

– If the Master_Down_Timer fires, then:

o Send an ADVERTISEMENT
o Broadcast a gratuitous ARP request containing the virtual
router MAC address for each IP address associated with the
virtual router
o Set the Adver_Timer to Advertisement_Interval
o Transition to the {Master} state

– If an ADVERTISEMENT is received, then:

o Set the Master_Down_Timer to Skew_Time

else:

If Preempt_Mode is False, or If the Priority in the
ADVERTISEMENT is greater than or equal to the local
Priority, then:

o Reset the Master_Down_Timer to Master_Down_Interval

else:

o Discard the ADVERTISEMENT

What master router does during :

While in the {Master} state the router functions as the forwarding
router for the IP address(es) associated with the virtual router.

While in this state, a VRRP router MUST do the following:

– MUST respond to ARP requests for the IP address(es) associated
with the virtual router.

– MUST forward packets with a destination link layer MAC address
equal to the virtual router MAC address.

– MUST NOT accept packets addressed to the IP address(es) associated
with the virtual router if it is not the IP address owner.

– MUST accept packets addressed to the IP address(es) associated
with the virtual router if it is the IP address owner.

Here is an example of VRRP config on Dell N-Series switches.

Configuring two instances for different sub-networks in vlan 50.

vrrp_example

On Cisco :

vrrp

In case if you are using Dell Force10 Switches, you can put the VRRP on top of VLT, this would allow you to have both VRRP MAC addresses populated in your LOCAL_DA Switch CAM-Table and allow the active-active routing instead of active-passive as it’s described in RFC.To check that MACs are being populated on both VLT peers you can use a command : show cam mac stack-unit 0 port-set 0 | grep vrrp_virtual_mac

Nice article about this can be found under this link.

Dell Networking VLT concepts

So what is a VLT and what does it does :

Virtual link trunking (VLT) allows physical links between two Dell switches to appear as a single virtual link to the network core or other
switches such as Edge, Access, or top-of-rack (ToR). As a result, the two physical switches appear as a single switch to the connected
devices.

Basically we are creating one logical switch out of two physical switches.

From the left we see how it looks when interconnected physically, from right how end device sees it.

vlt_concept

Configuration steps :

1.Enable spanning tree – RSTP and PVST supported  – step is optional, but nevertheless recommended.

configure

protocol spanning-tree rstp

bridge-priority 4096 (primary VLT switch)

bridge-priority 8192 (Secondary VLT switch)

no disable

Recommended to have a root bridge on VLT master and to set STP priority to secondary VLT switch in case if the first fails no to have topology change when other third unknown device would become a root.

2. Configure ports for VLTi link :

configure

interface range fortyGigE 0/56 , fortyGigE 0/60

no shutdown

interface port-channel 100

channel-member fortyGigE 0/56,60

no shutdown

3. Create VLT domain on both switches, don’t forget to create a backup-link

configure

vlt domain 1

primary-priority 10 (primary VLT switch)

primary-priority 20 (Secondary VLT switch)

back-up destination 192.168.0.2 (Primary VLT switch, management interface)

back-up destination 192.168.0.3  (Secondary VLT switch management interface)

peer-link port-channel 100

Backup links are needed to have a heartbeat messages flowing between two switches.

heartbeat

VLT also would work without the heartbeat but then you can encounter possible split brain scenario in case of VLTi link failure.

After configuring the VLT we should get the following picture :

shvltbrief.png

Now let’s attach a device to our VLT switches.

On both VLT members pick up a port for redundant connection :

interface port-channel xx

no ip address

switchport

channel-member tex/x/x

no shut

vlt-peer-lag port-channel 110

And you are ready to go.

You can tweak the stuff like dampening – just to give some time for routing and other protocols to get online after rebooting the switch ,as ports will get up faster and devices without knowing that routing protocol is not ready yet may black hole the traffic.

You can also play with spanning-tree metrics – to have interruption after reboot as small as possible.

VLT behavior :

vlt_behaviour

You can check that MACs are being synced using the command :

show mac-address-table count

Some of the of interesting points to remember (more you can find by downloading the user guide)

  • When you enable Layer 3 routing protocols on VLT peers, make sure the delay-restore timer is set to a value that allows sufcient time
    for all routes to establish adjacency and exchange all the L3 routes between the VLT peers before you enable the VLT ports.

  • RSTP and PVST is supported only, no other spanning-tree would work properly in vlt config

  • Stacking is not allowed when configuring the VLT.

  • If the source is connected to an orphan (non-spanned, non-VLT) port in a VLT peer, the receiver is connected to a VLT (spanned) portchannel, and the VLT port-channel link between the VLT peer connected to the source and ToR is down, trafc is duplicated due to
    route inconsistency between peers. To avoid this scenario, Dell Networking recommends confguring both the source and the receiver
    on a spanned VLT VLAN.

  • In a scenario where one hundred hosts are connected to a Peer1 on a non-VLT domain and trafc flows through Peer1 to Peer2; when
    you move these hosts from a non-VLT domain to a VLT domain and send ARP requests to Peer1, only half of these ARP requests reach
    Peer1, while the remaining half reach Peer2 (because of LAG hashing). The reason for this behavior is that Peer1 ignores the ARP
    requests that it receives on VLTi (ICL) and updates only the ARP requests that it receives on the local VLT. As a result, the remaining
    ARP requests still points to the Non-VLT links and trafc does not reach half of the hosts. To mitigate this issue, ensure that you
    confgure the following settings on both the Peers (Peer1 and Peer2):
    arp learn-enable and mac-address-table stationmove refresh-arp

  • Don’t use any VLAN config on VLTi – switch will match the vlans automatically

  • Don’t use Dynamic lang on VLTI – static is recommended

  • In a VLT domain, the following software features are supported on VLTi: link layer discovery protocol (LLDP), flow control, port
    monitoring, jumbo frames, and data center bridging (DCB)

  • If the link between the VLT peer switches is established, changing the VLT system MAC address or the VLT unit-id causes the link
    between the VLT peer switches to become disabled. However, removing the VLT system MAC address or the VLT unit-id may
    disable the VLT ports if you happen to confgure the unit ID or system MAC address on only one VLT peer at any time.

  • If the link between VLT peer switches is established, any change to the VLT system MAC address or unit-id fails if the changes
    made create a mismatch by causing the VLT unit-ID to be the same on both peers and/or the VLT system MAC address does not
    match on both peers

  • If VLTi connectivity with a peer is lost but the VLT backup connectivity indicates that the peer is still alive, the VLT ports on the
    Secondary peer are orphaned and are shut down.

    Also the L3 VLANS would be shut down too

Some failure scenarios :

failurescenarios

Overall VLT is a great thing for load balancing, redundancy and availability (you can upgrade the switches one by one without having a downtime) – In stack this wouldn’t be possible.

All info and images were taken from Dell User guide for S4048-ON switch, you can download it by following this link : http://downloads.dell.com/manuals/all-products/esuprt_ser_stor_net/esuprt_networking/esuprt_net_fxd_prt_swtchs/force10-s4048-on_administrator%20guide15_en-us.pdf  

In user guide you can find a lot of detailed info about all the possible switch OS functions and how to use/implement/troubleshoot them.

Dell VLT Peer-Routing

Some important points about VLT Peer routing technology.

Peer routing enables one VLT node to act as a proxy gateway for the other peer in a VLT domain. When you enable routing on VLT peers,
you can also enable the peer routing feature.  

In a nutshell, when peer-routing is enabled on both VLT switches you can load-balance, the L3 packets through both switches – as this allows a switch in VLT domain to forward traffic on behalf of its peer switch.

Example how VLT forwards the traffic without peer-routing enabled :

without_peer_routing

 

When you enable peer-routing :

with_peer_routing

Images taken from Configuration guide

Peer-routing helps to avoid sub-optimal routing, reduces the latency by avoiding another hop in traffic path, no need to have VRRP.

Keep in mind in case if switch – Peer-1 will fail with peer routing enabled, your traffic will still be forwarded without any interruption – but as you don’t have any virtual IP address any control or management plane requests won’t be answered by Switch-1’s peer.

So basically by enabling peer routing we have only one goal – redundancy and traffic sharing for L3 protocols.

During the bootup of VLT peer switches, a forwarding loop may occur until the VLT confgurations are applied on each switch and the
primary/secondary roles are determined.


To prevent the interfaces in the VLT interconnect trunk and RSTP-enabled VLT ports from entering a Forwarding state and creating a
traffic loop in a VLT domain, take the following steps.


1 Configure RSTP in the core network and on each peer switch as described in
Rapid Spanning Tree Protocol (RSTP).
Disabling RSTP on one VLT peer may result in a VLT domain failure.


2 Enable RSTP on each peer switch.
PROTOCOL SPANNING TREE RSTP mode
no disable

forwarding loop3 Configure each peer switch with a unique bridge priority.
PROTOCOL SPANNING TREE RSTP mode
bridge-priority

More info about peer-routing advantages comparing to VRRP.

https://hasanmansur.com/2016/06/09/vlt-peer-routing-and-routed-vlt/

Routed VLT v1.2 – document covers peer-routing in great details.

 

Error Detection – Checksum

So how does the checksum works ?

  • Both devices needs to agree on checksum number – will it be odd or even
  • The higher the number is the more precise is the check

Let’s take the example from one of YouTube videos with a great explanation of how Checksum is working.

Taking the numbers which we want to transmit :

25 11 12 7 13 4

Both devices should agree on checksum – let it be number 16

  1. Sum up the numbers 25+11+12+7+13+4=72
  2. Divide them by checksum 72/16=4.5(ignore what is after the .) so we have 4
  3. 4*16=64 = 72 – 64 = 8  
  4. Now 8 is a checksum
  5. First device takes the numbers and writes them to tcp/ip stack, puts also the number 8 to the checksum field – to transmit it with actual message, this part helps to know if the rest of message is correct and right
  6. Second device reads the numbers from tcp/ip stack and performs the checksum check, if it get’s the same value = 8 that means that data wasn’t corrupted and we can trust it.

This is a very fast check to compute, but unfortunately checksum is not robust and not reliable it can help only against single bit error.

For example if we would send 25 11 12 7 13 4 but message will arrive as 24 12 12 7 13 4 it will also be 72 and checksum won’t detect any problems here.

checksum

IP Checksum picture taken from Stanford Networking Course

A bit info about ICMP, ping and traceroute

I’ve went through RFC 792 and would like to share some basic(high level) info about ICMP and how we are using it.
ICMP is an INTERNET CONTROL MESSAGE PROTOCOL
  • It runs over network layer – so it’s encapsulated in IP datagrams
  • Unreliable – Simple datagram service, there is no retries to re transmit the messages in case if it failed to reach the destination
  • ICMP message is generated using the header of IP datagram(source address, destination address) and it takes first 8 bytes from original IP datagram payload,afterwards the message will be marked with type and code. Some of the types : icmp_types
  • Host unreachable – when IP datagram gets to the last router but last router doesn’t know where the host is
  • Port unreachable –  means that the ports that’s contained inside of outgoing packet is not being recognized at receivers end

How does the PING uses ICMP :

The ping application calls ICMP directly, it sends ICMP echo request – message type 8 code 0 to receiver.

that get’s encapsulated into IP datagram, flows through the network, when receiver will get it it will send echo reply – type 0 code 0

How Traceroute uses ICMP and UDP – *nix version, as Windows uses pure ICMP sends the echo requests until they won’t get echo reply from the target(link) :

  • For TRACEROUTE the goal is to discover all routers in the path, show the path and provide the round trip delay.

  • When  execute the traceroute it generates an UDP message which will be encapsulated in IP datagram, TTL will be set to 1 for the first message.

  • After reaching the first router, TTL will be decremented and equal to 0, that will force the router to drop the packet and generate a ICMP message back to sender with ICMP Type 11 = which means TTL expired.

  • To send that TTL expired message back router will take the IP header data and first 8 bytes from IP payload.

  • When TTL expired message will reach the source – traceroute will know that TTL has expired and this message has arrived from first hop router, also traceroute will measure the round-trip-time ( how long it took from sending the UDP message to receiving TTL expired back)

  • Now it will generate a second UDP message only with one change – the TTL field value will be increased to 2, then the same to 3 and etc. it will stop only when destination port unreachable message arrives back.

  • Traceroute by generating the requests also generates a random unusable UDP port number, when our UDP datagram will get through all the routers to destination, receiver won’t be able to recognize the UDP port number and will send the ICMP Type 3 Code 3 Message – Destination Port Unreachable – after receiving that Traceroute will end the trace.

Picture of ICMP types taken from Stanford Networking course http://online.stanford.edu/course/introduction-computer-networking