Digital Communication II Explanations

During my time supervising the University of Cambridge Computer Laboratory Part II Digital Communications II course, various questions or confusions arose. Here is an in depth explanation on the issues raised. To the best of my knowledge it is correct, but if in doubt, ask the lecturer.

Scalability of Routing Protocols

Link state protocols maintain state about all the connections in the network. This enables them to run Djikstra's algorithm and arrive at the shortest route to a packet's destination. Once the network has settled down ("converged"), the number of updates between link state routers should theoretically be small, as the only time an update is made is when a route goes down or comes back up. Convergence is fast as these updates are broadcast to every router throughout the network. However, two problems arise. The first is that with a large and complex topology, there will be an enormous amount of information to store and process -- this tends to become infeasible when we're talking about the entire Internet. The second is that we cannot guarantee that the update packets from any particular router reach any or all of the rest in the network (e.g. if a link goes down an update packet might get lost). Hence there has to be periodic refresh of the soft state that specifies what links are functioning. Such refreshes are costly in terms of bandwidth.

Distance-Vector protocols cause each router to advertise what destinations they are aware of, and what the metric (be it latency, hop count, etc.) is to that destination from them. The routing table is a list of destinations, each with the next hop the packet is to be forwarded to, and the metric. A route is therefore made up of the composition of the next hops from each of the routers that are on the route, in essence "distributed Djikstra". The storage and processing overhead is less than link state, making it more suited to complex networks. Convergence times for D-V are higher than link state (e.g. several hours for BGP) due to the information being propagated out by normal routing updates, cascading through the routers. D-V routers regularly transmit the entire contents of their routing table to their immediate neighbours (not the entire network). As a concrete example, BGP is a D-V protocol, and is hard state, i.e. only updates are transmitted, and these over a TCP connection, where packets are guaranteed not to get lost. Hence the amount of traffic generated is not actually that great. Another good reason for BGP to be D-V is that policy information cannot be conveyed in link state protocols: the links are either up or down. Providers do not want to disclose their policies, and therefore do not broadcast them (as would be needed by some sort of link state protocol), instead choosing to publish metrics and destinations that they can reach. BGP provides a mechanism for using policies which are highly necessary in today's Internet.

The exam question that I set actually has a model answer that says it could be argued both ways -- see above. My general feeling would be to say that link state is not used in the core, and therefore is not as scalable, but provided you outline clearly your reasons (to show that you know what you're talking about), I think you'd have got the marks anyway.

TCP Vegas: Doesn't Bandwidth Decrease to Zero?

TCP Vegas is a modification to TCP to attempt to smooth the value of the congestion window, cwin. This is done by comparing the bandwidth currently being achieved, with the maximum theoretically achievable (calculated using the best RTT that the sender has seen on this connection). If the difference between the two is large, i.e. achieved is much lower than maximum, Vegas reduces the size of cwin.

This has lead to the belief that Vegas would end up reducing cwin to zero (!). In practice, Vegas calculates the difference by:
diff = cwin/baseRTT - cwin/currRTT
Where baseRTT is the minimum ever achieved, and currRTT is the most recently measured RTT.
In this way if cwin is decreased, the difference will also decrease (as baseRTT is smaller than currRTT, and therefore given that the fractions have common numerators, the first will be decreased by a greater amount than the second). Hence cwin is stabilised.

The Packet-Pair Model

This is used to provide an estimate for the available bandwidth on a network. Simplistically, this involves sending two packets, with zero (or very little) spacing between them in time, and at the destination examining the spacing between them when they arrive. The spacing becomes greater as traffic from other flows on the network is "mixed in". Hence, the magnitude of the gap gives an indication of network load.

Unfortunately, the relationship between gap length and available bandwidth is complicated. Hu and Steenkiste's paper "Estimating Available Bandwidth Using Packet Pair Probing" is worth looking at for more details.

Routing in ATM Networks

IP can be routed over ATM using an ATM Adaptation Layer (generally AAL5), which is then carried as usual on circuits with VCIs. To set up a VCI, the ATM network must ascertain what switches must be traversed in order to reach the final IP destination.

ATM routing itself is generally done using the Private Network-Network Interface (PNNI). This is a hierarchical summarisation protocol, that assigns each ATM switch to a peer group. A field similar to IP's subnet mask enables the peer group ID to be derived from the ATM switch's node ID. The lower the peer group ID, the higher in the hierarchy it is. Routing is similar to that used in the Open Shortest Path First (OSPF) protocol.
Cisco's explanation of PNNI may be useful.

When a VCI is to be established, the IP address must be resolved to an ATM node ID. This takes place in a similar fashion to the ARP on ethernet, where if the destionation node is within the same Local IP Subnet (LIS), then a request is sent to the network to obtain the destination node ID. Because ATM has no notion of broadcast, a dedicated server must exist per LIS that will resolve IP addresses to ATM node IDs. The protocol for these requests is ATMARP (ATM Address Resolution Protocol), along with inATMARP (inverse ATMARP). If the destination is not on the LIS, then there must be a dedicated IP router that connects two or more LISs that can route the call accordingly.
RFC 1577 describes this in more detail.

The 3*MSS in Fast Recovery

If a sender receives 3 duplicate ACKs it assumes packet loss has taken place (< 3 duplicates might just have been caused by re-ordering, as if packet i+1 arrives before packet i, we can only give another ACK for packet i-1, until we receive packet i, at which point we can ACK packet i+1 immediately). Three is an arbitrary number that works well (!).
If 3 duplicate ACKs are received, fast retransmit kicks in, and resends packet i. We decrease cwin to (swin/2 + 3). The +3 inflates cwin to take into account the three packets that the duplicate ACKs were sent in response to, that have already arrived at the destination. (see RFC 2001).
If on resend we get another duplicate ACK, we increment cwin by one, again accounting for the packet of data that was just resent having reached the destination. The key point to realise is that cwin is the number of packets that we are allowed on the wire "unacknowledged". A duplicate ACK does not count as a "real" acknowledgement, hence under normal circumstances receiving a duplicate ACK would not increment cwin. To allow another packet to be sent out onto the link (another restransmission), we need to increment cwin once more. Whilst this seems like slow start, where cwin is incremented by one MSS per ACK, remember that the number of packets being sent per RTT is only one: we only ever have "space" in cwin for one more unacknowledged packet to be sent out on the wire. Hence the increase is linear (cwin = cwin + MSS per RTT).
When we get a real ACK (i.e. we've now resent what we needed to), we go back to the normal AIMD algorithm, and start off from cwin = ssthresh (where ssthresh was set to the old swin/2), with additive increase. This is because the increases in cwin that we have been making with duplicate ACKs were not actually telling us how congested the link was: we were only (re)sending one packet per RTT.

Bear in mind that if the cwin is too small or the loss it too large you may not have 3 packets to use to re-synchronise your link's clocking, and therefore you will need to decrease your cwin to 1 MSS and slow start up to ssthresh.

The Need for Duplicate ACKs to "Clock the Link"

When fast recovery takes place, the congestion window (cwin) size decreases to half of what the previous send window was, plus 3*MSS (as mentioned above). The slow start threshold (ssthresh) is also set to half the previous send window, i.e. once the missing packet(s) has/have been transmitted, the connection will already be above the threshold below which slow start takes place, and instead will continue linear increase, probing the link to find at what size of cwin the congestion begins.

By only descending to the previous swin/2, the assumption is being made that the congestion is not significant enough to cause packet loss with this low a cwin. This assumption can be justified if 3 duplicate ACKs are received, as this indicates a transient packet loss, rather than a sudden and long term change in the congestion of the link. If the duplicate ACKs are not received, it is more likely that the link has suddenly become severely congested compared to what it was. Therefore we must assume that the new value of cwin that we must now find could be much lower than the previous swin/2, hence we set cwin = MSS.

Another interesting point stemming from this is that we also set cwin = MSS if a TCP connection has not been used for a significant period of time. Were we to use the old value of cwin we might find that the congestion conditions had substantially changed, hence we "start from scratch" and slow start up.

The usage of the duplicate ACKs to infer whether the congestion on the link has changed substantially is "clocking the link". The origin of this is the fact that the sender is aware of when to put another packet out onto the link by when an ACK is received from the destination. Hence the ACKs "clock" the link by signalling the sender.

Switch Utilisation -- Karol 1987 Paper

The 58% value for the switch utilisation is calculated assuming that the number of switch ports (N) is infinite. An Appendix in the paper (claims to!) show(s) that the steady-state rate of arrival of new packets at input queues which are not blocked on waiting for the switch fabric, becomes Poisson-like. This enables an expression to be derived for the average number of packets that are waiting in each queue to pass through the fabric, in terms of the throughput.

We then note that F, the mean steady-state number of input queues that are free (i.e. not blocked on waiting for the fabric), divided by N, gives the throughput (p).
Using these two equations we can show that:
1 - p = (p^2)/2(1-p)
which solves to give
p = 2 - sqrt(2) = 0.586, i.e. about 58%.

If you're really interested I've got a printed copy of the paper (the DTG has an online subscription the relevant digital library).

Explicit Congestion Notification (ECN)

Explicit Congestion Notification (ECN) is used by routers to indicate to transmitters that they are causing congestion on the network. The mechanism is that a packet or packets on the offending flow have their ECN bit set by the router as they pass through it. When these packets reach the receiver, the ACK that is sent back also has the ECN bit set.

In the DECbit version of ECN (now obsolete), if the transmitter received more than 50% of its ACKs with the ECN bit set, it would perform multiplicative decrease of its congestion window (i.e. react in the same way as it would do to a packet loss when using fast retransmit and fast recovery).

In the newer version of ECN, (RFC 3168), this is changed slightly in that a router need only set the ECN bit in a single packet (and hence, the receiver, in a single ACK), to cause the transmitter to perform multiplicative decrease.

The advantage to ECN over timeouts is that transmitters are notified much more quickly of their causing congestion, whilst not actually requiring a wait for a timeout, or retransmissions.

CSMA/CD vs CSMA/CA

CSMA/CD is used on Ethernet networks (or similar). We can detect any collisions because all the receivers are on the same wire as the transmitters.
CSMA/CA cannot do this as the receivers may suffer a collision due to the hidden terminal problem.
CSMA/CD senses the medium and then immediately transmits. If it detects a collision, it backs off for a random period.
CSMA/CA senses the medium, then waits a random number of slots, then senses again. This avoids the problem of all stations waiting until the medium is free, and then immediately all transmitting (i.e. colliding).
CSMA/CA therefore introduces slight delays, because of this extra random wait time.
We can detect collisions on both by means of listening CSMA/CD, and by the lack of an ACK in CSMA/CA.

RTS/CTS on Wireless Networks

RTS/CTS on WLAN is separate from CSMA/CA. The former can be turned off, the latter cannot.
RTS/CTS allows a station to reserve the medium for a certain amount of time. If a station hears a CTS, (and not the RTS), it still does not transmit for the time period given in the CTS. Hence, RTS/CTS mostly solves the hidden node problem, whereas CSMA/CA suffers from it.

Crossbar Switches and Clos Networks

In a crossbar switch, each input has a wire which has an interconnection (cross point) with every other output. A diagram essentially looks like n input wires running vertically crossing n output wires that run horizontally. At any one time one cross point on any one input wire (and any one output wire) can be active. This means that the switch is fully connected, but evidently does block when (e.g.) two inputs try to send a packet to the same output simultaneously.

In switches with a large number of inputs and outputs, the number of cross points becomes very large, and hence they take up a significant amount of space/power. A Clos network is an arrangement of crossbar switches (crossbars) that requires fewer cross points but still achieves fully-connectedness. Evidently, the are still subject to blocking. Wikipedia's article on Clos networks explains this further.

Banyan Switches

In contrast to crossbar switches, Banyan switches (so called because they look a little like the paths made by the roots of a Banyan tree) use Beta elements. These consist of two inputs and two outputs, with a cross connect between them. The cross connect can either be set in the cross or bar configurations, i.e. either connecting input 0 to output 0, and input 1 to output 1, or input 0 to output 1, and input 1 to output 0. Some buffer memory must also be present for packets arriving at the inputs that do not fit the current state of the cross connect. The switch is non-blocking, so we do require buffer memory. This can be solved somewhat by using a Batcher stage before the Banyan switch (see below). For information on how the Beta elements are connected together, see the Packet Switching page of the networks course at Tel Aviv University (warning: Java applet present).

Batcher-Banyan Switches

The problem with a simple Banyan switch is that we get head of line blocking. Without VOQs if two inputs contend for a particular path through the switch, one of them must block.
To solve this, we use a Batcher stage before the Banyan fabric. This sorts all packets that are at the heads of all the queues.
In each element of the Batcher switch, we examine the (maximum of) two packet output port destinations. The lower output port ID is sent down the output connection of that Batcher stage that is in the opposite direction to the way the arrow for that stage points.
This results in packets for output ports that have higher IDs heading in the directions of the arrows.
The point of the Batcher stage is that, for n inputs, provided there are n packets heading to n different outputs (i.e. there can only be one packet heading to each output), we can sort the packets such that all n can pass simultaneously through the subsequent Banyan switch.
Why does this work? The first stage of a Batcher network simply sorts pairwise. The next stage does a 4-way merge, followed then by an 8-way merge. We end up with the packets sorted in order of output port number, which we can then pass to the Banyan switch.

Knockout Switches

All input ports connected to all output concentrators using a broadcast bus, i.e. there is one wire for each input, with one crosspoint on that line per output. Hence multiple inputs can send to the same output concentrator simultaneously.
Key point is to have a fully connected switch that uses a concentrator for each output port in order to determine which packets should continue to the output.
In the concentrator, packets go through a series of "rounds", e.g. quarter finals, semifinals, finals. These determine which packets get through to the output buffer. Note that there are fewer output buffers than input ports (for a fully connected switch with only output buffering normally you would need one buffer for each input port at every output port, just in case). Hence, if more packets arrive than there are buffers, some of them are discarded (and re-sent by the upper layers of the fabric).
Left on its own, the concentrator would tend to fill up the left-most buffers with packets, and begin to discard some, whilst having space in the right-hand buffers. This is why after the concentrator we also have a shifter, which spreads the packets out. In this way, if the last packet is placed in buffer m, the next batch will be placed in buffers starting at m+1, etc., modulo L (the number of packet buffers there are).

Using Multiple Hash Tables for Longest Prefix Matching

Also known as "How do I do 2004/8/3 part (b)(iii)?"
Essentially, you just need to have a hash table per prefix length. This can be further optimised, as explained in this rather nice paper on using hash tables for routing at SIGCOMM 1997.