Communication Course - Ex3 Description

Last updated: 20/12/04 (added an explanation section of the clock ticks)

Exercise 3: Simulating a Shortest Path Routing Algorithm

In this JAVA exercise you will simulate a Link State routing algorithm. You will implement a simplified but real version of the Shortest Path First (SPF) Routing Algorithm. Routers are simulated by processes. Router processes communicate using UDP.


Content:


 

Introduction:

The network layer is concerned with what routes to choose in order to get a piece of data from the source to the destination. The working horse of the network layer is the router who does the actual decision on which path to transmit the data. Giving the routers "good" distributed algorithms to follow is thus crucial in making networks work smoothly. One class of routing algorithms are the Link State Protocols who maintain a certain degree of global  information about connectivity and link costs.

In this exercise you will use the Dijkstra Shortest Path algorithm and implement it as a dynamic link state routing algorithm. To make the algorithm real-life it will need to adapt to the dynamics of the network. You will thus need to regularly run the shortest path algorithm taking into consideration changing link costs and faulty routers that may be temporarily down and then recover or send corrupt messages. The algorithm will need to prevent (up to a certain degree) routing oscillations.

As we want you to finish the exercise before you finish your degree we will describe the failure detection methods and the program & data structures you should use.

The program should be written in JAVA.

The Nodes:

Each node (router) is implemented by a process. You should write a program called spf.java that implements a node.
The program synopsis (you can assume the input is correct):

java spf <node name> <config file> <clock rate>

The Router's Data Structures:

The network is represented by a directed graph: there exists a link (X,Y)  iff  there exists a link (Y,X), but the costs may be different. Each router in the network is represented by a node in the graph; and each network link is represented by an arc in the graph. Each node in the graph is labeled by the corresponding router name. Thus, each network link L that connects a router X with a router Y (i.e., X is a predecessor of Y in the graph) is uniquely identified through the pair (X, Y).

Each router in the network maintains the following data structure:

LSrec is a 5-tuple: Let LS be an LSrec. Thus, LS is interpreted as the status of the specific network link identified by the pair of adjacent routers (LS.origR, LS.adjR).

The Router Operation at a Glance:

The router operation can be summarized as follows:

Maintaining Routing Information

This includes the following:

Aging LS records in the LSDB

Periodically decrement the age field of each LSrec in the LSDB. Purge all LSrecs whose age field reached 0 from the LSDB. The aging mechanism helps to achieve the following aims:

Failure detection of outgoing links

Each router employs a simple failure detection mechanism in order to discover the outgoing links which are not operational anymore. Whenever a link L is found to be faulty, the router informs other routers about the topology change by flooding an LSrec with 0xffffffff (infinite cost) in its cost field.

The failure detection mechanism is as follows: periodically each router sends a HELLO message out of each of the router's outgoing links (including those which are currently DOWN). If the router does not receive a HELLO message from some of its UP neighbors during a predefined time interval, the link leading to this neighbor is declared DOWN.

In this exercise, to make the things simpler, we will separate between the LS message flooding and the failure detection: i.e., LS messages coming from the DOWN links are ignored and newly originated and incoming LS messages are not flooded out of the DOWN links.

A previously DOWN link becomes UP whenever a HELLO message from the router at the opposite end of this link is received.

The (event driven) Router Operation in Detail:

Startup

Upon startup each router should perform the steps listed below:

Handling Timer Events

Every HELLO_INTERVAL: Every LS_REFRESH_INTERVAL: Every AGING_INTERVAL: Every ROUTE_REFRESH_INTERVAL:

Handling Input Events

Upon receiving a HELLO message from a link L: Upon receiving an LS message lsm carrying an LSrec ls from a link L:

Handling Failure Detection Events

If no HELLO messages have been received from an outgoing link L for MAX_HELLOS * HELLO_INTERVAL time interval:

Message Formats and Time Intervals:

This section defines the format of all the messages a node should know how to handle. All the numbers should be sent as unsigned int. The first 4 bytes of each message are the message type.
 value | Description
--------+----------------------------
   1     |  HELLO message.
   2     |  LS message.
   3     |  A clock tick message.
HELLO messages are used to implement the failure detection mechanism (see above). A HELLO message consists of the message type (always 1) and the sending router name:
    class _HELLOmessage 
    {
        int type;     /* always 1 */
        int router;
    } 
An LS message consists of the type (always 2), the name of an adjacent router which flooded this LS message to the receiving router, and the LSrec corresponding to a link identified by the (from_router, to_router) pair:
    class _LSmessage 
    {
        int type;     /* always 2 */
        int neighbor;
        int from_router;
        int to_router;
        int counter;
        int cost;
        int age;
    } 

A clock tick message indicates to a node that the time has arrived to send delay updates to all its neighbors. The message consists of the type only. Note that clock ticks are not synchronized between the nodes, and the clock on different nodes may run at different rates. The body of the clock tick message is empty.

     /* The TICK message format */
    class  _TICKmessage
    {
        int type; /* always 3 */
    }
 

Time intervals measured in ticks and all the mentioned above constants are defined in the file spf_consts.java. You should NOT submit this file.

Handling Counter Wraparounds:

Note that the number of the counter bits in the LSmessage structure is bounded by the COUNTER_WIDTH constant. This means that in this exercise you are required to cope with the counter wraparounds when comparing counter values. This counter value comparison is done according to the following rules:

The counter's value 'a' is older than the counter's value 'b' if

Getting Link Costs:

To get the link cost, your program should call the function get_link_cost. The function accepts two char arguments. The first is the router's name, and the second is the adjacent router name. The get_link_cost function should be taken from the file spf_consts.java. You must not use any assumptions on the cost function in the solution,  for example knowing that the cost function always return X for a certain link CANNOT be used just because we implemented a simple cost function. In real life the the cost would be a function of physical distance, estimated RTT, up-time, estimated probability of packet loss, bandwidth, queuing delay, average traffic, etc.

The return value of get_link_cost is the cost of the link. The cost is a number in the range 1-100 if the link is UP, and 0xffffffff if the link is DOWN. Treat the function as if successive calls to the function might return different results. 

Explaining The Clock Ticks:

It is easier to program systems when you think about them in an event driven manner. What drives this program are basic time ** clock ticks ** events which in turn are used to drive higher level events (e.g.event of sending a message every HELLO_INTERVAL).

When executing the program you decide on what time scale the clock ticks are, i.e. every 5 second, 10 seconds, 37 seconds, etc. (through the 'clock rate' parameter to the program). This makes the program more scalable as the relative timeliness between the different events is important. I.e. it makes sense that the LS_REFRERSH_INTERVAL is greater than the HELLO_INTERVAL (lets say 5 times longer). Otherwise it would be funny to refresh the db if you had no event in between which could teach you that something has changed
 => thus the relative timeliness between the events is of logical importance.

So why don't we just hardcode that the LS_REFRESH_INTERVAL (for example) is 50 seconds and the HELLO_INTERVAL is 5 seconds?

Imagine you had the actual time scale hard coded (not just relative to the other event but relative to real-time). Now lets say somebody installed 100 of your router (with the SPF program that you are writing embedded into the router) in their network. Imagine if this happens to be a specially slow network with very high RTTs of 10 seconds or more. If you generate a HELLO message every 5 seconds you would very fast make the network congested. On the other hand if this was a very fast network (RTT of 0.01 msec for example) then recalculating the route every 5 min could make a lot of packets be sent through bad or non existing obsolete routes.

Thus the intervals need to be not just relative to each other but also to the time scale of the network. Without supplying the time scale as a parameter then you would have to write and compile a different program for every network in which your router is going to be installed.

Additionally, supplying the time scale as a parameter also introduces greater variability among the routers as they could all be run with different time-scales. This relaxes the network oscillation problem.

So why can you supply a zero time scale which will cause the router to use the ticks of neighboring routers?
Imagine you have a router which is a very central and happens to be very busy. It makes sense to make this router run faster than its neighbors. That way if the route costs are a function of transmission rate/congestion/network load then it will be able to faster recalculate the routes and send the updates to the neighbors. Running this router with a zero parameter means it will run at least as fast as the fastest neighbor. Additionally, it will not have to generate and send TICK messages into the network which takes of a little load.

Of course you could abuse this and run all the routers with a zero parameter and nothing would move in the network. You could also try to run your car on water. But would you do that? This is one of the reasons why correctly configuring routers is an important but not easy task.

If the time scale parameter is zero, then your clock ticks advances with every tick message you get. Yes, that means you will run faster than all your neighbors.
Maybe as fast as the superposed frequency of them all. If they all happen to be very fast then you could get a crazy rate, but that means your router is not configured correctly. Having such a router makes sense if it connects between a lot of slow routers with heavy load, not if it connects a lot of very fast routers with hardly any packets to transmit.

Output:

To enable us to evaluate your program, the program should print to standard output some messages when certain events occur. The standard output should be flushed (use non buffered I/O or flush the output).

The messages we are expecting to see are:

Event
Message
Starting a new clock tick TICK
received hello message HELLO FROM [neighbor_router_name]
received LS message LS FROM [node] FOR LINK [src node] [dest node] [counter] [cost] [age]
the link to a node becomes UP UP [neighbor_router_name]
the link to a node becomes DOWN  DOWN [neighbor_router_name]
Every ROUTE_REFRESH_INTERVAL LSDB
[src node] [adj node] [counter] [cost] [age]
...
[src node] [adj node] [counter] [cost] [age]
Every ROUTE_REFRESH_INTERVAL ROUTE [destination node] [next hop] [cost]
ROUTE [destination node] [next hop] [cost]
...
ROUTE [destination node] [next hop] [cost]

Every ROUTE_REFRESH_INTERVAL, all LSrecs currently in the LSDB should be output. The LSrecs in the summary should be output ordered lexicographically by (orig_router_name, adj_router_name) pair. For example, if the LSDB contains LSrecs for links (A, B), (A, X), (C, B), (B, A), (C, A), the output can look like:

Note, you print INFINITY instead of 0xffffffff (infinite cost).
The routing information (i.e., ROUTE ...) should be output ordered alphabetically according to the destination node name.

Note, you print UNREACHABLE for a next hop, if there is no path to corresponding destination.

Error Handling:

When receiving a message of an illegal message type from a neighbor, you should issue an error message, but you must continue running.

In general, your program should be able to handle erroneous input. You do not have to recognize all possible input errors (e.g. there is no need to check that the graph in the configuration file is connected), but you may not crash on such an error. As a rule of a thumb, if avoiding a check for a specific error might result in "bad" behavior, test for the error. Bad behavior includes program crash, sending illegal messages to other nodes, infinite loops, sending too many messages to nodes, "forgetting" to send messages, etc. 

General:

Note that the data structure and the pseudocode given above is only a general algorithm description. Feel free to use any additional/alternative data structures/functions/procedures/methods etc, according to your needs.

Don't use the example config we supplied without modifying it since if all of you execute concurrently with the same node names and ports at the same computers then...

Policies, Tips and Gradings:

  1. Try to start to work early. A smart design is very important for a correct implementation of this exercise. Don't rush to begin coding before you're sure that you fully understand all the subtle points.
  2. It's highly desirable that you'll write your code in a strict event-driven manner.
  3. This exercise is not a light one; working on a big exercise with a partner can lead to unnecessary tension. DO NOT QUARREL OVER THE EXERCISE. It is practically impossible that two partners do EXACTLY the same amount of work (though the well-known horse-and-rider model should be avoided).  Besides, there are more important things in this world than this exercise (e.g. the next exercise...). If you are not getting along with your partner in such a manner that it seriously hurts the writing then you can send me an email (which will be kept STRICTLY confidential). This will not hurt any of the partners, it is just intended to help you.
  4. The grading of this exercise will be as follows:

ü      85%: Correctness.

ü      10%: Code structure.

ü      5%: Submission.

ü      Up to 5 bonus points for exceptionally readable or well structured code. Up to 15 bonus points for above standard solution (e.g. a more intelligent and life-like cost function but without introducing new problems - must be well documented; a more efficient shortest path algorithm; etc.).

 

 



Start of Page    To the Communication Course Home Page
com1@cs.huji.ac.il