Communication Course - Ex3 Description

Last updated: 20/12/04 (added an explanation section of the clock ticks)

Exercise 3: Simulating a Shortest Path Routing Algorithm

In this JAVA exercise you will simulate a Link State routing algorithm. You will implement a simplified but real version of the Shortest Path First (SPF) Routing Algorithm. Routers are simulated by processes. Router processes communicate using UDP.

Introduction:

The network layer is concerned with what routes to choose in order to get a piece of data from the source to the destination. The working horse of the network layer is the router who does the actual decision on which path to transmit the data. Giving the routers "good" distributed algorithms to follow is thus crucial in making networks work smoothly. One class of routing algorithms are the Link State Protocols who maintain a certain degree of global information about connectivity and link costs.

In this exercise you will use the Dijkstra Shortest Path algorithm and implement it as a dynamic link state routing algorithm. To make the algorithm real-life it will need to adapt to the dynamics of the network. You will thus need to regularly run the shortest path algorithm taking into consideration changing link costs and faulty routers that may be temporarily down and then recover or send corrupt messages. The algorithm will need to prevent (up to a certain degree) routing oscillations.

As we want you to finish the exercise before you finish your degree we will describe the failure detection methods and the program & data structures you should use.

The program should be written in JAVA.

The Nodes:

Each node (router) is implemented by a process. You should write a program called spf.java that implements a node.
The program synopsis (you can assume the input is correct):

java spf <node name> <config file> <clock rate>

The node name is a single uppercase letter. It tells the program which node in the network it is.
The config file describes the structure of the network. Each line in the file describes either a node or a link.
Node lines are of the format:
- NODE node-name host port
  This line indicates that node node-name runs on the specified host, and receives messages on the specified UDP port.
Link lines are of the format:
- LINK Node1 Node2
  The line specifies that there is a link between Node1 and Node2 and vice versa. Empty lines, and white spaces (blanks and tabs) in the input file should be ignored.
  The program should use a configuration file to know to which nodes it should contact and where to find them.
  You can find an example of a configuration file in: config.
The clock rate argument specifies the number of seconds between successive clock ticks. This mechanism avoids network oscillations. If the number is 0, the program should not generate clock ticks. It will get clock ticks from its neighbors via a 'clock tick' message. Clock tick messages should be ignored if the clock rate is not 0.

The Router's Data Structures:

The network is represented by a directed graph: there exists a link (X,Y) iff there exists a link (Y,X), but the costs may be different. Each router in the network is represented by a node in the graph; and each network link is represented by an arc in the graph. Each node in the graph is labeled by the corresponding router name. Thus, each network link L that connects a router X with a router Y (i.e., X is a predecessor of Y in the graph) is uniquely identified through the pair (X, Y).

Each router in the network maintains the following data structure:

My_name: is the router's name;
For each outgoing link L:

State(L) is either DOWN or UP;
Counter(L) is an integer number;

LSDB: is a Link State (LS) Database representing the router's view of the network topology and the network link costs. LSDB consists of LS records (LSrecs) as defined below.

LSrec is a 5-tuple:

origR;
adjR;
counter;
cost;
age.

Let LS be an LSrec. Thus, LS is interpreted as the status of the specific network link identified by the pair of adjacent routers (LS.origR, LS.adjR).

The Router Operation at a Glance:

The router operation can be summarized as follows:

Maintaining routing information;
Aging LS records in the LSDB;
Failure detection of outgoing links;

Maintaining Routing Information

This includes the following:

Periodically update each LS record, ls, in the LSDB, such that ls.origR = My_name. This includes updating ls.cost, incrementing ls.counter and setting ls.age to be MAX_AGE. Originate a new LS message with the updated LS record in the body;

get_link_cost

Upon receiving an LSrec, ls, corresponding to a link L, update the LSDB with ls if LSDB either doesn't contain an LSrec for L or ls is newer (according to its counter field value) than the current LSDB's version of LSrec for L. If the LSDB was updated, flood ls out of all the router's outgoing links whose state is UP, excluding the link from which ls was received.

Aging LS records in the LSDB

Periodically decrement the age field of each LSrec in the LSDB. Purge all LSrecs whose age field reached 0 from the LSDB. The aging mechanism helps to achieve the following aims:

Router recovery after the crash: the recovered router sleeps for a while before it starts to send new LSrecs. This allows other routers to get rid of that router's old LSrecs;
Minimizing influence of corrupted LSrecs on the route calculation: It sometimes happens that a router receives an LSrec which pretends to be an LSrec of some dead router. Such LSrec can be very harmful for the routing algorithm. The aging mechanism guarantees that such an LSrec will eventually be purged from the LSDB thus affecting route computation only for a limited time interval.

Failure detection of outgoing links

Each router employs a simple failure detection mechanism in order to discover the outgoing links which are not operational anymore. Whenever a link L is found to be faulty, the router informs other routers about the topology change by flooding an LSrec with 0xffffffff (infinite cost) in its cost field.

The failure detection mechanism is as follows: periodically each router sends a HELLO message out of each of the router's outgoing links (including those which are currently DOWN). If the router does not receive a HELLO message from some of its UP neighbors during a predefined time interval, the link leading to this neighbor is declared DOWN.

In this exercise, to make the things simpler, we will separate between the LS message flooding and the failure detection: i.e., LS messages coming from the DOWN links are ignored and newly originated and incoming LS messages are not flooded out of the DOWN links.

A previously DOWN link becomes UP whenever a HELLO message from the router at the opposite end of this link is received.

The (event driven) Router Operation in Detail:

Startup
Handling Timer Events
Handling Input Events
Handling Failure Detection Events

Startup

Upon startup each router should perform the steps listed below:

Initialize the router's data structure as follows:

Set My_name and outgoing links according to the configuration file;
For each outgoing link L = (My_name, R):

State(L) = DOWN;
Counter(L) = 0;

Do not send new LSrecs for MAX_AGE * AGING_INTERVAL ticks.

This allows other routers to extinct this router's old LS records (if such exist) from their LSDBs as explained above (see aging mechanism).

Handling Timer Events

Every HELLO_INTERVAL:

Send a HELLO message out of each outgoing links (including those whose state is DOWN);

Every LS_REFRESH_INTERVAL:

For each outgoing link L:

Create a new LSrec, lsnew, such that:

If State(L) = UP:

lsnew.cost = get_link_cost(L);

Else

lsnew.cost = 0xffffffff;

(lsnew.origR, lsnew.adjR) = L;
lsnew.counter = Counter(L) + 1;
lsnew.age = MAX_AGE;

Put lsnew into the LSDB, overriding the previous LSrec for L if such exists;
Create a new LS message, lsm, carrying lsnew;
Send lsm out of each outgoing link L' such that State(L') = UP;

Every AGING_INTERVAL:

For each LSrec , ls, in the LSDB corresponding to non-outgoing (non-local) link:

ls.age := ls.age - 1;
If ls.age = 0:

Purge ls from the LSDB;

Every ROUTE_REFRESH_INTERVAL:

Output the LSDB (see Output section below)
Recalculate routes using the LSDB as an input to the Dijkstra Shortest Path algorithm;
Output the routing information (see Output section below);

Handling Input Events

Upon receiving a HELLO message from a link L:

If State(L) = DOWN:

State(L) := UP
Create a new LSrec, lsnew, such that:

(lsnew.origR, lsnew.adjR) = L;
lsnew.counter = Counter(L) + 1;
lsnew.cost = get_link_cost(L);
lsnew.age = MAX_AGE;

Put lsnew into the LSDB, overriding the previous LSrec for L if such exists;
Create a new LS message, lsm, carrying lsnew;
Send lsm out of each outgoing link L' such that State(L') = UP;

Upon receiving an LS message lsm carrying an LSrec ls from a link L:

If State(L) = UP:

If there exist an LSrec, ls', such that (ls'.origR, ls'.adjR) = (ls.origR, ls.adjR) in the LSDB:

Retrieve ls' from the LSDB;

Else:

Let ls' := NULL;

If ls' = NULL or ls'.count < ls.count:

Replace ls' with ls in the LSDB;
Send lsm out of each outgoing link L' such that State(L') = UP and L' != L;

Handling Failure Detection Events

If no HELLO messages have been received from an outgoing link L for MAX_HELLOS * HELLO_INTERVAL time interval:

State(L) := DOWN;
Create a new LSrec, lsnew, such that:

(lsnew.origR, lsnew.adjR) = L;
lsnew.counter = Counter(L) + 1;
lsnew.cost = 0xffffffff;
lsnew.age = MAX_AGE;

Put lsnew into the LSDB, overriding the previous LSrec for L if such exists;
Create a new LS message, lsm, carrying lsnew;
Send lsm out of each outgoing link L' such that State(L') = UP;

Message Formats and Time Intervals:

This section defines the format of all the messages a node should know how to handle. All the numbers should be sent as unsigned int. The first 4 bytes of each message are the message type.

 value | Description
--------+----------------------------
   1     |  HELLO message.
   2     |  LS message.
   3     |  A clock tick message.

HELLO messages are used to implement the failure detection mechanism (see above). A HELLO message consists of the message type (always 1) and the sending router name:

    class _HELLOmessage 
    {
        int type;     /* always 1 */
        int router;
    }

An LS message consists of the type (always 2), the name of an adjacent router which flooded this LS message to the receiving router, and the LSrec corresponding to a link identified by the (from_router, to_router) pair:

    class _LSmessage 
    {
        int type;     /* always 2 */
        int neighbor;
        int from_router;
        int to_router;
        int counter;
        int cost;
        int age;
    }

A clock tick message indicates to a node that the time has arrived to send delay updates to all its neighbors. The message consists of the type only. Note that clock ticks are not synchronized between the nodes, and the clock on different nodes may run at different rates. The body of the clock tick message is empty.

     /* The TICK message format */
    class _TICKmessage
    {
        int type; /* always 3 */
    }

Time intervals measured in ticks and all the mentioned above constants are defined in the file spf_consts.java. You should NOT submit this file.

Handling Counter Wraparounds:

Note that the number of the counter bits in the LSmessage structure is bounded by the COUNTER_WIDTH constant. This means that in this exercise you are required to cope with the counter wraparounds when comparing counter values. This counter value comparison is done according to the following rules:

The counter's value 'a' is older than the counter's value 'b' if

Getting Link Costs:

To get the link cost, your program should call the function get_link_cost. The function accepts two char arguments. The first is the router's name, and the second is the adjacent router name. The get_link_cost function should be taken from the file spf_consts.java. You must not use any assumptions on the cost function in the solution, for example knowing that the cost function always return X for a certain link CANNOT be used just because we implemented a simple cost function. In real life the the cost would be a function of physical distance, estimated RTT, up-time, estimated probability of packet loss, bandwidth, queuing delay, average traffic, etc.

The return value of get_link_cost is the cost of the link. The cost is a number in the range 1-100 if the link is UP, and 0xffffffff if the link is DOWN. Treat the function as if successive calls to the function might return different results.

Explaining The Clock Ticks:

It is easier to program systems when you think about them in an event driven manner. What drives this program are basic time ** clock ticks ** events which in turn are used to drive higher level events (e.g.event of sending a message every HELLO_INTERVAL).

When executing the program you decide on what time scale the clock ticks are, i.e. every 5 second, 10 seconds, 37 seconds, etc. (through the 'clock rate' parameter to the program). This makes the program more scalable as the relative timeliness between the different events is important. I.e. it makes sense that the LS_REFRERSH_INTERVAL is greater than the HELLO_INTERVAL (lets say 5 times longer). Otherwise it would be funny to refresh the db if you had no event in between which could teach you that something has changed
=> thus the relative timeliness between the events is of logical importance.

So why don't we just hardcode that the LS_REFRESH_INTERVAL (for example) is 50 seconds and the HELLO_INTERVAL is 5 seconds?

Imagine you had the actual time scale hard coded (not just relative to the other event but relative to real-time). Now lets say somebody installed 100 of your router (with the SPF program that you are writing embedded into the router) in their network. Imagine if this happens to be a specially slow network with very high RTTs of 10 seconds or more. If you generate a HELLO message every 5 seconds you would very fast make the network congested. On the other hand if this was a very fast network (RTT of 0.01 msec for example) then recalculating the route every 5 min could make a lot of packets be sent through bad or non existing obsolete routes.

Thus the intervals need to be not just relative to each other but also to the time scale of the network. Without supplying the time scale as a parameter then you would have to write and compile a different program for every network in which your router is going to be installed.

Additionally, supplying the time scale as a parameter also introduces greater variability among the routers as they could all be run with different time-scales. This relaxes the network oscillation problem.

So why can you supply a zero time scale which will cause the router to use the ticks of neighboring routers?
Imagine you have a router which is a very central and happens to be very busy. It makes sense to make this router run faster than its neighbors. That way if the route costs are a function of transmission rate/congestion/network load then it will be able to faster recalculate the routes and send the updates to the neighbors. Running this router with a zero parameter means it will run at least as fast as the fastest neighbor. Additionally, it will not have to generate and send TICK messages into the network which takes of a little load.

Of course you could abuse this and run all the routers with a zero parameter and nothing would move in the network. You could also try to run your car on water. But would you do that? This is one of the reasons why correctly configuring routers is an important but not easy task.

If the time scale parameter is zero, then your clock ticks advances with every tick message you get. Yes, that means you will run faster than all your neighbors.
Maybe as fast as the superposed frequency of them all. If they all happen to be very fast then you could get a crazy rate, but that means your router is not configured correctly. Having such a router makes sense if it connects between a lot of slow routers with heavy load, not if it connects a lot of very fast routers with hardly any packets to transmit.

Output:

To enable us to evaluate your program, the program should print to standard output some messages when certain events occur. The standard output should be flushed (use non buffered I/O or flush the output).

The messages we are expecting to see are:

Event Message

Starting a new clock tick TICK

received hello message HELLO FROM [neighbor_router_name]

received LS message LS FROM [node] FOR LINK [src node] [dest node] [counter] [cost] [age]

the link to a node becomes UP UP [neighbor_router_name]

the link to a node becomes DOWN DOWN [neighbor_router_name]

Every ROUTE_REFRESH_INTERVAL LSDB
[src node] [adj node] [counter] [cost] [age]
...
[src node] [adj node] [counter] [cost] [age]

Every ROUTE_REFRESH_INTERVAL ROUTE [destination node] [next hop] [cost]
ROUTE [destination node] [next hop] [cost]
...
ROUTE [destination node] [next hop] [cost]

Every ROUTE_REFRESH_INTERVAL, all LSrecs currently in the LSDB should be output. The LSrecs in the summary should be output ordered lexicographically by (orig_router_name, adj_router_name) pair. For example, if the LSDB contains LSrecs for links (A, B), (A, X), (C, B), (B, A), (C, A), the output can look like:

Note, you print INFINITY instead of 0xffffffff (infinite cost).
The routing information (i.e., ROUTE ...) should be output ordered alphabetically according to the destination node name.

Note, you print UNREACHABLE for a next hop, if there is no path to corresponding destination.

Error Handling:

When receiving a message of an illegal message type from a neighbor, you should issue an error message, but you must continue running.

In general, your program should be able to handle erroneous input. You do not have to recognize all possible input errors (e.g. there is no need to check that the graph in the configuration file is connected), but you may not crash on such an error. As a rule of a thumb, if avoiding a check for a specific error might result in "bad" behavior, test for the error. Bad behavior includes program crash, sending illegal messages to other nodes, infinite loops, sending too many messages to nodes, "forgetting" to send messages, etc.

General:

Note that the data structure and the pseudocode given above is only a general algorithm description. Feel free to use any additional/alternative data structures/functions/procedures/methods etc, according to your needs.

Don't use the example config we supplied without modifying it since if all of you execute concurrently with the same node names and ports at the same computers then...

Policies, Tips and Gradings:

Try to start to work early. A smart design is very important for a correct implementation of this exercise. Don't rush to begin coding before you're sure that you fully understand all the subtle points.
It's highly desirable that you'll write your code in a strict event-driven manner.
This exercise is not a light one; working on a big exercise with a partner can lead to unnecessary tension. DO NOT QUARREL OVER THE EXERCISE. It is practically impossible that two partners do EXACTLY the same amount of work (though the well-known horse-and-rider model should be avoided). Besides, there are more important things in this world than this exercise (e.g. the next exercise...). If you are not getting along with your partner in such a manner that it seriously hurts the writing then you can send me an email (which will be kept STRICTLY confidential). This will not hurt any of the partners, it is just intended to help you.
The grading of this exercise will be as follows:

� 85%: Correctness.

� 10%: Code structure.

� 5%: Submission.

� Up to 5 bonus points for exceptionally readable or well structured code. Up to 15 bonus points for above standard solution (e.g. a more intelligent and life-like cost function but without introducing new problems - must be well documented; a more efficient shortest path algorithm; etc.).

com1@cs.huji.ac.il

Event	Message
Starting a new clock tick	TICK
received hello message	HELLO FROM [neighbor_router_name]
received LS message	LS FROM [node] FOR LINK [src node] [dest node] [counter] [cost] [age]
the link to a node becomes UP	UP [neighbor_router_name]
the link to a node becomes DOWN	DOWN [neighbor_router_name]
Every ROUTE_REFRESH_INTERVAL	LSDB [src node] [adj node] [counter] [cost] [age] ... [src node] [adj node] [counter] [cost] [age]
Every ROUTE_REFRESH_INTERVAL	ROUTE [destination node] [next hop] [cost] ROUTE [destination node] [next hop] [cost] ... ROUTE [destination node] [next hop] [cost]