Exercise 4

How File System Structure Affects Performance

Subject: File System Structure.

Due Date: 20.5.2010

Assignment

In this exercise you will examine the effect of file system structure on performance, by comparing the iNode structure, used in UNIX-like systems, and the linked-list structure, used in the FAT32 file system. For this sake you will implement library functions that calculate the used space and the average number of accesses to disk in a typical file system, using data on file sizes collected from a real computer system.

You will then use your implementation to plot graphs that show how these parameters change when you change the block size that the file system employs, and compare the performance of the two file systems.

The Library

You will implement a C++ library with one function. The public interface of the library is ~os/lib/fileperf.hh. You should include this header file from its original location, and implement the function:

int fileSystemPerformance(unsigned int blockSize, unsigned int pointerSize, unsigned int maximalSize, storageType type, const std::vector fileSizes, outputStruct *output);

This function receives as input parameters that determine the storage method of files on disk, and a list of files that are currently in the file system. The function returns (using the 'output' parameter) the space used by the file system and the average number of disk accesses to access data on this file system. In the following, all sizes are in bytes.

Input Parameters:

blockSize: The size of a single block on disk.
pointerSize: The size of one pointer to a block on disk.
maximalSize: The size of the largest file this storage method needs to support.
type: The type of file storage to use: FS_INODE for iNode structure, FS_LINKED_LIST for linked list structure.
fileSizes: The sizes of all the files that should be 'stored' on disk.

Return Value:

The return value of the function should be 0 upon success, and -1 if any of the parameters to the functions are illegal, or if any of the files of the given fileSizes cannot be stored according to the given parameters. A file of size zero (an empty file) is legal.

Output:

The function returns its output using the pointer 'output', which is the last parameter of this function. This pointer points to a struct defined as follows:

struct outputStruct {
long unsigned int space;
double averageAccessNum;
};

When the function returns, the fields of the 'output' structure should be filled as follows:

'space' should be equal to the number of bytes on disk that are used to store the files with the given fileSizes, using the given file storage parameters. The exact calculation of the space depends on the storage type (inodes or linked list), which will be explained in the next sections. The space should be in bytes, and it should include all the bytes in blocks that are used for the storage of the files (even if the blocks are only partially used). An empty file uses only a header block.
'averageAccessNum' is the average number of block accesses it takes to access a byte of data from one of the files. The average is calculated by first calculating the average block access number for each file separately, and then averaging this number over all the non-empty files in the system.
- The average block access number for a specific file is calculated by taking the number of block accesses required to access a specific byte in the file, and then averaging over all the bytes in the file. For example: suppose the file is of size 3, and suppose that it takes one block access to access the first byte of the file, and 2 block accesses to access the second or the third byte of the file. Then the average number of accesses for this file is (1+2+2)/3=1.66666.
- The number of block accesses to a specific byte in the file is defined as the number of different blocks that need to be accessed on disk to find this byte and access it, starting from the header block of the file.
- Example for how to average over several files:Say we have files a,b and c. File a has 10 bytes, and the average access number to this file is 2.3 (this number was calculated by averaging over all the bytes in the file). File b has 20 bytes, and the average access number to this file is 1.2. This number was calculated as for file b. File c is empty - it has no average access number because it has no bytes.
  The overall average access number is calculated only on the non-empty files, and every file gets the same weight, regardless of the number of bytes it has. Therefore the average in this example is (2.3 + 1.2)/2.

Handling illegal input
The function should return -1 for any illegal input.

If the file system parameters do not allow supporting the given maximalSize, this is illegal (regardless of whether such a file exists in the list of files itself).
If any of the files in the list of files to store is more than the maximalSize, the input is considered illegal and nothing needs to be calculated.
A file of size 0 is legal.
The pointer size should not be considered as limiting the number of blocks in the system (though it would be the case in real systems), so regardless of the pointer size, you should assume the number of blocks available to the system is not bounded.

Linked List Structure

The linked list structure for each file is as follows:

The header block: Contains a fixed-size header, data from the file, and a pointer to the next data block. If there is no additional block of data, this pointer contains NULL.
Data blocks: Contain file data, and a pointer to the next block of file data. Here too, if there is no additional block of data, this pointer contains NULL.

The amount of file data in each block is the maximal possible amount, based on the block size and the other fields in that block. For the sake of this exercise, we will assume the fixed-size header has zero size.

Example

Suppose that the block size = 30, pointer size = 2, and the file size is 46. Then storing the file will cost two blocks:

The header block:
- 28 bytes of the file data,
- 2 bytes for the pointer to the next block.
One data block:
- 18 bytes of the file data,
- 10 unused bytes,
- 2 bytes for the pointer to the next block.

The space used by the file is therefore 30*2 = 60. The number of block accesses to bytes in the file will be 1 for the first 28 bytes, and 2 for the last 18 bytes of the file. Therefore the average number of block acceses for this file is (28*1+18*2)/46 = 1.39... .

iNode Structure

The iNode structure is similar to the one described in the Tirgul class:

iNodes

There are three types of blocks:

The header block: Contains a fixed-size header, several direct pointers to data blocks, and several indirect pointers.

The number of direct pointers should be the maximal possible number that can fit in the block, after the space for the header and the indirect pointers have been allocated. A direct pointer points to a data block.
There must be at least one direct pointer in the header block.
There is exactly one indirect pointer of every level in the header block. An indirect pointer of level i points to a pointer block of level i-1. You should decide how many levels are needed, based on the maximal file size that the file system should support. Note that if there are more levels, there are more indirect pointers, so there is less room for direct pointers in the header block. You should calculate the minimal number of levels that are needed based on all these considerations.
If it is not possible to store the maximal file size while keeping all the constraints, such as a single indirect pointer of each level and at least one direct pointer of level 1, then the input to the function is considered illegal and the function should return -1.

Pointer blocks: A pointer block holds pointers to other blocks. The number of pointers in the block is the maximal that can fit in the block size. A pointer block of level i, where i>=1, holds pointers to pointer blocks of level i-1. A pointer block of level 0 holds pointers to data blocks (i.e. direct pointers).
Data blocks: Contains only data.

For the sake of this exercise, we will assume the fixed-size header has zero size.

Example

Suppose that the block size = 12, the pointer size = 4, the maximal file size is 52, and we need to store a single file of size 30.

First we need to decide how many direct and indirect pointers there should be in a header block in this file system. If we use zero indirect pointers, then we will have room for 12/4 = 3 direct pointers in the header block. This allows us to use a maximum of 3 data blocks for a file, so the maximal file size would be 36. Therefore we need more than zero indirect pointers. Using one indirect pointer will be enough to store a file of size 52, therefore we will have in each header block one level 1 indirect pointer, and 2 direct pointers.

To store a file of size 30, we will need 3 data blocks. Therefore we will use the two direct pointers in the header block, and the indirect level 1 pointer as well. The indirect pointer will point to a block of direct pointers, and one of them will point to the third data block. Therefore we use 5 blocks for this file: the header block, one block of direct pointers, and three data blocks, so the space we spend for this file is 12*5 = 60 bytes.

The average number of block accesses of this file is calculated as follows: To access bytes 1-24 of the file we need to access the header block and the relevant data block, so we have 2 block accesses. To access bytes 25-30 we need to access the header block, the direct pointers block and the data block, so there are 3 block accesses. Therefore, the average access number for this file is (24*2+6*3)/30 = 2.2.

Graphs

After implementing the library function above, you will use this function to generate graphs that show the space and the average access number that will be used for a specific list of file sizes, as a function of the block size.

The file sizes are given in the file ~os/lib/fileSizes.txt. This file lists sizes of files in the recoded computer system.

Generate, using the library function you implemented, two graphs:

The used space as a function of the block size.
The average number of accesses as a function of the block size.

Each graph should have two lines - one for a linked list and the other for an iNode structure. Assume a pointer size of 4, and a maximal file size of 10^8. You should decide on the block sizes to plot: Make sure the range is large enough to show all the interesting phenomena. You can use either normal scale or log scale for each of the axes in each of the graphs. Do not forget to annotate your graphs: state the meaning of each line, the meaning and units of each axis, and whether it is normal scale or log scale. The data for the graphs should be generated by a program you will write, that uses your library function. You may use any graph software to plot the resulting graphs.

You do not need to submit the program that generated the graphs. You will submit only the text files, containing the data which your program generated. The text file for each graph should have three columns, separated by spaces, where each line contains: block size, value for linked list, and value for inode structure. The lines should be sorted by block size.

Questions

Answer the following questions regarding the graphs that you generated:

Describe, in general, the difference in the graphs of space usage between the two storage methods.
If you had to choose which storage method to use based only on the space usage, which method would you choose and why?
Describe, in general, the difference in the graphs of average access number between the two storage methods.
If you had to choose which storage method to use based on both the space usage and the average access number, which method would you choose and why?

Submission

Submit tar file named ex4.tar containing:

README file built according to the course guidelines, that includes answers to the questions above..
Source code for the library (not including fileperf.hh)
Makefile - your makefile should generate a static library file named: libfileperf.a when running 'make' with no arguments.
Images of the two graphs, named space.jpg for the space graph and access.jpg for the second graph.
Text files named space.txt and access.txt, containing the data that you used to generate the graphs.

Do not submit the program that you wrote for generating the graph data.