Matlab - advanced lesson 1

The Matlab language integrates computation, visualization, and programming in an easy-to-use environment.
In the home work you got, you learned the basics: using variables, arrays and matrices, perform mathematical operations, plotting basic graphs, and working with binary files. Now we can jump to more advanced matterial.
During this lesson we will use gene-expression analysis as a running example, at each step of the analysis we will learn the relevant Matlab commands.
Let's start... We want to define the transcriptional response to smotic stress in yeast. We will use experimets measuring the expression levels 20 min after exposure to salt (using 2-dye arrays measuring the log-ratio between the RNA level in yeast after exposure to salt and before). We assume the data is already clean and normalized.
Note that there are many Matlab tool-boxes designed especially to analyse biological data, and specifically gene-expression data. We will learn more about them in the next lessons.
Our data is a tab delimited file :

Probe_id   Name              solt1/wt1   solt2/wt2    solt3/wt3    solt4/wt4
A_1111    chr3:13000-130030       1          1.2          2             1.5
A_1112    chr3:14000-140030       1          0.4          1.1            1
A_1113    chr5:1607804-1608104    0.2        -2           -1            -1.5
A_1114    chr2:9657000-9657300    0          0.2          -0.5          0.8
...

Handeling files

Step 1: Reading the data and saving it is a structure in matlab
There are various function you can use to read and upload files. Some examples:

load = Load workspace variables from disk (as you did in the exercise)
importdata = loads data from filename into the workspace.
textscan = Read formatted data from text file or string
fopen,fget = open file & read line

Let's try this:

>> e=importdata('expression.txt')

e = 

        data: [3000x3000 double]
textdata: {3001x3002 cell}
>> e.data(:,1:4)

ans =

    1.0000    1.2000    2.0000    1.5000
    1.0000    0.4000    1.1000    1.0000
    0.2000   -2.0000   -1.0000   -1.5000
         0    0.2000   -0.5000    0.8000
>> e.textdata(1,:)

ans = 

    'Probe_id'    'Name'    'solt1/wt1'    'solt2/wt2'    'solt3/wt3'    'solt4/wt4'
>> e.textdata(2:3,1:2)

ans = 

    'A_1111'    'chr3:13000-130030'
    'A_1112'    'chr3:14000-140030'


Statistics
Step 2: Finding differentially expressed genes

Differentiall expressed genes, are genes that change under the given condition compared to the reference. 
Since the expression data is noisy, we use repeats of the experiment to filter out noise, by requireing consistency between the repeats. 
A statistical test used to determine if a gene is differential, is t-test.

There are many statistical tests ready and easy to use in Matlab. Some examples:
kstest = Single sample Kolmogorov-Smirnov goodness-of-fit hypothesis test, compared to a random sample.
ttest =  One-sample and paired-sample T-test.  0reA T-test of the hypothesis that the data in the vector X comes from a distribution with mean zero. 

[H,P]= TTEST(X,M,ALPHA,TAIL) : performs a T-test of the hypothesis that the data in X come from a distribution with mean M,  at the significance level (100*ALPHA), against the alternative hypothesis specified by TAIL
ttest2 =  Two-sample T-test, performs a T-test of the hypothesis that two independent samples, come from distributions with equal means
corr =  pairwise linear correlation coefficient between two column vectors.

Let's try this :
>> [h_down,p_down]=ttest(e.data',0,0.05,'left');
>> p_down(1:4)

ans =

    0.9964    0.9940    0.0535    0.6632

>> [h_up,p_up]=ttest(e.data',0,0.05,'right');
>> p_up(1:4)

ans =

    0.0036    0.0060    0.9465    0.3368

Note: This function requires a multiple hypothese correction for the p-value.

You can use for example an False Discovery Rate (FDR) correction

using the function mafdr(p), from the bioinformatix toolbox.


NaN (Not-a-Number) and the find function
Step 3: Dealing with missing values

Missing values in the file (due to noise in the experiment for example), will be marked as NaN. This can be problematic:
>> [h,p]=ttest([NaN,NaN,1,1])

h =

     1


p =

     0

>> [h,p]=ttest([0,0,1,1])

h =

     0


p =

    0.1817


We can filter out probes with more than 60% missing values:

>> ind = find(sum(~isnan(e.data),2)>(size(e.data,2)*0.6));
>> data_60 = a.data(ind,:);
>> size(data_60)
>> data_60 = e.data(ind,:)

data_60 =

    1.0000    1.2000    2.0000    1.5000
    1.0000    0.4000    1.1000    1.0000
    0.2000   -2.0000   -1.0000   -1.5000
         0    0.2000   -0.5000    0.8000

>> size(data_60)

ans =

     2300     4

>> Names = e.textdata(ind+1,1:2)

Now we can re-do the t-test all over again... and create a matrix of differential genes.

First try :

>> differential = data_60(h_up+h_down>0,:); 

Second try:

>> differential = data_60( [(p_down<0.1)+(p_up<0.01)]>0 , :);
>> size(differential)
ans =

     320     4


Clustering
Step 4: Clustering the expression data 

Clustering the data can help us see trends in the data, and divide it to subgroups that share a similar expression pattern.

This might be more meaningful wehn working with more complex data, such as time-searies expression data.

There are different available clustering algorithms in Matlab, that can help you both cluster and visualize the data.


Hierarchical clustering = the easiest way is to use clusterdata function:

T = clusterdata(X,cutoff)
and you have a range of parameters to play with such as the distance function to use.
You can also do this explicitly using these three functions : 
Y = pdist(X,'euclid');
Z = linkage(Y,'single');
T = cluster(Z,'cutoff',cutoff);

Kmeans clustering=  kmeans(X,k) partitions the points in the data matrix X into k clusters, by an iterative algorithm that minimizes the within-cluster distance to the centroid.  

For our example, we will perform k-means with K=3 (why 6?? good question) :

K = 3;
I = kmeans(differential,K,'Distance','correlation');


Some more about graphs
Step 5: Displaying the clustering results 

We can think of different ways to display the data, such as plotting all genes in the cluster and their average :
>> figure;
>> for i=1:K
   subplot(K,1,i);
   hold on
   for j = I'
   plot( data(j,:));
   end
   plot( mean( data(find(I==i),:) ), 'r','LineWidth',2);
   title([ 'cluster ' num2str(i)]);
   set(gca,'FontSize',12);
end
set(gcf, 'PaperPosition', [0 0 9 9], 'PaperPositionMode', 'manual', 'PaperSize', [9 9]);
print(gcf,'-dpng', '-r300', [name '.MEAN.clusteredEuc.K' num2str(K) '.png']);

See for example such an image from time-series data : here 

Or a different display, using a colormap where up-regulated genes are read and down regulated genes are green :
>> figure;
>> for i=1:K
    subplot(K,1,i);
    colormap redgreencmap;
    imagesc( data(find(I==i),:),[-max(max(data)),max(max(data))]);
    colorbar;
end

The resulting image from the same time-series data you can see here 

Other useful functions : see the  help ,
functions such as: hist, pie, area. Or 3D graphs such as the mesh and scatter3 functions.

Handeling Files and text data
Step 6: Writing the clustering results to a file

Let's go directly to the example:

fh = fopen([name 'clsuterK' num2str(K) '-res.txt'],'w');
for i=1:K
 fprintf(1,'Number of genes in cluster %d : # %d\n',i,length(find(I==i)));
 for j = [find(I==i)]'
      fprintf(fh,'\t%s%s\n',Names{j,2},sprintf('\t%d',data(j:)))
 end
end
fclose(fh);


Other Data Structures
Step 7: Annotating the genes

In our data each probe is given using its location in the genome. 
We would like to annotate each probe, by relating it to the closes gene in the genome.
We will use for this the RefGene data, but it will be mch easier if we use more complex data structures.

Cell arryas = A cell array provides a storage mechanism for dissimilar kinds of data. 
To access data in a cell array, you use the same indexing as with other matrices, but use curly braces, {}.
Structures =  are MATLAB arrays with named "data containers" called fields. The fields of a structure can contain any kind of data. 

As an example we will parse the file  RefGene.txt


fid = fopen(fileName);
i = 1;
while( i > 0)
        t1 = fgetl(fid); if ~ischar(t1), break; end
        [a0,a] = strtok(t1);
        data = strread(a,'%s');
        C{i}.name = data(1);
         C{i}.chr = str2double(data(2));
        C{i}.strand = data(3);
        C{i}.tss = str2double(data(4));
        C{i}.tse = str2double(data(5));
        C{i}.cds = str2double(data(6));
        C{i}.cde = str2double(data(7));
        C{i}.exs = strread(data{9},'%d,');
        C{i}.exe = strread(data{10},'%d,');
        i=i+1
end
save('refGene.mat','C','chrN');
fclose(fid);


Running matlab from a perl script
in your perl script write:
my $matlab_prefix = "matlab -nodisplay -nodesktop -nojvm -r ";
my $matlab_cmd = "$matlab_prefix\"plot([1:10],[1:10])\"";
system ("$matlab_cmd > /dev/null < /dev/null");