Matlab - advanced lesson 1

The Matlab language integrates computation, visualization, and programming in an easy-to-use environment.
In the home work you got, you learned the basics: using variables, arrays and matrices, perform mathematical operations, plotting basic graphs, and working with binary files. Now we can jump to more advanced matterial.
During this lesson we will use gene-expression analysis as a running example, at each step of the analysis we will learn the relevant Matlab commands.
Let's start... We want to define the transcriptional response to smotic stress in yeast. We will use experimets measuring the expression levels 20 min after exposure to salt (using 2-dye arrays measuring the log-ratio between the RNA level in yeast after exposure to salt and before). We assume the data is already clean and normalized.
Note that there are many Matlab tool-boxes designed especially to analyse biological data, and specifically gene-expression data. We will learn more about them in the next lessons.
Our data is a tab delimited file :
Probe_id   Name              solt1/wt1   solt2/wt2    solt3/wt3    solt4/wt4
A_1111    chr3:13000-130030       1          1.2          2             1.5
A_1112    chr3:14000-140030       1          0.4          1.1            1
A_1113    chr5:1607804-1608104    0.2        -2           -1            -1.5
A_1114    chr2:9657000-9657300    0          0.2          -0.5          0.8
...

Handeling files

Step 1: Reading the data and saving it is a structure in matlab
There are various function you can use to read and upload files. Some examples: Let's try this:
>> e=importdata('expression.txt')

e = 

        data: [3000x3000 double]
textdata: {3001x3002 cell}
>> e.data(:,1:4)

ans =

    1.0000    1.2000    2.0000    1.5000
    1.0000    0.4000    1.1000    1.0000
    0.2000   -2.0000   -1.0000   -1.5000
         0    0.2000   -0.5000    0.8000
>> e.textdata(1,:)

ans = 

    'Probe_id'    'Name'    'solt1/wt1'    'solt2/wt2'    'solt3/wt3'    'solt4/wt4'
>> e.textdata(2:3,1:2)

ans = 

    'A_1111'    'chr3:13000-130030'
    'A_1112'    'chr3:14000-140030'


Statistics

Step 2: Finding differentially expressed genes
Differentiall expressed genes, are genes that change under the given condition compared to the reference. Since the expression data is noisy, we use repeats of the experiment to filter out noise, by requireing consistency between the repeats. A statistical test used to determine if a gene is differential, is t-test.
There are many statistical tests ready and easy to use in Matlab. Some examples: Let's try this :

>> [h_down,p_down]=ttest(e.data',0,0.05,'left');
>> p_down(1:4)

ans =

    0.9964    0.9940    0.0535    0.6632

>> [h_up,p_up]=ttest(e.data',0,0.05,'right');
>> p_up(1:4)

ans =

    0.0036    0.0060    0.9465    0.3368

Note: This function requires a multiple hypothese correction for the p-value.
You can use for example an False Discovery Rate (FDR) correction
using the function mafdr(p), from the bioinformatix toolbox.

NaN (Not-a-Number) and the find function

Step 3: Dealing with missing values
Missing values in the file (due to noise in the experiment for example), will be marked as NaN. This can be problematic:
>> [h,p]=ttest([NaN,NaN,1,1])

h =

     1


p =

     0

>> [h,p]=ttest([0,0,1,1])

h =

     0


p =

    0.1817

We can filter out probes with more than 60% missing values:
>> ind = find(sum(~isnan(e.data),2)>(size(e.data,2)*0.6));
>> data_60 = a.data(ind,:);
>> size(data_60)
>> data_60 = e.data(ind,:)

data_60 =

    1.0000    1.2000    2.0000    1.5000
    1.0000    0.4000    1.1000    1.0000
    0.2000   -2.0000   -1.0000   -1.5000
         0    0.2000   -0.5000    0.8000

>> size(data_60)

ans =

     2300     4

>> Names = e.textdata(ind+1,1:2)

Now we can re-do the t-test all over again... and create a matrix of differential genes.
First try : >> differential = data_60(h_up+h_down>0,:); Second try: >> differential = data_60( [(p_down<0.1)+(p_up<0.01)]>0 , :); >> size(differential) ans = 320 4

Clustering

Step 4: Clustering the expression data
Clustering the data can help us see trends in the data, and divide it to subgroups that share a similar expression pattern.
This might be more meaningful wehn working with more complex data, such as time-searies expression data.
There are different available clustering algorithms in Matlab, that can help you both cluster and visualize the data.
  • Hierarchical clustering = the easiest way is to use clusterdata function:
    T = clusterdata(X,cutoff) and you have a range of parameters to play with such as the distance function to use. You can also do this explicitly using these three functions : Y = pdist(X,'euclid'); Z = linkage(Y,'single'); T = cluster(Z,'cutoff',cutoff);
  • Kmeans clustering= kmeans(X,k) partitions the points in the data matrix X into k clusters, by an iterative algorithm that minimizes the within-cluster distance to the centroid.
For our example, we will perform k-means with K=3 (why 6?? good question) : K = 3; I = kmeans(differential,K,'Distance','correlation');

Some more about graphs

Step 5: Displaying the clustering results
We can think of different ways to display the data, such as plotting all genes in the cluster and their average :
>> figure;
>> for i=1:K
   subplot(K,1,i);
   hold on
   for j = I'
   plot( data(j,:));
   end
   plot( mean( data(find(I==i),:) ), 'r','LineWidth',2);
   title([ 'cluster ' num2str(i)]);
   set(gca,'FontSize',12);
end
set(gcf, 'PaperPosition', [0 0 9 9], 'PaperPositionMode', 'manual', 'PaperSize', [9 9]);
print(gcf,'-dpng', '-r300', [name '.MEAN.clusteredEuc.K' num2str(K) '.png']);

See for example such an image from time-series data : here 

Or a different display, using a colormap where up-regulated genes are read and down regulated genes are green :
>> figure;
>> for i=1:K
    subplot(K,1,i);
    colormap redgreencmap;
    imagesc( data(find(I==i),:),[-max(max(data)),max(max(data))]);
    colorbar;
end

The resulting image from the same time-series data you can see here 
Other useful functions : see the help , functions such as: hist, pie, area. Or 3D graphs such as the mesh and scatter3 functions.

Handeling Files and text data

Step 6: Writing the clustering results to a file
Let's go directly to the example: fh = fopen([name 'clsuterK' num2str(K) '-res.txt'],'w'); for i=1:K fprintf(1,'Number of genes in cluster %d : # %d\n',i,length(find(I==i))); for j = [find(I==i)]' fprintf(fh,'\t%s%s\n',Names{j,2},sprintf('\t%d',data(j:))) end end fclose(fh);

Other Data Structures

Step 7: Annotating the genes
In our data each probe is given using its location in the genome. We would like to annotate each probe, by relating it to the closes gene in the genome. We will use for this the RefGene data, but it will be mch easier if we use more complex data structures.
  • Cell arryas = A cell array provides a storage mechanism for dissimilar kinds of data. To access data in a cell array, you use the same indexing as with other matrices, but use curly braces, {}.
  • Structures = are MATLAB arrays with named "data containers" called fields. The fields of a structure can contain any kind of data.
As an example we will parse the file RefGene.txt
fid = fopen(fileName); i = 1; while( i > 0) t1 = fgetl(fid); if ~ischar(t1), break; end [a0,a] = strtok(t1); data = strread(a,'%s'); C{i}.name = data(1); C{i}.chr = str2double(data(2)); C{i}.strand = data(3); C{i}.tss = str2double(data(4)); C{i}.tse = str2double(data(5)); C{i}.cds = str2double(data(6)); C{i}.cde = str2double(data(7)); C{i}.exs = strread(data{9},'%d,'); C{i}.exe = strread(data{10},'%d,'); i=i+1 end save('refGene.mat','C','chrN'); fclose(fid);

Running matlab from a perl script

in your perl script write:
my $matlab_prefix = "matlab -nodisplay -nodesktop -nojvm -r ";
my $matlab_cmd = "$matlab_prefix\"plot([1:10],[1:10])\"";
system ("$matlab_cmd > /dev/null < /dev/null");