Input format

1. Data file which has the expression values.
The data file should be tab delimited with the following format:

[name] [description] [sample_1_name] [sample_2_name] . . .
[gene1_name] [gene1_description] [exprssion_val] [expression_val]
[gene2_name] [gene1_description] [exprssion_val]. . .
.
.


2. Labels file which has the samples' partitions into classes. The data file should be tab delimited with the following format:

[class_1_name] [sample1] [sample2] . . .
[class_2_name] [sample1] [sample2] . . .
.
.
examples: Raw.txt   ALLvsAML.class
(Published by T. Golub et al. 'Molecular Classification of Cancer: Class Discovery and Class Prediction by Gene Expression Monitoring', Science 286, 531-7, 1999.)

 



Normalize data

This program accepts a data file as input and normalizes the expression values of each gene. The normalization is semi-supervised, i.e. each expression value is divided by the average expression of the first class. In order to normalize according to all samples in the data set, use option -N.

Usage: NormalizeData DataFile LabelFile OutputFile

Options

-g Geometric avg
-a Arithmetic avg
-s Scale to std = 1
-M <min> Min value to truncate with
-N Normalized based on all classes of the labelings


Example:

  NormalizeData -N -g -M 20 Raw.txt ALLvsAML.class Normalized.txt

This command line will generate log-ratio expression file where each entry is log (base 2) of the ratio to the geometric average of all arrays. 

If the -N flag was omitted, the normalization would be based on the first class in the file (i.e. ALL). The minimum value is used for truncating small values.


Score Genes

This program scores each gene with respect to a given labeling. The output file contains a report on the scores for each gene.


Usage: ScoreGene [options] DataFile LabelFile OutputFile

Options

-m Compute TNOM score
-i Compute Info score
-w Compute Wilcoxon score
-l Compute logistic score
-g Compute Gaussian overlap score
-t Compute t-test score
-n Compute n-1 fold change
-f Compute fold change
-d Print gene descriptions in report


The default options are '-d -w -g -i -n'

Example:

  ScoreGene -m -i Raw.txt ALLvsAML.class ScoreGene.txt


This command will generate a tab-delimited output file (that can be viewed in Excel, for example) with the following format:

Gene Description TNOM TNOM Value Info Info Threshold
HG3703-HT3915_s_at "Udp-Glucuronosyltransferase 1 Family, Polypeptide 1, Alt. Splice 1" 0.786565 24 0.761646 -10
Y08374_rna1_at "GP-39 cartilage protein gene extracted from H.sapiens gene encoding cartilage GP-39 protein, exon 1 and 2 (and joined CDS) " 0.459584 23 0.494184 -102.5
X99728_at "GB DEF = NDUFV3 gene, exon 3" 0.22698 22 0.0156431 356.5

We see the name of each gene, its description, the TNOM score (p-value + # errors) and the Info pvalue and score.
For example, the probe X99728_at has TNOM p-value 0.22698 and Info p-value 0.0156431.


Overabundance

This program computes overabundance graph in DataFile with respect to LabelFile. The output shows, for each p-value, the number of observed genes which were scored with this p-value, the expected number of genes, and the surprise level.


Usage: OverAbundance [options] DataFile LabelFile OutputFile

Options

-m   Compute TNOM score
-i   Compute Info score
-w   Compute Wilcoxon score
-l   Compute logistic score
-g   Compute Gaussian overlap score
-t   Compute t-test score


The default options are '-m'

Example:

 OverAbundance Normalized.txt ALLvsAML.class OverAbundance.txt

prints (after the usual confirmation printouts) the information about possible cutoffs:

MaxSurprise = 1401.63 at 0.00177662 with 508 genes
Bonferroni (95% confidence) at 7.01361e-06 with 97 genes
Max FDR (95% confidence) at 0.00177662 with 508 genes

MaxSurprise is described in the Class Discovery papers. The Bonferroni & FDR bounds are two statistical multi-hypothesis selection methods.

The output file format is :

Score Count Accumulate Expected Std Binomial Surprise FDR
7.81421e-15 3 3 5.57075e-11 7.46375e-06 97.6225 2.10408e-05
1.34795e-13 4 7 9.60954e-10 3.09993e-05 176.713 4.90952e-05

The first column in the table is the p-value, the second is the number of genes at that p-value, third is number of genes with this p-value or better, the fourth is the number of genes expected in random data to be of that p-value or better, the fifth contains the standard deviations of this number, and the next two columns measure surprise in Binomial Surprise and FDR cutoff.

 


Classification

This program performs leave-one-out-cross-validation ('jack-knife') classification of the data after features have been selected with different thresholds.

Usage: ClassificationGraph [options] DataFile LabelFile OutputFile

Options

-m Compute TNOM score
-i Compute Info score
-w Compute Wilcoxon score
-l Compute logistic score
-g Compute Gaussian overlap score
-t Compute t-test score
-n Compute n-1 fold change
-s Do not perform gene selection within iterations
-T <tst> Define test set labels
-C <classes> Labels file for Leave-Class-Out procedure
-b Classify with AdaBoost
-a Classify with Naive Bayes

The default options are '-m'

Example:

  ClassificationGraph -m -a Raw.txt ALLvsAML.txt out.txt

This will generate an output file with the following format:

1 7129 4 6 10
0.707107 4873.99 6 5 11
0.5 4873.99 6 5 11
0.353553 3511.19 6 5 11


In each row, the first column is the p-value threshold, second is the average number of selected genes (over LOOCV repeats). In the next columns you can find the number of misclassifications that were done for the different classes (e.g. with p-value=1, 7129 genes were selected and the classifier misclassified 6 ALL and 4 AML samples). In the last column is the total number of misclassifications.

Plot Top Genes

This Program extracts genes from DataFile that are significant with respect to LabelFile. The selected genes have p-value better than the given threshold. The output is a CDT file with genes sorted by significance, and arrays sorted by classification confidence. The CDT file can be viewed using  Treeview, or GeneXPress programs.

Usage: PlotTopGenes [options] DataFile LabelFile OutputFile

Options

-T <threshold>   Threshold for choosing genes
-F <sig>   Choose genes that pass FDR
-B <sig>   Choose genes that pass Bonfferoni
-s   Scale to std = 1
-L   Print logarithms of expression values
-M <Min>   Min value to trancate with
-N   Normalized wrt to training classe
-C   Do not classify
-O   Omit non-labeled genes
-m   Compute TNOM score
-i   Compute Info score
-w   Compute Wilcoxon score
-l   Compute logistic score
-g   Compute Gaussian overlap score
-t   Compute t-test score
-n   Compute n-1 fold change
-b   Classify with AdaBoost
-a   Classify with Naive Bayes
-c   Classify with Gaussian NB


Note that this program annotates each experiment by the classification confidence given by the classifier. Thus, we can see which experiments were correctly classified in each category.

Example of CDT file: rows are the genes, columns are the experiments. The experiments are ordered according to their labels: at the right side experiments from class type 1, at the left side experiments from class type 2, and in the center unlabeled experiments.

       


PCluster

This program clusters genes according to the expression correlations in each class.

Usage: PCluster ExpFile LabelFile OutputFile

options:

-l Take log of expression values before clustering
-s Scale each gene to have mean 0 and variance 1 before clustring
-S Scale only the output of the process
-B Use marginal likelihood (instead of ML)
-L Use leafs to determine order (default)
-T Use trunks to determine order
-G <Genes> Work only on the genes listed in the attached file
-C <Cluster file>,<Cluster number> Output <Cluster number> Clusters of genes
-u Do not introduce empty columns between classes


Example:

PCluster Raw.txt ALLvsAML.txt out

This will generate a cdt output file out.cdt which describes the different clusters that were found by the algorithm.


Januray, 2003