1. Data file which has the expression values.
The data file should be tab delimited with the following format:
[name] [description] [sample_1_name] [sample_2_name] . . .
[gene1_name] [gene1_description] [exprssion_val] [expression_val]
[gene2_name] [gene1_description] [exprssion_val]. . .
.
.
2. Labels file which has the samples' partitions into classes.
The data file should be tab delimited with the following format:
[class_1_name] [sample1] [sample2] . . .
[class_2_name] [sample1] [sample2] . . .
.
.
examples: Raw.txt ALLvsAML.class
(Published by T. Golub et al. 'Molecular Classification of Cancer: Class Discovery and Class Prediction by Gene Expression
Monitoring', Science 286, 531-7, 1999.)
| -g | Geometric avg |
| -a | Arithmetic avg |
| -s | Scale to std = 1 |
| -M <min> | Min value to truncate with |
| -N | Normalized based on all classes of the labelings |
NormalizeData -N -g -M 20 Raw.txt ALLvsAML.class Normalized.txt
This command line will generate log-ratio expression file where each entry is log (base 2) of the ratio to the geometric average of all arrays.
If the -N flag was omitted, the normalization would be based on the first class in the
file (i.e. ALL). The minimum value is used for truncating small
values.
This program scores each gene with respect to a given labeling. The output file contains a report on the scores for each gene.
Usage: ScoreGene [options] DataFile LabelFile OutputFile
Options
| -m | Compute TNOM score |
| -i | Compute Info score |
| -w | Compute Wilcoxon score |
| -l | Compute logistic score |
| -g | Compute Gaussian overlap score |
| -t | Compute t-test score |
| -n | Compute n-1 fold change |
| -f | Compute fold change |
| -d | Print gene descriptions in report |
Example:
ScoreGene -m -i Raw.txt ALLvsAML.class ScoreGene.txt
This command will generate a tab-delimited output file (that can be viewed in
Excel, for example) with the following format:
| Gene | Description | TNOM | TNOM Value | Info | Info Threshold |
|---|---|---|---|---|---|
| HG3703-HT3915_s_at | "Udp-Glucuronosyltransferase 1 Family, Polypeptide 1, Alt. Splice 1" | 0.786565 | 24 | 0.761646 | -10 |
| Y08374_rna1_at | "GP-39 cartilage protein gene extracted from H.sapiens gene encoding cartilage GP-39 protein, exon 1 and 2 (and joined CDS) " | 0.459584 | 23 | 0.494184 | -102.5 |
| X99728_at | "GB DEF = NDUFV3 gene, exon 3" | 0.22698 | 22 | 0.0156431 | 356.5 |
This program computes overabundance graph in DataFile with respect to LabelFile.
The output shows, for each p-value, the number of observed genes which were scored
with this p-value, the expected number of genes, and the surprise level.
Usage: OverAbundance [options] DataFile LabelFile OutputFile
Options
| -m | Compute TNOM score |
| -i | Compute Info score |
| -w | Compute Wilcoxon score |
| -l | Compute logistic score |
| -g | Compute Gaussian overlap score |
| -t | Compute t-test score |
OverAbundance Normalized.txt ALLvsAML.class OverAbundance.txt
prints (after the usual confirmation printouts) the information about
possible cutoffs:
MaxSurprise = 1401.63 at 0.00177662 with 508 genes
Bonferroni (95% confidence) at 7.01361e-06 with 97 genes
Max FDR (95% confidence) at 0.00177662 with 508 genes
MaxSurprise is described in the Class Discovery papers.
The Bonferroni & FDR bounds are two statistical multi-hypothesis selection methods.
The output file format is :
| Score | Count | Accumulate | Expected | Std | Binomial Surprise | FDR |
|---|---|---|---|---|---|---|
| 7.81421e-15 | 3 | 3 | 5.57075e-11 | 7.46375e-06 | 97.6225 | 2.10408e-05 |
| 1.34795e-13 | 4 | 7 | 9.60954e-10 | 3.09993e-05 | 176.713 | 4.90952e-05 |
This program performs leave-one-out-cross-validation ('jack-knife') classification of the data after features have been selected with different thresholds.
Usage: ClassificationGraph [options] DataFile LabelFile OutputFile
Options
| -m | Compute TNOM score |
| -i | Compute Info score |
| -w | Compute Wilcoxon score |
| -l | Compute logistic score |
| -g | Compute Gaussian overlap score |
| -t | Compute t-test score |
| -n | Compute n-1 fold change |
| -s | Do not perform gene selection within iterations |
| -T <tst> | Define test set labels |
| -C <classes> | Labels file for Leave-Class-Out procedure |
| -b | Classify with AdaBoost |
| -a | Classify with Naive Bayes |
ClassificationGraph -m -a Raw.txt ALLvsAML.txt out.txt
This will generate an output file with the following format:
| 1 | 7129 | 4 | 6 | 10 |
| 0.707107 | 4873.99 | 6 | 5 | 11 |
| 0.5 | 4873.99 | 6 | 5 | 11 |
| 0.353553 | 3511.19 | 6 | 5 | 11 |
This Program extracts genes from DataFile that are significant with respect to LabelFile.
The selected genes have p-value better than the given threshold.
The output is a CDT file with genes sorted by significance, and arrays sorted by classification confidence.
The CDT file can be viewed using
Treeview, or GeneXPress
programs.
Usage: PlotTopGenes [options] DataFile LabelFile OutputFile
Options
| -T <threshold> |   Threshold for choosing genes |
| -F <sig> |   Choose genes that pass FDR |
| -B <sig> |   Choose genes that pass Bonfferoni |
| -s |   Scale to std = 1 |
| -L |   Print logarithms of expression values |
| -M <Min> |   Min value to trancate with |
| -N |   Normalized wrt to training classe |
| -C |   Do not classify |
| -O |   Omit non-labeled genes |
| -m |   Compute TNOM score |
| -i |   Compute Info score |
| -w |   Compute Wilcoxon score |
| -l |   Compute logistic score |
| -g |   Compute Gaussian overlap score |
| -t |   Compute t-test score |
| -n |   Compute n-1 fold change |
| -b |   Classify with AdaBoost |
| -a |   Classify with Naive Bayes |
| -c |   Classify with Gaussian NB |
Example of CDT file: rows are the genes, columns are the experiments. The experiments are ordered according to their labels: at the right side experiments from class type 1, at the left side experiments from class type 2, and in the center unlabeled experiments.
 
This program clusters genes according to the expression correlations in each class.
Usage: PCluster ExpFile LabelFile OutputFile
options:
| -l | Take log of expression values before clustering |
| -s | Scale each gene to have mean 0 and variance 1 before clustring |
| -S | Scale only the output of the process |
| -B | Use marginal likelihood (instead of ML) |
| -L | Use leafs to determine order (default) |
| -T | Use trunks to determine order |
| -G <Genes> | Work only on the genes listed in the attached file |
| -C <Cluster file>,<Cluster number> | Output <Cluster number> Clusters of genes |
| -u | Do not introduce empty columns between classes |
PCluster Raw.txt ALLvsAML.txt out
This will generate a cdt output file out.cdt which describes the different clusters that were found
by the algorithm.