Vol 13, Issue 3, 503-512, March 2003
METHODS
A Classification-Based Machine Learning Approach for the Analysis of Genome-Wide Expression Data
James Lyons-Weiler1,2,
Satish Patel and
Soumyaroop Bhattacharya
Department of Biological Sciences/Graduate Program in
Biochemistry/Center for Bioinformatics and Computational Biology,
University of Massachusetts, Lowell,
Lowell, Massachusetts 01854, USA
Three important areas of data analysis for global gene expression
analysis are class discovery, class prediction, and finding
dysregulated genes (biomarkers). The clinical application of microarray
data will require marker genes whose expression patterns are
sufficiently well understood to allow accurate predictions on disease
subclass membership. Commonly used methods of analysis
include hierarchical clustering algorithms, t-, F-, and Z-tests, and
machine learning approaches. We describe an approach called the maximum
difference subset (MDSS) algorithm that combines classification
algorithms, classical statistics, and elements of machine learning and
provides a coherent framework. By integrating prediction accuracy, the
MDSS algorithm learns the critical threshold of statistical
significance (the or P-value), eliminating the
arbitrariness of setting a threshold of statistical significance and
minimizing the effect of the normality assumptions. To reduce the false
positive rate and to increase external validity of the predictive gene
set, a jackknife step is used. This step identifies and removes genes
in the initial MDSS with low combined predictive utility. The overall
MDSS provides a prediction that is less dependent on an arbitrary study
design (sample inclusion or exclusion) and should thus have high
external validity. We demonstrate that this approach, unlike other
published methods, identifies biomarkers capable of predicting the
outcome of anthracycline-cytarabine chemotherapy in cases of acute
myeloid leukemia. By incorporating two criteriastatistical
significance and predictive utilitythe approach learns the
significance level relevant for a given data set. The MDSS approach can
be used with any test and classifier operator pair.
1 Present address: Department of Pathology/Center for
Pathology Informatics/Benedum Center for Oncology Informatics,
University of Pittsburgh, Pittsburgh, Pennsylvania 15232, USA.
2 Corresponding author.
E-MAIL lyonsweilerj{at}msx.upmc.edu; FAX (412) 647-5380.
Article and publication are at
http://www.genome.org/cgi/doi/10.1101/gr.104003.

CiteULike Connotea Del.icio.us Digg Reddit Technorati What's this?
This article has been cited by other articles:

|
 |

|
 |
 
S. Bhattacharya and T. J. Mariani
Transformation of expression intensities across generations of Affymetrix microarrays using sequence matching and regression modeling
Nucleic Acids Res.,
October 13, 2005;
33(18):
e157 - e157.
[Abstract]
[Full Text]
[PDF]
|
 |
|

|
 |

|
 |
 
A. C. Borczuk, L. Shah, G. D. N. Pearson, K. L. Walter, L. Wang, J. H. M. Austin, R. A. Friedman, and C. A. Powell
Molecular Signatures in Biopsy Specimens of Lung Cancer
Am. J. Respir. Crit. Care Med.,
July 15, 2004;
170(2):
167 - 174.
[Abstract]
[Full Text]
[PDF]
|
 |
|
|
|