TY - JOUR
T1 - Enriched random forests
AU - Amaratunga, Dhammika
AU - Cabrera, Javier
AU - Lee, Yung Seop
PY - 2008/9
Y1 - 2008/9
N2 - Although the random forest classification procedure works well in datasets with many features, when the number of features is huge and the percentage of truly informative features is small, such as with DNA microarray data, its performance tends to decline significantly. In such instances, the procedure can be improved by reducing the contribution of trees whose nodes are populated by non-informative features. To some extent, this can be achieved by prefiltering, but we propose a novel, yet simple, adjustment that has demonstrably superior performance: choose the eligible subsets at each node by weighted random sampling instead of simple random sampling, with the weights tilted in favor of the informative features. This results in an 'enriched random forest'. We illustrate the superior performance of this procedure in several actual microarray datasets.
AB - Although the random forest classification procedure works well in datasets with many features, when the number of features is huge and the percentage of truly informative features is small, such as with DNA microarray data, its performance tends to decline significantly. In such instances, the procedure can be improved by reducing the contribution of trees whose nodes are populated by non-informative features. To some extent, this can be achieved by prefiltering, but we propose a novel, yet simple, adjustment that has demonstrably superior performance: choose the eligible subsets at each node by weighted random sampling instead of simple random sampling, with the weights tilted in favor of the informative features. This results in an 'enriched random forest'. We illustrate the superior performance of this procedure in several actual microarray datasets.
UR - http://www.scopus.com/inward/record.url?scp=51749102692&partnerID=8YFLogxK
U2 - 10.1093/bioinformatics/btn356
DO - 10.1093/bioinformatics/btn356
M3 - Article
C2 - 18650208
AN - SCOPUS:51749102692
SN - 1367-4803
VL - 24
SP - 2010
EP - 2014
JO - Bioinformatics
JF - Bioinformatics
IS - 18
ER -