Data mining criteria for tree-based regression and classification

Andreas Buja, Yung Seop Lee

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

37 Scopus citations

Abstract

This paper is concerned with the construction of regression and classification trees that are more adapted to data mining applications than conventional trees. To this end, we propose new splitting criteria for growing trees. Conventional splitting criteria attempt to perform well on both sides of a split by attempting a compromise in the quality of fit between the left and the right side. By contrast, we adopt a data mining point of view by proposing criteria that search for interesting subsets of the data, as opposed to modeling all of the data equally well. The new criteria do not split based on a compromise between the left and the right bucket; they effectively pick the more interesting bucket and ignore the other. As expected, the result is often a simpler characterization of interesting subsets of the data. Less expected is that the new criteria often yield whole trees that provide more interpretable data descriptions. Surprisingly, it is a "flaw" that works to their advantage: The new criteria have an increased tendency to accept splits near the boundaries of the predictor ranges. This so-called "end-cut problem" leads to the repeated peeling of small layers of data and results in very unbalanced but highly expressive and interpretable trees.

Original languageEnglish
Title of host publicationProceedings of the Seventh ACM SIGKDD International Conference on Knowledge Discovery and Data Mining
EditorsF. Provost, R. Srikant, M. Schkolnick, D. Lee
PublisherAssociation for Computing Machinery (ACM)
Pages27-36
Number of pages10
ISBN (Print)158113391X, 9781581133912
DOIs
StatePublished - 2001
EventProceedings of the Seventh ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD-2001) - San Francisco, CA, United States
Duration: 26 Aug 200129 Aug 2001

Publication series

NameProceedings of the Seventh ACM SIGKDD International Conference on Knowledge Discovery and Data Mining

Conference

ConferenceProceedings of the Seventh ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD-2001)
Country/TerritoryUnited States
CitySan Francisco, CA
Period26/08/0129/08/01

Keywords

  • Boston Housing data
  • CART
  • Pima Indians Diabetes data
  • Splitting criteria

Fingerprint

Dive into the research topics of 'Data mining criteria for tree-based regression and classification'. Together they form a unique fingerprint.

Cite this