Classification and Regression Tree — CART

Classification and Regression Tree — CART

According to the graph, the following questions should be asked:

  • how old is the customer?
  • if the answer is ?above 40?, then the second question should be: ?does he/she own a real estate??
  • if the answer is ?yes? then loan can be issued to that customer.

In Machine Learning Decision Trees, the logic is very similar to the above real-world graphic. The tree is basically an upside down version of a real tree, the roots on top and the leaves at the bottom.

The above graph is an example of Gender prediction based on Height and Weight values. Height feature (an example of a predictor space) is being split into two branches based on whether the height value is below 180 cm or above 180 cm. The point where the left-hand branch and the right-hand branch are connected is called an internal node. From the internal node, if the Height value is above 180 cm, then the left-hand direction should be followed and the outcome is ?Male?. That is the dead-end, the terminal node or the leaf. If the height value is below 180 cm, then further questions should be asked in order to predict the outcome.

The main idea is to purify the predictor space by recursive binary splitting in a downward direction (from root to leaves). We split the predictor space from the best cut-point and then based on the newly derived two predictor spaces (two boxes containing observations) , find the second cut-point and continue binary splitting. This splitting process (tree pruning) is being performed in a greedy way, the best place for binary splitting is decided on each step, rather than looking ahead and picking a split that would lead to a better tree in some future step.

What is the way of finding the best place to split?

In order to find the best cut-point, different approaches are being used for regression and classification problems. The main idea here is to find the cut-point where the cost function is minimum. We?ll examine the approaches one by one.

Residual Sum of Squares ? RSS

This approach is being used in regression problems.

RSS formula

In the above formula, y is the real value whereas y_hat is the predicted one.

In order to find the cost function, the processes should be made.

  • A value is being selected as the cut-point and the predictor space is being divided into two boxes from this cut-point.
  • Per each box, the square of the difference between y and y_hat are being calculated iteratively for each observation within this box.
  • This process is being carried for all the points and the minimum RSS value is chosen among all the results. This is the best point to place the internal node, in other words the best place for the cut-point.

Gini and Entropy

These two approaches are being used in classification problems. Gini is the default approach in Python?s scikit-learn library.

Gini formula
Entropy formula

Just like the RSS approach, in order to find the cut-point where the cost function is minimum, the above formulas are being used iteratively.

Below is a part of the tree graph that I plotted via a Graphviz function during Diabetes CART project. Gini is being calculated for the root feature having 90 samples and being split into two branches from the best cut-point. Gini, having a small value means that there is a node purity. This pruning process stops on either:

  • Gini becomes zero,
  • Minimum samples split value (that is given to the model as a hyperparameter) has been reached,
  • Maximum depth (that is given to the model as a hyperparameter) has been reached.
plotted via export_graphviz from scikit-learn library


  • Decision Trees are easy to understand, implement and visualize.
  • They are robust to missing values and outliers. The only exception can be the outliers in the outcome in regression problems.
  • It is commonly believed that decision trees are a successful way of mirroring human decision-making.


  • Decision trees are prone to overfitting, especially when the tree is particularly deep. This causes the model to learn the train dataset with high accuracy instead of learning the different patterns that lay behind the dataset. High scores on train dataset would be misleading for the accuracy of the test dataset would be lower.
  • A smaller tree with fewer splits might lead to lower variance and better interpretation at the cost of a little bias. Also, bias with imbalance dataset.
  • Small changes in the dataset can lead to a different decision tree.


  1. Data Science and Machine Learning Bootcamp Lectures
  2. Gareth James, Daniela Witten, Trevor Hastie, Robert Tibshirani. An Introduction to Statistical Learning : with Applications in R. New York :Springer, 2013.

No Responses

Write a response