Cost Complexity Pruning in Decision Trees

Cost Complexity Pruning in Decision Trees

Understanding the problem of Overfitting in Decision Trees and solving it by Minimal Cost-Complexity Pruning using Scikit-Learn in Python

A Snippet of our dataset

Our aim is to predict the Species of a flower based on its Sepal Length and Width.

We will split the dataset into two parts- Train and Test so that we can see how our model performs on unseen data as well. (We shall use the train_test_split function from sklearn. model_selection to split the dataset)

Spliiting the Dataset into Train and Test

Now, let?s fit a Decision Tree to the train part and predict on both test and train.(We will use DecisionTreeClassifier from sklearn.tree for this purpose)

Implementation of the Decision Tree

By default, the Decision Tree function doesn?t perform any pruning and allows the tree to grow as much as it can. We get an accuracy score of 0.95 and 0.63 on train and test part respectively as shown below. We can say that our model is Overfitting i.e. memorizing the train part but is not able to perform equally well on the test part.

Accuracy on train and test is 0.95 and 0.63 respectively

DecisionTree in sklearn has a function called cost_complexity_pruning_path, which gives the effective alphas of subtrees during pruning and also the corresponding impurities. In other words, we can use these values of alpha to prune our decision tree-

Cost Complexity Pruning Path

We will use these set these values of alpha and pass it to the ccp_alpha parameter of our DecisionTreeClassifier. By looping over the alphas array, we will find the accuracy on both Train and Test part of our dataset.

Code to loop over the alphas and plot the line graph for corresponding Train and Test accuracies,
Accuracy v/s Alpha

From the above plot, we can see that between alpha=0.01 and 0.02, we get the maximum test accuracy. Although our train accuracy has decreased to 0.8, our model is now more generalized and it will perform better on unseen data.

Train and Test Accuracy at ?=0.02

We can also use K-fold cross validation to test our model rather than using a train-test split. It will give us a much better overview of our model?s performance on unseen data.

If you want to understand the Math behind Cost-Complexity Pruning, click here. Check out the scikit-learn documentation of Decision trees by clicking here.

You can find the notebook on my GitHub and take a closer look at what I have done. Also, connect with me on LinkedIn and let?s discuss about Data!

Thanks for staying till the end! ^_^


No Responses

Write a response