## Forewords

In a previous post, I wrote about the theoretical foundation of the CART algorithm with a particular focus on regression problems. I then put the theory into practice by conducting an experiment in R where some regression trees were fitted and interpreted. Continuing on the topic of CART, in this post I will write about how the algorithm can also be used to build classification trees. Since the two kinds of trees mainly differ in terms of the optimization criteria for splitting and pruning, these two aspects will be the focus of this post.

## Optimization criteria for splitting

For both regression and classification problems, CART uses a greedy and binary recursive partitioning strategy to grow a tree. However, unlike regression trees which use mean squared error as the measure for node impurity, classification trees uses a measure called Gini index.

Assuming that the outcome has \(c\) classes, and the proportion of class \(i\) in a node is denoted as \(p_i\), then the Gini index for that particular node can be defined as:

\[ \begin{align} Gini \ index &= \sum_{i = 1}^{c}p_i(1 - p_i) \\ &= \sum_{i = 1}^{c}p_i - \sum_{i = 1}^{c}p_i^2 \\ &= 1 - \sum_{i = 1}^{c}p_i^2 \end{align} \]

The value of the Gini index has an inverse association with the variance of \(p\) (i.e., \(\sum_{i = 1}^{c}(p_i - 1 / c)^2\)), and is maximized when \(p_1 = p_2 = ... = p_c = \frac{1}{c}\), minimized when \(p_i = 1\) (\(i \in \{1, 2, ..., c\}\)).

The strategy of recursive partitioning is to find a best split for each node that has the minimum weighted sum of the Gini index of the two child nodes. Suppose the number of observations in a parent node \(R_k\) is denoted as \(N_k\); those in the two child nodes \(R_{kl}\) and \(R_{kr}\) are denoted as \(N_{kl}\) and \(N_{kr}\), respectively, then the measure to evaluate the split would be^{1}:

\[ Gini(R_{kl}) \times \frac{N_{kl}}{N_k} + Gini(R_{kr}) \times \frac{N_{kr}}{N_k} \]

where \(Gini(R_{kl})\) and \(Gini(R_{kr})\) respectively represents the Gini index for the two child nodes. Note that \(N_{kl} + N_{kr} = N_k\).

## Prediction

The proportion of a class in a node is taken as the probability of that class in that specific node. Therefore, the outcome of any observation that falls into a leaf node will be predicted with the most frequent class in that node.

## Tree pruning

After a full tree has been grown, it can be pruned using cost-complexity pruning. The Gini index can be used to guide the process, but typically the misclassification error rate is used instead. Let \(m\) denote the number of leaf nodes in a sub-tree \(T\), the pruning process aims to find the best sub-tree that has the minimal \(C_{\alpha}(T)\) defined as follows:

\[ C_{\alpha}(T) = \sum_{k = 1}^{m} ErrorRate(R_k) \times N_k + \alpha m \quad (3) \]

where \(\alpha\) is the hyperparameter that controls the process.

The pruning process starts with parent nodes of leave nodes and proceeds until only the root node is left. In effect, each time the algorithm chooses one node to collapse which leads to a minimal increase in \(\sum_{k = 1}^{m} ErrorRate(R_k) \times N_k\).

## More on the optimization criteria

Why use the Gini index instead of the misclassification rate as the optimization criteria for splitting? One reason is that the Gini index is differentiable and is therefore more suitable for numerical optimization. Another reason is that the Gini index is more sensitive to changes in the node probabilities than the misclassification rate. Assume a two-class problem with 4 observations in each class, which can be denoted by \((4, 4)\). Suppose one split creates nodes \((3, 1)\) and \((1, 3)\), while another split creates nodes \((2, 4)\) and \((2, 0)\). Both splits produce a misclassification rate of 0.25, but the Gini index is lower for the second split. Since the second split produces a pure node and is therefore more preferable, the Gini index appears to be a better measure to evaluate the result of a split.

For comparison, the measure to evaluate a split in a regression tree is \(MSE(R_{kl}) \times \frac{N_{kl}}{N_k} + MSE(R_{kr}) \times \frac{N_{kr}}{N_k}\), where \(MSE\) stands for mean squared error.↩