This blog will talk about existing approaches dealing with the imbalanced data. Whenever we apply statistical classification methods in real world data problem it is ubiquitously existing that the data is imbalanced, which is when far fewer observations labeled as one class, and the majority of the observations labeled as the other class. Examples are fraud detection, customer conversion forecasting, medical diagnosis (such as for rare diseases), text classification, etc. We will talk about approaches that focus on

Accuracy is the most common evaluation metrics for most traditional application, but it is nor suitable to evaluate imbalanced data sets. This is because for example if 99% of the outcome is negative then the classification methods will forecast fairly good by putting everybody into the negative category in order to achieve the goal of producing a good model. Click on the link for more details about confusion matrix.

F-value is a popular evaluation metric for imbalance problem. It is a combination of recall and precision, which is effective metrics for information retrieval community where the imbalance problem exists. F-value is high when both recall and precision are high, and it can be adjusted by changing parameter beta, where beta corresponds to relative importune of precision vs. recall.

In imbalanced data, not only the class distribution is skewed, the mis-classification cost is often uneven. The minority class are often more important than the majority class, and False Negative mistakes is usually more costly than False Positive mistakes.

**I) evaluation metrics of imbalance problem**first, and then we move to**II) remedies for how to better realize the classification of imbalance problem**(methods generally fall into following categories,**. For more details please check put these three papers,***A) changing class distributions, B) classifier level approaches, and C) ensemble learning methods**"On the Class Imbalance Problem, Guo 2008" "A New Evaluation Measure for Imbalanced Data sets, Weng 2006", and "Facing Imbalanced Data, Jeni"*.**I) Evaluation metrics.**Accuracy is the most common evaluation metrics for most traditional application, but it is nor suitable to evaluate imbalanced data sets. This is because for example if 99% of the outcome is negative then the classification methods will forecast fairly good by putting everybody into the negative category in order to achieve the goal of producing a good model. Click on the link for more details about confusion matrix.

F-value is a popular evaluation metric for imbalance problem. It is a combination of recall and precision, which is effective metrics for information retrieval community where the imbalance problem exists. F-value is high when both recall and precision are high, and it can be adjusted by changing parameter beta, where beta corresponds to relative importune of precision vs. recall.

- Eq 1. Accuracy=(TP+TN)/(TP+FN+FP+TN)
- Eq 2. Precision=TP/(TP+FP)
- Eq 3. Recall=TP/(TP+FN)
- Eq 4. FP rate=FP/(FP+TN)
- Eq 5. TP rate=TP/(TP+FN)
- Eq 6. F-value=(1+beta^2)Recall*Precision/(beta^2 Recall+Precision)
- Eq 7. MGM=sqrt((Accuracy+)*(Accuracy-))
- Eq 8. MS=(Accuracy+) + (Accuracy-)

In imbalanced data, not only the class distribution is skewed, the mis-classification cost is often uneven. The minority class are often more important than the majority class, and False Negative mistakes is usually more costly than False Positive mistakes.

**Cost matrix**is used when we know the cost of problem at hand. In this case, we can use the known cost to penalize the resulting confusion matrix to arrive at a meaningful performance assessment.In addition, as many know, ROC analysis and AUC (area under ROC curve) are commonly accepted method for classification performance meansure. However, ROC curve is hard to compare difference classifiers for different mis-classification cost and class distributions. In paper

Weng proposed a skewed weight distribution method that allows one to compute AUC with a cost bias, which is called weighted-AUC. When the cost is uneven and biased towards the rar e cla ss, instead of summing up areas with equal weights, we want to give more weights to the areas near the top of the graph.

*A New Evaluation Measure for Imbalance Datasets*Weng showed that**weighted-AUC**is a better alternate when evaluating imbalanced datasets. A false negative is worse than a false positive, so in the ideal case, the learner should be able to catch every positive example, and the best choice should result in the lowest FP rate at the 100% TP rate line. In following graph classifier A is more appropriate than classifier B because at higher TP rate region classifier A has smaller FP rate than classifier B.Weng proposed a skewed weight distribution method that allows one to compute AUC with a cost bias, which is called weighted-AUC. When the cost is uneven and biased towards the rar e cla ss, instead of summing up areas with equal weights, we want to give more weights to the areas near the top of the graph.