Results & Discussion

CV: Welcome

Note: Complete results and files for running the programs can be found on the project GitHub Page (link in Materials page at top).

Base Model:

The confusion matrix and table below represent the results from the first model iteration which attempted to mirror the 3 layered CNN-SVM hybrid model from the Chakrabarty paper using the 5 category Kaggle dataset.

As can be seen, these results were highly unsuccessful, as the model does not distinguish any difference among classes. In fact the model is always predicting class 0 no matter the input. Note the accuracy of 0.73 is largely inflated due to the unbalanced dataset (discussed more below).

We hypothesized several possible issues that could be leading to these results, and our next step was to test different iterations to try and solve the issue. The complete results of each mini-study can be found on the GitHub page, with details in the ReadMe file. A summary of the iterations and results is seen here below.

Iteration #1: The Kaggle dataset is currently representative of the entire population and thus largely skewed towards class 0 (healthy). We hypothesized that this skewed class distribution may be influencing the training of the feature extraction CNN, in that the ideal scenario is to simply predict class 0. Thus, new training batches were created with an even distribution of classes in each set. The figure here below shows the original distribution of the entire Kaggle dataset.

Note that creating evenly distributed classes meant the total dataset that was use-able was now limited to the size of the smallest class (class 4). Thus our training set was much smaller than before.

Our results from iteration 1 (using balanced dataset) can be seen here below. Again, we notice our model was highly unsuccessful and almost randomly predicted between class 0 and class 4. There are very few predictions in the intermediate classes as seen in the confusion matrix. Iteration 2 aims to target this issue next.

Iteration #2: After performing iteration 1, we hypothesized that the pre-processing step removes too much information prior to training the CNN. From the confusion matrix in iteration 1, we noticed that most the predictions were in classes 0 and 4. Thus it might be possible the color spectrum might play a factor in the diagnosis. Additionally, it is possible the thresholding step was removing too many fine details that were relevant to the intermediate classes. Thus during this iteration, the pre-processing step was removed. A visual of the impact of pre-processing on the input images is seen here.

By removing the pre-processing the goal was the CNN would be able to perform the feature-extraction on it's own. The results are seen here below.

From iteration 2, we noticed the results seem very similar to random guessing evenly between all 5 classes. Even though this fixed the issue of not detecting any images in the class 1-3 range, the results seen here are less than ideal. There were two main speculations our team could make from these results: (1) there is not distinct enough differences between classes 0, 1, 2, 3, 4 or (2) the CNN is not deep enough to determine the discrete features between classes. Both theories are tested below.

Nonetheless, from this iteration our team was able to conclude that the pre-processing step was limiting the CNN to define the different classes and thus no pre-processing was performed on the future iterations.

Iteration #3: We hypothesized that there is not a distinct enough difference among the classes of the dataset, so a new dataset was created consisting of only class 0 and class 4, representing the most drastic differences between classes. This test would be more representative of the study performed by the Chakrabarty paper (using only 2 classes representing no DR vs. proliferative DR). Refer to the images on the Dataset tab to see the differences between classes. Notice, for example, that is very hard to visually distinguish between class 0 and 1. The results from this iteration are shown here below. Note no pre-processing was done on these images and a balanced dataset was used.

Notice that almost all images are predicted as class 1 here. In this iteration class 1 represents class 4 from our original study (since we are now using only class 0 and 4 from the original study - which are now 0 and 1 respectively).

None of these iterations produced great results (a more detailed breakdown of each study is located in the GitHub page, see details in the ReadMe). However these studies indicated to us that the 3 layer CNN is not a deep enough network to handle this dataset.

Additional Model Iteration: Adapt State of Art Deep Learning Structure to replace CNN Structure from Literature

Our next step was to use a deeper, state of the art network (ResNet) in place of our simple, 3 layer CNN. The results can be seen below when applying ResNet instead of our 3-layer CNN in iteration 3 above (no pre-processing, balanced dataset, only class 0 and 4).

The results did improve compared to iteration 3 of our study, but it still equates to almost random guessing. We then hypothesized that the SVM component of the algorithm was unnecessary. Modern deep neural networks (such as ResNet) alone have shown great results on image processing. As a result, we ran another iteration that is seen here below. The results below represent feeding the images directly to a tuned ResNet model and using the CNN alone for predictions (no SVM). This model was run on a similar dataset (no pre-processing, balanced dataset, only class 0 and 4) for comparison.

As can be seen from the results, the SVM does seem to degrade model performance, and ResNet alone can perform much better at classifying DR, with an accuracy of 82% on a balanced 0/4 dataset.

Discussion

Our results above do not support the conclusion derived by the authors (Chakrabarty and Chatterjee). The SVM-Hybrid model did not perform well on a larger, more variant dataset. Our results show:

A deeper network is necessary to extract the minute feature differences.
The SVM component in the hybrid model is not necessary. A CNN alone performs better for this study.

Additionally, our results supported modern theory that deep neural networks alone have the capabilities to extract appropriate features and thus manual feature extraction may not always be beneficial.

Though our focus for this study was to simply analyze the performance of the SVM-CNN Hybrid approach, there are several future iterations that our team would like to explore to continue on improving the model we have built.

Data Augmentation is likely to help with such a small dataset (when using the balanced dataset).
Class weighting can allow us to perform training on the entire dataset, thus taking advantage of the resources available without skewing the model's prediction.
There may be possible differences in the left/right eye or types of microscopes that might influence our results. Further cleaning & separation of the data might help.
Our team would also like to seek more information on the scoring criteria used for this dataset. Was the same physician used for all labeling? If not, this may introduce some bias into our dataset. Considering how hard it is currently for physicians to detect early stages of DR (classes 1 and 2), each physician may have different grading scales.

20-04-2021 00_49_35_Xtest_confusionMatri

CV: Welcome