Publication Date
Spring 2023
Degree Type
Master's Project
Degree Name
Master of Science (MS)
Department
Computer Science
First Advisor
William Andreopoulos
Second Advisor
Fabio Di Troia
Third Advisor
Carlos Rojas
Keywords
visualization, DNABERT, K-means clusters, DNA sequence tagging
Abstract
Deep neural networks have gained popularity and achieved high performance across multiple domains like medical decision-making, autonomous vehicles, decision support systems, etc. Despite this achievement, the internal workings of these models are opaque and are considered as black boxes due to their nested and non-linear structure. This opaque nature of the deep neural networks makes it difficult to interpret the reason behind their output, thus reducing trust and verifiability of the system where these models are applied. This paper explains a systematic approach to identify the clusters with most misclassifications or false label annotations. For this research, we extracted the activation vectors from a deep learning model DNABERT and visualized them using t-SNE to decode the reason behind the results that are produced. We applied K-means in a hierarchical fashion on the activation vector for a set of training instances. We analyzed cluster mean activation vectors to find any patterns in the errors across K-means clusters. The cluster analysis revealed that the predictions were uniform, or nearly 100% same, in clusters of similar activation vectors. It was found that two clusters containing most of their objects belonging to the same true class tend to be closer together than clusters of opposite classes. The means of objects of the same true label are closer if two clusters have the same predicted labels rather than opposite predicted labels, showing that the activation vectors reflect both predicted and true classes. We propose a heuristic to find the clusters with a high number of misclassifications or incorrect label annotations using the between clusters and within clusters mean vector analysis. This can aid in identifying misclassifications of DNA sequences or problems with sequence tagging.
Recommended Citation
Bhandare, Vedashree, "Visualizing classification errors and mislabeling in machine learning" (2023). Master's Projects. 1218.
DOI: https://doi.org/10.31979/etd.3pvx-z87v
https://scholarworks.sjsu.edu/etd_projects/1218