Publication Date

Spring 2023

Degree Type

Master's Project

Degree Name

Master of Science (MS)

Department

Computer Science

First Advisor

William Andreopoulos

Second Advisor

Fabio Di Troia

Third Advisor

Carlos Rojas

Keywords

visualization, DNABERT, K-means clusters, DNA sequence tagging

Abstract

Deep neural networks have gained popularity and achieved high performance across multiple domains like medical decision-making, autonomous vehicles, decision support systems, etc. Despite this achievement, the internal workings of these models are opaque and are considered as black boxes due to their nested and non-linear structure. This opaque nature of the deep neural networks makes it difficult to interpret the reason behind their output, thus reducing trust and verifiability of the system where these models are applied. This paper explains a systematic approach to identify the clusters with most misclassifications or false label annotations. For this research, we extracted the activation vectors from a deep learning model DNABERT and visualized them using t-SNE to decode the reason behind the results that are produced. We applied K-means in a hierarchical fashion on the activation vector for a set of training instances. We analyzed cluster mean activation vectors to find any patterns in the errors across K-means clusters. The cluster analysis revealed that the predictions were uniform, or nearly 100% same, in clusters of similar activation vectors. It was found that two clusters containing most of their objects belonging to the same true class tend to be closer together than clusters of opposite classes. The means of objects of the same true label are closer if two clusters have the same predicted labels rather than opposite predicted labels, showing that the activation vectors reflect both predicted and true classes. We propose a heuristic to find the clusters with a high number of misclassifications or incorrect label annotations using the between clusters and within clusters mean vector analysis. This can aid in identifying misclassifications of DNA sequences or problems with sequence tagging.

Share

COinS