Publication Date

Summer 2023

Degree Type


Degree Name

Master of Science (MS)


Applied Data Science


Anand Ramasubramanian; David C. Anastasiu; Vishnu Pendyala


Thrombin is the key enzyme in the pathogenesis of most cardiovascular diseases. A new class of anticoagulation drugs, direct thrombin inhibitors (DTI) have greatly improved patient care. Despite minimum off-target effects and immediacy of action of DTI, the risk of bleeding and pharmacokinetic issues has limited their applicability, and thrombotic complications remain a major concern. In an effort to increase the pipeline of DTI, we developed a two-staged machine learning pipeline to identify and rank peptide sequences based on their effective thrombin inhibitory potential. The positive dataset for our model consisted of thrombin inhibitor peptides and their binding affinities (KI) curated from published literature, and the negative dataset consisted of peptides with no-known thrombin inhibitory or related activity. The first stage of the model identified thrombin inhibitory sequences with an MCC of 83.6%; and the second stage of the model, which covers the eight-orders of magnitude range in KI, predicted the binding affinity of new sequences with a log RMSE 1.114. These models also revealed physicochemical and structural characteristics that are hidden but unique to thrombin inhibitor peptides. Using the model, we mined more than 10 million peptides from diverse habitats and identified and ranked 86 sequences based on their KI. Cluster analysis revealed that the hits were comprised of sequences of various levels of homology with thrombin inhibitors. We propose that these hits offer a novel promising set of DTI candidates, and the classification-regression model pipeline may be applicable to other systems where binding affinity data is available.