Author

Johny Xiongz

Publication Date

Fall 2025

Degree Type

Master's Project

Degree Name

Master of Science in Computer Science (MSCS)

Department

Computer Science

First Advisor

Amith Kamath Belman

Second Advisor

William Andreopoulos

Third Advisor

Shantanu Deshpande

Keywords

Machine Learning, Convolutional Layers, Multimodal, Transformer-based encoders

Abstract

This paper presents a Dual-Stream Transformer–based architecture for multi-modal user verification, leveraging both keyboard and mouse dynamics to capture complementary behavioral patterns. Recurrent Neural Networks (RNNs), Long Short-Term Memory (LSTM) and other sequential models have shown success in evaluating sequential relationships; however, they mainly focus on short-term patterns and can sometimes struggle to capture long-term patterns. The proposed architecture employs two parallel Transformer-based encoders, each dedicated to one behavioral modality. The two streams integrate temporal convolutional layers for local feature ex- traction and self-attention mechanisms for modeling global temporal dependencies. This allows the system to learn subtle and high-level behavioral representations. We introduce two implementations of the Dual-Stream Architecture model. The first implementation demonstrates a late fusion to allow the model to learn from each modality independently while the second implementation introduces an early fusion through a dot-product fusion mechanism for the two streams to learn from each other. The experimental results show that the late fusion architecture verifies specified genuine users from impostor users effectively while the early fusion lacks in optimization. This research highlights the potential of Transformer-based multimodal fusion as a solution for continuous and unobtrusive user authentication.

Available for download on Saturday, December 19, 2026

Share

COinS