Publication Date

Fall 2025

Degree Type

Thesis

Degree Name

Master of Science (MS)

Department

Computer Engineering

Advisor

Magdalini Eirinaki; Bernardo Flores; Gheorghi Guzun

Abstract

Transcribing guitar music automatically is a complex task due to polyphonic overlap, tuning variations, and diverse playing techniques. Current transcription systems focus on identifying note pitches and timing while ignoring performance techniques that describe how the notes are played, treating guitar recordings as generic polyphonic audio and producing MIDI-like outputs that lose important information about articulation and style. To address these challenges, we propose an end-to-end transformer model for automatic guitar transcription. The system uses a T5-based encoder-decoder architecture that processes the Constant-Q Transform (CQT) of stereo audio input. The stereo representation helps separate individual guitar parts within a mix by exploiting spatial cues often present in studio recordings. Our model predicts detailed transcriptions that include string, fret, and timing information, along with expressive annotations for techniques such as hammer-ons, pull-offs, bends, and slides. A specialized tokenization scheme is introduced to efficiently encode this information for the model, allowing it to represent the physical and expressive aspects of guitar performance. The model is trained and evaluated on a large, annotated dataset of 75,579 songs totaling over 5,134 hours, covering a wide range of genres and recording conditions. Experimental results show that the proposed model improves over an established transcription baseline. It achieves an overall F-measure of 0.3017, with onset and offset F-measures of 0.5180 and 0.4819, respectively. We also conduct a detailed analysis of the model’s outputs, examining common error patterns such as missing notes, pitch deviations, and timing inaccuracies.

Share

COinS