Publication Date

Spring 2019

Degree Type

Master's Project

Degree Name

Master of Science (MS)

Department

Computer Science

First Advisor

Robert Chun

Second Advisor

Thomas Austin

Third Advisor

Nishad Desai

Keywords

Speaker recognition, human computer interaction, biometrics, internet of things, mel frequency cepstral coefficients

Abstract

Speaker recognition is a technique of identifying the person talking to a machine using the voice features and acoustics. It has multiple applications ranging in the fields of Human Computer Interaction (HCI), biometrics, security, and Internet of Things (IoT). With the advancements in technology, hardware is getting powerful and software is becoming smarter. Subsequently, the utilization of devices to interact effectively with humans and performing complex calculations is also increasing. This is where speaker recognition is important as it facilitates a seamless communication between humans and computers. Additionally, the field of security has seen a rise in biometrics. At present, multiple biometric techniques co-exist with each other, for instance, iris, fingerprint, voice, facial, and more. Voice is one metric which apart from being natural to the users, provides comparable and sometimes even higher levels of security when compared to some traditional biometric approaches. Hence, it is a widely accepted form of biometric technique and is constantly being studied by scientists for further improvements. This study aims to evaluate different pre-processing, feature extraction, and machine learning techniques on audios recorded in unconstrained and natural environments to determine which combination of these works well for speaker recognition and classification. Thus, the report presents several methods of audio pre- processing like trimming, split and merge, noise reduction, and vocal enhancements to enhance the audios obtained from real-world situations. Additionally, a text-independent approach is used in this research which makes the model flexible to multiple languages. Mel Frequency Cepstral Coefficients (MFCC) are extracted for each audio, along with their differentials and accelerations to evaluate machine learning classification techniques such as kNN, Support Vector Machines, and Random Forest Classifiers. Lastly, the approaches are evaluated against existing research to study which techniques performs well on these sets of audio recordings.

Share

COinS