Abhijith G.(College of Engineering,Trivandrum), Adarsh S.(College of Engineering,Trivandrum), Akshay Prasannan(College of Engineering,Trivandrum) and Rajeev Rajan(College of Engineering ,Trivandrum)
Abstract:
The fusion of i-vector with prosodic features is used to identify the most competent voice imitator through a deep neural network framework (DNN) in this paper. This experiment is conducted by analyzing the spectral and prosodic characteristics during voice imitation. Spectral features include mel frequency cepstral features (MFCC) and modified group delay features (MODGDF). Prosodic features, computed by the Legendre polynomial approximation, are used as complementary information to the i-vector model. Proposed system evaluates the competence of artists in voice mimicking and ranks them according to the scores from a classifier based on mean opinion score (MOS). If the artist with the highest MOS is identified as rank-1 by the proposed system, a hit occurs. DNN-based classifier makes the decision based on the probability value on the nodes at the output layer. The performance is evaluated using top X-hit criteria on a mimicry dataset. Top-2 hit rate of 81.81% is obtained for fusion experiment. The experiments demonstrate the potential of i-vector framework and its fusion in competency evaluation of voice mimicking.