I am a Ph.D student in the Department of electrical communication engineering (ECE) at the Indian Institute of Science (IISc), since August 2013. I am a member of the Speech and Audio Group headed by Prof. T. V. Sreenivas.
Prior to joining IISc for Ph.D, I worked with Ikanos Communications India Pvt. Ltd. from July 2011 to July 2013 as a Firmware engineer working on VDSL2 systems. I obtained M.E in Signal Processing from IISc in May 2011, and B.Tech degree in Electronics and Communication Engineering from JNTU, Hyderabad in May 2009.
My research interests are in the areas of speech and audio processing, and machine learning. Specifically, I am interested in multi-channel speech processing, speech source localization, self-localization of microphone arrays, acoustic scene analysis, compressed sensing and sparse signal processing applied to speech and audio.
Srikanth Raj Chetupalli
Ph.D Student,
Dept. of Electrical Communication Engineering,
Indian Institute of Science, Bangalore.
email: srajATeceDOTiiscDOTernetDOTin
Abstract: Dereverberation of a moving speech source in the presence of other directional interferers, is a harder problem than that of stationary source and interference cancellation. We explore joint multi channel linear prediction (MCLP) and relative transfer function (RTF) formulation in a stochastic framework and maximum likelihood estimation. We found that the combination of spatial filtering with distortion-less response constraint, and time-varying complex Gaussian model for the desired source signal at a reference microphone does provide better signal estimation. For a stationary source, we consider batch estimation, and obtain an iterative solution. Extending to a moving source, we formulate a linear time-varying dynamic system model for the MCLP coefficients and RTF based online adaptive spatial filter. For the case of tracking a desired source in the presence of interfering sources, the same formulation is used by specifying the RTF. Simulated experimental results show that the proposed scheme provides better spatial selectivity and dereverberation than the traditional methods, for both stationary and dynamic sources even in the presence of interfering sources.
Abstract: Multi-channel linear prediction (MCLP) can model the late reverberation in the short-time Fourier transform domain using a delayed linear predictor and the prediction residual is taken as the desired early reflection component. Traditionally, a Gaussian source model with time-dependent precision (inverse of variance) is considered for the desired signal. In this paper, we propose a Student's t-distribution model for the desired signal, which is realized as a Gaussian source with a Gamma distributed precision. Further, since the choice of a proper MCLP order is critical, we also incorporate a Gaussian distribution prior for the prediction coefficients and a higher order. We consider a batch estimation scenario and develop variational Bayes expectation maximization (VBEM) algorithm for joint posterior inference and hyper-parameter estimation. This has lead to more accurate and robust estimation of the late reverb component and hence its cancellation, benefitting the desired residual signal estimation. Along with these stochastic models, we formulate single channel output (MISO) and multi channel output (MIMO) schemes using shared priors for the desired signal precision and the estimated MCLP coefficients at each microphone. Experiments using real room impulse responses show improved late reverberation suppression with the proposed VBEM approach over the traditional methods, for different room conditions. Additionally, we achieve a sparse coefficient vector for the MCLP avoiding the criticality of manually choosing the model order. The MIMO formulation is easily extended to include spatial filtering of the enhanced signals, which further improves the estimation of the desired signal.
Abstract: Blind inverse filtering using multi-channel linear prediction (MCLP) in short-time Fourier transform (STFT) domain is an effective means to enhance reverberant speech. Traditionally, a speech power spectral density (PSD) weighted prediction error (WPE) minimization approach is used to estimate the prediction filters, independently in each frequency bin. The method is sensitive to the estimation of desired signal PSD. In this paper, we propose an auto-encoder (AE) deep neural network (DNN) based constraint for the estimation of desired signal PSD. An auto encoder trained on clean speech STFT coefficients is used as the prior to non-linearly map the natural speech PSD. We explore two different architectures for the auto-encoder: (i) fully-connected (FC) feed-forward, and (ii) recurrent long short-term memory (LSTM) architecture. Experiments using real room impulse responses show that the LSTM-DNN based PSD estimate performs better than the traditional methods for reverberant signal enhancement.
Abstract: Segmentation/diarization of audio recordings using a network of ad-hoc mobile arrays and the spatial information gathered is a part of acoustic scene analysis. Because of deploying ad-hoc mobile devices, synchronous recording is assumed only at each array node and a gross feature level synchrony across different nodes of the network. We compute spatial features at each node in a distributed manner without the overhead of signal data aggregation between mobile devices. The spatial features are then modeled jointly using a Dirichlet mixture model, and the posterior probabilities of the mixture components are used to derive the segmentation information. Experiments on real life recordings in a reverberant room using a network of randomly placed mobile phones has shown a diarization error rate of less than 14% even with overlapped talkers.
Abstract: Segmentation/diarization of audio recordings using a network of ad-hoc mobile arrays and the spatial information gathered is a part of acoustic scene analysis. Because of deploying ad-hoc mobile devices, synchronous recording is assumed only at each array node and a gross feature level synchrony across different nodes of the network. We compute spatial features at each node in a distributed manner without the overhead of signal data aggregation between mobile devices. The spatial features are then modeled jointly using a Dirichlet mixture model, and the posterior probabilities of the mixture components are used to derive the segmentation information. Experiments on real life recordings in a reverberant room using a network of randomly placed mobile phones has shown a diarization error rate of less than 14% even with overlapped talkers.
Abstract: Segment clustering is a crucial step in unsupervised speaker diarization. Bottom-up approaches, such as, hierarchical agglomerative clustering technique are used traditionally for segment clustering. In this paper, we consider the top-down approach to clustering, in which a speaker sensitive, low-dimensional representation of segments (speaker space) is obtained first, followed by Gaussian mixture model (GMM) based clustering. We explore three methods of obtaining the low dimension segment representation: (i) multi-dimensional scaling (MDS) based on segment to segment stochastic distances; (ii) traditional principal component analysis (PCA), and (iii) factor analysis (i-vectors), of GMM mean super-vectors. We found that, MDS based embeddings result in better representation and hence result in better diarization performance compared to PCA and even i-vector embeddings.
Abstract: Spatial cross coherence function between two locations in a diffuse sound field is a function of the distance between them. Earlier approaches to microphone geometry calibration utilizing this property assumed the presence of ambient noise sources. In this paper, we consider the geometry estimation using a single acoustic source (not noise) and show that late reverberation (diffuse signal) estimation using multi-channel linear prediction (MCLP) provides a computationally efficient solution. The idea behind this is that, the component of a reverberant signal corresponding to late reflections satisfies the diffuse sound field properties, which we exploit for distance estimation between microphone pairs. MCLP of short-time Fourier transform (STFT) coefficients is used to decompose each microphone signal into early and late reflection components. Cross coherence computed between the separated late reflection components is then used for pair-wise microphone distance estimation. Multidimensional scaling (MDS) is then used to estimate the geometry of the microphones from pair-wise distance measurements. We show that, higher reverberation, though detrimental to signal estimation, can aid in microphone geometry estimation. Estimated position error of less then $2~cm$ is achieved using the proposed approach for real microphone recorded signals.
Abstract: Passive sound source localization (SSL) using time-difference-of-arrival (TDOA) measurements is a non-linear inversion problem. In this paper, a data-driven approach to SSL using TDOA measurements is considered. A neural network (NN) is viewed as an architecture constrained non-linear function, with its parameters learnt from the training data. We consider a three layer neural network with TDOA measurements between pairs of microphones as input features and source location in the Cartesian coordinate system as output. Experimentally, we show that, NN trained even on noise-less TDOA measurements can achieve good performance for noisy TDOA inputs also. These performances are better than the traditional spherical interpolation (SI) method. We show that the NN trained offline using simulated TDOA measurements, performs better than the SI method, on real-life speech signals in a simulated enclosure.
Abstract: Speaker identification implemented on a mobile robot is a challenging problem because of varying reverberant environments which the robot encounters while in motion. The performance of a typical speaker identification system degrades significantly in reverberant environments. The degradation in performance is mainly due to the conventional feature being not robust to change in reverberant condition. In this paper, we present a non-linear filter based mel frequency cepstral coefficient (MFCC) feature extraction, which is more robust to changes in reverberant conditions. This feature extraction method is a two stage operation and is applied on the spectrogram of the speech signal. The first stage suppresses the frequency spread due to reverberation within each frame and in the second stage, reverberation effect across the frames is suppressed. The performance is evaluated by the GMM-UBM based identifier built and tested with conventional MFCC feature vectors and with the non-linear filter based MFCC feature vectors. We show that, the identification accuracy of GMM-UBM based identifier with non-linear filter based MFCC feature vectors is better than that of conventional MFCC feature vectors.
Abstract: Reducing artifacts of time-scale or pitch-scale modified speech is a classical problem and we address the same using time-varying signal models. We choose the AM-FM decomposition, i.e., instantaneous amplitude (IA) and instantaneous phase (IP), of narrow band signals to compose the overall speech signal. To suit perceptual aspects of the scale modified signal, we choose AM-FM decomposition of mel-scale sub-band filtered speech. The required scale modification is then applied to the multi-band IA and IP signals, individually. Experiments with time-scaling is found to preserve the spectral content, pitch, and also preserve the temporal structure. Listening test results for speech, and music (solo, and polyphonic) signals are compared with "phase vocoder with identity phase locking" and "harmonic-percussive separation (HP)" based time-scaling methods. The performance of mel-sub-band AM-FM shows a significant improvement in reduced reverberation-like perception (also referred as "phasiness"), and preserving the localized aspect of transients, in both speech and music signals. We also show the effectiveness of the new technique for non-uniform time-scale and pitch-scale modification.
Abstract: We consider the joint estimation of time-varying linear prediction (TVLP) filter coefficients and the excitation signal parameters for the analysis of long-term speech segments. Traditional approaches to TVLP estimation assume linear expansion of the coefficients in a set of known basis functions only. But, excitation signal is also time-varying, which affects the estimation of TVLP filter parameters. In this paper, we propose a Bayesian approach, to incorporate the nature of excitation signal and also adapt regularization of the filter parameters. Since the order of the system is not known a-priori, we formulate a Gaussian prior for the filter parameters, and the excitation signal is modeled as Gaussian with time-varying Gamma distributed precision. We develop an iterative algorithm for the maximum-likelihood (ML) estimation of the posterior distribution of filter parameters and the time-varying precision of the excitation signal, along with the parameters of the prior distribution. We show that the proposed method adapts to different types of excitation signals in speech, and also the time-varying system with unknown model order. The spectral modeling performance for synthetic speech-like signals, quantified using the absolute spectral difference (SPDIFF) shows that the proposed method estimates the system function more accurately compared to several of the traditional methods.
Abstract: We explore, experimentally, feature selection and optimization of stochastic model parameters for the problem of speaker spotting. Based on an initially identified segment of speech of a speaker, an iterative model refinement method is developed along with a latent variable mixture model so that segments of the same speaker are identified in a long speech record. It is found that a GMM with moderate number of mixtures is better suited for the task than a large number mixture model as used in speaker identification. Similarly, a PCA based low-dimensional projection of MFCC based feature vector provides better performance. We show that about 6 seconds of initially identified speaker data is sufficient to achieve > 90% performance of speaker segment identification
Abstract: Estimation of linear prediction coefficients under the sparsity constraint of the prediction residue, is a modification of the traditional minimum mean square error linear predictor formulation, which accounts for the impulse nature of the residual signal for voiced speech signals. This is solved using the 1-norm minimization approach under sparsity constraints. In this paper, we develop a successive approximation algorithm for estimating the linear predictor coefficients and the sparse residual signal. We illustrate the usefulness of the proposed approach using synthetic, and also real speech examples. Experimental results in a multi-pulse based analysis-synthesis show that the proposed approach can provide better perceptual quality speech reconstruction than the orthogonal matching pursuit based algorithm, with computational time much lower than convex optimization based techniques.
Abstract: Time-varying linear prediction has been studied in the context of speech signals, in which the auto-regressive (AR) coefficients of the system function are modeled as a linear combination of a set of known bases. Traditionally, least squares minimization is used for the estimation of model parameters of the system. Motivated by the sparse nature of the excitation signal for voiced sounds, we explore the time-varying linear prediction modeling of speech signals using sparsity constraints. Parameter estimation is posed as a 0-norm minimization problem. The re-weighted 1-norm minimization technique is used to estimate the model parameters. We show that for sparsely excited time-varying systems, the formulation models the underlying system function better than the least squares error minimization approach. Evaluation with synthetic and real speech examples show that the estimated model parameters track the formant trajectories closer than the least squares approach.
Abstract: A joint analysis-synthesis framework is developed for the compressive sensing recovery of speech signals. The signal is assumed to be sparse in the residual domain with the linear prediction filter used as the sparse transformation. Importantly this transform is not known apriori, since estimating the predictor filter requires the knowledge of the signal. Two prediction filters, one comb filter for pitch and another all pole formant filter are needed to induce maximum sparsity. An iterative method is proposed for the estimation of the prediction filters and the signal itself. Formant prediction filter is used as the synthesis transform, while the pitch filter is used to model the periodicity in the residual excitation signal.
Abstract: Compressive Sensing (CS) signal recovery has been formulated for signals sparse in a known linear transform domain. We consider the scenario in which the transformation is unknown and the goal is to estimate the transform as well as the sparse signal from just the CS measurements. Specifically, we consider the speech signal as the output of a time-varying AR process, as in the linear system model of speech production, with the excitation being sparse. We propose an iterative algorithm to estimate both the system impulse response and the excitation signal from the CS measurements. We show that the proposed algorithm, in conjunction with a modified iterative hard thresholding, is able to estimate the signal adaptive transform accurately, leading to much higher quality signal reconstruction than the codebook based matching pursuit approach. The estimated time-varying transform is better than a 256 size codebook estimated from original speech. Thus, we are able to get near “toll quality” speech reconstruction from sub-Nyquist rate CS measurements.
Compressive Sensing (CS) is a new sensing paradigm which permits sampling of a signal at its intrinsic information rate which could be much lower than Nyquist rate, while guaranteeing good quality reconstruction for signals sparse in a linear transform domain. We explore the application of CS formulation to music signals. Since music signals comprise of both tonal and transient nature, we examine several transforms such as discrete cosine transform (DCT), discrete wavelet transform (DWT), Fourier basis and also non-orthogonal warped transforms to explore the effectiveness of CS theory and the reconstruction algorithms. We show that for a given sparsity level DCT, overcomplete, and warped Fourier dictionaries result in better reconstruction, and warped Fourier dictionary gives perceptually better reconstruction. “MUSHRA” test results show that a moderate quality reconstruction is possible with about half the Nyquist sampling.