LATENT VARIABLE APPROACH TO DIARIZATION OF AUDIO RECORDINGS USING RANDOMLY PLACED MOBILE DEVICES

Srikanth Raj Chetupalli, Anirban Bhowmick and Thippur V. Sreenivas

The Problem



Experimental setup

Consider a meeting scenario with several participants carrying mobile devices with one or more microphones. Mobile devices placed on a table can be connected to form an ad-hoc network of microphones. We address the diarization task, i.e., "Who spoke when?" in the meeting recordings comprising of multiple sources and signals recorded in such an ad-hoc microphone array network.

Such a setup is characterized by (i) random placement and orientation of mobile devices, (ii) asynchronous recording across the mobile devices, (iii) synchronous two-channel recording at each mobile device, and (iv) variability across the devices with respect to the number, type, and arrangement of the microphones on the mobile device.

The Approach

"Compute spatial features at each mobile device separately, coarse align the feature streams across the devices and jointly model them in a stochastic formulation "

Spatial features / Directional statistics

Compute the spatial response, assuming the two microphones on the mobile device to be along a straight line with a known spacing. Steered response function (SRP-PHAT) approach is used to compute the spatial response at a set of discrete angles, smooth across time using recursive averaging and normalize to sum to unity. We interpret this "directional statistics" feature as a probability mass function (PMF). Directional statistics are computed independently for each mobile device.

Coarse synchronization

The signals across the devices can be aligned coarsely using specific acoustic events such as a clap, a tap on the table, or using network time

Stochastic modeling

Directional statistic feature at a mobile device during a short time frame is modeled as a sample drawn from a Dirichlet distribution with source specific parameters. The features across all the mobile devices are jointly modeled using a latent variable formulation, with the latent variables selecting the hidden source. Expectation-maximization is used parameter estimation and posterior inference using maximum likelihood criterion. Diarization information is obtained from the posterior probability of the source obtained after the convergence of EM algorithm.

Real data experiments

Dataset

Experimental setup

Real-life meeting recordings are used for the evaluation. Three mobile phones are placed randomly on a table inside a reverberant enclosure. Reverberation time of the enclosure is approx. \(650~ms\). Three speakers are seated around the table. The signals are recorded at \(48~KHz\) sampling rate, and down-sampled to \(16~KHz\) prior to processing. Duration of the recordings varied between \(5-10\) minutes. The recordings are manually annotated at the speaker level for the purpose of evaluation (Contact the authors to access the original recordings and the annotations).

Illustration and Results

Spectrogram and directional statistics for a mobile phone, and the estimated posterior speaker probability as a function of time.

Recording 1

Recording 2

Recording 3

Recording 4

Recording 5


DER performance


Meeting ID R1 R2 R3 R4 R5 Avg.
Proposed 13.1 12.5 20.9 14.0 6.5 13.4
Oracle 11.3 10.8 20.5 13.7 5.6 12.4

Oracle performance is obtained from ground truth labels by assigning previous segment speaker label during segments with overlap of multiple speakers. DER is with-in 2% for all the five recordings, and average DER is less than 14%