Consider a meeting scenario with several participants carrying mobile devices with one or more microphones. Mobile devices placed on a table can be connected to form an ad-hoc network of microphones. We address the diarization task, i.e., "Who spoke when?" in the meeting recordings comprising of multiple sources and signals recorded in such an ad-hoc microphone array network.
Such a setup is characterized by (i) random placement and orientation of mobile devices, (ii) asynchronous recording across the mobile devices, (iii) synchronous two-channel recording at each mobile device, and (iv) variability across the devices with respect to the number, type, and arrangement of the microphones on the mobile device.
Compute the spatial response, assuming the two microphones on the mobile device to be along a straight line with a known spacing. Steered response function (SRP-PHAT) approach is used to compute the spatial response at a set of discrete angles, smooth across time using recursive averaging and normalize to sum to unity. We interpret this "directional statistics" feature as a probability mass function (PMF). Directional statistics are computed independently for each mobile device.
The signals across the devices can be aligned coarsely using specific acoustic events such as a clap, a tap on the table, or using network time
Directional statistic feature at a mobile device during a short time frame is modeled as a sample drawn from a Dirichlet distribution with source specific parameters. The features across all the mobile devices are jointly modeled using a latent variable formulation, with the latent variables selecting the hidden source. Expectation-maximization is used parameter estimation and posterior inference using maximum likelihood criterion. Diarization information is obtained from the posterior probability of the source obtained after the convergence of EM algorithm.
Real-life meeting recordings are used for the evaluation. Three mobile phones are placed randomly on a table inside a reverberant enclosure. Reverberation time of the enclosure is approx. \(650~ms\). Three speakers are seated around the table. The signals are recorded at \(48~KHz\) sampling rate, and down-sampled to \(16~KHz\) prior to processing. Duration of the recordings varied between \(5-10\) minutes. The recordings are manually annotated at the speaker level for the purpose of evaluation (Contact the authors to access the original recordings and the annotations).
Spectrogram and directional statistics for a mobile phone, and the estimated posterior speaker probability as a function of time.
Meeting ID | R1 | R2 | R3 | R4 | R5 | Avg. |
---|---|---|---|---|---|---|
Proposed | 13.1 | 12.5 | 20.9 | 14.0 | 6.5 | 13.4 |
Oracle | 11.3 | 10.8 | 20.5 | 13.7 | 5.6 | 12.4 |
Oracle performance is obtained from ground truth labels by assigning previous segment speaker label during segments with overlap of multiple speakers. DER is with-in 2% for all the five recordings, and average DER is less than 14%