Hello! I am a first-year PhD student at the University of Edinburgh, researching spatial understanding and world representation in brains and large vision-language models (LVLMs) under the supervision of Benjamin Peters and Frank Keller. My work is funded by the UKRI AI CDT in Designing Responsible NLP. Previously, I completed a BSc. (Hons) in Computer Science & Cognitive Science at the University of Toronto as a Lester B. Pearson International scholar. I worked as a research assistant at the CoNSens Lab under the supervision of Matthias Niemeier, developing deep reinforcement learning networks to model task feature integration in sensorimotor tasks. I also conducted research at the CL&NLP group under the supervision of Gerald Penn, investigating novel methods for modelling automatic speech recognition error to enhance downstream spoken language understanding. In my spare time, I like to go on long walks, preferably by the sea. Sometimes I like to make art too.
Affiliations: EdinburghNLP,
ILCC
(previously): CL&NLP group, CoNSens Lab
The predominant method for scoring the quality of automatic speech recognition (ASR) transcripts when ground-truth labels are not available is to predict the word error rate (WER) from the corresponding audio segment. We propose WAV2LEV, a novel paradigm for WER estimation which predicts the underlying sequences of Levenshtein edit operations (substitutions, deletions, insertions and matches) from which the WER can be computed. This approach offers more fine-grained token-level error estimation in comparison to previous work without compromising on performance for WER estimation. To support this investigation, we present Mini-CNoiSY (Miniature Clean-Noisy Speech from YouTube), a bespoke 354-hour noisy speech corpus which ensures confidence in ground-truth labeling and captures a diverse range of noise artifacts which degrade ASR performance. Our results show that WAV2LEV achieves near state-of-the-art performance for the task of WER estimation with a root mean square error (RMSE) of 0.1488 and a Pearson correlation coefficient (PCC) of 89.71%, while generating predictions of ASR error that are more informative and fine-grained than that of direct WER estimators.