Technical Programme

Programme at a Glance

Monday 01 July

Tuesday 02 July

12:00 -13:00
Registration & Lunch
13:00 – 13.20
Welcome Message (Dr. Kate Knill)
Constance Tipper
13:20 – 14:20
Keynote A: Dr. Catherine Lai
Chair: Dr. Mengjie Qian
Constance Tipper
14:30 – 15:30
Poster Session A
LT1/LT2/Inglis Corridor
15:30 – 16:00
Tea/Coffee Break
16:00 – 17:00
Oral Session A
Chair: Dr. Erfan Loweimi
Constance Tipper
18:30 –
Social Event including Banquet Dinner in Association with Google
Robinson College
08:30 -09:00
09:00 – 10.00
Keynote B: Prof. Elizabeth Stokoe
Chair: Dr. Stefano Bannò
Constance Tipper
10:00 – 11:00
Poster Session B
LT1/LT2/Inglis Corridor/Marquee
11:00 – 11:30
Tea/Coffee Break
11:30 – 12:30
Oral Session B
Chair: Dr. Brian Sun
Constance Tipper
12:30 – 13:30
13:30 – 14:30
Poster Session C
LT1/LT2/Inglis Corridor/Marquee
14:30 – 15:30
Keynote C: Prof. Jon Barker
Chair: Dr. Simon McKnight
Constance Tipper
15:30 – 16:15
Future Plans & Farewell
Constance Tipper

Detailed Programme

Keynote A


Dr Catherine Lai


Across the prosodic dimension: Exploring spoken communication beyond text


Recent advances in machine learning have made an undeniable impact on the field of speech technology as we’ve long known it.  These advances have also led so some rather bold claims: e.g., Speech-to-text (aka Automatic Speech Recognition) and Text-To-Speech synthesis are solved!  What these sorts of claims often miss is that the traditional objectives of speech technologies neglect important aspects of spoken communication beyond text.  For example, most machine learning oriented work on spoken language understanding is still focuses on text-based methods, ignoring the fact that how we speak can change how our words are interpreted. Nevertheless, previous work has shown that speech prosody (e.g. pitch, energy and timing characteristics of speech) can be used to signal speaker intent and affect, as well as to infer and project dialogue structure.  We also know that prosody can be highly contextually variable.  So, to make use of prosody in speech technology, we need to be able to model this variability and to understand what it actually does in spoken communication.  In this talk, I will discuss recent work exploring prosodic variation in (English) spoken dialogue, using representation learning methods developed for speech generation and recognition.  I argue that there are lot of benefits to be had from self-supervised methods for representation learning on speech and text datasets, but we still need linguistic knowledge to actually make use of the true richness of speech.

Poster Session A

1 Label-Synchronous Neural Transducer for End-to-End ASR.  Keqi Deng, Phil Woodland

2 Impact of Investigator Speech and its removal for Alzheimer’s Dementia Detection. Marek Sviderski,  Basel Barakat,  Becky Allen

3 Imbalanced Multimodal Learning in Video Conversations. Samantha Kotey, Naomi Harte

4 Utilising Unsupervised Text-to-Speech Synthesis for Data Augmentation to Improve Accented Speech Recognition. Cong-Thanh Do, Shuhei Imai, Rama S Doddipatla, Thomas Hain

5 The effect of feeding and non-nutritive sucking on speech sound development at ages 2 to 5 years. Sam Burr

6 Exploring speech representations for proficiency assessment in language learning. Elaf Islam, Chanho Park,  Thomas Hain

7 Affinity: Dialogs to build human-robot rapport. Alina S Larson,  Matthew P Aylett

8 Phonetics and Phonology inside the Black Box. Iona Gessinger,  Erfan Amirzadeh Shams,  Julie Carson-Berndsen

9 Towards explainable speaker recognition neural networks. Yanze Xu

10 Can political science help us improve speech synthesis evaluation? Sébastien Le Maguer

11 SCORE: Self-supervised Correspondence Fine-tuning for Improved Content Representations. Amit Meghanani,  Thomas Hain 

12 Exploring Dominant Paths in CTC-like ASR Models: Unraveling the Effectiveness of Viterbi Decoding. Zeyu Zhao, Peter Bell, Ondřej Klejch

13 Multilingual Integration in Lyrics Transcription: Data, Language Conditioning, and Transliteration Augmentation. Jiawen Huang, Emmanouil Benetos

14 Real-Time Spoken Language Processing in Conversational AI: Challenges and Future Directions. Arash Ashrafzadeh, Arash Eshghi, Matthew Aylett 

15 Self-supervised models for dysarthric speech: Understanding representations through visual analysis and probing. Ariadna Sanchez, Simon King

16 How Much Context Does My Attention-Based ASR System Need? Robert J Flynn, Anton Ragni

17 Focused Discriminative Training For Streaming Automatic Speech Recognition. Adnan Haider

18 Speech Interactions Designed With Minoritised Language Speakers. Emily E Nielsen, Electra Wallington, Ondřej Klejch, Dani Raju Kalarikalayil, Nina Markl, Gavin Bailey, Thomas ReitmaierJennifer Pearson, Matt Jones, Peter Bell, Simon Robinson

19 MyChat: a “written speech” corpus rich in conversational features. Mai Hoang Dao, Catherine Lai, Peter Bell

20 Discovery of Grammatical Matrix Language Markers in Code-Switched Text. Olga Iakovenko, Thomas Hain

21 Behavioral evidence for higher articulation rate convergence following natural than artificial time altered speech. Jérémy Giroud, Kirsty Phillips, Jessica Lei, Matthew Davis

22 Dynamic Time Warping as an Alignment Technique for Vocal Puppetry. Eleanor Crocker

23 Bias in Sparse Subnetworks for Multilingual Automatic Speech Recognition. Ed Storey, Naomi Harte, Peter Bell

24 SALMONN: Towards Generic Hearing Abilities for Large Language Models. Guangzhi Sun

25 Speech Synthesis for Sarcasm in Low-Resource Scenarios. Zhu Li,  Yuqing Zhang

26 Zero-shot Audio Topic Reranking using Large Language Models. Mengjie Qian, Rao Ma, Adian Liusie, Erfan Loweimi, Kate Knill, Mark Gales

27 Exploring Category Specific Disorder in Chinese GAD, BD, and MDD using the Verbal Fluency Task and Functional Near-Infrared Spectroscopy. Yufei Ren

28 Investigating Listener’s Perception of Conversational Speaking Style using an Interview-Based Approach. Adaeze Adigwe, Simon King, Sarenne Wallbridge

29 Towards End-to-End Spoken Grammatical Error Correction. Stefano Bannò, Rao Ma, Mengjie Qian, Kate Knill, Mark Gales

30 Hider-Finder-Combiner: Voice Conversion and Voice Privacy through Adversarial Information Hiding.  Jacob J Webber, Oliver Watts, Gustav Eje Henter, Jennifer Williams, Simon King

31 A Comparison of Synthesis Method Impact on Listener Perception of Play-Acted Speech. Emily Lau, Brechtje Post, Kate Knill

32 A framework for flexible model combination of speech tasks. Shreyas Ramoji, Thomas Hain

33 Putting Expression in the Irish Synthetic Voice. Anna M Giovannini, Zihan Wang, Andy Murphy, Maria O’Reilly, Ailbhe Ni Chasaide, Christer Gobl

34 A New Standardized and Reproducible Benchmarking framework for Automatic Personality Perception Experiments over the SSPNet Speaker Personality Corpus. Nesreen Alareef, Evangelia Fringi, Tanaya Guha, Alessandro Vinciarelli

Oral Session A

1 The engineering behind understanding every voice. Jamie Dougherty

2 Speech Modifications for Improved Listening Experience in ADHD Adults. Lucy M Valls-Reed, Jennifer Williams

3 Variability of speech timing features across repeated recordings of non-pathological speech samples. Judith Dineley, Ewan Carr, Lauren White, Catriona Lucas, Zahia Rahman, Tian Pan, Faith Matcham, Johnny Downs, Richard Dobson, Thomas Quatieri, Nicholas Cummins 

Keynote B


Prof. Elizabeth Stokoe


How ‘conversational’ are conversational products and technologies?


Conversational products and technologies are in the headlines more than ever. But how ‘conversational’ are they? And what does ‘conversational’ actually mean? Many products leverage ‘conversation’, from communication training to assessment tools; from to scripted interaction to role-play, and from chatbots to voice assistants. But do they do so in ways that strengthen or do damage in their domains of use? Six decades of research in conversation analysis have identified and described the constitutive practices of human social interaction across the widest range of ordinary and institutional settings. In this talk, I will address the questions of what, when, and how conversational products could and should leverage from conversation analysis.

Poster Session B


Poster Session B

1 Automatic Speech Recognition System-Independent Word Error Rate Estimation. Chanho Park, Mingjie Chen, Thomas Hain

2 Frameworks for Assessing Privacy Risks in Affective Speech-Based Systems. Basmah M.Alsenani, Tanaya Guha, Alessandro Vinciarelli

3 Just Because We Camp, Doesn’t Mean We Should: The Ethics of Modelling Queer Voices. Atli Thor Sigurgeirsson, Eddie L. Ungless

4 Portability of Text- v Audio-based Auto-markers for Dialogic Assessment. Simon W McKnight, Stefano Bannò, Mark Gales, Siyuan Tang, Kate Knill

5 NOMAD: Unsupervised Learning of Perceptual Embeddings for Speech Enhancement and Non-Matching Reference Audio Quality Assessment. Alessandro Ragano, Jan Skoglund, Andrew Hines 

6 Enhancing ASR performance for dysarthric speech: Deep Learning approaches and dataset expansion. Leon Turner, Eugenio Donati

7 Investigating the Role of Visual Information in Audio-Visual Speech Recognition. Zhaofeng Lin,  Naomi Harte

8 Has Artificial Intelligence Rendered Language Teaching Obsolete? Zoe Handley

9 Effective Context in Self-Supervised Speech Models. Yen Meng, Hao Tang 

10 NLP and Speech Processing to study Mental Health Recovery Narratives from NEON (Narrative Experiences Online collection). Shrankhla Pandey, Sarah Morgan

11 On Robustness of Speaker Retrieval in the Wild: A Comparative Study of x-vector and ECAPA-TDNN Models. Erfan Loweimi, Mengjie Qian, Kate Knill, Mark Gales

12 Wearable Audio-Visual AI for Use in Hearing Aids. Poppy Welch, Jennifer Williams

13 Towards Multimodal Turn-taking for Naturalistic Human-Robot Interaction. Sam O’Connor Russell, Naomi Harte

14 Parameter Efficient Finetuning for Speech Emotion Recognition.  Nineli Lashkarashvili, Wen Wu, Guangzhi Sun, Phil Woodland

15 Tracking Articulatory Feature Transition Phenomena in Speech Embeddings. Patrick Cormac English, Erfan Shams Amirzadeh , Julie Carson-Berndsen, John Kelleher 

16 3rd COG-MHEAR Audio-Visual Speech Enhancement Challenge (AVSEC-3). Andrea L Aldana, Ondřej Klejch, Cassia Valentini, Peter Bell

17 Research plan: Neural encoding of AI-generated speech prosody by L1 and L2 speakers. Linda Bakkouche

18 Exploring individual speaker characteristics within a forensic automatic speaker recognition system. Chenzi Xu, Vincent Hughes, Paul Foulkes,  Philip Harrison, Poppy Welch, Jessica Wormald, Finnian Kelly, David van der Vloed

19 An Audio-Based Depression Tracking Model  Using Machine Learning. Andrea Vitullo, Eugenio Donati

20 Characterizing Code-switching: Applying Linguistic Principles for Metric Assessment and Development. Jie Chi, Electra Wallington, Peter Bell

21 Leveraging Language Affinities in Wav2Vec Finetuning for Low Resourced Languages. Jeffrey Josanne Michael, Oscar Saz

22 Bootstrapping Spoken Information Retrieval for Unwritten Languages. Ondřej Klejch, Electra Wallington, Thomas Reitmaier, Emily E Nielsen, Dani Kalarikalayil Raju, Nina Markl, Gavin Bailey, Jennifer Pearson, Matt Jones, Simon Robinson, Peter Bell 

23 Investigation of PEVD-based speech enhancement in wireless acoustic sensor networks. Emilie d’Olne, Patrick A Naylor

24 Topical Emphasis synthesis with Voice Puppetry. Emelie Van De Vreken, Catherine Lai, Korin Richmond

25 SOT Triggered Neural Clustering for Speaker Attributed ASR.  Xianrui Zheng, Guangzhi Sun, Chao Zhang, Phil Woodland

26 Fast and high-quality open-source speech synthesis with Matcha-TTS. Shivam Mehta, Ruibo Tu, Jonas Beskow, Eva Szekely, Gustav Eje Henter

27 A Study of Continual Test-Time Adaptation for Speech Recognition. Xinying Wei, Mingjie Chen, Thomas Hain

28 Prosodic cues and processing of Bulgarian-English code-switched speech. Yoana Dancheva, Margreet Vogelzang, Ianthi Tsimpli

29 On the Optimization Aspects of Variational Appoaches to Speech Representation Learning. Sung-Lin Yeh , Hao Tang

30 Factors influencing vowel categorisation flexibility. Stephanie Cooper, Brechtje Post

31 Effects of Voice Similarity on the Quality of Experience of Voice Assistants. Crisron Rudolf G Lucas, Andrew Hines

32 HAFFORMER: A Hierarchical Attention-Free Framework for Alzheimer’s Disease Detection From Spontaneous Speech. Zhongren Dong, Zixing Zhang, Weixiang Xu, Jing Han, Jianjun Ou,  Bjorn W Schuller

33 Liquid+sonorant epenthesis in Connemara Irish English:  bilinguals vs. monolinguals. Kate Tallon, Ailbhe Ni Chasaide

34 Investigating the Emergent Audio Classification Ability of ASR Foundation Models. Rao Ma, Adian Liusie, Mark Gales, Kate Knill

Oral Session B

1 The Phonetics of Perceived Voice Similarity: Some Implications for Voice Parades. Kirsty McDougall

2 SummaryMixing: A Linear-Complexity Alternative to Self-Attention for Speech Recognition and Understanding.  Titouan Parcollet, Rogier C van Dalen, Shucong Zhang, Sourav Bhattacharya

3 Can We Trust Explainable AI Methods on ASR?  An Evaluation on Phoneme Recognition. Xiaoliang Wu, Peter Bell, Ajitha Rajan

Poster Session C

1 Linear-Complexity Self-Supervised Learning for Speech Processing.  Shucong Zhang, Titouan Parcollet, Rogier C van Dalen, Sourav Bhattacharya

2 Linear Complexity Unified Streaming and Non-streaming Conformers for Transducer-based Speech Recognition.  Titouan Parcollet, Rogier C van Dalen, Shucong Zhang, Sourav Bhattacharya

3 Fine-tuning of Self-supervised Models Jointly with Autoencoders. Wenjie Peng, Thomas Hain

4 Analysis of self-supervised speech representations for tone languages. Opeyemi Moyinoluwa Osakuade, Simon King

5 Investigation of Spanish-accented English Pronunciation Patterns in ASR. Margot Masson, Julie Carson-Berndsen

6 CognoSpeak: an automatic, remote assessment of early cognitive decline in real-world conversational speech. Madhurananda Pahar, Fuxiang Tao, Bahman Mirheidari, Nathan Pevy, Rebecca Bright, Swapnil Gadgil, Lise Sproson, Dorota Braun, Caitlin Illingworth, Daniel Blackburn, Heidi Christensen

7 video-SALMONN: Speech-Enhanced Audio-Visual Large Language Models. Guangzhi Sun

8 Using Structured Conversational Prompts in the Diagnosis of Dementia. Fritz Peters, Heidi Christensen

9 Crossmodal ASR Error Correction with Discrete Speech Units. Yuanchao Li, Pinzhen Chen, Peter Bell, Catherine Lai

10 Exploring Acoustic Features for Challenging Voice Anonymisation Speakers. Henry Card, Jennifer Williams

11 English for Academic Purposes (EAP) Tutors’ Knowledge, Beliefs and Practices in Relation to the Use of Artificial Intelligence (AI). Zoe Handley

12 Authentic Speaker Recognition. Qiang Huang

13 Learn and Don’t Forget: Adding a New Language to ASR Foundation Models. Mengjie Qian, Siyuan Tang, Rao Ma, Kate Knill, Mark Gales

14 Phonation based control system using laryngeal bio-impedance and machine learning. Eugenio Donati, Christos Chousidis

15 The Effect of Decoding Strategies on the Quality and Diversity of Speech Generated by Large Speech Models. Adaeze Adigwe, Zehai Tu, Simon King

16 Efficiency and (lack of) flexibility in an LSTM network model of spoken word recognition. Máté Aller, Matthew H Davis

17 Designing Conversational Assistants to Discourage Abusive Language. Tanvi Dinkar, Gavin Abercrombie, Chris Pidcock, Benedict Jones, Matthew Aylett

18 Multichannel Binaural Speech Enhancement using Complex Convolutional Transformer Networks. Vikas D Tokala, Emilie d’Olne, Mike Brookes, Simon Doclo,  Jesper Jensen, Patrick A Naylor

19 Low-dimensional Style Token Control for Hyperarticulated Speech Synthesis. Dan Wells, Miku Nishihara, Korin Richmond, Aidan Pine

20 Spoken English Learner Corpus Transcription Validation Process. Mateus Miranda

21 Spontaneous and Scripted Speech Classification for Multilingual Audio. Shahar Elisha, Mariano Beguerisse-Díaz, Emmanouil Benetos

22 Exploring Gender Disparities in Automatic Speech Recognition Technology.  Hend ElGhazaly, Bahman Mirheidari, Nafise Sadat Moosavi, Heidi Christensen

23 Word Time Prediction for Attention-Based Encoder-Decoder Models in Automatic Speech Recognition.  Dongcheng Jiang, Phil Woodland

24 Semantic Map-based Generation of Navigation Instructions. Svetlana Stoyanchev, Chengzu Li, Chao Zhang, Simone Teufel, Rama S Doddipatla

25 Leveraging data sources for foundation models in dialectal Arabic TTS. Simon Shelley, Omar Elsherief, Oscar Saz

26 Using speech graph analysis to differentiate people at clinical high risk of psychosis from healthy controls. Xinyi Liang

27 Improving Retrieval-Augmented Response Generation in Goal-Oriented Dialogue Question Answering. Norbert Braunschweiler, Abigail M Sticha, Rama S Doddipatla, Kate Knill

28 Acoustic Knowledge and Impact of Explainable Artificial Intelligence (XAI) on Audio Models Explainability. Marjan Golmaryami, Jennifer Williams

29 Accounting for vocal mismatch within an automatic speaker recognition system. Tallulah Buckley, Vincent Hughes, Paul Foulkes, Philip Harrison, Jessica Wormald, Finnian Kelly, David van der Vloed

30 Evaluating the Effectiveness of the Conformer ASR Model in Handling Spanish Morphology. Mourbi Basak

31 Muting Whisper: A Universal Acoustic Adversarial Attack on Speech Foundation Models. Vyas Raina, Rao Ma, Charles G McGhee, Kate Knill,  Mark Gales

32 Zipfian Properties of English and Finnish Documents. Martin J Tunnicliffe, Gordon Hunter

33 Low-resource Speech Recognition and Dialect Identification of Irish in a Multi-task Framework. Mengjie Qian, Liam Lonergan, Neasa Ni Chiarain, Christer Gobl,  Ailbhe Ni Chasaide

34 Technical challenges and opportunities in building an Irish-language augmentative and alternative communication (AAC) system. Ailbhe Ni Chasaide, Rían Errity, Emily Barnes, Julie Mhic Con Iomaire

Keynote C


Prof. Jon Barker


Using Machine Learning to Improve Hearing Aid Signal Processing: The Clarity and Cadenza Challenges


At least 1.5 billion people are currently living with hearing loss, and this number will increase as the global population ages.
Many of these people would benefit from hearing aids, yet only a fraction of them have them and, further, many who have them do not use their devices often enough. A major reason for the low uptake is that hearing aids do not perform well enough in many everyday situations. Among users’ biggest complaints is that speech often remains poorly intelligible when listening in noisy situations, and that hearing aids often do not cope well with music. However, recent advances in machine learning have the potential to directly address these problems and transform the experience of hearing aid users.
In this talk, I will be presenting two large EPSRC projects Clarity (speech) and Cadenza (music) that are collaborations between the Universities of Sheffield, Salford, Cardiff, Leeds and Nottingham. These projects have been designed to investigate the potential for hearing aid machine learning, and also to grow the community of researchers working in this area. The projects are using a series of open challenges to achieve these goals. Clarity has been considering speech intelligibility enhancement and speech intelligibility prediction, while the more recent project, Cadenza, has been considering music enhancement through a process of source separation and hearing impairment-aware remixing.  The talk will explain some of the difficulties inherent in hearing aid signal processing, how recent advances from the speech community are being applied, and new approaches that are emerging from the latest Clarity/Cadenza challenges. The talk will also present the current challenges (the 3rd Clarity Enhancement Challenge, and the 2nd Cadenza Challenge) which will be ongoing at the time of the UKIS meeting with plenty of opportunity for those who are interested in getting involved.