GitHub - halsay/ASR-TTS-paper-daily: Update ASR paper everyday

Updated on 2026.03.08

Usage instructions: here

This page is modified from here

Table of Contents

ASR
TTS

ASR

Publish Date	Title	Authors	PDF	Code
2026-03-05	PersianPunc: A Large-Scale Dataset and BERT-Based Approach for Persian Punctuation Restoration	Mohammad Javad Ranjbar Kalahroodi et.al.	2603.05314	null
2026-03-05	Visual-Informed Speech Enhancement Using Attention-Based Beamforming	Chihyun Liu et.al.	2603.05270	null
2026-03-05	Beyond Word Error Rate: Auditing the Diversity Tax in Speech Recognition through Dataset Cartography	Ting-Hui Cheng et.al.	2603.05267	null
2026-03-05	Boosting ASR Robustness via Test-Time Reinforcement Learning with Audio-Text Semantic Rewards	Linghan Fang et.al.	2603.05231	null
2026-03-05	Federated Heterogeneous Language Model Optimization for Hybrid Automatic Speech Recognition	Mengze Hong et.al.	2603.04945	null
2026-03-05	Spectral dynamics reservoir computing for high-speed hardware-efficient neuromorphic processing	Jiaxuan Chen et.al.	2603.04901	null
2026-03-05	WhisperAlign: Word-Boundary-Aware ASR and WhisperX-Anchored Pyannote Diarization for Long-Form Bengali Speech	Aurchi Chowdhury et.al.	2603.04809	null
2026-03-05	When Denoising Hinders: Revisiting Zero-Shot ASR with SAM-Audio and Whisper	Akif Islam et.al.	2603.04710	null
2026-02-16	Generating Realistic, Protocol-Compliant Maritime Radio Dialogues using Self-Instruct and Low-Rank Adaptation	Gürsel Akdeniz et.al.	2603.04423	null
2026-03-04	Robust LLM-based Audio-Visual Speech Recognition with Sparse Modality Alignment and Visual Unit-Guided Refinement	Fei Su et.al.	2603.03811	null
2026-02-28	ACES: Accent Subspaces for Coupling, Explanations, and Stress-Testing in Automatic Speech Recognition	Swapnil Parekh et.al.	2603.03359	null
2026-03-03	An Investigation Into Various Approaches For Bengali Long-Form Speech Transcription and Bengali Speaker Diarization	Epshita Jahan et.al.	2603.03158	null
2026-03-03	Speech recognition assisted by large language models to command software orally -- Application to an augmented and virtual reality web app for immersive molecular graphics	Fabio Cortes Rodriguez et.al.	2603.02901	null
2026-03-04	SilentWear: an Ultra-Low Power Wearable System for EMG-based Silent Speech Recognition	Giusy Spacone et.al.	2603.02847	null
2026-03-05	Benchmarking Speech Systems for Frontline Health Conversations: The DISPLACE-M Challenge	Dhanya E et.al.	2603.02813	null
2026-03-02	GLoRIA: Gated Low-Rank Interpretable Adaptation for Dialectal ASR	Pouya Mehralian et.al.	2603.02464	null
2026-03-02	RO-N3WS: Enhancing Generalization in Low-Resource ASR with Diverse Romanian Speech Benchmarks	Alexandra Diaconu et.al.	2603.02368	null
2026-03-02	Sequence-Level Unsupervised Training in Speech Recognition: A Theoretical Study	Zijian Yang et.al.	2603.02285	null
2026-02-27	Whisper-RIR-Mega: A Paired Clean-Reverberant Speech Benchmark for ASR Robustness to Room Acoustics	Mandip Goswami et.al.	2603.02252	link
2026-02-25	Quality of Automatic Speech Recognition -- Polish Language case study -- from Wav2Vec to Scribe ElevenLabs	Marcin Pietroń et.al.	2603.02246	null
2026-03-02	VietSuperSpeech: A Large-Scale Vietnamese Conversational Speech Dataset for ASR Fine-Tuning in Chatbot, Customer Support, and Call Center Applications	Loan Do et.al.	2603.01894	null
2026-03-02	More Data, Fewer Diacritics: Scaling Arabic TTS	Ahmed Musleh et.al.	2603.01622	null
2026-03-02	The USTC-NERCSLIP Systems for the CHiME-9 MCoRec Challenge	Ya Jiang et.al.	2603.01415	null
2026-03-02	End-to-End Simultaneous Dysarthric Speech Reconstruction with Frame-Level Adaptor and Multiple Wait-k Knowledge Distillation	Minghui Wu et.al.	2603.01382	null
2026-03-02	DARS: Dysarthria-Aware Rhythm-Style Synthesis for ASR Enhancement	Minghui Wu et.al.	2603.01369	null
2026-03-03	Using Songs to Improve Kazakh Automatic Speech Recognition	Rustem Yeshpanov et.al.	2603.00961	null
2026-03-01	Towards Orthographically-Informed Evaluation of Speech Recognition Systems for Indian Languages	Kaushal Santosh Bhogale et.al.	2603.00941	null
2026-02-28	Polynomial Mixing for Efficient Self-supervised Speech Encoders	Eva Feillet et.al.	2603.00683	null
2026-02-28	Whisper-MLA: Reducing GPU Memory Consumption of ASR Models based on MHA2MLA Conversion	Sen Zhang et.al.	2603.00563	null
2026-02-16	Iterative LLM-based improvement for French Clinical Interview Transcription and Speaker Diarization	Ambre Marie et.al.	2603.00086	null
2026-02-27	Chunk-wise Attention Transducers for Fast and Accurate Streaming Speech-to-Text	Hainan Xu et.al.	2602.24245	null
2026-02-27	Dialect and Gender Bias in YouTube's Spanish Captioning System	Iris Dania Jimenez et.al.	2602.24002	null
2026-02-26	Challenges in Automatic Speech Recognition for Adults with Cognitive Impairment	Michelle Cohn et.al.	2602.23436	null
2026-02-16	Hello-Chat: Towards Realistic Social Audio Interactions	Yueran Hou et.al.	2602.23387	null
2026-02-26	Make It Hard to Hear, Easy to Learn: Long-Form Bengali ASR and Speaker Diarization via Extreme Augmentation and Perfect Alignment	Sanjid Hasan et.al.	2602.23070	null
2026-02-26	A Holistic Framework for Robust Bangla ASR and Speaker Diarization with Optimized VAD and CTC Alignment	Zarif Ishmam et.al.	2602.22935	null
2026-02-26	Efficient Dialect-Aware Modeling and Conditioning for Low-Resource Taiwanese Hakka Speech Processing	An-Ci Peng et.al.	2602.22522	null
2026-02-25	TG-ASR: Translation-Guided Learning with Parallel Gated Cross Attention for Low-Resource Automatic Speech Recognition	Cheng-Yeh Yang et.al.	2602.22039	null
2026-02-25	Robust Long-Form Bangla Speech Processing: Automatic Speech Recognition and Speaker Diarization	MD. Sagor Chowdhury et.al.	2602.21741	null
2026-03-02	Mitigating Structural Noise in Low-Resource S2TT: An Optimized Cascaded Nepali-English Pipeline with Punctuation Restoration	Tangsang Chongbang et.al.	2602.21647	null
2026-02-24	823-OLT @ BUET DL Sprint 4.0: Context-Aware Windowing for ASR and Fine-Tuned Speaker Diarization in Bengali Long Form Audio	Ratnajit Dhar et.al.	2602.21183	null
2026-02-24	Training-Free Intelligibility-Guided Observation Addition for Noisy ASR	Haoyang Li et.al.	2602.20967	null
2026-02-23	An Approach to Combining Video and Speech with Large Language Models in Human-Robot Interaction	Guanting Shen et.al.	2602.20219	null
2026-02-22	Pay Attention to CTC: Fast and Robust Pseudo-Labelling for Unified Speech Recognition	Alexandros Haliassos et.al.	2602.19316	null
2026-02-21	Whisper: Courtside Edition Enhancing ASR Performance Through LLM-Driven Context Generation	Yonathan Ron et.al.	2602.18966	null
2026-02-21	ReHear: Iterative Pseudo-Label Refinement for Semi-Supervised Speech Recognition via Audio Large Language Models	Zefang Liu et.al.	2602.18721	null
2026-02-18	Fine-Pruning: A Biologically Inspired Algorithm for Personalization of Machine Learning Models	Joseph Bingham et.al.	2602.18507	null
2026-02-19	Voice-Driven Semantic Perception for UAV-Assisted Emergency Networks	Nuno Saavedra et.al.	2602.17394	null
2026-02-13	Speech to Speech Synthesis for Voice Impersonation	Bjorn Johnson et.al.	2602.16721	null
2026-02-24	Enroll-on-Wakeup: A First Comparative Study of Target Speech Extraction for Seamless Interaction in Real Noisy Human-Machine Dialogue Scenarios	Yiming Yang et.al.	2602.15519	null
2026-02-17	Joint Enhancement and Classification using Coupled Diffusion Models of Signals and Logits	Gilad Nurko et.al.	2602.15405	null
2026-02-16	CLAP-Based Automatic Word Naming Recognition in Post-Stroke Aphasia	Yacouba Kaloga et.al.	2602.14584	null
2026-02-15	From Scarcity to Scale: A Release-Level Analysis of the Pashto Common Voice Dataset	Jandad Jahani et.al.	2602.14062	null
2026-02-15	Eureka-Audio: Triggering Audio Intelligence in Compact Language Models	Dan Zhang et.al.	2602.13954	null
2026-02-14	voice2mode: Phonation Mode Classification in Singing using Self-Supervised Speech Models	Aju Ani Justus et.al.	2602.13928	null
2026-02-03	Multimodal Consistency-Guided Reference-Free Data Selection for ASR Accent Adaptation	Ligong Lei et.al.	2602.13263	null
2026-02-13	ViMedCSS: A Vietnamese Medical Code-Switching Speech Dataset & Benchmark	Tung X. Nguyen et.al.	2602.12911	null
2026-02-13	Lamer-SSL: Layer-aware Mixture of LoRA Experts for Continual Multilingual Expansion of Self-supervised Models without Forgetting	Jing Xu et.al.	2602.12746	null
2026-02-13	PISHYAR: A Socially Intelligent Smart Cane for Indoor Social Navigation and Multimodal Human-Robot Interaction for Visually Impaired People	Mahdi Haghighat Joo et.al.	2602.12597	null
2026-02-13	Decoder-only Conformer with Modality-aware Sparse Mixtures of Experts for ASR	Jaeyoung Lee et.al.	2602.12546	null
2026-01-21	Retrieval-Augmented Self-Taught Reasoning Model with Adaptive Chain-of-Thought for ASR Named Entity Correction	Junjie An et.al.	2602.12287	null
2026-02-16	"Sorry, I Didn't Catch That": How Speech Models Miss What Matters Most	Kaitlyn Zhou et.al.	2602.12249	null
2026-02-12	Moonshine v2: Ergodic Streaming Encoder ASR for Latency-Critical Speech Applications	Manjunath Kudlur et.al.	2602.12241	null
2026-02-12	On the Sensitivity of Firing Rate-Based Federated Spiking Neural Networks to Differential Privacy	Luiz Pereira et.al.	2602.12009	null
2026-02-28	TC-BiMamba: Trans-Chunk bidirectionally within BiMamba for unified streaming and non-streaming ASR	Qingshun She et.al.	2602.11546	null
2026-02-21	Voxtral Realtime	Alexander H. Liu et.al.	2602.11298	null
2026-02-11	Self-Supervised Learning for Speaker Recognition: A study and review	Theo Lepage et.al.	2602.10829	null
2026-02-10	ViSpeechFormer: A Phonemic Approach for Vietnamese Automatic Speech Recognition	Khoa Anh Nguyen et.al.	2602.10003	null
2026-02-10	Where Are We At with Automatic Speech Recognition for the Bambara Language?	Seydou Diallo et.al.	2602.09785	null
2026-02-04	Beyond the Utterance: An Empirical Study of Very Long Context Speech Recognition	Robert Flynn et.al.	2602.09044	null
2026-02-04	Windowed SummaryMixing: An Efficient Fine-Tuning of Self-Supervised Learning Models for Low-resource Speech Recognition	Aditya Srinivas Menon et.al.	2602.09043	null
2026-02-19	Prototype-Based Disentanglement for Controllable Dysarthric Speech Synthesis	Haoshen Wang et.al.	2602.08696	null
2026-02-09	Cross-Modal Bottleneck Fusion For Noise Robust Audio-Visual Speech Recognition	Seaone Ok et.al.	2602.08293	null
2026-02-08	D-ORCA: Dialogue-Centric Optimization for Robust Audio-Visual Captioning	Changli Tang et.al.	2602.07960	null
2026-02-06	Equipping LLM with Directional Multi-Talker Speech Understanding Capabilities	Ju Lin et.al.	2602.07211	null
2026-02-05	From Hallucination to Articulation: Language Model-Driven Losses for Ultra Low-Bitrate Neural Speech Coding	Jayeon Yi et.al.	2602.06213	null
2026-02-05	Enabling Automatic Disordered Speech Recognition: An Impaired Speech Dataset in the Akan Language	Isaac Wiafe et.al.	2602.05406	null
2026-02-11	Evaluating Kubernetes Performance for GenAI Inference: From Automatic Speech Recognition to LLM Summarization	Sai Sindhur Malleni et.al.	2602.04900	null
2026-02-04	Speaker-Aware Simulation Improves Conversational Speech Recognition	Máté Gedeon et.al.	2602.04776	null
2026-03-01	Universal Robust Speech Adaptation for Cross-Domain Speech Recognition and Enhancement	Chien-Chun Wang et.al.	2602.04307	null
2026-02-04	Frontend Token Enhancement for Token-Based Speech Recognition	Takanori Ashihara et.al.	2602.04217	null
2026-02-06	Benchmarking Automatic Speech Recognition for Indian Languages in Agricultural Contexts	Chandrashekar M S et.al.	2602.03868	null
2026-02-03	Mići Princ -- A Little Boy Teaching Speech Technologies the Chakavian Dialect	Nikola Ljubešić et.al.	2602.03245	null
2026-03-02	WAXAL: A Large-Scale Multilingual African Language Speech Corpus	Abdoulaye Diack et.al.	2602.02734	null
2026-02-02	Mixture-of-Experts with Intermediate CTC Supervision for Accented Speech Recognition	Wonjun Lee et.al.	2602.01967	null
2026-02-02	BBPE16: UTF-16-based byte-level byte-pair encoding for improved multilingual speech recognition	Hyunsik Kim et.al.	2602.01717	null
2026-02-01	EmoAra: Emotion-Preserving English Speech Transcription and Cross-Lingual Translation with Arabic Text-to-Speech	Besher Hassan et.al.	2602.01170	null
2026-02-01	Adapting Where It Matters: Depth-Aware Adaptation for Efficient Multilingual Speech Recognition in Low-Resource Languages	Yang Xiao et.al.	2602.01008	null
2026-02-01	MedSpeak: A Knowledge Graph-Aided ASR Error Correction Framework for Spoken Medical QA	Yutong Song et.al.	2602.00981	null
2026-01-30	CALM: Joint Contextual Acoustic-Linguistic Modeling for Personalization of Multi-Speaker ASR	Muhammad Shakeel et.al.	2601.22792	null
2026-01-30	Streaming Speech Recognition with Decoder-Only Large Language Models and Latency Optimization	Genshun Wan et.al.	2601.22779	null
2026-01-29	Towards Robust Dysarthric Speech Recognition: LLM-Agent Post-ASR Correction Beyond WER	Xiuwen Zheng et.al.	2601.21347	null
2026-01-30	Qwen3-ASR Technical Report	Xian Shi et.al.	2601.21337	link
2026-01-28	asr_eval: Algorithms and tools for multi-reference and streaming speech recognition evaluation	Oleg Sedukhin et.al.	2601.20992	null
2026-01-30	Text-only adaptation in LLM-based ASR through text denoising	Sergio Burdisso et.al.	2601.20900	null
2026-01-28	Reducing Prompt Sensitivity in LLM-based Speech Recognition Through Learnable Projection	Sergio Burdisso et.al.	2601.20898	null
2026-01-28	A Study of Data Selection Strategies for Pre-training Self-Supervised Speech Models	Ryan Whetten et.al.	2601.20896	null
2026-01-28	SW-ASR: A Context-Aware Hybrid ASR Pipeline for Robust Single Word Speech Recognition	Manali Sharma et.al.	2601.20890	null
2026-01-27	MA-LipNet: Multi-Dimensional Attention Networks for Robust Lipreading	Matteo Rossi et.al.	2601.20881	null
2026-01-28	ASR for Affective Speech: Investigating Impact of Emotion and Speech Generative Strategy	Ya-Tse Wu et.al.	2601.20319	null
2026-01-28	Mind the Shift: Using Delta SSL Embeddings to Enhance Child ASR	Zilai Wang et.al.	2601.20142	null
2026-01-27	Do we really need Self-Attention for Streaming Automatic Speech Recognition?	Youness Dkhissi et.al.	2601.19960	null
2026-01-23	Benchmarking von ASR-Modellen im deutschen medizinischen Kontext: Eine Leistungsanalyse anhand von Anamnesegesprächen	Thomas Schuster et.al.	2601.19945	null
2026-01-08	FastWhisper: Adaptive Self-knowledge Distillation for Real-time Automatic Speech Recognition	Junseok Lee et.al.	2601.19919	null
2026-01-27	SLM-SS: Speech Language Model for Generative Speech Separation	Tianhua Li et.al.	2601.19533	null
2026-01-27	Dynamic Multi-Expert Projectors with Stabilized Routing for Multilingual Speech Recognition	Isha Pandey et.al.	2601.19451	null
2026-01-27	SE-DiCoW: Self-Enrolled Diarization-Conditioned Whisper	Alexander Polok et.al.	2601.19194	null
2026-02-02	Language Family Matters: Evaluating LLM-Based ASR Across Linguistic Boundaries	Yuchen Zhang et.al.	2601.18899	null
2026-01-29	Unheard in the Digital Age: Rethinking AI Bias and Speech Diversity	Onyedikachi Hope Amaechi-Okorie et.al.	2601.18641	null
2026-01-26	Pisets: A Robust Speech Recognition System for Lectures and Interviews	Ivan Bondarenko et.al.	2601.18415	link
2026-01-26	Noise-Robust AV-ASR Using Visual Features Both in the Whisper Encoder and Decoder	Zhengyang Li et.al.	2601.18396	null
2026-01-26	OCR-Enhanced Multimodal ASR Can Read While Listening	Junli Chen et.al.	2601.18393	null
2026-01-26	Efficient Rehearsal for Continual Learning in ASR via Singular Value Tuning	Steven Vander Eeckt et.al.	2601.18266	null
2026-01-26	VIBEVOICE-ASR Technical Report	Zhiliang Peng et.al.	2601.18184	null
2026-01-25	SpatialEmb: Extract and Encode Spatial Information for 1-Stage Multi-channel Multi-speaker ASR on Arbitrary Microphone Arrays	Yiwen Shao et.al.	2601.18037	null
2026-01-25	dLLM-ASR: A Faster Diffusion LLM-based Framework for Speech Recognition	Wenjie Tian et.al.	2601.17902	null
2026-02-28	Speech Emotion Recognition with ASR Integration	Yuanchao Li et.al.	2601.17901	null
2026-01-25	Quran-MD: A Fine-Grained Multilingual Multimodal Dataset of the Quran	Muhammad Umar Salman et.al.	2601.17880	null
2026-01-25	BanglaRobustNet: A Hybrid Denoising-Attention Architecture for Robust Bangla Speech Recognition	Md Sazzadul Islam Ridoy et.al.	2601.17679	null
2026-01-25	End-to-End Joint ASR and Speaker Role Diarization with Child-Adult Interactions	Anfeng Xu et.al.	2601.17640	link
2026-01-24	Window Size Versus Accuracy Experiments in Voice Activity Detectors	Max McKinnon et.al.	2601.17270	null
2026-01-22	Sink or SWIM: Tackling Real-Time ASR at Scale	Federico Bruzzone et.al.	2601.17097	null
2026-01-16	AI-based System for Transforming text and sound to Educational Videos	M. E. ElAlami et.al.	2601.17022	null
2026-01-21	Test-Time Adaptation for Speech Emotion Recognition	Jiaheng Dong et.al.	2601.16240	null
2026-01-20	SoundBreak: A Systematic Study of Audio-Only Adversarial Attacks on Trimodal Models	Aafiya Hussain et.al.	2601.16231	null
2026-01-22	Quantum Dimension Reduction of Hidden Markov Models	Rishi Sundar et.al.	2601.16126	null
2026-01-27	Distillation-based Layer Dropping (DLD): Effective End-to-end Framework for Dynamic Speech Networks	Abdul Hannan et.al.	2601.16117	null
2026-01-20	Lost in Transcription: How Speech-to-Text Errors Derail Code Understanding	Jayant Havare et.al.	2601.15339	null
2026-01-22	Deaf and Hard of Hearing Access to Intelligent Personal Assistants: Comparison of Voice-Based Options with an LLM-Powered Touch Interface	Paige S. DeVries et.al.	2601.15209	null
2026-01-21	Inverse-Hessian Regularization for Continual Learning in ASR	Steven Vander Eeckt et.al.	2601.14751	null
2026-01-19	Typhoon ASR Real-time: FastConformer-Transducer for Thai Automatic Speech Recognition	Warit Sirichotedumrong et.al.	2601.13044	link
2026-01-19	DUAP: Dual-task Universal Adversarial Perturbations Against Voice Control Systems	Suyang Sun et.al.	2601.12786	null
2026-01-18	SSVD-O: Parameter-Efficient Fine-Tuning with Structured SVD for Speech Recognition	Pu Wang et.al.	2601.12600	null
2026-01-18	Purification Before Fusion: Toward Mask-Free Speech Enhancement for Robust Audio-Visual Speech Recognition	Linzhi Wu et.al.	2601.12436	null
2026-01-18	CTC-DID: CTC-Based Arabic dialect identification for streaming applications	Muhammad Umar Farooq et.al.	2601.12199	null
2026-01-16	WenetSpeech-Wu: Datasets, Benchmarks, and Models for a Unified Chinese Wu Dialect Speech Processing Ecosystem	Chengyou Wang et.al.	2601.11027	null
2026-01-15	Unifying Speech Recognition, Synthesis and Conversion with Autoregressive Transformers	Runyuan Cai et.al.	2601.10770	null
2026-01-15	STEAMROLLER: A Multi-Agent System for Inclusive Automatic Speech Recognition for People who Stutter	Ziqi Xu et.al.	2601.10223	null
2025-12-23	Multi-Level Embedding Conformer Framework for Bengali Automatic Speech Recognition	Md. Nazmus Sakib et.al.	2601.09710	null
2026-01-14	Linear Complexity Self-Supervised Learning for Music Understanding with Random Quantizer	Petros Vavaroutsos et.al.	2601.09603	null
2026-01-14	Speech-Hands: A Self-Reflection Voice Agentic Approach to Speech Recognition and Audio Reasoning with Omni Perception	Zhen Wan et.al.	2601.09413	null
2026-01-14	SLAM-LLM: A Modular, Open-Source Multimodal Large Language Model Framework and Best Practice for Speech, Language, Audio and Music Processing	Ziyang Ma et.al.	2601.09385	null
2026-01-17	MCGA: A Multi-task Classical Chinese Literary Genre Audio Corpus	Yexing Du et.al.	2601.09270	link
2026-01-13	Robust CAPTCHA Using Audio Illusions in the Era of Large Language Models: from Evaluation to Advances	Ziqi Ding et.al.	2601.08516	null
2026-01-12	Towards Comprehensive Semantic Speech Embeddings for Chinese Dialects	Kalvin Chang et.al.	2601.07274	link
2026-01-11	Task Arithmetic with Support Languages for Low-Resource ASR	Emma Rafkin et.al.	2601.07038	null
2026-01-11	Categorize Early, Integrate Late: Divergent Processing Strategies in Automatic Speech Recognition	Nathan Roll et.al.	2601.06972	null
2026-01-11	Variational decomposition autoencoding improves disentanglement of latent representations	Ioannis Ziogas et.al.	2601.06844	null
2026-01-11	Doing More with Less: Data Augmentation for Sudanese Dialect Automatic Speech Recognition	Ayman Mansour et.al.	2601.06802	null
2026-01-10	QMAVIS: Long Video-Audio Understanding using Fusion of Large Multimodal Models	Zixing Lin et.al.	2601.06573	null
2026-01-09	An Intelligent AI glasses System with Multi-Agent Architecture for Real-Time Voice Processing and Task Execution	Sheng-Kai Chen et.al.	2601.06235	null
2026-01-13	GenAITEd Ghana: A First-of-Its-Kind Context-Aware and Curriculum-Aligned Conversational AI Agent for Teacher Education	Matthew Nyaaba et.al.	2601.06093	null
2026-01-09	Multimodal In-context Learning for ASR of Low-resource Languages	Zhaolin Li et.al.	2601.05707	null
2026-01-08	LLMs-Integrated Automatic Hate Speech Recognition Using Controllable Text Generation Models	Ryutaro Oshima et.al.	2601.04654	null
2026-01-08	WESR: Scaling and Evaluating Word-level Event-Speech Recognition	Chenchen Yang et.al.	2601.04508	null
2026-01-08	Latent-Level Enhancement with Flow Matching for Robust Automatic Speech Recognition	Da-Hee Yang et.al.	2601.04459	null
2026-01-14	Stuttering-Aware Automatic Speech Recognition for Indonesian Language	Fadhil Muhammad et.al.	2601.03727	null
2026-01-08	TellWhisper: Tell Whisper Who Speaks When	Yifan Hu et.al.	2601.03712	null
2026-01-06	Linear Script Representations in Speech Foundation Models Enable Zero-Shot Transliteration	Ryan Soh-Eun Shim et.al.	2601.02906	null
2026-01-06	Multi-channel multi-speaker transformer for speech recognition	Guo Yifan et.al.	2601.02688	null
2026-01-05	Dynamic Quantization Error Propagation in Encoder-Decoder ASR Quantization	Xinyu Wang et.al.	2601.02455	null
2026-01-05	VocalBridge: Latent Diffusion-Bridge Purification for Defeating Perturbation-Based Voiceprint Defenses	Maryam Abbasihafshejani et.al.	2601.02444	null
2026-01-14	MORE: Multi-Objective Adversarial Attacks on Speech Recognition	Xiaoxue Gao et.al.	2601.01852	null
2026-01-03	IO-RAE: Information-Obfuscation Reversible Adversarial Example for Audio Privacy Protection	Jiajie Zhu et.al.	2601.01239	null
2026-01-02	Improving Code-Switching Speech Recognition with TTS Data Augmentation	Yue Heng Yeo et.al.	2601.00935	null
2025-12-31	Index-ASR Technical Report	Zheshu Song et.al.	2601.00890	null
2026-01-02	Three factor delay learning rules for spiking neural networks	Luke Vassallo et.al.	2601.00668	null
2026-01-01	IKFST: IOO and KOO Algorithms for Accelerated and Precise WFST-based End-to-End Automatic Speech Recognition	Zhuoran Zhuang et.al.	2601.00160	null
2025-12-31	Learning Speech Representations with Variational Predictive Coding	Sung-Lin Yeh et.al.	2601.00100	null
2025-12-31	SLM-TTA: A Framework for Test-Time Adaptation of Generative Spoken Language Models	Yuan-Kuei Wu et.al.	2512.24739	null
2025-12-29	PROFASR-BENCH: A Benchmark for Context-Conditioned ASR in High-Stakes Professional Speech	Deepak Babu Piskala et.al.	2512.23686	link
2025-12-17	Marco-ASR: A Principled and Metric-Driven Framework for Fine-Tuning Large-Scale ASR Models for Domain Adaptation	Xuanfan Ni et.al.	2512.22165	null
2025-12-14	EEG-to-Voice Decoding of Spoken and Imagined speech Using Non-Invasive EEG	Hanbeot Park et.al.	2512.22146	null
2025-12-26	Contextual Biasing for LLM-Based ASR with Hotword Retrieval and Reinforcement Learning	YuXiang Kong et.al.	2512.21828	null
2025-12-25	Broadband tunable microwave photonic radar for simultaneous detection of human respiration, heartbeat, and speech with deep learning-based speech recognition	Lei Gao et.al.	2512.21566	null
2025-12-29	VALLR-Pin: Uncertainty-Factorized Visual Speech Recognition for Mandarin with Pinyin Guidance	Chang Sun et.al.	2512.20032	null
2025-12-22	From Speech to Subtitles: Evaluating ASR Models in Subtitling Italian Television Programs	Alessandro Lucca et.al.	2512.19161	null
2025-12-22	Enhancing Fully Formatted End-to-End Speech Recognition with Knowledge Distillation via Multi-Codebook Vector Quantization	Jian You et.al.	2512.18967	null
2025-12-20	Phoneme-based speech recognition driven by large language models and sampling marginalization	Te Ma et.al.	2512.18371	null
2025-12-20	TICL+: A Case Study On Speech In-Context Learning for Children's Speech Recognition	Haolong Zheng et.al.	2512.18263	null
2025-11-27	Supplementary Resources and Analysis for Automatic Speech Recognition Systems Trained on the Loquacious Dataset	Nick Rossenbach et.al.	2512.17915	null
2025-12-19	Peeking Into The Future For Contextual Biasing	Ramaneswaran Selvakumar et.al.	2512.17657	null
2025-12-19	When De-noising Hurts: A Systematic Study of Speech Enhancement Effects on Modern Medical ASR Systems	Sujal Chondhekar et.al.	2512.17562	null
2025-12-19	Zero-Shot Recognition of Dysarthric Speech Using Commercial Automatic Speech Recognition and Multimodal Large Language Models	Ali Alsayegh et.al.	2512.17474	null
2025-12-19	Incorporating Error Level Noise Embedding for Improving LLM-Assisted Robustness in Persian Speech Recognition	Zahra Rahmani et.al.	2512.17247	null
2025-11-04	V-Agent: An Interactive Video Search System Using Vision-Language Models	SunYoung Park et.al.	2512.16925	null
2026-01-14	Navigating the Reality Gap: Privacy-Preserving On-Device Continual Adaptation of ASR for Clinical Telephony	Darshil Chauhan et.al.	2512.16401	null
2026-01-15	TinyMyo: a Tiny Foundation Model for Flexible EMG Signal Processing at the Edge	Matteo Fasulo et.al.	2512.15729	link
2025-12-16	ComMark: Covert and Robust Black-Box Model Watermarking with Compressed Samples	Yunfei Yang et.al.	2512.15641	null
2025-12-16	Adapting Speech Language Model to Singing Voice Synthesis	Yiwen Zhao et.al.	2512.14657	null
2025-12-16	Scalable Frameworks for Real-World Audio-Visual Speech Recognition	Sungnyun Kim et.al.	2512.14083	null
2025-12-15	Reproducing and Dissecting Denoising Language Models for Speech Recognition	Dorian Koch et.al.	2512.13576	null
2025-12-18	Adaptive Edge-Cloud Inference for Speech-to-Action Systems Using ASR and Large Language Models	Mohammad Jalili Torkamani et.al.	2512.12769	null
2025-12-13	System X: A Mobile Voice-Based AI System for EMR Generation and Clinical Decision Support in Low-Resource Maternal Healthcare	Maryam Mustafa et.al.	2512.12240	null
2025-12-12	All-in-One ASR: Unifying Encoder-Decoder Models of CTC, Attention, and Transducer in Dual-Mode ASR	Takafumi Moriya et.al.	2512.11543	null
2025-12-12	The Affective Bridge: Unifying Feature Representations for Speech Deepfake Detection	Yupei Li et.al.	2512.11241	null
2025-12-11	The TCG CREST -- RKMVERI Submission for the NCIIPC Startup India AI Grand Challenge	Nikhil Raghav et.al.	2512.11009	null
2025-11-30	Benchmarking Automatic Speech Recognition Models for African Languages	Alvin Nahabwe et.al.	2512.10968	null
2025-11-30	ASR Under the Stethoscope: Evaluating Biases in Clinical Speech Recognition across Indian Languages	Subham Kumar et.al.	2512.10967	null
2025-12-11	TRIDENT: A Redundant Architecture for Caribbean-Accented Emergency Speech Triage	Elroy Galbraith et.al.	2512.10741	null
2025-12-10	Robust Speech Activity Detection in the Presence of Singing Voice	Philipp Grundhuber et.al.	2512.09713	null
2025-12-02	Enhancing Automatic Speech Recognition Through Integrated Noise Detection Architecture	Karamvir Singh et.al.	2512.08973	null
2025-12-08	A Simple Method to Enhance Pre-trained Language Models with Speech Tokens for Classification	Nicolas Calbucura et.al.	2512.07571	null
2025-12-08	Efficient ASR for Low-Resource Languages: Leveraging Cross-Lingual Unlabeled Data	Srihari Bandarupalli et.al.	2512.07277	null
2025-12-06	Sanvaad: A Multimodal Accessibility Framework for ISL Recognition and Voice-Based Interaction	Kush Revankar et.al.	2512.06485	null
2025-12-01	KidSpeak: A General Multi-purpose LLM for Kids' Speech Recognition and Screening	Rohan Sharma et.al.	2512.05994	null
2025-11-23	SyncVoice: Towards Video Dubbing with Vision-Augmented Pretrained TTS Model	Kaidi Wang et.al.	2512.05126	null
2025-12-04	Measuring the Unspoken: A Disentanglement Model and Benchmark for Psychological Analysis in the Wild	Yigui Feng et.al.	2512.04728	null
2025-12-04	Multi-Loss Learning for Speech Emotion Recognition with Energy-Adaptive Mixup and Frame-Level Attention	Cong Wang et.al.	2512.04551	null
2025-12-02	Comparing Unsupervised and Supervised Semantic Speech Tokens: A Case Study of Child ASR	Mohan Shi et.al.	2512.03301	null
2025-12-02	MAViD: A Multimodal Framework for Audio-Visual Dialogue Understanding and Generation	Youxin Pang et.al.	2512.03034	null
2025-12-02	Bangla Hate Speech Classification with Fine-tuned Transformer Models	Yalda Keivan Jafari et.al.	2512.02845	null
2025-12-02	Reasoning-Aware Multimodal Fusion for Hateful Video Detection	Shuonan Yang et.al.	2512.02743	null
2025-12-02	Hear What Matters! Text-conditioned Selective Video-to-Audio Generation	Junwon Lee et.al.	2512.02650	null
2025-12-01	See, Hear, and Understand: Benchmarking Audiovisual Human Speech Understanding in Multimodal Large Language Models	Le Thien Phuc Nguyen et.al.	2512.02231	null
2026-01-19	Swivuriso: The South African Next Voices Multilingual Speech Dataset	Vukosi Marivate et.al.	2512.02201	null
2025-11-18	On the Difficulty of Token-Level Modeling of Dysfluency and Fluency Shaping Artifacts	Kashaf Gulzar et.al.	2512.02027	null
2025-12-01	MAC-SLU: Multi-Intent Automotive Cabin Spoken Language Understanding Benchmark	Yuezhang Peng et.al.	2512.01603	link
2025-12-01	ZO-ASR: Zeroth-Order Fine-Tuning of Speech Foundation Models without Back-Propagation	Yuezhang Peng et.al.	2512.01267	null
2025-11-28	OmniFusion: Simultaneous Multilingual Multimodal Translations via Modular Fusion	Sai Koneru et.al.	2512.00234	link
2025-11-28	Scaling HuBERT for African Languages: From Base to Large and XL	Antoine Caubrière et.al.	2511.23370	null
2025-11-28	HPSU: A Benchmark for Human-Level Perception in Real-World Spoken Speech Understanding	Chen Li et.al.	2511.23178	null
2025-11-28	Group-Aware Partial Model Merging for Children's Automatic Speech Recognition	Thomas Rolland et.al.	2511.23098	null
2025-11-27	Modeling Romanized Hindi and Bengali: Dataset Creation and Multilingual LLM Integration	Kanchon Gharami et.al.	2511.22769	null
2025-11-27	Do You See What I Say? Generalizable Deepfake Detection based on Visual Speech Recognition	Maheswar Bora et.al.	2511.22443	null
2025-11-27	Layover or Direct Flight: Rethinking Audio-Guided Image Segmentation	Joel Alberto Santos et.al.	2511.22025	null
2025-11-16	On the Cross-lingual Transferability of Pre-trained wav2vec2-based Models	Jonatas Grosman et.al.	2511.21704	null
2025-11-26	Multi-Reward GRPO for Stable and Prosodic Single-Codebook TTS LLMs at Scale	Yicheng Zhong et.al.	2511.21270	null
2025-11-26	ASR Error Correction in Low-Resource Burmese with Alignment-Enhanced Transformers using Phonetic Features	Ye Bhone Lin et.al.	2511.21088	null
2025-11-26	Towards Audio Token Compression in Large Audio Language Models	Saurabhchand Bhati et.al.	2511.20973	null
2025-12-24	SingingSDS: A Singing-Capable Spoken Dialogue System for Conversational Roleplay Applications	Jionghao Han et.al.	2511.20972	link
2025-11-25	Bridging the Language Gap: Synthetic Voice Diversity via Latent Mixup for Equitable Speech Recognition	Wesley Bian et.al.	2511.20534	null
2025-11-25	Mispronunciation Detection and Diagnosis Without Model Training: A Retrieval-Based Approach	Huu Tuong Tu et.al.	2511.20107	null
2025-11-25	EM2LDL: A Multilingual Speech Corpus for Mixed Emotion Recognition through Label Distribution Learning	Xingfeng Li et.al.	2511.20106	null
2025-11-25	It Hears, It Sees too: Multi-Modal LLM for Depression Detection By Integrating Visual Understanding into Audio Language Models	Xiangyu Zhao et.al.	2511.19877	null
2025-11-24	Neural Architecture Search for Quantum Autoencoders	Hibah Agha et.al.	2511.19246	null
2025-11-24	AuViRe: Audio-visual Speech Representation Reconstruction for Deepfake Temporal Localization	Christos Koutlis et.al.	2511.18993	null
2025-11-27	PrismAudio: Decomposed Chain-of-Thoughts and Multi-dimensional Rewards for Video-to-Audio Generation	Huadai Liu et.al.	2511.18833	null
2025-11-24	Context-Aware Whisper for Arabic ASR Under Linguistic Varieties	Bashar Talafha et.al.	2511.18774	null
2025-11-24	AIRHILT: A Human-in-the-Loop Testbed for Multimodal Conflict Detection in Aviation	Omar Garib et.al.	2511.18718	null
2025-11-23	A Multimodal Conversational Agent for Tabular Data Analysis	Mohammad Nour Al Awad et.al.	2511.18405	null
2025-11-21	Point of Order: Action-Aware LLM Persona Modeling for Realistic Civic Simulation	Scott Merrill et.al.	2511.17813	null
2025-11-12	Speech Recognition Model Improves Text-to-Speech Synthesis using Fine-Grained Reward	Guansu Wang et.al.	2511.17555	null
2025-11-21	Enhancing Quranic Learning: A Multimodal Deep Learning Approach for Arabic Phoneme Recognition	Ayhan Kucukmanisa et.al.	2511.17477	null
2025-11-21	Robot Confirmation Generation and Action Planning Using Long-context Q-Former Integrated with Multimodal LLM	Chiori Hori et.al.	2511.17335	null
2025-11-21	Investigating self-supervised representations for audio-visual deepfake detection	Dragos-Alexandru Boldisor et.al.	2511.17181	null
2026-01-19	WER is Unaware: Assessing How ASR Errors Distort Clinical Understanding in Patient Facing Dialogue	Zachary Ellis et.al.	2511.16544	null
2025-12-03	NLP Datasets for Idiom and Figurative Language Tasks	Blake Matheny et.al.	2511.16345	null
2025-11-20	Train Short, Infer Long: Speech-LLM Enables Zero-Shot Streamable Joint ASR and Diarization on Long Audio	Mohan Shi et.al.	2511.16046	null
2025-11-19	Scriboora: Rethinking Human Pose Forecasting	Daniel Bermuth et.al.	2511.15565	null
2025-11-18	Quality-Controlled Multimodal Emotion Recognition in Conversations with Identity-Based Transfer Learning and MAMBA Fusion	Zanxu Wang et.al.	2511.14969	null
2025-11-18	Ground Truth Generation for Multilingual Historical NLP using LLMs	Clovis Gladstone et.al.	2511.14688	null
2025-12-01	IMSE: Efficient U-Net-based Speech Enhancement using Inception Depthwise Convolution and Amplitude-Aware Linear Attention	Xinxin Tang et.al.	2511.14515	null
2025-11-18	TTA: Transcribe, Translate and Alignment for Cross-lingual Speech Representation	Wei Liu et.al.	2511.14410	null
2025-11-18	AfriSpeech-MultiBench: A Verticalized Multidomain Multicountry Benchmark Suite for African Accented English ASR	Gabrial Zencha Ashungafac et.al.	2511.14255	null
2025-11-19	StreamingTalker: Audio-driven 3D Facial Animation with Autoregressive Diffusion Model	Yifan Yang et.al.	2511.14223	null
2025-11-18	Listen Like a Teacher: Mitigating Whisper Hallucinations using Adaptive Layer Attention and Knowledge Distillation	Kumud Tripathi et.al.	2511.14219	null
2025-11-17	Human-centric Maintenance Process Through Integration of AI, Speech, and AR	Parul Khanna et.al.	2511.13918	null
2025-11-19	Segmenting Collision Sound Sources in Egocentric Videos	Kranti Kumar Parida et.al.	2511.13863	null
2025-11-26	Passive Dementia Screening via Facial Temporal Micro-Dynamics Analysis of In-the-Wild Talking-Head Video	Filippo Cenacchi et.al.	2511.13802	null
2025-11-05	Emotion Recognition in Multi-Speaker Conversations through Speaker Identification, Knowledge Distillation, and Hierarchical Fusion	Xiao Li et.al.	2511.13731	null
2026-01-14	Toward Conversational Hungarian Speech Recognition: Introducing the BEA-Large and BEA-Dialogue Datasets	Máté Gedeon et.al.	2511.13529	null
2025-11-17	Spatial Blind Spot: Auditory Motion Perception Deficits in Audio LLMs	Zhe Sun et.al.	2511.13273	null
2025-11-17	Distinguishing Repetition Disfluency from Morphological Reduplication in Bangla ASR Transcripts: A Novel Corpus and Benchmarking Analysis	Zaara Zabeen Arpa et.al.	2511.13159	null
2025-11-16	Hi-Reco: High-Fidelity Real-Time Conversational Digital Humans	Hongbin Huang et.al.	2511.12662	null
2025-11-23	Uni-MoE-2.0-Omni: Scaling Language-Centric Omnimodal Large Model with Advanced MoE, Training and Data	Yunxin Li et.al.	2511.12609	link
2025-11-15	How Far Do SSL Speech Models Listen for Tone? Temporal Focus of Tone Representation under Low-resource Transfer	Minu Kim et.al.	2511.12285	null
2025-11-15	Fusionista2.0: Efficiency Retrieval System for Large-Scale Datasets	Huy M. Le et.al.	2511.12255	null
2025-11-12	Tighter Truncated Rectangular Prism Approximation for RNN Robustness Verification	Xingqi Lin et.al.	2511.11699	null
2025-11-12	Beyond saliency: enhancing explanation of speech emotion recognition with expert-referenced acoustic cues	Seham Nasr et.al.	2511.11691	null
2025-11-14	Speech-Aware Long Context Pruning and Integration for Contextualized Automatic Speech Recognition	Yiming Rong et.al.	2511.11139	null
2025-11-13	TEDxTN: A Three-way Speech Translation Corpus for Code-Switched Tunisian Arabic - English	Fethi Bougares et.al.	2511.10780	null
2025-11-09	Towards Fine-Grained Code-Switch Speech Translation with Semantic Space Alignment	Yan Gao et.al.	2511.10670	null
2025-11-13	Music Flamingo: Scaling Music Understanding in Audio Language Models	Sreyan Ghosh et.al.	2511.10289	null
2025-11-12	Omnilingual ASR: Open-Source Multilingual Speech Recognition for 1600+ Languages	Omnilingual ASR team et.al.	2511.09690	link
2025-11-12	End-to-end Contrastive Language-Speech Pretraining Model For Long-form Spoken Question Answering	Jiliang Hu et.al.	2511.09282	null
2025-11-12	Context-Aware Dynamic Chunking for Streaming Tibetan Speech Recognition	Chao Wang et.al.	2511.09085	null
2025-11-12	Towards Effective and Efficient Non-autoregressive decoders for Conformer and LLM-based ASR using Block-based Attention Mask	Tianzi Wang et.al.	2511.09084	null
2025-11-11	Quantizing Whisper-small: How design choices affect ASR performance	Arthur Söhler et.al.	2511.08093	null
2025-11-11	Speech Emotion Recognition with Phonation Excitation Information and Articulatory Kinematics	Ziqian Zhang et.al.	2511.07955	null
2025-11-13	SpikCommander: A High-performance Spiking Transformer with Multi-view Learning for Efficient Speech Command Recognition	Jiaqi Wang et.al.	2511.07883	null
2025-11-24	SynTTS-Commands: A Public Dataset for On-Device KWS via TTS-Synthesized Multilingual Speech	Lu Gan et.al.	2511.07821	null
2025-11-10	LiveNeRF: Efficient Face Replacement Through Neural Radiance Fields Integration	Tung Vu et.al.	2511.07552	null
2025-11-10	Enabling Automatic Self-Talk Detection via Earables	Euihyeok Lee et.al.	2511.07493	null
2025-11-11	Surgical Agent Orchestration Platform for Voice-directed Patient Data Interaction	Hyeryun Park et.al.	2511.07392	null
2025-11-10	Omni-AVSR: Towards Unified Multimodal Speech Recognition with Large Language Models	Umberto Cappellazzo et.al.	2511.07253	link
2025-11-10	Improving Remote Patient Monitoring Systems Using a Fog-based IoT Platform with Speech Recognition	Marc Jayson Baucas et.al.	2511.07189	null
2025-11-10	E2E-VGuard: Adversarial Prevention for Production LLM-based End-To-End Speech Synthesis	Zhisheng Zhang et.al.	2511.07099	null
2025-11-10	CLiFT-ASR: A Cross-Lingual Fine-Tuning Framework for Low-Resource Taiwanese Hokkien Speech Recognition	Hung-Yang Sung et.al.	2511.06860	null
2025-11-10	MedVoiceBias: A Controlled Study of Audio LLM Behavior in Clinical Decision-Making	Zhi Rui Tam et.al.	2511.06592	null
2025-11-07	Shared Latent Representation for Joint Text-to-Audio-Visual Synthesis	Dogucan Yaman et.al.	2511.05432	null
2025-11-12	MERaLiON-SER: Robust Speech Emotion Recognition Model for English and SEA Languages	Hardik B. Sailor et.al.	2511.04914	null
2025-11-06	CantoASR: Prosody-Aware ASR-LALM Collaboration for Low-Resource Cantonese	Dazhong Chen et.al.	2511.04139	null
2025-11-06	WST: Weakly Supervised Transducer for Automatic Speech Recognition	Dongji Gao et.al.	2511.04035	null
2025-11-06	Accelerating scientific discovery with the common task framework	J. Nathan Kutz et.al.	2511.04001	null
2025-11-05	Seeing What You Say: Expressive Image Generation from Speech	Jiyoung Lee et.al.	2511.03423	null
2025-11-05	Open Source State-Of-the-Art Solution for Romanian Speech Recognition	Gabriel Pirlogeanu et.al.	2511.03361	null
2025-11-05	TASU: Text-Only Alignment for Speech Understanding	Jing Peng et.al.	2511.03310	null
2025-11-11	How to Evaluate Speech Translation with Source-Aware Neural MT Metrics	Mauro Cettolo et.al.	2511.03295	null
2025-11-04	An unscented Kalman filter method for real time input-parameter-state estimation	Marios Impraimakis et.al.	2511.02717	null
2025-11-04	Energy-Efficient Hardware Acceleration of Whisper ASR on a CGLA	Takuto Ando et.al.	2511.02269	null
2025-11-03	SeaLLMs-Audio: Large Audio-Language Models for Southeast Asia	Chaoqun Liu et.al.	2511.01670	null
2025-11-02	MULTI-Bench: A Multi-Turn Interactive Benchmark for Assessing Emotional Intelligence ability of Spoken Dialogue Models	Yayue Deng et.al.	2511.00850	null
2025-11-01	Emotion Detection in Speech Using Lightweight and Transformer-Based Models: A Comparative and Ablation Study	Lucky Onyekwelu-Udoka et.al.	2511.00402	null
2025-10-31	Reference Microphone Selection for Guided Source Separation based on the Normalized L-p Norm	Anselm Lohmann et.al.	2510.27198	null
2025-10-30	Overview of the MEDIQA-OE 2025 Shared Task on Medical Order Extraction from Doctor-Patient Consultations	Jean-Philippe Corbeil et.al.	2510.26974	null
2025-10-29	Multi-Representation Attention Framework for Underwater Bioacoustic Denoising and Recognition	Amine Razig et.al.	2510.26838	null
2025-10-29	Audio-Visual Speech Enhancement In Complex Scenarios With Separation And Dereverberation Joint Modeling	Jiarong Du et.al.	2510.26825	null
2025-10-28	Cross-Corpus Validation of Speech Emotion Recognition in Urdu using Domain-Knowledge Acoustic Features	Unzela Talpur et.al.	2510.26823	null
2025-10-28	See the Speaker: Crafting High-Resolution Talking Faces from Speech with Prior Guidance and Region Refinement	Jinting Wang et.al.	2510.26819	null
2025-10-30	HMM for short independent sequences: Multiple sequence Baum-Welch application	Margarita Cabrera-Bean et.al.	2510.26532	null
2025-10-29	Lost in Phonation: Voice Quality Variation as an Evaluation Dimension for Speech Foundation Models	Harm Lameris et.al.	2510.25577	null
2025-10-29	Learning Disentangled Speech- and Expression-Driven Blendshapes for 3D Talking Face Animation	Yuxiang Mao et.al.	2510.25234	null
2025-10-30	Evaluating Emotion Recognition in Spoken Language Models on Emotionally Incongruent Speech	Pedro Corrêa et.al.	2510.25054	null
2025-10-28	POWSM: A Phonetic Open Whisper-Style Speech Foundation Model	Chin-Jou Li et.al.	2510.24992	null
2025-11-25	Ming-Flash-Omni: A Sparse, Unified Architecture for Multimodal Perception and Generation	Inclusion AI et.al.	2510.24821	null
2025-10-28	BEST-RQ-Based Self-Supervised Learning for Whisper Domain Adaptation	Raphaël Bagat et.al.	2510.24570	null
2025-10-28	Levée d'ambiguïtés par grammaires locales	Eric G. C. Laporte et.al.	2510.24530	null
2025-10-30	Audio Signal Processing Using Time Domain Mel-Frequency Wavelet Coefficient	Rinku Sebastian et.al.	2510.24519	null
2025-10-28	Sound Source Localization for Spatial Mapping of Surgical Actions in Dynamic Scenes	Jonas Hein et.al.	2510.24332	null
2025-10-28	V-SAT: Video Subtitle Annotation Tool	Arpita Kundu et.al.	2510.24180	null
2025-10-28	RegSpeech12: A Regional Corpus of Bengali Spontaneous Speech Across Dialects	Md. Rezuwan Hassan et.al.	2510.24096	null
2025-10-28	Listening without Looking: Modality Bias in Audio-Visual Captioning	Yuchi Ishikawa et.al.	2510.24024	null
2025-10-30	TeleEgo: Benchmarking Egocentric AI Assistants in the Wild	Jiaqi Yan et.al.	2510.23981	null
2025-10-27	A Neural Model for Contextual Biasing Score Learning and Filtering	Wanting Huang et.al.	2510.23849	null
2025-11-01	RoboOmni: Proactive Robot Manipulation in Omni-modal Context	Siyin Wang et.al.	2510.23763	link
2025-10-27	LibriConvo: Simulating Conversations from Read Literature for ASR and Diarization	Máté Gedeon et.al.	2510.23320	null
2025-10-27	Arabic Little STT: Arabic Children Speech Recognition Dataset	Mouhand Alkadri et.al.	2510.23319	null
2025-10-27	A Cocktail-Party Benchmark: Multi-Modal dataset and Comparative Evaluation Results	Thai-Binh Nguyen et.al.	2510.23276	null
2025-10-29	Are ASR foundation models generalized enough to capture features of regional dialects for low-resource languages?	Tawsif Tashwar Dipto et.al.	2510.23252	null
2025-10-27	Treble10: A high-quality dataset for far-field speech recognition, dereverberation, and enhancement	Sarabeth S. Mullins et.al.	2510.23141	null
2025-10-27	Adapting Speech Foundation Models with Large Language Models for Unified Speech Recognition	Jing-Xuan Zhang et.al.	2510.22961	null
2025-10-26	EchoMind: An Interrelated Multi-level Benchmark for Evaluating Empathetic Speech Language Models	Li Zhou et.al.	2510.22758	null
2025-10-26	LRW-Persian: Lip-reading in the Wild Dataset for Persian Language	Zahra Taghizadeh et.al.	2510.22716	null
2025-10-28	Look and Tell: A Dataset for Multimodal Grounding Across Egocentric and Exocentric Views	Anna Deichler et.al.	2510.22672	null
2025-11-02	Mitigating Attention Sinks and Massive Activations in Audio-Visual Speech Recognition with LLMs	Anand et.al.	2510.22603	link
2025-10-26	A Sociophonetic Analysis of Racial Bias in Commercial ASR Systems Using the Pacific Northwest English Corpus	Michael Scott et.al.	2510.22495	null
2025-10-26	The Tonogenesis Continuum in Tibetan: A Computational Investigation	Siyu Liang et.al.	2510.22485	null
2025-10-25	M-CIF: Multi-Scale Alignment For CIF-Based Non-Autoregressive ASR	Ruixiang Mao et.al.	2510.22172	null
2025-10-23	LSF-Animation: Label-Free Speech-Driven Facial Animation via Implicit Feature Representation	Xin Lu et.al.	2510.21864	null
2025-10-24	Compressing Quaternion Convolutional Neural Networks for Audio Classification	Arshdeep Singh et.al.	2510.21388	null
2025-10-24	SindBERT, the Sailor: Charting the Seas of Turkish NLP	Raphael Scheible-Schmitt et.al.	2510.21364	null
2025-10-27	ReFESS-QI: Reference-Free Evaluation For Speech Separation With Joint Quality And Intelligibility Scoring	Ari Frummer et.al.	2510.21014	null
2025-10-22	Beyond Hearing: Learning Task-agnostic ExG Representations from Earphones via Physiology-informed Tokenization	Hyungjun Yoon et.al.	2510.20853	null
2025-10-21	Can large audio language models understand child stuttering speech? speech summarization, and source separation	Chibuzor Okocha et.al.	2510.20850	null
2025-10-23	Decoding the Ear: A Framework for Objectifying Expressiveness from Human Preference Through Efficient Alignment	Zhiyu Lin et.al.	2510.20513	null
2025-10-23	Speaking Clearly: A Simplified Whisper-Based Codec for Low-Bitrate Speech Coding	Xin Zhang et.al.	2510.20504	link
2025-10-23	SpeechAgent: An End-to-End Mobile Infrastructure for Speech Impairment Assistance	Haowei Lou et.al.	2510.20113	null
2025-10-22	Re-evaluating Minimum Bayes Risk Decoding for Automatic Speech Recognition	Yuu Jinnai et.al.	2510.19471	null
2025-10-22	Time delay embeddings to characterize the timbre of musical instruments using Topological Data Analysis: a study on synthetic and real data	Gakusei Sato et.al.	2510.19435	null
2025-10-23	FLASH Viterbi: Fast and Adaptive Viterbi Decoding for Modern Data Systems	Ziheng Deng et.al.	2510.19301	null
2025-10-22	Tibetan Language and AI: A Comprehensive Survey of Resources, Methods and Challenges	Cheng Huang et.al.	2510.19144	null
2025-11-05	StutterZero and StutterFormer: End-to-End Speech Conversion for Stuttering Transcription and Correction	Qianheng Xu et.al.	2510.18938	null
2025-10-28	RIR-Mega: a large-scale simulated room impulse response dataset for machine learning and room acoustics modeling	Mandip Goswami et.al.	2510.18917	link
2025-10-21	Adapting Language Balance in Code-Switching Speech	Enes Yavuz Ugan et.al.	2510.18724	null
2025-10-23	MLMA: Towards Multilingual ASR With Mamba-based Architectures	Mohamed Nabih Ali et.al.	2510.18684	null
2025-10-21	KrishokBondhu: A Retrieval-Augmented Voice-Based Agricultural Advisory Call Center for Bengali Farmers	Mohd Ruhul Ameen et.al.	2510.18355	null
2025-10-20	Transformer Redesign for Late Fusion of Audio-Text Features on Ultra-Low-Power Edge Hardware	Stavros Mitsis et.al.	2510.18036	null
2025-10-20	ImaGGen: Zero-Shot Generation of Co-Speech Semantic Gestures Grounded in Language and Image Input	Hendric Voss et.al.	2510.17617	null
2025-10-20	Conveying Meaning through Gestures: An Investigation into Semantic Co-Speech Gesture Generation	Hendric Voss et.al.	2510.17599	null
2025-10-19	End-to-end Listen, Look, Speak and Act	Siyin Wang et.al.	2510.16756	null
2025-10-19	Zero- and One-Shot Data Augmentation for Sentence-Level Dysarthric Speech Recognition in Constrained Scenarios	Shiyao Wang et.al.	2510.16700	null
2025-10-18	Hallucination Benchmark for Speech Foundation Models	Alkis Koudounas et.al.	2510.16567	null
2025-10-18	Interpreting the Dimensions of Speaker Embedding Space	Mark Huckvale et.al.	2510.16489	null
2025-10-18	Probing the Hidden Talent of ASR Foundation Models for L2 English Oral Assessment	Fu-An Chao et.al.	2510.16387	null
2025-10-18	MuseTok: Symbolic Music Tokenization for Generation and Semantic Understanding	Jingyue Huang et.al.	2510.16273	null
2025-10-17	SpeechLLMs for Large-scale Contextualized Zero-shot Slot Filling	Kadri Hacioglu et.al.	2510.15851	null
2025-10-17	Magnitude and Phase-based Feature Fusion Using Co-attention Mechanism for Speaker recognition	Rongfeng Su et.al.	2510.15659	null
2025-10-17	SpikeVox: Towards Energy-Efficient Speech Therapy Framework with Spike-driven Generative Language Models	Rachmad Vidya Wicaksana Putra et.al.	2510.15566	null
2025-10-17	VocalBench-DF: A Benchmark for Evaluating Speech LLM Robustness to Disfluency	Hongcheng Liu et.al.	2510.15406	null
2025-10-16	OmniMotion: Multimodal Motion Generation with Continuous Masked Autoregression	Zhe Li et.al.	2510.14954	null
2025-10-16	RLAIF-SPA: Optimizing LLM-based Emotional Speech Synthesis via RLAIF	Qing Yang et.al.	2510.14628	null
2025-10-15	Do Slides Help? Multi-modal Context for Automatic Transcription of Conference Talks	Supriti Sinhamahapatra et.al.	2510.13979	null
2025-10-15	Two Heads Are Better Than One: Audio-Visual Speech Error Correction with Dual Hypotheses	Sungnyun Kim et.al.	2510.13281	null
2025-11-13	A Critical Review of the Need for Knowledge-Centric Evaluation of Quranic Recitation	Mohammed Hilal Al-Kharusi et.al.	2510.12858	null
2025-10-14	Adaptive vector steering: A training-free, layer-wise intervention for hallucination mitigation in large audio and multimodal models	Tsung-En Lin et.al.	2510.12851	null
2025-10-11	Automatic Speech Recognition in the Modern Era: Architectures, Training, and Evaluation	Md. Nayeem et.al.	2510.12827	null
2025-10-14	Structured Sparsity and Weight-adaptive Pruning for Memory and Compute efficient Whisper models	Prasenjit K Mudi et.al.	2510.12666	null
2025-10-12	End-to-end Speech Recognition with similar length speech and text	Peng Fan et.al.	2510.10453	null
2025-10-11	End-to-end Automatic Speech Recognition and Speech Translation: Integration of Speech Foundational Models and LLMs	Nam Luu et.al.	2510.10329	null
2025-10-11	SyncLipMAE: Contrastive Masked Pretraining for Audio-Visual Talking-Face Representation	Zeyu Ling et.al.	2510.10069	null
2025-10-10	Accent-Invariant Automatic Speech Recognition via Saliency-Driven Spectrogram Masking	Mohammad Hossein Sameti et.al.	2510.09528	null
2025-10-10	WildElder: A Chinese Elderly Speech Dataset from the Wild with Fine-Grained Manual Annotations	Hui Wang et.al.	2510.09344	null
2025-10-10	Effects of automotive microphone frequency response characteristics and noise conditions on speech and ASR quality -- an experimental evaluation	Michele Buccoli et.al.	2510.09236	null
2025-10-10	FLToP CTC: Frame-Level Token Pruning via Relative Threshold for Efficient and Memory-Saving Decoding on Diverse Platforms	Atul Shree et.al.	2510.09085	null
2025-10-08	Look before Transcription: End-to-End SlideASR with Visually-Anchored Policy Optimization	Rui Hu et.al.	2510.08618	null
2025-10-01	Articulation-Informed ASR: Integrating Articulatory Features into ASR via Auxiliary Speech Inversion and Cross-Attention Fusion	Ahmed Adel Attia et.al.	2510.08585	null
2025-10-09	Pseudo2Real: Task Arithmetic for Pseudo-Label Correction in Automatic Speech Recognition	Yi-Cheng Lin et.al.	2510.08047	null
2025-10-09	Bloodroot: When Watermarking Turns Poisonous For Stealthy Backdoor	Kuan-Yu Chen et.al.	2510.07909	null
2025-10-08	How much speech data is necessary for ASR in African languages? An evaluation of data scaling in Kinyarwanda and Kikuyu	Benjamin Akera et.al.	2510.07221	null
2025-10-09	Open ASR Leaderboard: Towards Reproducible and Transparent Multilingual and Long-Form Speech Recognition Evaluation	Vaibhav Srivastav et.al.	2510.06961	null
2025-10-07	Linguistically Informed Tokenization Improves ASR for Underresourced Languages	Massimo Daul et.al.	2510.06461	null
2025-10-06	How I Built ASR for Endangered Languages with a Spoken Dictionary	Christopher Bartley et.al.	2510.04832	null
2025-10-06	UniVoice: Unifying Autoregressive ASR and Flow-Matching based TTS with Large Language Models	Wenhao Guan et.al.	2510.04593	null
2025-10-06	Evaluating Self-Supervised Speech Models via Text-Based LLMS	Takashi Maekaku et.al.	2510.04463	null
2025-10-05	Probing Whisper for Dysarthric Speech in Detection and Assessment	Zhengjun Yue et.al.	2510.04219	null
2025-10-05	Drax: Speech Recognition with Discrete Flow Matching	Aviv Navon et.al.	2510.04162	link
2025-10-05	MoME: Mixture of Matryoshka Experts for Audio-Visual Speech Recognition	Umberto Cappellazzo et.al.	2510.04136	null
2025-10-04	Adapting Diarization-Conditioned Whisper for End-to-End Multi-Talker Speech Recognition	Martin Kocour et.al.	2510.03723	null
2025-10-04	Towards Unsupervised Speech Recognition at the Syllable-Level	Liming Wang et.al.	2510.03639	null
2025-10-04	Scaling Multi-Talker ASR with Speaker-Agnostic Activity Streams	Xiluo He et.al.	2510.03630	null
2025-10-03	Listening or Reading? Evaluating Speech Awareness in Chain-of-Thought Speech-to-Text Translation	Jacobo Romero-Díaz et.al.	2510.03115	null
2025-10-03	Revisiting Direct Speech-to-Text Translation with Speech LLMs: Better Scaling than CoT Prompting?	Oriol Pareras et.al.	2510.03093	null
2025-10-16	Transcribe, Translate, or Transliterate: An Investigation of Intermediate Representations in Spoken Language Models	Tolúlopé Ògúnrèmí et.al.	2510.02569	null
2025-09-26	KAME: Tandem Architecture for Enhancing Knowledge in Real-Time Speech-to-Speech Conversational AI	So Kuroki et.al.	2510.02327	null
2025-10-02	EvolveCaptions: Empowering DHH Users Through Real-Time Collaborative Captioning	Liang-Yuan Wu et.al.	2510.02181	null
2025-10-01	Backdoor Attacks Against Speech Language Models	Alexandrine Fortier et.al.	2510.01157	null
2025-10-01	Automatic Speech Recognition (ASR) for African Low-Resource Languages: A Systematic Literature Review	Sukairaj Hafiz Imam et.al.	2510.01145	null
2025-10-01	Spiralformer: Low Latency Encoder for Streaming Speech Recognition with Circular Layer Skipping and Early Exiting	Emiru Tsunoo et.al.	2510.00982	null
2025-09-30	IR-UWB Radar-Based Contactless Silent Speech Recognition with Attention-Enhanced Temporal Convolutional Networks	Sunghwa Lee et.al.	2509.26409	null
2025-09-30	ASR Under Noise: Exploring Robustness for Sundanese and Javanese	Salsabila Zahirah Pranida et.al.	2509.25878	null
2025-09-29	Beyond WER: Probing Whisper's Sub-token Decoder Across Diverse Language Resource Levels	Siyu Liang et.al.	2509.25516	null
2025-09-29	Confidence-Guided Error Correction for Disordered Speech Recognition	Abner Hernandez et.al.	2509.25048	null
2025-10-05	HiKE: Hierarchical Evaluation Framework for Korean-English Code-Switching Speech Recognition	Gio Paik et.al.	2509.24613	link
2025-09-29	A Text-To-Text Alignment Algorithm for Better Evaluation of Modern Speech Recognition Systems	Lasse Borgholt et.al.	2509.24478	null
2025-09-29	Code-switching Speech Recognition Under the Lens: Model- and Data-Centric Perspectives	Hexin Liu et.al.	2509.24310	null
2025-09-28	AISHELL6-whisper: A Chinese Mandarin Audio-visual Whisper Speech Dataset with Speech Recognition Baselines	Cancan Li et.al.	2509.23833	null
2025-09-28	Automatic Speech Recognition for Greek Medical Dictation	Vardis Georgilas et.al.	2509.23550	null
2025-09-30	MeanFlowSE: One-Step Generative Speech Enhancement via MeanFlow	Yike Zhu et.al.	2509.23299	null
2025-09-26	ArFake: A Multi-Dialect Benchmark and Baselines for Arabic Spoof-Speech Detection	Mohamed Maged et.al.	2509.22808	null
2025-09-26	Index-MSR: A high-efficiency multimodal fusion framework for speech recognition	Jinming Chen et.al.	2509.22744	null
2025-10-10	From Coarse to Fine: Recursive Audio-Visual Semantic Enhancement for Speech Separation	Ke Xue et.al.	2509.22425	null
2025-09-26	Decoding Deception: Understanding Automatic Speech Recognition Vulnerabilities in Evasion and Poisoning Attacks	Aravindhan G et.al.	2509.22060	null
2025-09-26	A Parallel Ultra-Low Power Silent Speech Interface based on a Wearable, Fully-dry EMG Neckband	Fiona Meier et.al.	2509.21964	null
2025-09-26	Lightweight Front-end Enhancement for Robust ASR via Frame Resampling and Sub-Band Pruning	Siyi Zhao et.al.	2509.21833	null
2025-09-26	Align2Speak: Improving TTS for Low Resource Languages via ASR-Guided Online Preference Optimization	Shehzeen Hussain et.al.	2509.21718	null
2025-09-27	i-LAVA: Insights on Low Latency Voice-2-Voice Architecture for Agents	Anupam Purwar et.al.	2509.20971	null
2025-09-25	Real-Time System for Audio-Visual Target Speech Enhancement	T. Aleksandra Ma et.al.	2509.20741	null
2025-09-25	Visual Authority and the Rhetoric of Health Misinformation: A Multimodal Analysis of Social Media Videos	Mohammad Reza Zarei et.al.	2509.20724	null
2025-09-23	Variational Low-Rank Adaptation for Personalized Impaired Speech Recognition	Niclas Pokel et.al.	2509.20397	null
2025-09-23	Data-Efficient ASR Personalization for Non-Normative Speech Using an Uncertainty-Based Phoneme Difficulty Score for Guided Sampling	Niclas Pokel et.al.	2509.20396	null
2025-09-26	MMedFD: A Real-world Healthcare Benchmark for Multi-turn Full-Duplex Automatic Speech Recognition	Hongzhao Chen et.al.	2509.19817	null
2025-09-23	Retrieval Augmented Generation based context discovery for ASR	Dimitrios Siskos et.al.	2509.19567	null
2025-09-23	SloPalSpeech: A 2,8000-Hour Slovak Speech Corpus from Parliamentary Data	Erik Božík et.al.	2509.19270	null
2025-09-23	HD-PPT: Hierarchical Decoding of Content- and Prompt-Preference Tokens for Instruction-based TTS	Sihang Nie et.al.	2509.19001	null
2025-09-23	Group Relative Policy Optimization for Text-to-Speech with Large Language Models	Chang Liu et.al.	2509.18798	null
2025-09-24	M4SER: Multimodal, Multirepresentation, Multitask, and Multistrategy Learning for Speech Emotion Recognition	Jiajun He et.al.	2509.18706	null
2025-09-23	HarmoniFuse: A Component-Selective and Prompt-Adaptive Framework for Multi-Task Speech Language Modeling	Yuke Si et.al.	2509.18570	null
2025-09-23	Explore the Reinforcement Learning for the LLM based ASR and TTS system	Changfeng Gao et.al.	2509.18569	null
2025-09-24	MNV-17: A High-Quality Performative Mandarin Dataset for Nonverbal Vocalization Recognition in Speech	Jialong Mai et.al.	2509.18196	null
2025-09-22	Transformer-Encoder Trees for Efficient Multilingual Machine Translation and Speech Translation	Yiwen Guan et.al.	2509.17930	null
2025-09-22	Leveraging Audio-Visual Data to Reduce the Multilingual Gap in Self-Supervised Speech Models	María Andrea Cruz Blandón et.al.	2509.17523	null
2025-09-29	Sidon: Fast and Robust Open-Source Multilingual Speech Restoration for Large-scale Dataset Cleansing	Wataru Nakata et.al.	2509.17052	link
2025-09-20	Idiosyncratic Versus Normative Modeling of Atypical Speech Recognition: Dysarthric Case Studies	Vishnu Raja et.al.	2509.16718	null
2025-10-09	Audio-Conditioned Diffusion LLMs for ASR and Deliberation Processing	Mengqi Wang et.al.	2509.16622	null
2025-09-26	GLip: A Global-Local Integrated Progressive Framework for Robust Visual Speech Recognition	Tianyue Wang et.al.	2509.16031	null
2025-09-22	Interpreting the Role of Visemes in Audio-Visual Speech Recognition	Aristeidis Papadopoulos et.al.	2509.16023	null
2025-09-19	VOX-KRIKRI: Unifying Speech and Language through Continuous Fusion	Dimitrios Damianos et.al.	2509.15667	null
2025-09-19	Layer-wise Minimal Pair Probing Reveals Contextual Grammatical-Conceptual Hierarchy in Speech Representations	Linyang He et.al.	2509.15655	null
2025-09-19	Thinking in cocktail party: Chain-of-Thought and reinforcement learning for target speaker automatic speech recognition	Yiru Zhang et.al.	2509.15612	null
2025-09-19	Chunk Based Speech Pre-training with High Resolution Finite Scalar Quantization	Yun Tang et.al.	2509.15579	null
2025-09-19	State-of-the-Art Dysarthric Speech Recognition with MetaICL for on-the-fly Personalization	Dhruuv Agarwal et.al.	2509.15516	null
2025-09-18	Impact of Phonetics on Speaker Identity in Adversarial Voice Attack	Daniyal Kabir Dar et.al.	2509.15437	null
2025-09-18	BiRQ: Bi-Level Self-Labeling Random Quantization for Self-Supervised Speech Recognition	Liuyuan Jiang et.al.	2509.15430	null
2025-09-23	Frustratingly Easy Data Augmentation for Low-Resource ASR	Katsumi Ibaraki et.al.	2509.15373	null
2025-09-25	Speech Language Models for Under-Represented Languages: Insights from Wolof	Yaya Sy et.al.	2509.15362	null
2025-09-20	Listening, Imagining & Refining: A Heuristic Optimized ASR Correction Framework with LLMs	Yutong Liu et.al.	2509.15095	null
2025-09-18	From Who Said What to Who They Are: Modular Training-free Identity-Aware LLM Refinement of Speaker Diarization	Yu-Wen Chen et.al.	2509.15082	null
2025-09-19	From Hype to Insight: Rethinking Large Language Model Integration in Visual Speech Recognition	Rishabh Jain et.al.	2509.14880	null
2025-09-18	UMA-Split: unimodal aggregation for both English and Mandarin non-autoregressive speech recognition	Ying Fang et.al.	2509.14653	null
2025-09-17	Multi-Channel Differential ASR for Robust Wearer Speech Recognition on Smart Glasses	Yufeng Yang et.al.	2509.14430	null
2025-09-17	CS-FLEURS: A Massively Multilingual and Code-Switched Speech Dataset	Brian Yan et.al.	2509.14161	null
2025-09-25	Canary-1B-v2 & Parakeet-TDT-0.6B-v3: Efficient and High-Performance Models for Multilingual ASR and AST	Monica Sekoyan et.al.	2509.14128	null
2025-09-17	Language Conditioning Improves Accuracy of Aircraft Goal Prediction in Untowered Airspace	Sundhar Vinodh Sangeetha et.al.	2509.14063	null
2025-09-17	Conducting Mission-Critical Voice Experiments with Automated Speech Recognition and Crowdsourcing	Jan Janak et.al.	2509.13724	null
2025-09-09	On the Contribution of Lexical Features to Speech Emotion Recognition	David Combei et.al.	2509.05634	null
2025-07-23	AI Telephone Surveying: Automating Quantitative Data Collection with an AI Interviewer	Danny D. Leybzon et.al.	2507.17718	null
2025-07-23	Synthetic Voice Data for Automatic Speech Recognition in African Languages	Brian DeRenzi et.al.	2507.17578	null
2025-07-23	BoSS: Beyond-Semantic Speech	Qing Wang et.al.	2507.17563	null
2025-07-23	Application of Whisper in Clinical Practice: the Post-Stroke Speech Assessment during a Naming Task	Milena Davudova et.al.	2507.17326	null
2025-07-23	Triple X: A LLM-Based Multilingual Speech Recognition System for the INTERSPEECH2025 MLC-SLM Challenge	Miaomiao Gao et.al.	2507.17288	null
2025-07-20	Weak Supervision Techniques towards Enhanced ASR Models in Industry-level CRM Systems	Zhongsheng Wang et.al.	2507.16843	null
2025-07-15	Towards Robust Speech Recognition for Jamaican Patois Music Transcription	Jordan Madden et.al.	2507.16834	null
2025-07-22	Step-Audio 2 Technical Report	Boyong Wu et.al.	2507.16632	null
2025-07-22	An approach to measuring the performance of Automatic Speech Recognition (ASR) models in the context of Large Language Model (LLM) powered applications	Sujith Pulikodan et.al.	2507.16456	null
2025-07-21	Beyond Rate Coding: Surrogate Gradients Enable Spike Timing Learning in Spiking Neural Networks	Ziqiao Yu et.al.	2507.16043	null
2025-07-21	Mixture to Beamformed Mixture: Leveraging Beamformed Mixture as Weak-Supervision for Speech Enhancement and Noise-Robust ASR	Zhong-Qiu Wang et.al.	2507.15229	null
2025-07-21	EchoVoices: Preserving Generational Voices and Memories for Seniors and Children	Haiying Xu et.al.	2507.15221	null
2025-07-19	Adapting Whisper for Lightweight and Efficient Automatic Speech Recognition of Children for On-device Edge Applications	Satwik Dutta et.al.	2507.14451	null
2025-07-18	Open Automatic Speech Recognition Models for Classical and Modern Standard Arabic	Lilit Grigoryan et.al.	2507.13977	null
2025-07-18	Optimizing ASR for Catalan-Spanish Code-Switching: A Comparative Analysis of Methodologies	Carlos Mena et.al.	2507.13875	null
2025-07-17	Reading Between the Lines: Combining Pause Dynamics and Semantic Coherence for Automated Assessment of Thought Disorder	Feng Chen et.al.	2507.13551	null
2025-07-18	Automatically assessing oral narratives of Afrikaans and isiXhosa children	Retief Louw et.al.	2507.13205	null
2025-07-17	NonverbalTTS: A Public English Corpus of Text-Aligned Nonverbal Vocalizations with Emotion Annotations for Text-to-Speech	Maksim Borisov et.al.	2507.13155	null
2025-07-17	UniSLU: Unified Spoken Language Understanding from Heterogeneous Cross-Task Datasets	Zhichao Sheng et.al.	2507.12951	null
2025-07-17	Task-Specific Audio Coding for Machines: Machine-Learned Latent Features Are Codes for That Machine	Anastasia Kuznetsova et.al.	2507.12701	null
2025-07-16	Improving Contextual ASR via Multi-grained Fusion with Large Language Models	Shilin Zhou et.al.	2507.12252	null
2025-07-14	WhisperKit: On-device Real-time ASR with Billion-Scale Transformers	Atila Orhon et.al.	2507.10860	null
2025-07-20	Supporting SENCOTEN Language Documentation Efforts with Automatic Speech Recognition	Mengzhe Geng et.al.	2507.10827	null
2025-07-14	DQLoRA: A Lightweight Domain-Aware Denoising ASR via Adapter-guided Distillation	Yiru Yang et.al.	2507.10313	null
2025-07-13	The DKU System for Multi-Speaker Automatic Speech Recognition in MLC-SLM Challenge	Yuke Lin et.al.	2507.09499	null
2025-07-12	Can We Really Repurpose Multi-Speaker ASR Corpus for Speaker Diarization?	Shota Horiguchi et.al.	2507.09226	null
2025-07-22	Mixture of LoRA Experts with Multi-Modal and Multi-Granularity LLM Generative Error Correction for Accented Speech Recognition	Bingshen Mu et.al.	2507.09116	null
2025-07-06	A Hybrid Machine Learning Framework for Optimizing Crop Selection via Agronomic and Economic Forecasting	Niranjan Mallikarjun Sindhur et.al.	2507.08832	null
2025-07-11	The Impact of Automatic Speech Transcription on Speaker Attribution	Cristina Aggazzotti et.al.	2507.08660	null
2025-07-11	ILT-Iterative LoRA Training through Focus-Feedback-Fix for Multilingual Speech Recognition	Qingliang Meng et.al.	2507.08477	null
2025-07-10	DARAS: Dynamic Audio-Room Acoustic Synthesis for Blind Room Impulse Response Estimation	Chunxi Wang et.al.	2507.08135	null
2025-07-10	Modèle physique variationnel pour l'estimation de réponses impulsionnelles de salles	Louis Lalay et.al.	2507.08051	null
2025-07-10	Edge-ASR: Towards Low-Bit Quantization of Automatic Speech Recognition Models	Chen Feng et.al.	2507.07877	null
2025-07-10	Code-Switching in End-to-End Automatic Speech Recognition: A Systematic Literature Review	Maha Tufail Agro et.al.	2507.07741	null
2025-07-08	Deep Feed-Forward Neural Network for Bangla Isolated Speech Recognition	Dipayan Bhadra et.al.	2507.07068	null
2025-07-04	Pronunciation-Lexicon Free Training for Phoneme-based Crosslingual ASR via Joint Stochastic Approximation	Saierdaer Yusuyin et.al.	2507.06249	null
2025-07-21	VisualSpeaker: Visually-Guided 3D Avatar Lip Synthesis	Alexandre Symeonidis-Herzig et.al.	2507.06060	null
2025-07-08	How to Evaluate Automatic Speech Recognition: Comparing Different Performance and Bias Measures	Tanvina Patel et.al.	2507.05885	null
2025-07-08	ContextASR-Bench: A Massive Contextual Speech Recognition Benchmark	He Wang et.al.	2507.05727	null
2025-11-06	Omni-Router: Sharing Routing Decisions in Sparse Mixture-of-Experts for Speech Recognition	Zijin Gu et.al.	2507.05724	null
2025-07-07	Adaptive Slimming for Scalable and Efficient Speech Enhancement	Riccardo Miccini et.al.	2507.04879	null
2025-07-08	SHNU Multilingual Conversational Speech Recognition System for INTERSPEECH 2025 MLC-SLM Challenge	Yuxiang Mei et.al.	2507.03343	null
2025-06-26	A Unified Speech LLM for Diarization and Speech Recognition in Multilingual Conversations	Phurich Saengthong et.al.	2507.02927	null
2025-07-03	Open-Source System for Multilingual Translation and Cloned Speech Synthesis	Mateo Cámara et.al.	2507.02530	null
2025-07-03	A Cookbook for Community-driven Data Collection of Impaired Speech in LowResource Languages	Sumaya Ahmed Salihs et.al.	2507.02428	null
2025-07-03	Benchmarking Akan ASR Models Across Domain-Specific Datasets: A Comparative Evaluation of Performance, Scalability, and Adaptability	Mark Atta Mensah et.al.	2507.02407	null
2025-07-02	Adaptability of ASR Models on Low-Resource Language: A Comparative Study of Whisper and Wav2Vec-BERT on Bangla	Md Sazzadul Islam Ridoy et.al.	2507.01931	null
2025-07-02	First Steps Towards Voice Anonymization for Code-Switching Speech	Sarina Meyer et.al.	2507.01765	null
2025-07-02	PERTINENCE: Input-based Opportunistic Neural Network Dynamic Execution	Omkar Shende et.al.	2507.01695	null
2025-07-02	Learning from Random Subspace Exploration: Generalized Test-Time Augmentation with Self-supervised Distillation	Andrei Jelea et.al.	2507.01347	null
2025-07-02	AI Meets Maritime Training: Precision Analytics for Enhanced Safety and Performance	Vishakha Lall et.al.	2507.01274	null
2025-06-16	Hello Afrika: Speech Commands in Kinyarwanda	George Igwegbe et.al.	2507.01024	null
2025-07-01	MambAttention: Mamba with Multi-Head Attention for Generalizable Single-Channel Speech Enhancement	Nikolai Lund Kühne et.al.	2507.00966	null
2025-07-01	Rectifying Magnitude Neglect in Linear Attention	Qihang Fan et.al.	2507.00698	null
2025-07-01	Audio-3DVG: Unified Audio - Point Cloud Fusion for 3D Visual Grounding	Duc Cao-Dinh et.al.	2507.00669	null
2025-06-29	Research on Comprehensive Classroom Evaluation System Based on Multiple AI Models	Cong Xie et.al.	2506.23079	null
2025-06-28	Mind the Gap: Entity-Preserved Context-Aware ASR Structured Transcriptions	Duygu Altinok et.al.	2506.22858	null
2025-06-28	Boosting CTC-Based ASR Using LLM-Based Intermediate Loss Regularization	Duygu Altinok et.al.	2506.22846	null
2025-06-28	A Self-Training Approach for Whisper to Enhance Long Dysarthric Speech Recognition	Shiyao Wang et.al.	2506.22810	null
2025-06-27	Speaker Targeting via Self-Speaker Adaptation for Multi-talker ASR	Weiqing Wang et.al.	2506.22646	null
2025-06-27	Cross-lingual Data Selection Using Clip-level Acoustic Similarity for Enhancing Low-resource Automatic Speech Recognition	Shunsuke Mitsumori et.al.	2506.22194	null
2025-06-27	SAGE: Spliced-Audio Generated Data for Enhancing Foundational Models in Low-Resource Arabic-English Code-Switched Speech Recognition	Muhammad Umar Farooq et.al.	2506.22143	null
2025-06-27	Analyzing and Fine-Tuning Whisper Models for Multilingual Pilot Speech Transcription in the Cockpit	Kartheek Kumar Reddy Nareddy et.al.	2506.21990	null
2025-06-23	Adapting Foundation Speech Recognition Models to Impaired Speech: A Semantic Re-chaining Approach for Personalization of German Speech	Niclas Pokel et.al.	2506.21622	null
2025-06-16	Language-Aware Prompt Tuning for Parameter-Efficient Seamless Language Expansion in Multilingual ASR	Hongli Yang et.al.	2506.21577	null
2025-06-16	Adapting Whisper for Parameter-efficient Code-Switching Speech Recognition via Soft Prompt Tuning	Hongli Yang et.al.	2506.21576	null
2025-06-12	FormosanBench: Benchmarking Low-Resource Austronesian Languages in the Era of Large Language Models	Kaiying Kevin Lin et.al.	2506.21563	null
2025-06-11	Efficient Multilingual ASR Finetuning via LoRA Language Experts	Jiahong Li et.al.	2506.21555	null
2025-06-25	Multimodal Representation Learning and Fusion	Qihang Jin et.al.	2506.20494	null
2025-06-25	Lightweight Target-Speaker-Based Overlap Transcription for Practical Streaming ASR	Aleš Pražák et.al.	2506.20288	null
2025-06-24	Accurate, fast, cheap: Choose three. Replacing Multi-Head-Attention with Bidirectional Recurrent Attention for Long-Form ASR	Martin Ratajczak et.al.	2506.19761	null
2025-06-23	Context Biasing for Pronunciations-Orthography Mismatch in Automatic Speech Recognition	Christian Huber et.al.	2506.18703	null
2025-06-23	Evaluating Multichannel Speech Enhancement Algorithms at the Phoneme Scale Across Genders	Nasser-Eddine Monir et.al.	2506.18691	null
2025-06-23	End-to-End Spoken Grammatical Error Correction	Mengjie Qian et.al.	2506.18532	null
2025-06-28	AI-Generated Song Detection via Lyrics Transcripts	Markus Frohmann et.al.	2506.18488	null
2025-06-22	Splitformer: An improved early-exit architecture for automatic speech recognition on edge devices	Maxence Lasbordes et.al.	2506.18035	null
2025-06-21	OpusLM: A Family of Open Unified Speech Language Models	Jinchuan Tian et.al.	2506.17611	null
2025-06-27	Data Quality Issues in Multilingual Speech Datasets: The Need for Sociolinguistic Awareness and Proactive Language Planning	Mingfei Lau et.al.	2506.17525	null
2025-06-20	Breaking the Transcription Bottleneck: Fine-tuning ASR Models for Extremely Low-Resource Fieldwork Languages	Siyu Liang et.al.	2506.17459	null
2025-06-20	Simultaneous Translation with Offline Speech and LLM Models in CUNI Submission to IWSLT 2025	Dominik Macháček et.al.	2506.17077	link
2025-06-20	Instituto de Telecomunicações at IWSLT 2025: Aligning Small-Scale Speech and Language Models for Speech-to-Text Learning	Giuseppe Attanasio et.al.	2506.17019	link
2025-06-27	State-Space Models in Efficient Whispered and Multi-dialect Speech Recognition	Aref Farhadipour et.al.	2506.16969	null
2025-06-20	LM-SPT: LM-Aligned Semantic Distillation for Speech Tokenization	Daejin Jo et.al.	2506.16738	null
2025-06-19	Weight Factorization and Centralization for Continual Learning in Speech Recognition	Enes Yavuz Ugan et.al.	2506.16574	null
2025-06-19	Automatic Speech Recognition Biases in Newcastle English: an Error Analysis	Dana Serditova et.al.	2506.16558	null
2025-06-18	Exploiting Music Source Separation for Automatic Lyrics Transcription with Whisper	Jaza Syed et.al.	2506.15514	null
2025-06-18	Foundation of Affective Computing and Interaction	Changzeng Fu et.al.	2506.15497	null
2025-06-17	Thinking in Directivity: Speech Large Language Model for Multi-Talker Directional Speech Recognition	Jiamin Xie et.al.	2506.14973	null
2025-06-17	Unifying Streaming and Non-streaming Zipformer-based ASR	Bidisha Sharma et.al.	2506.14434	null
2025-06-17	Improving Practical Aspects of End-to-End Multi-Talker Speech Recognition for Online and Offline Scenarios	Aswin Shanmugam Subramanian et.al.	2506.14204	null
2025-06-17	AsyncSwitch: Asynchronous Text-Speech Adaptation for Code-Switched ASR	Tuan Nguyen et.al.	2506.14190	null
2025-06-16	A Silent Speech Decoding System from EEG and EMG with Heterogenous Electrode Configurations	Masakazu Inoue et.al.	2506.13835	null
2025-07-07	Qwen vs. Gemma Integration with Whisper: A Comparative Study in Multilingual SpeechLLM Systems	Tuan Nguyen et.al.	2506.13596	null
2025-06-16	BUT System for the MLC-SLM Challenge	Alexander Polok et.al.	2506.13414	null
2025-07-04	Bi-directional Context-Enhanced Speech Large Language Models for Multilingual Conversational ASR	Yizhou Peng et.al.	2506.13396	null
2025-07-04	NTU Speechlab LLM-Based Multilingual ASR System for Interspeech MLC-SLM Challenge 2025	Yizhou Peng et.al.	2506.13339	null
2025-06-18	Seewo's Submission to MLC-SLM: Lessons learned from Speech Reasoning Language Models	Bo Li et.al.	2506.13300	null
2025-06-15	SC-SOT: Conditioning the Decoder on Diarized Speaker Information for End-to-End Overlapped Speech Recognition	Yuta Hirano et.al.	2506.12672	null
2025-06-13	Adapting Whisper for Streaming Speech Recognition via Two-Pass Decoding	Haoran Zhou et.al.	2506.12154	null
2025-05-31	CMT-LLM: Contextual Multi-Talker ASR Utilizing Large Language Models	Jiajun He et.al.	2506.12059	null
2025-06-13	Enabling automatic transcription of child-centered audio recordings from real-world environments	Daniil Kocharov et.al.	2506.11747	null
2025-06-13	Lightweight and Robust Multi-Channel End-to-End Speech Recognition with Spherical Harmonic Transform	Xiangzhu Kong et.al.	2506.11630	null
2025-06-13	(SimPhon Speech Test): A Data-Driven Method for In Silico Design and Validation of a Phonetically Balanced Speech Test	Stefan Bleeck et.al.	2506.11620	null
2025-06-13	Machine Unlearning for Robust DNNs: Attribution-Guided Partitioning and Neuron Pruning in Noisy Environments	Deliang Jin et.al.	2506.11615	null
2025-06-12	Advances in Small-Footprint Keyword Spotting: A Comprehensive Review of Efficient Models and Algorithms	Soumen Garai et.al.	2506.11169	link
2025-06-10	ASRJam: Human-Friendly AI Speech Jamming to Prevent Automated Phone Scams	Freddie Grabovski et.al.	2506.11125	null
2025-06-09	Benchmarking Foundation Speech and Language Models for Alzheimer's Disease and Related Dementia Detection from Spontaneous Speech	Jingyu Li et.al.	2506.11119	null
2025-06-05	Customizing Speech Recognition Model with Large Language Model Feedback	Shaoshi Ling et.al.	2506.11091	null
2025-06-05	Better Pseudo-labeling with Multi-ASR Fusion and Error Correction by SpeechLLM	Jeena Prakash et.al.	2506.11089	null
2025-06-04	Improving Child Speech Recognition and Reading Mistake Detection by Using Prompts	Lingyun Gao et.al.	2506.11079	null
2025-06-02	Regularized Federated Learning for Privacy-Preserving Dysarthric and Elderly Speech Recognition	Tao Zhong et.al.	2506.11069	null
2025-05-31	PMF-CEC: Phoneme-augmented Multimodal Fusion for Context-aware ASR Error Correction with Error-specific Selective Decoding	Jiajun He et.al.	2506.11064	null
2025-06-12	Improving Named Entity Transcription with Contextual LLM-based Revision	Viet Anh Trinh et.al.	2506.10779	null
2025-06-12	FairASR: Fair Audio Contrastive Learning for Automatic Speech Recognition	Jongsuk Kim et.al.	2506.10747	null
2025-06-12	Joint ASR and Speaker Role Tagging with Serialized Output Training	Anfeng Xu et.al.	2506.10349	null
2025-06-11	Regularizing Learnable Feature Extraction for Automatic Speech Recognition	Peter Vieting et.al.	2506.09804	null
2025-06-11	OWSM-Biasing: Contextualizing Open Whisper-Style Speech Models for Automatic Speech Recognition with Dynamic Vocabulary	Yui Sudo et.al.	2506.09448	null
2025-06-10	SimClass: A Classroom Speech Dataset Generated via Game Engine Simulation For Automatic Speech Recognition Research	Ahmed Adel Attia et.al.	2506.09206	null
2025-07-11	Addressing Pitfalls in Auditing Practices of Automatic Speech Recognition Technologies: A Case Study of People with Aphasia	Katelyn Xiaoying Mei et.al.	2506.08846	link
2025-06-09	Uncovering the Functional Roles of Nonlinearity in Memory	Manuel Brenner et.al.	2506.07919	null
2025-06-09	Unified Semi-Supervised Pipeline for Automatic Speech Recognition	Nune Tadevosyan et.al.	2506.07659	null
2025-06-09	Transcript-Prompted Whisper with Dictionary-Enhanced Decoding for Japanese Speech Annotation	Rui Hu et.al.	2506.07646	null
2025-06-09	Speaker-Distinguishable CTC: Learning Speaker Distinction Using CTC for Multi-Talker Speech Recognition	Asahi Sakuma et.al.	2506.07515	null
2025-06-09	DeRAGEC: Denoising Named Entity Candidates with Synthetic Rationale for ASR Error Correction	Solee Im et.al.	2506.07510	null
2025-06-11	Towards Energy-Efficient and Low-Latency Voice-Controlled Smart Homes: A Proposal for Offline Speech Recognition and IoT Integration	Peng Huang et.al.	2506.07494	null
2025-06-08	Speech Recognition on TV Series with Video-guided Post-Correction	Haoyuan Yang et.al.	2506.07323	null
2025-06-08	Technical Report: A Practical Guide to Kaldi ASR Optimization	Mengze Hong et.al.	2506.07149	null
2025-06-07	Automatic Speech Recognition of African American English: Lexical and Contextual Effects	Hamid Mojarad et.al.	2506.06888	null
2025-06-07	Beyond Classification: Towards Speech Emotion Reasoning with Multitask AudioLLMs	Wenyu Zhang et.al.	2506.06820	null
2025-06-07	A Survey of Retentive Network	Haiqi Yang et.al.	2506.06708	null
2025-06-06	AS-ASR: A Lightweight Framework for Aphasia-Specific Automatic Speech Recognition	Chen Bao et.al.	2506.06566	null
2025-06-13	Structured State Space Model Dynamics and Parametrization for Spiking Neural Networks	Maxime Fabre et.al.	2506.06374	link
2025-06-06	Lightweight Prompt Biasing for Contextualized End-to-End ASR Systems	Bo Ren et.al.	2506.06252	null
2025-06-06	Phonetically-Augmented Discriminative Rescoring for Voice Search Error Correction	Christophe Van Gysel et.al.	2506.06117	null
2025-06-06	Diarization-Aware Multi-Speaker Automatic Speech Recognition via Large Language Models	Yuke Lin et.al.	2506.05796	null
2025-06-06	Bridging the Modality Gap: Softly Discretizing Audio Representation for LLM-based Automatic Speech Recognition	Mu Yang et.al.	2506.05706	null
2025-06-06	Low-Resource Domain Adaptation for Speech LLMs via Text-Only Fine-Tuning	Yangui Fang et.al.	2506.05671	null
2025-06-03	Auto Review: Second Stage Error Detection for Highly Accurate Information Extraction from Phone Conversations	Ayesha Qamar et.al.	2506.05400	null
2025-06-05	LLM-based phoneme-to-grapheme for phoneme-based speech recognition	Te Ma et.al.	2506.04711	null
2025-06-05	ViCocktail: Automated Multi-Modal Data Collection for Vietnamese Audio-Visual Speech Recognition	Thai-Binh Nguyen et.al.	2506.04635	null
2025-06-05	LESS: Large Language Model Enhanced Semi-Supervised Learning for Speech Foundational Models	Wen Ding et.al.	2506.04586	null
2025-06-04	Effects of Speaker Count, Duration, and Accent Diversity on Zero-Shot Accent Robustness in Low-Resource ASR	Zheng-Xin Yong et.al.	2506.04364	null
2025-06-04	MFLA: Monotonic Finite Look-ahead Attention for Streaming Speech Recognition	Yinfeng Xia et.al.	2506.03722	null
2025-06-03	A Multi-Dialectal Dataset for German Dialect ASR and Dialect-to-Standard Speech Translation	Verena Blaschke et.al.	2506.02894	null
2025-06-03	Overcoming Data Scarcity in Multi-Dialectal Arabic ASR via Whisper Fine-Tuning	Ömer Tarik Özyilmaz et.al.	2506.02627	null
2025-06-03	On the Language and Gender Biases in PSTN, VoIP and Neural Audio Codecs	Kemal Altwlkany et.al.	2506.02545	null
2025-06-03	SOVA-Bench: Benchmarking the Speech Conversation Ability for LLM-based Voice Assistant	Yixuan Hou et.al.	2506.02457	null
2025-06-03	Enhancing Lyrics Transcription on Music Mixtures with Consistency Loss	Jiawen Huang et.al.	2506.02339	null
2025-06-02	Cocktail-Party Audio-Visual Speech Recognition	Thai-Binh Nguyen et.al.	2506.02178	null
2025-06-02	HENT-SRT: Hierarchical Efficient Neural Transducer with Self-Distillation for Joint Speech Recognition and Translation	Amir Hussein et.al.	2506.02157	null
2025-06-01	Enhancing Speech Instruction Understanding and Disambiguation in Robotics via Speech Prosody	David Sasu et.al.	2506.02057	null
2025-05-31	No Audiogram: Leveraging Existing Scores for Personalized Speech Intelligibility Prediction	Haoshuai Zhou et.al.	2506.02039	null
2025-05-27	Leveraging Large Language Models in Visual Speech Recognition: Model Scaling, Context-Aware Decoding, and Iterative Polishing	Zehua Liu et.al.	2506.02012	null
2025-05-27	CNVSRC 2024: The Second Chinese Continuous Visual Speech Recognition Challenge	Zehua Liu et.al.	2506.02010	null
2025-06-02	DNCASR: End-to-End Training for Speaker-Attributed ASR	Xianrui Zheng et.al.	2506.01916	null
2025-06-02	Reasoning-Based Approach with Chain-of-Thought for Alzheimer's Detection Using Speech and Large Language Models	Chanwoo Park et.al.	2506.01683	null
2025-06-02	Self-Supervised Speech Quality Assessment (S3QA): Leveraging Speech Foundation Models for a Scalable Speech Quality Metric	Mattson Ogg et.al.	2506.01655	null
2025-06-02	Riemannian Time Warping: Multiple Sequence Alignment in Curved Spaces	Julian Richter et.al.	2506.01635	null
2025-06-02	Unsupervised Rhythm and Voice Conversion to Improve ASR on Dysarthric Speech	Karl El Hajal et.al.	2506.01618	null
2025-06-02	Analyzing the Importance of Blank for CTC-Based Knowledge Distillation	Benedikt Hilmes et.al.	2506.01503	null
2025-06-02	TalTech Systems for the Interspeech 2025 ML-SUPERB 2.0 Challenge	Tanel Alumäe et.al.	2506.01458	null
2025-06-02	Whale: Large-Scale multilingual ASR model with w2v-BERT and E-Branchformer with large speech data	Yosuke Kashiwagi et.al.	2506.01439	null
2025-06-02	Speech-to-Speech Translation Pipelines for Conversations in Low-Resource Languages	Andrei Popescu-Belis et.al.	2506.01406	null
2025-06-02	CleanS2S: Single-file Framework for Proactive Speech-to-Speech Interaction	Yudong Lu et.al.	2506.01268	null
2025-06-02	WCTC-Biasing: Retraining-free Contextual Biasing ASR with Wildcard CTC-based Keyword Spotting and Inter-layer Biasing	Yu Nakagome et.al.	2506.01263	null
2025-06-01	GigaAM: Efficient Self-Supervised Learner for Speech Recognition	Aleksandr Kutsakov et.al.	2506.01192	link
2025-06-01	What do self-supervised speech models know about Dutch? Analyzing advantages of language-specific pre-training	Marianne de Heer Kloots et.al.	2506.00981	link
2025-06-01	Fine-Tuning ASR for Stuttered Speech: Personalized vs. Generalized Approaches	Dena Mujtaba et.al.	2506.00853	null
2025-05-31	Chain-of-Thought Training for Open E2E Spoken Dialogue Systems	Siddhant Arora et.al.	2506.00722	null
2025-05-31	Towards Temporally Explainable Dysarthric Speech Clarity Assessment	Seohyun Park et.al.	2506.00454	link
2025-05-31	DYNAC: Dynamic Vocabulary based Non-Autoregressive Contextualization for Speech Recognition	Yui Sudo et.al.	2506.00422	null
2025-05-31	Causal Structure Discovery for Error Diagnostics of Children's ASR	Vishwanath Pratap Singh et.al.	2506.00402	null
2025-05-30	Can LLMs Understand Unvoiced Speech? Exploring EMG-to-Text Conversion with LLMs	Payal Mohapatra et.al.	2506.00304	null
2025-05-30	Vedavani: A Benchmark Corpus for ASR on Vedic Sanskrit Poetry	Sujeet Kumar et.al.	2506.00145	null
2025-05-30	SwitchLingua: The First Large-Scale Multilingual and Multi-Ethnic Code-Switching Dataset	Peng Xie et.al.	2506.00087	null
2025-05-30	Running Conventional Automatic Speech Recognition on Memristor Hardware: A Simulated Approach	Nick Rossenbach et.al.	2505.24721	null
2025-06-02	MSDA: Combining Pseudo-labeling and Self-Supervision for Unsupervised Domain Adaptation in ASR	Dimitrios Damianos et.al.	2505.24656	null
2025-05-30	SuPseudo: A Pseudo-supervised Learning Method for Neural Speech Enhancement in Far-field Speech Recognition	Longjie Luo et.al.	2505.24450	null
2025-05-30	Pseudo Labels-based Neural Speech Enhancement for the AVSR Task in the MISP-Meeting Challenge	Longjie Luo et.al.	2505.24446	null
2025-06-05	Fewer Hallucinations, More Verification: A Three-Stage LLM-Based Framework for ASR Error Correction	Yangui Fang et.al.	2505.24347	null
2025-05-30	Dynamic Context-Aware Streaming Pretrained Language Model For Inverse Text Normalization	Luong Ho et.al.	2505.24229	null
2025-05-30	MOPSA: Mixture of Prompt-Experts Based Speaker Adaptation for Elderly Speech Recognition	Chengxi Deng et.al.	2505.24224	null
2025-06-03	Improving Multilingual Speech Models on ML-SUPERB 2.0: Fine-tuning with Data Augmentation and LID-Aware CTC	Qingzheng Wang et.al.	2505.24200	null
2025-05-29	BeaverTalk: Oregon State University's IWSLT 2025 Simultaneous Speech Translation System	Matthew Raffel et.al.	2505.24016	link
2025-05-29	Prompting Whisper for Improved Verbatim Transcription and End-to-end Miscue Detection	Griffin Dietz Smith et.al.	2505.23627	null
2025-05-29	Contextualized Automatic Speech Recognition with Dynamic Vocabulary Prediction and Activation	Zhennan Lin et.al.	2505.23077	null
2025-05-29	AISHELL-5: The First Open-Source In-Car Multi-Channel Multi-Speaker Speech Dataset for Automatic Speech Diarization and Recognition	Yuhang Dai et.al.	2505.23036	link
2025-05-28	NGPU-LM: GPU-Accelerated N-Gram Language Model for Context-Biasing in Greedy ASR Decoding	Vladimir Bataev et.al.	2505.22857	null
2025-06-05	Evaluation of LLMs in Speech is Often Flawed: Test Set Contamination in Large Language Models for Speech Recognition	Yuan Tseng et.al.	2505.22251	null
2025-05-28	Advancing Hearing Assessment: An ASR-Based Frequency-Specific Speech Test for Diagnosing Presbycusis	Stefan Bleeck et.al.	2505.22231	null
2025-05-28	On-the-fly Routing for Zero-shot MoE Speaker Adaptation of Speech Foundation Models for Dysarthric Speech Recognition	Shujie HU et.al.	2505.22072	null
2025-05-28	Weakly Supervised Data Refinement and Flexible Sequence Compression for Efficient Thai LLM-based ASR	Mingchen Shao et.al.	2505.22063	null
2025-05-28	Overlap-Adaptive Hybrid Speaker Diarization and ASR-Aware Observation Addition for MISP 2025 Challenge	Shangkun Huang et.al.	2505.22013	null
2025-05-28	Leveraging LLM for Stuttering Speech: A Unified Architecture Bridging Recognition and Event Detection	Shangkun Huang et.al.	2505.22005	null
2025-05-27	GMU Systems for the IWSLT 2025 Low-Resource Speech Translation Shared Task	Chutong Meng et.al.	2505.21781	null
2025-05-27	Loquacious Set: 25,000 Hours of Transcribed and Diverse English Speech Recognition Data for Research and Commercial Use	Titouan Parcollet et.al.	2505.21578	null
2025-05-25	WhisperD: Dementia Speech Recognition and Filler Word Detection with Whisper	Emmanuel Akinrintoyo et.al.	2505.21551	null
2025-05-29	VietASR: Achieving Industry-level Vietnamese ASR with 50-hour labeled data and Large-Scale Speech Pretraining	Jianheng Zhuo et.al.	2505.21527	null
2025-05-27	Towards One-bit ASR: Extremely Low-bit Conformer Quantization Using Co-training and Stochastic Precision	Zhaoqing Li et.al.	2505.21245	null
2025-05-27	PSRB: A Comprehensive Benchmark for Evaluating Persian ASR Systems	Nima Sedghiyeh et.al.	2505.21230	null
2025-05-27	Topological Deep Learning for Speech Data	Zhiwang Yu et.al.	2505.21173	null
2025-05-27	Leveraging LLM and Self-Supervised Training Models for Speech Recognition in Chinese Dialects: A Comparative Analysis	Tianyi Xu et.al.	2505.21138	null
2025-05-27	Towards Pretraining Robust ASR Foundation Model with Acoustic-Aware Data Augmentation	Dancheng Liu et.al.	2505.20606	null
2025-05-30	The NaijaVoices Dataset: Cultivating Large-Scale, High-Quality, Culturally-Rich Speech Data for African Languages	Chris Emezue et.al.	2505.20564	null
2025-05-26	Robust fine-tuning of speech recognition models via model merging: application to disordered speech	Alexandre Ducorroy et.al.	2505.20477	null
2025-06-05	In-context Language Learning for Endangered Languages in Speech Recognition	Zhaolin Li et.al.	2505.20445	null
2025-05-26	Continuous Learning for Children's ASR: Overcoming Catastrophic Forgetting with Elastic Weight Consolidation and Synaptic Intelligence	Edem Ahadzi et.al.	2505.20216	null
2025-05-26	Exploring Generative Error Correction for Dysarthric Speech Recognition	Moreno La Quatra et.al.	2505.20163	link
2025-05-26	Mixture of LoRA Experts for Low-Resourced Multi-Accent Automatic Speech Recognition	Raphaël Bagat et.al.	2505.20006	null
2025-05-26	Novel Loss-Enhanced Universal Adversarial Patches for Sustainable Speaker Privacy	Elvir Karimov et.al.	2505.19951	null
2025-05-26	KIT's Low-resource Speech Translation Systems for IWSLT2025: System Enhancement with Synthetic Data and Model Regularization	Zhaolin Li et.al.	2505.19679	null
2025-05-26	Languages in Multilingual Speech Foundation Models Align Both Phonetically and Semantically	Ryan Soh-Eun Shim et.al.	2505.19606	null
2025-05-26	Beyond Manual Transcripts: The Potential of Automated Speech Recognition Errors in Improving Alzheimer's Disease Detection	Yin-Long Liu et.al.	2505.19448	null
2025-05-25	BR-ASR: Efficient and Scalable Bias Retrieval Framework for Contextual Biasing ASR in Speech LLM	Xun Gong et.al.	2505.19179	null
2025-05-24	Building a Functional Machine Translation Corpus for Kpelle	Kweku Andoh Yamoah et.al.	2505.18905	null
2025-05-24	StandUp4AI: A New Multilingual Dataset for Humor Detection in Stand-up Comedy Videos	Valentin Barriere et.al.	2505.18903	null
2025-05-24	CHSER: A Dataset and Case Study on Generative Speech Error Correction for Child ASR	Natarajan Balaji Shankar et.al.	2505.18463	link
2025-05-23	Daily-Omni: Towards Audio-Visual Reasoning with Temporal Alignment across Modalities	Ziwei Zhou et.al.	2505.17862	link
2025-05-27	CosyVoice 3: Towards In-the-wild Speech Generation via Scaling-up and Post-training	Zhihao Du et.al.	2505.17589	null
2025-05-23	Swedish Whispers; Leveraging a Massive Speech Corpus for Swedish Speech Recognition	Leonora Vesterbacka et.al.	2505.17538	null
2025-05-23	Speechless: Speech Instruction Training Without Speech for Low Resource Languages	Alan Dao et.al.	2505.17417	link
2025-05-23	LLM-based Generative Error Correction for Rare Words with Synthetic Data and Phonetic Context	Natsuo Yamashita et.al.	2505.17410	link
2025-06-02	An End-to-End Approach for Child Reading Assessment in the Xhosa Language	Sergio Chevtchenko et.al.	2505.17371	null
2025-05-20	From Weak Labels to Strong Results: Utilizing 5,000 Hours of Noisy Classroom Transcripts with Minimal Accurate Data	Ahmed Adel Attia et.al.	2505.17088	null
2025-05-30	Impact of Frame Rates on Speech Tokenizer: A Case Study on Mandarin and English	Haoyang Zhang et.al.	2505.17076	null
2025-05-28	An Effective Training Framework for Light-Weight Automatic Speech Recognition Models	Abdul Hannan et.al.	2505.16991	null
2025-05-22	From Tens of Hours to Tens of Thousands: Scaling Back-Translation for Speech Recognition	Tianduo Wang et.al.	2505.16972	link
2025-05-22	SoccerChat: Integrating Multimodal Data for Enhanced Soccer Game Understanding	Sushant Gautam et.al.	2505.16630	null
2025-05-27	X-ARES: A Comprehensive Framework for Assessing Audio Encoder Performance	Junbo Zhang et.al.	2505.16369	link
2025-05-24	Large Language Models based ASR Error Correction for Child Conversations	Anfeng Xu et.al.	2505.16212	null
2025-05-22	Differentiable K-means for Fully-optimized Discrete Token-based ASR	Kentaro Onda et.al.	2505.16207	null
2025-05-22	Prosodically Enhanced Foreign Accent Simulation by Discrete Token-based Resynthesis Only with Native Speech Corpora	Kentaro Onda et.al.	2505.16191	null
2025-05-22	Selective Invocation for Multilingual ASR: A Cost-effective Approach Adapting to Speech Recognition Difficulty	Hongfei Xue et.al.	2505.16168	null
2025-05-21	Word Level Timestamp Generation for Automatic Speech Recognition and Translation	Ke Hu et.al.	2505.15646	link
2025-05-20	In-Context Learning Boosts Speech Recognition via Human-like Adaptation to Speakers and Language Varieties	Nathan Roll et.al.	2505.14887	null
2025-05-30	Towards Inclusive ASR: Investigating Voice Conversion for Dysarthric Speech Recognition in Low-Resource Languages	Chin-Jou Li et.al.	2505.14874	link
2025-05-20	Vox-Profile: A Speech Foundation Model Benchmark for Characterizing Diverse Speaker and Speech Traits	Tiantian Feng et.al.	2505.14648	link
2025-05-20	Dual Precision Quantization for Efficient and Accurate Deep Neural Networks Inference	Tomer Gafni et.al.	2505.14638	link
2025-05-20	PersonaTAB: Predicting Personality Traits using Textual, Acoustic, and Behavioral Cues in Fully-Duplex Speech Dialogs	Sho Inoue et.al.	2505.14356	link
2025-05-21	Scaling and Enhancing LLM-based AVSR: A Sparse Mixture of Projectors Approach	Umberto Cappellazzo et.al.	2505.14336	null
2025-05-23	HausaNLP: Current Status, Challenges and Future Directions for Hausa Natural Language Processing	Shamsuddeen Hassan Muhammad et.al.	2505.14311	null
2025-05-27	The Multimodal Information Based Speech Processing (MISP) 2025 Challenge: Audio-Visual Diarization and Recognition	Ming Gao et.al.	2505.13971	null
2025-08-12	Transfer Learning from Visual Speech Recognition to Mouthing Recognition in German Sign Language	Dinh Nam Pham et.al.	2505.13784	null
2025-05-21	Multi-head Temporal Latent Attention	Keqi Deng et.al.	2505.13544	link
2025-05-21	Granary: Speech Recognition and Translation Dataset in 25 European Languages	Nithin Rao Koluguri et.al.	2505.13404	null
2025-05-19	Cross-modal Knowledge Transfer Learning as Graph Matching Based on Optimal Transport for ASR	Xugang Lu et.al.	2505.13079	null
2025-05-19	KIT's Offline Speech Translation and Instruction Following Submission for IWSLT 2025	Sai Koneru et.al.	2505.13036	null
2025-05-19	Personalized Fine-Tuning with Controllable Synthetic Speech from LLM-Generated Transcripts for Dysarthric Speech Recognition	Dominik Wagner et.al.	2505.12991	null
2025-05-19	Calm-Whisper: Reduce Whisper Hallucination On Non-Speech By Calming Crazy Heads Down	Yingzhi Wang et.al.	2505.12969	null
2025-05-16	Automatic Speech Recognition for African Low-Resource Languages: Challenges and Future Directions	Sukairaj Hafiz Imam et.al.	2505.11690	null
2025-05-16	ASR-FAIRBENCH: Measuring and Benchmarking Equity Across Speech Recognition Systems	Anand Rai et.al.	2505.11572	null
2025-05-26	LipDiffuser: Lip-to-Speech Generation with Conditional Diffusion Models	Danilo de Oliveira et.al.	2505.11391	null
2025-05-16	LegoSLM: Connecting LLM with Speech Encoder using CTC Posteriors	Rao Ma et.al.	2505.11352	null
2025-05-16	Survey of End-to-End Multi-Speaker Automatic Speech Recognition for Monaural Audio	Xinlu He et.al.	2505.10975	null
2025-05-27	Multi-Stage Speaker Diarization for Noisy Classrooms	Ali Sartaz Khan et.al.	2505.10879	link
2025-05-15	Inclusivity of AI Speech in Healthcare: A Decade Look Back	Retno Larasati et.al.	2505.10596	null
2025-05-15	Quantized Approximate Signal Processing (QASP): Towards Homomorphic Encryption for audio	Tu Duyen Nguyen et.al.	2505.10500	null
2025-05-12	Full simulation on the dynamics of auditory synaptic fusion: Strong clustering of calcium channel might be the origin of the coherent release in the auditory hair cells	Jaeyun Yoo et.al.	2505.07273	null
2025-05-09	Remote Rowhammer Attack using Adversarial Observations on Federated Learning Clients	Jinsheng Yuan et.al.	2505.06335	null
2025-05-08	Teochew-Wild: The First In-the-wild Teochew Dataset with Orthographic Annotations	Linrong Pan et.al.	2505.05056	null
2025-05-07	SwinLip: An Efficient Visual Speech Encoder for Lip Reading Using Swin Transformer	Young-Hu Park et.al.	2505.04394	null
2025-05-09	Robust Speech Recognition with Schrödinger Bridge-Based Speech Enhancement	Rauf Nasretdinov et.al.	2505.04237	null
2025-05-06	VITA-Audio: Fast Interleaved Cross-Modal Token Generation for Efficient Large Speech-Language Model	Zuwei Long et.al.	2505.03739	link
2025-05-06	Fairness of Automatic Speech Recognition in Cleft Lip and Palate Speech	Susmita Bhattacharjee et.al.	2505.03697	null
2025-05-26	SepALM: Audio Language Models Are Error Correctors for Robust Speech Separation	Zhaoxi Mu et.al.	2505.03273	null
2025-05-15	CoGenAV: Versatile Audio-Visual Representation Learning via Contrastive-Generative Synchronization	Detao Bai et.al.	2505.03186	link
2025-05-05	Voila: Voice-Language Foundation Models for Real-Time Autonomous Interaction and Voice Role-Play	Yemin Shi et.al.	2505.02707	link
2025-05-08	Transforming faces into video stories -- VideoFace2.0	Branko Brkljač et.al.	2505.02060	link
2025-05-06	A Synergistic Framework of Nonlinear Acoustic Computing and Reinforcement Learning for Real-World Human-Robot Interaction	Xiaoliang Chen et.al.	2505.01998	null
2025-05-02	Transfer Learning-Based Deep Residual Learning for Speech Recognition in Clean and Noisy Environments	Noussaiba Djeffal et.al.	2505.01632	null
2025-05-01	Scaling On-Device GPU Inference for Large Generative Models	Jiuqiang Tang et.al.	2505.00232	null
2025-07-31	BERSting at the Screams: A Benchmark for Distanced, Emotional and Shouted Speech Recognition	Paige Tuttösí et.al.	2505.00059	link
2025-04-30	Retrieval-Enhanced Few-Shot Prompting for Speech Event Extraction	Máté Gedeon et.al.	2504.21372	null
2025-04-28	A Comprehensive Part-of-Speech Tagging to Standardize Central-Kurdish Language: A Research Guide for Kurdish Natural Language Processing Tasks	Shadan Shukr Sabr et.al.	2504.19645	null
2025-04-25	Kimi-Audio Technical Report	KimiTeam et.al.	2504.18425	link
2025-04-28	Augmenting Captions with Emotional Cues: An AR Interface for Real-Time Accessible Communication	Sunday David Ubur et.al.	2504.17171	null
2025-04-22	TinyML for Speech Recognition	Andrew Barovic et.al.	2504.16213	null
2025-04-22	LiveCC: Learning Video LLM with Streaming Speech Transcription at Scale	Joya Chen et.al.	2504.16030	null
2025-04-22	Development and evaluation of a deep learning algorithm for German word recognition from lip movements	Dinh Nam Pham et.al.	2504.15792	null
2025-04-21	Chinese-LiPS: A Chinese audio-visual speech recognition dataset with Lip-reading and Presentation Slides	Jinghua Zhao et.al.	2504.15066	null
2025-04-21	StableQuant: Layer Adaptive Post-Training Quantization for Speech Foundation Models	Yeona Hong et.al.	2504.14915	null
2025-04-17	Acoustic to Articulatory Inversion of Speech; Data Driven Approaches, Challenges, Applications, and Future Scope	Leena G Pillai et.al.	2504.13308	null
2025-05-04	Dysarthria Normalization via Local Lie Group Transformations for Robust ASR	Mikhail Osipov et.al.	2504.12279	link
2025-04-03	Edge Intelligence for Wildlife Conservation: Real-Time Hornbill Call Classification Using TinyML	Kong Ka Hing et.al.	2504.12272	null
2025-04-19	Advancing Arabic Speech Recognition Through Large-Scale Weakly Supervised Learning	Mahmoud Salhab et.al.	2504.12254	null
2025-04-15	Real-Time Word-Level Temporal Segmentation in Streaming Speech Recognition	Naoto Nishida et.al.	2504.10849	null
2025-04-25	Spatial Audio Processing with Large Language Model on Wearable Devices	Ayushi Mishra et.al.	2504.08907	null
2025-04-10	From Speech to Summary: A Comprehensive Survey of Speech Summarization	Fabian Retkowski et.al.	2504.08024	null
2025-04-09	Visual-Aware Speech Recognition for Noisy Scenarios	Lakshmipathi Balaji et.al.	2504.07229	null
2025-04-09	RNN-Transducer-based Losses for Speech Recognition on Noisy Targets	Vladimir Bataev et.al.	2504.06963	link
2025-04-07	DoCIA: An Online Document-Level Context Incorporation Agent for Speech Translation	Xinglin Lyu et.al.	2504.05122	null
2025-04-06	Public speech recognition transcripts as a configuring parameter	Damien Rudaz et.al.	2504.04488	null
2025-04-06	Selective Masking Adversarial Attack on Automatic Speech Recognition Systems	Zheng Fang et.al.	2504.04394	null
2025-05-08	An Efficient GPU-based Implementation for Noise Robust Sound Source Localization	Zirui Lin et.al.	2504.03373	null
2025-04-04	A Human Digital Twin Architecture for Knowledge-based Interactions and Context-Aware Conversations	Abdul Mannan Mohammed et.al.	2504.03147	null
2025-03-26	Efficient First-Order Optimization on the Pareto Set for Multi-Objective Learning under Preference Guidance	Lisha Chen et.al.	2504.02854	null
2025-04-03	LinTO Audio and Textual Datasets to Train and Evaluate Automatic Speech Recognition in Tunisian Arabic Dialect	Hedi Naouara et.al.	2504.02604	null
2025-04-22	F5R-TTS: Improving Flow-Matching based Text-to-Speech with Group Relative Policy Optimization	Xiaohui Sun et.al.	2504.02407	null
2025-04-02	Chain of Correction for Full-text Speech Recognition with Large Language Models	Zhiyuan Tang et.al.	2504.01519	null
2025-04-01	Whispering Under the Eaves: Protecting User Privacy Against Commercial and LLM-powered Automatic Speech Recognition Systems	Weifei Jin et.al.	2504.00858	link
2025-03-31	SVLA: A Unified Speech-Vision-Language Assistant with Multimodal Reasoning and Speech Generation	Ngoc Dung Huynh et.al.	2503.24164	null
2025-04-02	TeleAntiFraud-28k: An Audio-Text Slow-Thinking Dataset for Telecom Fraud Detection	Zhiming Ma et.al.	2503.24115	link
2025-03-30	The Impact of Code-switched Synthetic Data Quality is Task Dependent: Insights from MT and ASR	Injy Hamed et.al.	2503.23576	null
2025-03-30	Whisper-LM: Improving ASR Models with Language Models for Low-Resource Languages	Xabier de Zuazo et.al.	2503.23542	link
2025-03-30	Scaling Auditory Cognition via Test-Time Compute in Audio Language Models	Ting Dang et.al.	2503.23395	null
2025-04-25	Coverage-Guaranteed Speech Emotion Recognition via Calibrated Uncertainty-Adaptive Prediction Sets	Zijun Jia et.al.	2503.22712	null
2025-03-13	Enhancing Aviation Communication Transcription: Fine-Tuning Distil-Whisper with LoRA	Shokoufeh Mirzaei et.al.	2503.22692	null
2025-03-05	Qieemo: Speech Is All You Need in the Emotion Recognition in Conversations	Jinming Chen et.al.	2503.22687	null
2025-03-11	Lend a Hand: Semi Training-Free Cued Speech Recognition via MLLM-Driven Hand Modeling for Barrier-free Communication	Guanjie Huang et.al.	2503.21785	link
2025-03-27	VALLR: Visual ASR Language Model for Lip Reading	Marshall Thomas et.al.	2503.21408	null
2025-03-27	A 71.2- $μ$ W Speech Recognition Accelerator with Recurrent Spiking Neural Network	Chih-Chyau Yang et.al.	2503.21337	null
2025-03-26	Improving Speech Recognition Accuracy Using Custom Language Models with the Vosk Toolkit	Aniket Abhishek Soni et.al.	2503.21025	null
2025-03-26	FinAudio: A Benchmark for Audio Large Language Models in Financial Applications	Yupeng Cao et.al.	2503.20990	null
2025-03-26	Dolphin: A Large-Scale Automatic Speech Recognition Model for Eastern Languages	Yangyang Meng et.al.	2503.20212	link
2025-03-25	Contextual Metric Meta-Evaluation by Measuring Local Metric Accuracy	Athiya Deviyani et.al.	2503.19828	null
2025-03-25	Boosting the Transferability of Audio Adversarial Examples with Acoustic Representation Optimization	Weifei Jin et.al.	2503.19591	null
2025-03-25	Design of Seamless Multi-modal Interaction Framework for Intelligent Virtual Agents in Wearable Mixed Reality Environment	Ghazanfar Ali et.al.	2503.19334	null
2025-05-13	From S4 to Mamba: A Comprehensive Survey on Structured State Space Models	Shriyank Somvanshi et.al.	2503.18970	null
2025-03-28	Whispering in Amharic: Fine-tuning Whisper for Low-resource Language	Dawit Ketema Gete et.al.	2503.18485	null
2025-03-23	Elevating Robust Multi-Talker ASR by Decoupling Speaker Separation and Speech Recognition	Yufeng Yang et.al.	2503.17886	null
2025-03-21	Your voice is your voice: Supporting Self-expression through Speech Generation and LLMs in Augmented and Alternative Communication	Yiwen Xu et.al.	2503.17479	null
2025-03-20	SeniorTalk: A Chinese Conversation Dataset with Rich Annotations for Super-Aged Seniors	Yang Chen et.al.	2503.16578	null
2025-03-19	A Comprehensive Survey on Architectural Advances in Deep CNNs: Challenges, Applications, and Emerging Research Directions	Saddam Hussain Khan et.al.	2503.16546	null
2025-02-27	ACE, Action and Control via Explanations: A Proposal for LLMs to Provide Human-Centered Explainability for Multimodal AI Assistants	Elizabeth Anne Watkins et.al.	2503.16466	null
2025-03-19	Evaluating ASR Confidence Scores for Automated Error Detection in User-Assisted Correction Interfaces	Korbinian Kuhn et.al.	2503.15124	null
2025-03-19	Communication Access Real-Time Translation Through Collaborative Correction of Automatic Speech Recognition	Korbinian Kuhn et.al.	2503.15120	null
2025-03-07	A Causal Inference Approach for Quantifying Research Impact	Keiichi Ochiai et.al.	2503.13485	null
2025-04-19	Halving transcription time: A fast, user-friendly and GDPR-compliant workflow to create AI-assisted transcripts for content analysis	Jakob Sponholz et.al.	2503.13031	null
2025-03-04	CORDIC Is All You Need	Omkar Kokane et.al.	2503.11685	null
2025-03-14	MMS-LLaMA: Efficient LLM-based Audio-Visual Speech Recognition with Minimal Multimodal Speech Tokens	Jeong Hun Yeo et.al.	2503.11315	link
2025-03-13	Whisper Speaker Identification: Leveraging Pre-Trained Multilingual Transformers for Robust Speaker Embeddings	Jakaria Islam Emon et.al.	2503.10446	link
2025-03-14	Proceedings of the ISCA/ITG Workshop on Diversity in Large Speech and Language Models	Sebastian Möller et.al.	2503.10298	null
2025-04-07	ValSub: Subsampling Validation Data to Mitigate Forgetting during ASR Personalization	Haaris Mehmood et.al.	2503.09906	null
2025-03-12	Quantization for OpenAI's Whisper Models: A Comparative Analysis	Allison Andreyev et.al.	2503.09905	link
2025-03-12	Everything Can Be Described in Words: A Simple Unified Multi-Modal Framework with Semantic and Temporal Alignment	Xiaowei Bi et.al.	2503.09081	null
2025-03-11	An Exhaustive Evaluation of TTS- and VC-based Data Augmentation for ASR	Sewade Ogun et.al.	2503.08954	null
2025-03-11	Prompt2LVideos: Exploring Prompts for Understanding Long-Form Multimodal Videos	Soumya Shamarao Jahagirdar et.al.	2503.08335	null
2025-03-10	Building English ASR model with regional language support	Purvi Agrawal et.al.	2503.07522	null
2025-03-30	Automatic Speech Recognition for Non-Native English: Accuracy and Disfluency Handling	Michael McGuire et.al.	2503.06924	null
2025-03-09	Adaptive Audio-Visual Speech Recognition via Matryoshka-Based Multimodal LLMs	Umberto Cappellazzo et.al.	2503.06362	null
2025-03-08	Zero-AVSR: Zero-Shot Audio-Visual Speech Recognition with LLMs by Learning Language-Agnostic Speech Representations	Jeong Hun Yeo et.al.	2503.06273	link
2025-03-08	A Noise-Robust Turn-Taking System for Real-World Dialogue Robots: A Field Experiment	Koji Inoue et.al.	2503.06241	null
2025-03-06	From Voice to Safety: Language AI Powered Pilot-ATC Communication Understanding for Airport Surface Movement Collision Risk Assessment	Yutian Pang et.al.	2503.04974	null
2025-03-04	Normalization through Fine-tuning: Understanding Wav2vec 2.0 Embeddings for Phonetic Analysis	Yiming Wang et.al.	2503.04814	null
2025-03-03	Direct Speech to Speech Translation: A Review	Mohammad Sarim et.al.	2503.04799	null
2025-03-06	Self-Supervised Models for Phoneme Recognition: Applications in Children's Speech for Reading Learning	Lucas Block Medin et.al.	2503.04710	null
2025-03-07	Efficient Finetuning for Dimensional Speech Emotion Recognition in the Age of Transformers	Aneesha Sampath et.al.	2503.03756	null
2025-03-03	Fine-Tuning Whisper for Inclusive Prosodic Stress Analysis	Samuel S. Sohn et.al.	2503.02907	null
2025-03-04	Go Beyond Your Means: Unlearning with Per-Sample Gradient Orthogonalization	Aviv Shamsian et.al.	2503.02312	null
2025-03-05	Pruning Deep Neural Networks via a Combination of the Marchenko-Pastur Distribution and Regularization	Leonid Berlyand et.al.	2503.01922	null
2025-03-07	Nexus-O: An Omni-Perceptive And -Interactive Model for Language, Audio, And Vision	Che Liu et.al.	2503.01879	null
2025-03-02	Unveiling Biases while Embracing Sustainability: Assessing the Dual Challenges of Automatic Speech Recognition Systems	Ajinkya Kulkarni et.al.	2503.00907	null
2025-03-02	UniWav: Towards Unified Pre-training for Speech Representation Learning and Generation	Alexander H. Liu et.al.	2503.00733	null
2025-02-27	LiteASR: Efficient Automatic Speech Recognition with Low-Rank Approximation	Keisuke Kamahori et.al.	2502.20583	link
2025-02-27	Adapting Automatic Speech Recognition for Accented Air Traffic Control Communications	Marcus Yu Zhe Wee et.al.	2502.20311	null
2025-02-27	CleanMel: Mel-Spectrogram Enhancement for Improving Both Speech Quality and ASR	Nian Shao et.al.	2502.20040	null
2025-03-12	CS-Dialogue: A 104-Hour Dataset of Spontaneous Mandarin-English Code-Switching Dialogues for Speech Recognition	Jiaming Zhou et.al.	2502.18913	null
2025-02-26	Exploring Gender Disparities in Automatic Speech Recognition Technology	Hend ElGhazaly et.al.	2502.18434	null
2025-02-25	Silent Speech Sentence Recognition with Six-Axis Accelerometers using Conformer and CTC Algorithm	Yudong Xie et.al.	2502.17829	null
2025-02-26	Low-Rank and Sparse Model Merging for Multi-Lingual Speech Recognition and Translation	Qiuming Zhao et.al.	2502.17380	null
2025-02-25	Improving the Inclusivity of Dutch Speech Recognition by Fine-tuning Whisper on the JASMIN-CGN Corpus	Golshid Shekoufandeh et.al.	2502.17284	link
2025-02-24	Balancing Speech Understanding and Generation Using Continual Pre-training for Codec-based Speech LLM	Jiatong Shi et.al.	2502.16897	null
2025-02-22	Understanding Zero-shot Rare Word Recognition Improvements Through LLM Integration	Haoxuan Wang et.al.	2502.16142	null
2025-02-21	The Esethu Framework: Reimagining Sustainable Dataset Governance and Curation for Low-Resource Languages	Jenalea Rajab et.al.	2502.15916	null
2025-02-21	Retrieval-Augmented Speech Recognition Approach for Domain Challenges	Peng Shen et.al.	2502.15264	null
2025-02-21	Enhancing Speech Large Language Models with Prompt-Aware Mixture of Audio Encoders	Weiqiao Shan et.al.	2502.15178	null
2025-02-21	Improving Streaming Speech Recognition With Time-Shifted Contextual Attention And Dynamic Right Context Masking	Khanh Le et.al.	2502.15158	null
2025-02-20	WavRAG: Audio-Integrated Retrieval Augmented Generation for Spoken Dialogue Models	Yifu Chen et.al.	2502.14727	null
2025-02-20	SegAug: CTC-Aligned Segmented Augmentation For Robust RNN-Transducer Based Speech Recognition	Khanh Le et.al.	2502.14685	null
2025-02-20	Moshi Moshi? A Model Selection Hijacking Adversarial Attack	Riccardo Petrucci et.al.	2502.14586	null
2025-02-18	Gesture-Aware Zero-Shot Speech Recognition for Patients with Language Disorders	Seungbae Kim et.al.	2502.13983	null
2025-02-18	Benchmarking Automatic Speech Recognition coupled LLM Modules for Medical Diagnostics	Kabir Kumar et.al.	2502.13982	null
2025-02-19	Measuring the Effect of Transcription Noise on Downstream Language Understanding Tasks	Ori Shapira et.al.	2502.13645	link
2025-02-21	VLAS: Vision-Language-Action Model With Speech Instructions For Customized Robot Manipulation	Wei Zhao et.al.	2502.13508	link
2025-02-19	Adopting Whisper for Confidence Estimation	Vaibhav Aggarwal et.al.	2502.13446	null
2025-02-18	Neuro-oscillatory models of cortical speech processing	Olesia Dogonasheva et.al.	2502.12935	null
2025-02-18	Lost in Transcription, Found in Distribution Shift: Demystifying Hallucination in Speech Foundation Models	Hanin Atwany et.al.	2502.12414	null
2025-02-18	On the Robust Approximation of ASR Metrics	Abdul Waheed et.al.	2502.12408	null
2025-02-17	NaturalL2S: End-to-End High-quality Multispeaker Lip-to-Speech Synthesis with Differential Digital Signal Processing	Yifan Liang et.al.	2502.12002	null
2025-02-17	Can you pass that tool?: Implications of Indirect Speech in Physical Human-Robot Collaboration	Yan Zhang et.al.	2502.11720	null
2025-02-28	In Situ Optimization of an Optoelectronic Reservoir Computer with Digital Delayed Feedback	Fyodor Morozko et.al.	2502.11126	null
2025-04-03	DuplexMamba: Enhancing Real-time Speech Conversations with Duplex and Streaming Capabilities	Xiangyu Lu et.al.	2502.11123	link
2025-02-11	MoHAVE: Mixture of Hierarchical Audio-Visual Experts for Robust Speech Recognition	Sungnyun Kim et.al.	2502.10447	null
2025-02-14	OWLS: Scaling Laws for Multilingual Speech Recognition and Translation Models	William Chen et.al.	2502.10373	null
2025-02-14	MTLM: an Innovative Language Model Training Paradigm for ASR	Qingliang Meng et.al.	2502.10058	null
2025-02-14	A Preliminary Exploration with GPT-4o Voice Mode	Yu-Xiang Lin et.al.	2502.09940	null
2025-02-14	Microphone Array Geometry Independent Multi-Talker Distant ASR: NTT System for the DASR Task of the CHiME-8 Challenge	Naoyuki Kamo et.al.	2502.09859	null
2025-02-13	Shortcut Learning Susceptibility in Vision Classifiers	Pirzada Suhail et.al.	2502.09150	null
2025-02-13	Quantum Approaches for Dysphonia Assessment in Small Speech Datasets	Ha Tran et.al.	2502.08968	null
2025-02-12	Causal Analysis of ASR Errors for Children: Quantifying the Impact of Physiological, Cognitive, and Extrinsic Factors	Vishwanath Pratap Singh et.al.	2502.08587	null
2025-02-24	VINP: Variational Bayesian Inference with Neural Speech Prior for Joint ASR-Effective Speech Dereverberation and Blind RIR Identification	Pengyu Wang et.al.	2502.07205	link
2025-02-16	A Comparative Study of ASR Implementations in Resource-Constrained Wireless Sensor Networks for Real-Time Voice Communication	Inaam F. Qutaiba I. Ali et.al.	2502.06969	null
2025-02-19	Speech to Speech Translation with Translatotron: A State of the Art Review	Jules R. Kala et.al.	2502.05980	null
2025-02-09	Audio-Visual Representation Learning via Knowledge Distillation from Speech Foundation Models	Jing-Xuan Zhang et.al.	2502.05766	link
2025-02-07	Koel-TTS: Enhancing LLM based Speech Generation with Preference Alignment and Classifier Free Guidance	Shehzeen Hussain et.al.	2502.05236	null
2025-02-06	Aligner-Encoders: Self-Attention Transformers Can Be Self-Transducers	Adam Stooke et.al.	2502.05232	null
2025-02-07	Evaluating Standard and Dialectal Frisian ASR: Multilingual Fine-tuning and Language Identification for Improved Low-resource Performance	Reihaneh Amooie et.al.	2502.04883	null
2025-02-07	Lightweight Operations for Visual Speech Recognition	Iason Ioannis Panagos et.al.	2502.04834	null
2025-02-06	Afrispeech-Dialog: A Benchmark Dataset for Spontaneous English Conversations in Healthcare and Beyond	Mardhiyah Sanni et.al.	2502.03945	null
2025-02-06	Rule-Based Modeling of Low-Dimensional Data with PCA and Binary Particle Swarm Optimization (BPSO) in ANFIS	Afnan Al-Ali et.al.	2502.03895	null
2025-02-05	Integrating automatic speech recognition into remote healthcare interpreting: A pilot study of its impact on interpreting quality	Shiyi Tan et.al.	2502.03381	null
2025-02-05	Leveraging Broadcast Media Subtitle Transcripts for Automatic Speech Recognition and Subtitling	Jakob Poncelet et.al.	2502.03212	link
2025-01-26	SEAL: Speech Embedding Alignment Learning for Speech Large Language Model with Retrieval-Augmented Generation	Chunyu Sun et.al.	2502.02603	null
2025-03-05	CTC-DRO: Robust Optimization for Reducing Language Disparities in Speech Recognition	Martijn Bartelds et.al.	2502.01777	null
2025-02-03	Adapter-Based Multi-Agent AVSR Extension for Pre-Trained ASR Models	Christopher Simic et.al.	2502.01709	null
2025-01-29	Privacy-Preserving Edge Speech Understanding with Tiny Foundation Models	Afsara Benazir et.al.	2502.01649	null
2025-02-03	A Differentiable Alignment Framework for Sequence-to-Sequence Modeling via Optimal Transport	Yacouba Kaloga et.al.	2502.01588	null
2025-02-11	mWhisper-Flamingo for Multilingual Audio-Visual Noise-Robust Speech Recognition	Andrew Rouditchenko et.al.	2502.01547	link
2025-02-03	Gradient Norm-based Fine-Tuning for Backdoor Defense in Automatic Speech Recognition	Nanjun Zhou et.al.	2502.01152	null
2025-02-01	Data-Driven Mispronunciation Pattern Discovery for Robust Speech Recognition	Anna Seo Gyeong Choi et.al.	2502.00583	null
2025-02-17	Evaluation of End-to-End Continuous Spanish Lipreading in Different Data Conditions	David Gimeno-Gómez et.al.	2502.00464	link
2025-02-04	Sagalee: an Open Source Automatic Speech Recognition Dataset for Oromo Language	Turi Abu et.al.	2502.00421	link
2025-02-01	When End-to-End is Overkill: Rethinking Cascaded Speech-to-Text Translation	Anna Min et.al.	2502.00377	null
2025-02-03	SELMA: A Speech-Enabled Language Model for Virtual Assistant Interactions	Dominik Wagner et.al.	2501.19377	null
2025-01-31	Language Bias in Self-Supervised Learning For Automatic Speech Recognition	Edward Storey et.al.	2501.19321	null
2025-02-03	DyPCL: Dynamic Phoneme-level Contrastive Learning for Dysarthric Speech Recognition	Wonjun Lee et.al.	2501.19010	null
2025-01-29	Cross-lingual Embedding Clustering for Hierarchical Softmax in Low-Resource Multilingual Speech Recognition	Zhengdong Yang et.al.	2501.17615	null
2025-01-28	RDMM: Fine-Tuned LLM Models for On-Device Robotic Decision Making with Enhanced Contextual Awareness in Specific Domains	Shady Nasrat et.al.	2501.16899	link
2025-01-28	AVE Speech Dataset: A Comprehensive Benchmark for Multi-Modal Speech Recognition Integrating Audio, Visual, and Electromyographic Signals	Dongliang Zhou et.al.	2501.16780	null
2025-01-28	SCDiar: a streaming diarization system based on speaker change detection and speech recognition	Naijun Zheng et.al.	2501.16641	null
2025-01-27	Optimized Self-supervised Training with BEST-RQ for Speech Recognition	Ilja Baumann et.al.	2501.16131	null
2025-01-27	Classification Error Bound for Low Bayes Error Conditions in Machine Learning	Zijian Yang et.al.	2501.15977	null
2025-01-26	End-to-End Target Speaker Speech Recognition Using Context-Aware Attention Mechanisms for Challenging Enrollment Scenario	Mohsen Ghane et.al.	2501.15466	null
2025-01-25	The Multicultural Medical Assistant: Can LLMs Improve Medical ASR Errors Across Borders?	Ayo Adedeji et.al.	2501.15310	null
2025-01-25	Speech Translation Refinement using Large Language Models	Huaixia Dou et.al.	2501.15090	link
2025-01-25	Robust Cross-Etiology and Speaker-Independent Dysarthric Speech Recognition	Satwinder Singh et.al.	2501.14994	null
2025-02-07	Methods to Increase the Amount of Data for Speech Recognition for Low Resource Languages	Alexan Ayrapetyan et.al.	2501.14788	null
2025-01-24	FireRedASR: Open-Source Industrial-Grade Mandarin Speech Recognition Models from Encoder-Decoder to LLM Integration	Kai-Tuo Xu et.al.	2501.14350	link
2025-01-24	LoCoML: A Framework for Real-World ML Inference Pipelines	Kritin Maddireddy et.al.	2501.14165	null
2025-01-23	Integrating Persian Lip Reading in Surena-V Humanoid Robot for Human-Robot Interaction	Ali Farshian Abbasi et.al.	2501.13996	null
2025-01-18	Fanar: An Arabic-Centric Multimodal Generative AI Platform	Fanar Team et.al.	2501.13944	null
2025-01-23	Predicting Compact Phrasal Rewrites with Large Language Models for ASR Post Editing	Hao Zhang et.al.	2501.13831	null
2025-01-23	Learning-based A Posteriori Speech Presence Probability Estimation and Applications	Shuai Tao et.al.	2501.13642	null
2025-01-23	DQ-Data2vec: Decoupling Quantization for Multilingual Speech Recognition	Qijie Shao et.al.	2501.13497	null
2025-02-16	OSUM: Advancing Open Speech Understanding Models with Limited Resources in Academia	Xuelong Geng et.al.	2501.13306	link
2025-01-22	Let SSMs be ConvNets: State-space Modeling with Optimal Tensor Contractions	Yan Ru Pei et.al.	2501.13230	null
2025-01-22	FlanEC: Exploring Flan-T5 for Post-ASR Error Correction	Moreno La Quatra et.al.	2501.12979	link
2025-01-21	A Domain Adaptation Framework for Speech Recognition Systems with Only Synthetic data	Minh Tran et.al.	2501.12501	null
2025-01-21	DOTA-ME-CS: Daily Oriented Text Audio-Mandarin English-Code Switching Dataset	Yupei Li et.al.	2501.12122	null
2025-01-20	Investigation of Whisper ASR Hallucinations Induced by Non-Speech Audio	Mateusz Barański et.al.	2501.11378	null
2025-01-19	Enhancing Neural Spoken Language Recognition: An Exploration with Multilingual Datasets	Or Haim Anidjar et.al.	2501.11065	null
2025-01-18	A Benchmark of French ASR Systems Based on Error Severity	Antoine Tholly et.al.	2501.10879	null
2025-01-18	GEC-RAG: Improving Generative Error Correction via Retrieval-Augmented Generation for Automatic Speech Recognition Systems	Amin Robatian et.al.	2501.10734	null
2025-01-17	Unsupervised Rhythm and Voice Conversion of Dysarthric to Healthy Speech for ASR	Karl El Hajal et.al.	2501.10256	null
2025-01-17	Automatic Speech Recognition for Sanskrit with Transfer Learning	Bidit Sadhukhan et.al.	2501.10024	null
2025-01-21	PIER: A Novel Metric for Evaluating What Matters in Code-Switching	Enes Yavuz Ugan et.al.	2501.09512	null
2025-01-16	Teaching Wav2Vec2 the Language of the Brain	Tobias Fiedler et.al.	2501.09459	link
2025-01-16	Delayed Fusion: Integrating Large Language Models into First-Pass Decoding in End-to-end Speech Recognition	Takaaki Hori et.al.	2501.09258	null
2025-01-17	persoDA: Personalized Data Augmentation for Personalized ASR	Pablo Peso Parada et.al.	2501.09113	null
2025-01-20	A Non-autoregressive Model for Joint STT and TTS	Vishal Sunder et.al.	2501.09104	null
2025-01-13	Discrimination loss vs. SRT: A model-based approach towards harmonizing speech test interpretations	Mareike Buhl et.al.	2501.08921	null
2025-01-15	Adapting Whisper for Regional Dialects: Enhancing Public Services for Vulnerable Populations in the United Kingdom	Melissa Torgbi et.al.	2501.08502	null
2025-01-14	Selective Attention Merging for low resource tasks: A case study of Child ASR	Natarajan Balaji Shankar et.al.	2501.08468	link
2025-01-14	Loudspeaker Beamforming to Enhance Speech Recognition Performance of Voice Driven Applications	Dimme de Groot et.al.	2501.08104	null
2025-01-17	Joint Automatic Speech Recognition And Structure Learning For Better Speech Understanding	Jiliang Hu et.al.	2501.07329	link
2025-01-13	Audio-CoT: Exploring Chain-of-Thought Reasoning in Large Audio Language Model	Ziyang Ma et.al.	2501.07246	null
2025-01-13	AdaCS: Adaptive Normalization for Enhanced Code-Switching ASR	The Chuong Chu et.al.	2501.07102	link
2025-01-11	Discrete Speech Unit Extraction via Independent Component Analysis	Tomohiko Nakamura et.al.	2501.06562	link
2025-01-11	A Survey on Spoken Italian Datasets and Corpora	Marco Giordano et.al.	2501.06557	null
2025-01-11	Speech Recognition for Automatically Assessing Afrikaans and isiXhosa Preschool Oral Narratives	Christiaan Jacobs et.al.	2501.06478	null
2025-01-10	TTS-Transducer: End-to-End Speech Synthesis with Neural Transducer	Vladimir Bataev et.al.	2501.06320	null
2025-01-10	Contextual ASR Error Handling with LLMs Augmentation for Goal-Oriented Conversational AI	Yuya Asano et.al.	2501.06129	null
2025-02-19	Fleurs-SLU: A Massively Multilingual Benchmark for Spoken Language Understanding	Fabian David Schmidt et.al.	2501.06117	link
2025-01-10	Benchmarking Rotary Position Embeddings for Automatic Speech Recognition	Shucong Zhang et.al.	2501.06051	null
2025-01-19	Comparing Self-Supervised Learning Models Pre-Trained on Human Speech and Animal Vocalizations for Bioacoustics Processing	Eklavya Sarkar et.al.	2501.05987	link
2025-01-10	Universal-2-TF: Robust All-Neural Text Formatting for ASR	Yash Khare et.al.	2501.05948	null
2025-01-09	Right Label Context in End-to-End Training of Time-Synchronous ASR Models	Tina Raissi et.al.	2501.04521	null
2025-01-08	Phone-purity Guided Discrete Tokens for Dysarthric Speech Recognition	Huimeng Wang et.al.	2501.04379	null
2025-01-08	LipGen: Viseme-Guided Lip Video Generation for Enhancing Visual Speech Recognition	Bowen Hao et.al.	2501.04204	null
2025-01-03	Listening and Seeing Again: Generative Error Correction for Audio-Visual Speech Recognition	Rui Liu et.al.	2501.04038	link
2025-01-07	Universal Speaker Embedding Free Target Speaker Extraction and Personal Voice Activity Detection	Bang Zeng et.al.	2501.03612	null
2025-01-14	Towards a Generalizable Speech Marker for Parkinson's Disease Diagnosis	Maksim Siniukov et.al.	2501.03581	null
2025-01-07	Deep Learning for Pathological Speech: A Survey	Shakeel A. Sheikh et.al.	2501.03536	null
2025-01-01	Breaking Through the Spike: Spike Window Decoding for Accelerated and Precise Automatic Speech Recognition	Wei Zhang et.al.	2501.03257	null
2025-01-08	Samba-ASR: State-Of-The-Art Speech Recognition Leveraging Structured State-Space Models	Syed Abdul Gaffar Shakhadri et.al.	2501.02832	null
2025-01-05	Reducing the Gap Between Pretrained Speech Enhancement and Recognition Models Using a Real Speech-Trained Bridging Module	Zhongjian Cui et.al.	2501.02452	null
2025-01-03	Improving Transducer-Based Spoken Language Understanding with Self-Conditioned CTC and Knowledge Transfer	Vishal Sunder et.al.	2501.01936	null
2025-01-11	Advancing Singlish Understanding: Bridging the Gap with Datasets and Multimodal Models	Bin Wang et.al.	2501.01034	link
2025-01-01	Incremental Dialogue Management: Survey, Discussion, and Implications for HRI	Casey Kennington et.al.	2501.00953	null
2025-01-01	Large Language Models Are Read/Write Policy-Makers for Simultaneous Generation	Shoutao Guo et.al.	2501.00868	link
2025-01-01	Automatic Text Pronunciation Correlation Generation and Application for Contextual Biasing	Gaofeng Cheng et.al.	2501.00804	null
2024-12-31	Fotheidil: an Automatic Transcription System for the Irish Language	Liam Lonergan et.al.	2501.00509	null
2024-12-31	Whisper Turns Stronger: Augmenting Wav2Vec 2.0 for Superior ASR in Low-Resource Languages	Or Haim Anidjar et.al.	2501.00425	null
2025-01-06	Takeaways from Applying LLM Capabilities to Multiple Conversational Avatars in a VR Pilot Study	Mykola Maslych et.al.	2501.00168	null
2024-12-30	DiCoW: Diarization-Conditioned Whisper for Target Speaker Automatic Speech Recognition	Alexander Polok et.al.	2501.00114	link
2024-12-25	Speech Recognition With LLMs Adapted to Disordered Speech Using Reinforcement Learning	Chirag Nagpal et.al.	2501.00039	null
2024-12-27	Enhancing Whisper's Accuracy and Speed for Indian Languages through Prompt-Tuning and Tokenization	Kumud Tripathi et.al.	2412.19785	null
2024-12-26	Towards a Single ASR Model That Generalizes to Disordered Speech	Jimmy Tobin et.al.	2412.19315	null
2024-12-26	Enhancing Audiovisual Speech Recognition through Bifocal Preference Optimization	Yihan Wu et.al.	2412.19005	link
2024-12-25	Structured Speaker-Deficiency Adaptation of Foundation Models for Dysarthric and Elderly Speech Recognition	Shujie Hu et.al.	2412.18832	null
2024-12-30	Zero-resource Speech Translation and Recognition with LLMs	Karel Mundnich et.al.	2412.18566	null
2025-01-09	Trading Devil RL: Backdoor attack via Stock market, Bayesian Optimization and Reinforcement Learning	Orson Mengara et.al.	2412.17908	null
2024-12-09	Ensemble Machine Learning Model for Inner Speech Recognition: A Subject-Specific Investigation	Shahamat Mustavi Tasin et.al.	2412.17824	null
2024-12-23	Investigating Prosodic Signatures via Speech Pre-Trained Models for Audio Deepfake Source Attribution	Orchid Chetia Phukan et.al.	2412.17796	null
2024-12-23	UME: Upcycling Mixture-of-Experts for Scalable and Efficient Automatic Speech Recognition	Li Fu et.al.	2412.17507	null
2024-12-23	Deep Learning in Proteomics Informatics: Applications, Challenges, and Future Directions	Yindan Luo et.al.	2412.17349	null
2025-01-17	Uncovering the Visual Contribution in Audio-Visual Speech Recognition	Zhaofeng Lin et.al.	2412.17129	null
2025-01-05	Adapting Whisper for Code-Switching through Encoding Refining and Language-Aware Decoding	Jiahui Zhao et.al.	2412.16507	null
2025-01-03	Speech Retrieval-Augmented Generation without Automatic Speech Recognition	Do June Min et.al.	2412.16500	null
2024-12-21	Enhancing Multilingual ASR for Unseen Languages via Language Embedding Modeling	Shao-Syuan Huang et.al.	2412.16474	null
2024-12-21	Transducer-Llama: Integrating LLMs into Streamable Transducer-based Speech Recognition	Keqi Deng et.al.	2412.16464	null
2025-01-19	MathSpeech: Leveraging Small LMs for Accurate Conversion in Mathematical Speech-to-Formula	Sieun Hyeon et.al.	2412.15655	link
2024-12-20	TouchASP: Elastic Automatic Speech Perception that Everyone Can Touch	Xingchen Song et.al.	2412.15622	null
2024-12-19	Transcribing and Translating, Fast and Slow: Joint Speech Translation and Recognition	Niko Moritz et.al.	2412.15415	null
2024-12-23	LAMA-UT: Language Agnostic Multilingual ASR through Orthography Unification and Language-Specific Transliteration	Sangmin Lee et.al.	2412.15299	null
2025-01-09	CAMEL: Cross-Attention Enhanced Mixture-of-Experts and Language Bias for Code-Switching Speech Recognition	He Wang et.al.	2412.12760	null
2024-12-24	Streaming Keyword Spotting Boosted by Cross-layer Discrimination Consistency	Yu Xi et.al.	2412.12635	null
2024-12-11	Greek2MathTex: A Greek Speech-to-Text Framework for LaTeX Equations Generation	Evangelia Gkritzali et.al.	2412.12167	null
2024-12-09	Harnessing Transfer Learning from Swahili: Advancing Solutions for Comorian Dialects	Naira Abdou Mohamed et.al.	2412.12143	null
2024-12-17	Speak & Improve Corpus 2025: an L2 English Speech Corpus for Language Assessment and Feedback	Kate Knill et.al.	2412.11986	null
2024-12-17	Speak & Improve Challenge 2025: Tasks and Baseline Systems	Mengjie Qian et.al.	2412.11985	null
2024-12-20	MERaLiON-SpeechEncoder: Towards a Speech Foundation Model for Singapore and Beyond	Muhammad Huzaifah et.al.	2412.11538	null
2024-12-15	Transliterated Zero-Shot Domain Adaptation for Automatic Speech Recognition	Han Zhu et.al.	2412.11185	null
2024-12-14	Robust Recognition of Persian Isolated Digits in Speech using Deep Neural Network	Ali Nasr-Esfahani et.al.	2412.10857	null
2024-12-14	Efficient Adaptation of Multilingual Models for Japanese ASR	Mark Bajo et.al.	2412.10705	link
2025-01-16	MERaLiON-AudioLLM: Bridging Audio and Language with Large Language Models	Yingxu He et.al.	2412.09818	null
2024-11-26	Enhancing Code-Switching ASR Leveraging Non-Peaky CTC Loss and Deep Language Posterior Injection	Tzu-Ting Yang et.al.	2412.08651	null
2024-12-11	Bilevel Joint Unsupervised and Supervised Training for Automatic Speech Recognition	Xiaodong Cui et.al.	2412.08548	null
2024-12-10	Style-agnostic evaluation of ASR using multiple reference transcripts	Quinten McNamara et.al.	2412.07937	null
2024-12-09	Effective Text Adaptation for LLM-based ASR through Soft Prompt Fine-Tuning	Yingyi Ma et.al.	2412.06967	null
2024-12-09	Not All Errors Are Equal: Investigation of Speech Recognition Errors in Alzheimer's Disease Detection	Jiawen Kang et.al.	2412.06332	null
2024-12-09	Leveraging Prompt Learning and Pause Encoding for Alzheimer's Disease Detection	Yin-Long Liu et.al.	2412.06259	null
2024-12-07	SQ-Whisper: Speaker-Querying based Whisper Model for Target-Speaker ASR	Pengcheng Guo et.al.	2412.05589	link
2024-12-06	Adaptive Dropout for Pruning Conformers	Yotaro Kubo et.al.	2412.04836	null
2024-12-05	Comprehensive Audio Query Handling System with Integrated Expert Models and Contextual Understanding	Vakada Naveen et.al.	2412.03980	null
2024-12-05	Speech Recognition-based Feature Extraction for Enhanced Automatic Severity Classification in Dysarthric Speech	Yerin Choi et.al.	2412.03784	null
2024-12-04	ASR-EC Benchmark: Evaluating Large Language Models on Chinese ASR Error Correction	Victor Junqiu Wei et.al.	2412.03075	null
2024-12-03	GLM-4-Voice: Towards Intelligent and Human-Like End-to-End Spoken Chatbot	Aohan Zeng et.al.	2412.02612	link
2024-12-01	Late fusion ensembles for speech recognition on diverse input audio representations	Marin Jezidžić et.al.	2412.01861	null
2024-12-01	Automating Feedback Analysis in Surgical Training: Detection, Categorization, and Assessment	Firdavs Nasriddinov et.al.	2412.00760	link
2024-12-04	A Comparative Study of LLM-based ASR and Whisper in Low Resource and Code Switching Scenario	Zheshu Song et.al.	2412.00721	null
2024-11-30	Sample adaptive data augmentation with progressive scheduling	Hongxuan Lu et.al.	2412.00415	null
2024-11-30	Empowering the Deaf and Hard of Hearing Community: Enhancing Video Captions Using Large Language Models	Nadeen Fathallah et.al.	2412.00342	null
2024-11-24	High-precision medical speech recognition through synthetic data and semantic correction: UNITED-MEDASR	Sourav Banerjee et.al.	2412.00055	null
2024-11-29	Memristive Nanowire Network for Energy Efficient Audio Classification: Pre-Processing-Free Reservoir Computing with Reduced Latency	Akshaya Rajesh et.al.	2411.19611	null
2024-11-28	ArEEG_Words: Dataset for Envisioned Speech Recognition using EEG for Arabic Words	Hazem Darwish et.al.	2411.18888	null
2024-11-20	Towards Advanced Speech Signal Processing: A Statistical Perspective on Convolution-Based Architectures and its Applications	Nirmal Joshua Kapu et.al.	2411.18636	null
2024-11-27	EEG-Based Analysis of Brain Responses in Multi-Modal Human-Robot Interaction: Modulating Engagement	Suzanne Oliver et.al.	2411.18587	null
2024-11-27	AMPS: ASR with Multimodal Paraphrase Supervision	Amruta Parulekar et.al.	2411.18368	null
2024-11-27	Continual Learning in Machine Speech Chain Using Gradient Episodic Memory	Geoffrey Tyndall et.al.	2411.18320	null
2024-11-27	Aligning Pre-trained Models for Spoken Language Translation	Šimon Sedláček et.al.	2411.18294	null
2024-11-27	Efficient Nonlinear Function Approximation in Analog Resistive Crossbars for Recurrent Neural Networks	Junyi Yang et.al.	2411.18271	null
2025-01-05	How to Learn a New Language? An Efficient Solution for Self-Supervised Learning Models Unseen Languages Adaption in Low-Resource Scenario	Shih-Heng Wang et.al.	2411.18217	null
2025-01-15	MSA-ASR: Efficient Multilingual Speaker Attribution with frozen ASR Models	Thai-Binh Nguyen et.al.	2411.18152	null
2024-11-27	SALMONN-omni: A Codec-free LLM for Full-duplex Speech Understanding and Generation	Wenyi Yu et.al.	2411.18138	null
2024-11-27	Fusion of Discrete Representations and Self-Augmented Representations for Multilingual Automatic Speech Recognition	Shih-heng Wang et.al.	2411.18107	null
2024-11-26	Disentangled-Transformer: An Explainable End-to-End Automatic Speech Recognition Model with Speech Content-Context Separation	Pu Wang et.al.	2411.17846	null
2024-12-02	Scaling Speech-Text Pre-training with Synthetic Interleaved Data	Aohan Zeng et.al.	2411.17607	null
2024-11-26	Towards Maximum Likelihood Training for Transducer-based Streaming Speech Recognition	Hyeonseung Lee et.al.	2411.17537	null
2024-11-26	Comparative Analysis of ASR Methods for Speech Deepfake Detection	Davide Salvi et.al.	2411.17349	null
2024-11-26	k2SSL: A Faster and Better Framework for Self-Supervised Speech Representation Learning	Yifan Yang et.al.	2411.17100	link
2024-11-22	TSkips: Efficiency Through Explicit Temporal Delay Connections in Spiking Neural Networks	Prajna G. Malettira et.al.	2411.16711	null
2024-11-22	Transforming NLU with Babylon: A Case Study in Development of Real-time, Edge-Efficient, Multi-Intent Translation System for Automated Drive-Thru Ordering	Mostafa Varzaneh et.al.	2411.15372	null
2024-11-20	From Statistical Methods to Pre-Trained Models; A Survey on Automatic Speech Recognition for Resource Scarce Urdu Language	Muhammad Sharif et.al.	2411.14493	null
2024-11-26	Tiny-Align: Bridging Automatic Speech Recognition and Large Language Model on the Edge	Ruiyang Qin et.al.	2411.13766	null
2024-11-18	A Novel Speech Analysis and Correction Tool for Arabic-Speaking Children	Lamia Berriche et.al.	2411.13592	null
2024-11-26	WavChat: A Survey of Spoken Dialogue Models	Shengpeng Ji et.al.	2411.13577	link
2024-11-20	CAFE A Novel Code switching Dataset for Algerian Dialect French and English	Houssam Eddine-Othman Lachemat et.al.	2411.13424	null
2024-11-20	Hard-Synth: Synthesizing Diverse Hard Samples for ASR using Zero-Shot TTS and LLM	Jiawei Yu et.al.	2411.13159	null
2024-11-19	Whisper Finetuning on Nepali Language	Sanjay Rijal et.al.	2411.12587	null
2024-11-27	Inter-linguistic Phonetic Composition (IPC): A Theoretical and Computational Approach to Enhance Second Language Pronunciation	Jisang Park et.al.	2411.10927	null
2024-11-16	BanglaDialecto: An End-to-End AI-Powered Regional Speech Standardization	Md. Nazmus Sadat Samin et.al.	2411.10879	link
2024-12-08	Interactive Cycle Model -- The Linkage Combination among Automatic Speech Recognition, Large Language Models and Smart Glasses	Libo Wang et.al.	2411.10362	link
2024-11-15	Systolic Arrays and Structured Pruning Co-design for Efficient Transformers in Edge Systems	Pedro Palacios et.al.	2411.10285	null
2024-11-15	DiMoDif: Discourse Modality-information Differentiation for Audio-visual Deepfake Detection and Localization	Christos Koutlis et.al.	2411.10193	null
2024-11-15	XLSR-Mamba: A Dual-Column Bidirectional State Space Model for Spoofing Attack Detection	Yang Xiao et.al.	2411.10027	null
2024-11-14	Everyone deserves their voice to be heard: Analyzing Predictive Gender Bias in ASR Models Applied to Dutch Speech Data	Rik Raes et.al.	2411.09431	null
2024-11-14	Transferable Adversarial Attacks against ASR	Xiaoxue Gao et.al.	2411.09220	null
2024-10-28	Multilingual Standalone Trustworthy Voice-Based Social Network for Disaster Situations	Majid Behravan et.al.	2411.08889	null
2024-11-11	Mamba-based Decoder-Only Approach with Bidirectional Speech Modeling for Speech Recognition	Yoshiki Masuyama et.al.	2411.06968	link
2024-12-28	DCF-DS: Deep Cascade Fusion of Diarization and Separation for Speech Recognition under Realistic Single-Channel Conditions	Shu-Tong Niu et.al.	2411.06667	null
2024-11-10	CTC-Assisted LLM-Based Contextual ASR	Guanrou Yang et.al.	2411.06437	link
2024-12-04	Dialectal Coverage And Generalization in Arabic Speech Recognition	Amirbek Djanibekov et.al.	2411.05872	link
2024-11-07	Sentiment Analysis of Spanish Political Party Tweets Using Pre-trained Language Models	Chuqiao Song et.al.	2411.04862	null
2024-11-07	Multistage Fine-tuning Strategies for Automatic Speech Recognition in Low-resource Languages	Leena G Pillai et.al.	2411.04573	null
2024-11-04	Unified Speech Recognition: A Single Model for Auditory, Visual, and Audiovisual Inputs	Alexandros Haliassos et.al.	2411.02256	link
2024-11-03	SPES: Spectrogram Perturbation for Explainable Speech-to-Text Generation	Dennis Fucci et.al.	2411.01710	null
2024-11-08	Enhancing AAC Software for Dysarthric Speakers in e-Health Settings: An Evaluation Using TORGO	Macarious Hui et.al.	2411.00980	null
2024-11-04	Optimizing Contextual Speech Recognition Using Vector Quantization for Efficient Retrieval	Nikolaos Flemotomos et.al.	2411.00664	null
2024-10-31	IO Transformer: Evaluating SwinV2-Based Reward Models for Computer Vision	Maxwell Meyer et.al.	2411.00252	null
2024-10-31	Speech is More Than Words: Do Speech-to-Text Translation Systems Leverage Prosody?	Ioannis Tsiamas et.al.	2410.24019	null
2024-10-30	Augmenting Polish Automatic Speech Recognition System With Synthetic Data	Łukasz Bondaruk et.al.	2410.22903	null
2024-10-30	Run-Time Adaptation of Neural Beamforming for Robust Speech Dereverberation and Denoising	Yoto Fujita et.al.	2410.22805	null
2024-10-29	Joint Beamforming and Speaker-Attributed ASR for Real Distant-Microphone Meeting Transcription	Can Cui et.al.	2410.21849	null
2024-10-28	Asynchronous Tool Usage for Real-Time Agents	Antonio A. Ginart et.al.	2410.21620	null
2024-10-27	Using Confidence Scores to Improve Eyes-free Detection of Speech Recognition Errors	Sadia Nowrin et.al.	2410.20564	null
2024-10-27	Improving Speech-based Emotion Recognition with Contextual Utterance Analysis and LLMs	Enshi Zhang et.al.	2410.20334	null
2024-11-04	emg2qwerty: A Large Dataset with Baselines for Touch Typing using Surface Electromyography	Viswanath Sivakumar et.al.	2410.20081	link
2024-10-25	A Survey on Speech Large Language Models	Jing Peng et.al.	2410.18908	null
2024-10-24	We Augmented Whisper With kNN and You Won't Believe What Came Next	Maya K. Nachesa et.al.	2410.18850	null
2024-10-24	STTATTS: Unified Speech-To-Text And Text-To-Speech Model	Hawau Olamide Toyin et.al.	2410.18607	link
2024-10-24	Evaluating and Improving Automatic Speech Recognition Systems for Korean Meteorological Experts	ChaeHun Park et.al.	2410.18444	null
2024-10-24	Contextual Biasing to Improve Domain-specific Custom Vocabulary Audio Transcription without Explicit Fine-Tuning of Whisper Model	Vishakha Lall et.al.	2410.18363	null
2024-10-23	ELAICHI: Enhancing Low-resource TTS by Addressing Infrequent and Low-frequency Character Bigrams	Srija Anand et.al.	2410.17901	null
2024-10-23	VoiceTextBlender: Augmenting Large Language Models with Speech Capabilities via Single-Stage Joint Speech-Text Supervised Fine-Tuning	Yifan Peng et.al.	2410.17485	null
2024-10-22	mmWave-Whisper: Phone Call Eavesdropping and Transcription Using Millimeter-Wave Radar	Suryoday Basak et.al.	2410.17457	null
2024-10-22	Improving Automatic Speech Recognition with Decoder-Centric Regularisation in Encoder-Decoder Models	Alexander Polok et.al.	2410.17437	null
2024-12-11	VoiceBench: Benchmarking LLM-Based Voice Assistants	Yiming Chen et.al.	2410.17196	link
2024-10-22	Enhancing Low-Resource ASR through Versatile TTS: Bridging the Data Gap	Guanrou Yang et.al.	2410.16726	null
2024-10-22	DENOASR: Debiasing ASRs through Selective Denoising	Anand Kumar Rai et.al.	2410.16712	null
2024-10-21	AlignVSR: Audio-Visual Cross-Modal Alignment for Visual Speech Recognition	Zehua Liu et.al.	2410.16438	link
2024-10-19	End-to-End Transformer-based Automatic Speech Recognition for Northern Kurdish: A Pioneering Approach	Abdulhady Abas Abdullah et.al.	2410.16330	null
2024-10-21	Acoustic Model Optimization over Multiple Data Sources: Merging and Valuation	Victor Junqiu Wei et.al.	2410.15620	null
2024-10-21	Interventional Speech Noise Injection for ASR Generalizable Spoken Language Understanding	Yeonjoon Jung et.al.	2410.15609	null
2024-10-22	Moonshine: Speech Recognition for Live Transcription and Voice Commands	Nat Jeffries et.al.	2410.15608	link
2024-10-20	Ichigo: Mixed-Modal Early-Fusion Realtime Voice Assistant	Alan Dao et.al.	2410.15316	link
2024-10-19	Enhancing Multimodal Sentiment Analysis for Missing Modality through Self-Distillation and Unified Modality Cross-Attention	Yuzhe Weng et.al.	2410.15029	link
2024-10-18	AC-Mix: Self-Supervised Adaptation for Low-Resource Automatic Speech Recognition using Agnostic Contrastive Mixup	Carlos Carvalho et.al.	2410.14910	null
2024-10-09	A two-stage transliteration approach to improve performance of a multilingual ASR	Rohit Kumar et.al.	2410.14709	null
2024-10-17	Parameter-efficient Adaptation of Multilingual Multimodal Models for Low-resource ASR	Abhishek Gupta et.al.	2410.13445	null
2024-10-17	Computational Approaches to Arabic-English Code-Switching	Caroline Sabty et.al.	2410.13318	null
2024-10-17	Roadmap towards Superhuman Speech Understanding using Large Language Models	Fan Bu et.al.	2410.13268	null
2024-10-17	Failing Forward: Improving Generative Error Correction for ASR with Synthetic Data and Retrieval Augmentation	Sreyan Ghosh et.al.	2410.13198	null
2024-10-17	EH-MAM: Easy-to-Hard Masked Acoustic Modeling for Self-Supervised Speech Representation Learning	Ashish Seth et.al.	2410.13179	link
2024-10-17	Deep Learning-based Software Engineering: Progress, Challenges, and Opportunities	Xiangping Chen et.al.	2410.13110	null
2024-10-07	Automatic Screening for Children with Speech Disorder using Automatic Speech Recognition: Opportunities and Challenges	Dancheng Liu et.al.	2410.11865	null
2024-10-15	A Framework for Adapting Human-Robot Interaction to Diverse User Groups	Theresa Pekarek Rosin et.al.	2410.11377	link
2024-10-15	Investigation of Speaker Representation for Target-Speaker Speech Processing	Takanori Ashihara et.al.	2410.11243	null
2024-10-14	Character-aware audio-visual subtitling in context	Jaesung Huh et.al.	2410.11068	null
2024-10-14	In-Materia Speech Recognition	Mohamadreza Zolfagharinejad et.al.	2410.10434	null
2024-10-13	State of NLP in Kenya: A Survey	Cynthia Jayne Amol et.al.	2410.09948	null
2024-10-12	SLAM-AAC: Enhancing Audio Captioning with Paraphrasing Augmentation and CLAP-Refine through LLMs	Wenxi Chen et.al.	2410.09503	link
2024-10-12	Automatic Speech Recognition with BERT and CTC Transformers: A Review	Noussaiba Djeffal et.al.	2410.09456	null
2024-10-11	UniGlyph: A Seven-Segment Script for Universal Language Representation	G. V. Bency Sherin et.al.	2410.08974	null
2024-10-14	Enhancing Indonesian Automatic Speech Recognition: Evaluating Multilingual Models with Diverse Speech Variabilities	Aulia Adila et.al.	2410.08828	null
2024-10-10	Full-Rank No More: Low-Rank Weight Training for Modern Speech Recognition Models	Adriana Fernandez-Lopez et.al.	2410.07771	null
2024-10-18	Advocating Character Error Rate for Multilingual ASR Evaluation	Thennal D K et.al.	2410.07400	null
2024-10-08	The USTC-NERCSLIP Systems for the CHiME-8 MMCSG Challenge	Ya Jiang et.al.	2410.05986	null
2024-10-07	Incorporating Talker Identity Aids With Improving Speech Recognition in Adversarial Environments	Sagarika Alavilli et.al.	2410.05423	null
2024-10-05	The OCON model: an old but gold solution for distributable supervised classification	Stefano Giacomelli et.al.	2410.05320	link
2024-10-07	Enhancing Job Interview Preparation Through Immersive Experiences Using Photorealistic, AI-powered Metahuman Avatars	Navid Ashrafi et.al.	2410.05131	null
2024-10-13	CR-CTC: Consistency regularization on CTC for improved speech recognition	Zengwei Yao et.al.	2410.05101	link
2024-10-06	Punctuation Prediction for Polish Texts using Transformers	Jakub Pokrywka et.al.	2410.04621	null
2024-10-06	Casablanca: Data and Models for Multidialectal Arabic Speech Recognition	Bashar Talafha et.al.	2410.04527	null
2024-10-05	Efficient and Robust Long-Form Speech Recognition with Hybrid H3-Conformer	Tomoki Honda et.al.	2410.04159	link
2024-10-05	The OCON model: an old but green solution for distributable supervised classification for acoustic monitoring in smart cities	Stefano Giacomelli et.al.	2410.04098	null
2024-10-05	Enhancement of Dysarthric Speech Reconstruction by Contrastive Learning	Keshvari Fatemeh et.al.	2410.04092	null
2024-10-04	Reverb: Open-Source ASR and Diarization from Rev	Nishchal Bhandari et.al.	2410.03930	null
2024-10-13	Self-Powered LLM Modality Expansion for Large Speech-Text Models	Tengfei Yu et.al.	2410.03798	link
2024-10-02	SeeSay: An Assistive Device for the Visually Impaired Using Retrieval Augmented Generation	Melody Yu et.al.	2410.03771	null
2024-10-02	Efficient Streaming LLM for Speech Recognition	Junteng Jia et.al.	2410.03752	null
2024-10-01	Recent Advances in Speech Language Models: A Survey	Wenqian Cui et.al.	2410.03751	null
2024-10-04	Multi-Dialect Vietnamese: Task, Dataset, Baseline Models and Challenges	Nguyen Van Dinh et.al.	2410.03458	link
2024-10-04	Team MTS @ AutoMin 2021: An Overview of Existing Summarization Approaches and Comparison to Unsupervised Summarization Techniques	Olga Iakovenko et.al.	2410.03412	null
2024-10-03	Three-in-One: Fast and Accurate Transducer for Hybrid-Autoregressive ASR	Hainan Xu et.al.	2410.02597	null
2024-10-04	Convolutional Variational Autoencoders for Spectrogram Compression in Automatic Speech Recognition	Olga Iakovenko et.al.	2410.02560	null
2024-10-03	Algorithms For Automatic Accentuation And Transcription Of Russian Texts In Speech Recognition Systems	Olga Iakovenko et.al.	2410.02538	null
2024-10-03	A Pilot Study of Applying Sequence-to-Sequence Voice Conversion to Evaluate the Intelligibility of L2 Speech Using a Native Speaker's Shadowings	Haopeng Geng et.al.	2410.02239	null
2024-09-27	A GEN AI Framework for Medical Note Generation	Hui Yi Leong et.al.	2410.01841	null
2024-10-02	Spoken Grammar Assessment Using LLM	Sunil Kumar Kopparapu et.al.	2410.01579	null
2024-10-01	MOSEL: 950,000 Hours of Speech Data for Open-Source Speech Foundation Model Training on EU Languages	Marco Gaido et.al.	2410.01036	link
2024-10-01	Automatic Speech Recognition for the Ika Language	Uchenna Nzenwata et.al.	2410.00940	null
2024-10-04	VHASR: A Multimodal Speech Recognition System With Vision Hotwords	Jiliang Hu et.al.	2410.00822	link
2024-10-01	End-to-End Speech Recognition with Pre-trained Masked Language Model	Yosuke Higuchi et.al.	2410.00528	link
2024-09-30	Mamba for Streaming ASR Combined with Unimodal Aggregation	Ying Fang et.al.	2410.00070	link
2024-10-02	Moshi: a speech-text foundation model for real-time dialogue	Alexandre Défossez et.al.	2410.00037	link
2024-09-30	Boosting Hybrid Autoregressive Transducer-based ASR with Internal Acoustic Model Training and Dual Blank Thresholding	Takafumi Moriya et.al.	2409.20313	null
2024-09-30	Alignment-Free Training for Transducer-based Multi-Talker ASR	Takafumi Moriya et.al.	2409.20301	null
2024-09-30	AfriHuBERT: A self-supervised speech representation model for African languages	Jesujoba O. Alabi et.al.	2409.20201	null
2024-09-30	Predictive Speech Recognition and End-of-Utterance Detection Towards Spoken Dialog Systems	Oswald Zink et.al.	2409.19990	null
2024-09-30	HDMoLE: Mixture of LoRA Experts with Hierarchical Routing and Dynamic Thresholds for Fine-Tuning LLM-based ASR Models	Bingshen Mu et.al.	2409.19878	null
2024-09-29	Fine-Tuning Automatic Speech Recognition for People with Parkinson's: An Effective Strategy for Enhancing Speech Technology Accessibility	Xiuwen Zheng et.al.	2409.19818	null
2024-09-29	Efficient Long-Form Speech Recognition for General Speech In-Context Learning	Hao Yen et.al.	2409.19757	null
2024-09-29	Quantitative Analysis of Audio-Visual Tasks: An Information-Theoretic Perspective	Chen Chen et.al.	2409.19575	null
2024-09-29	CoT-ST: Enhancing LLM-based Speech Translation with Multimodal Chain-of-Thought	Yexing Du et.al.	2409.19510	link
2024-09-28	Advanced Clustering Techniques for Speech Signal Enhancement: A Review and Metanalysis of Fuzzy C-Means, K-Means, and Kernel Fuzzy C-Means Methods	Abdulhady Abas Abdullah et.al.	2409.19448	null
2024-09-27	Speech-Mamba: Long-Context Speech Recognition with Selective State Spaces Models	Xiaoxue Gao et.al.	2409.18654	null
2024-09-30	ChildMandarin: A Comprehensive Mandarin Speech Dataset for Young Children Aged 3-5	Jiaming Zhou et.al.	2409.18584	null
2024-09-27	Improving Multilingual ASR in the Wild Using Simple N-best Re-ranking	Brian Yan et.al.	2409.18428	link
2024-09-26	Unveiling the Role of Pretraining in Direct Speech Translation	Belen Alastruey et.al.	2409.18044	null
2024-09-26	Are Transformers in Pre-trained LM A Good ASR Encoder? An Empirical Study	Keyu An et.al.	2409.17750	null
2024-09-26	Paraformer-v2: An improved non-autoregressive transformer for noise-robust speech recognition	Keyu An et.al.	2409.17746	null
2024-09-26	Deep CLAS: Deep Contextual Listen, Attend and Spell	Shifu Xiong et.al.	2409.17603	null
2024-11-08	How to Connect Speech Foundation Models and Large Language Models? What Matters and What Does Not	Francesco Verdini et.al.	2409.17044	null
2024-09-25	MT2KD: Towards A General-Purpose Encoder for Speech, Speaker, and Audio Events	Xiaoyu Yang et.al.	2409.17010	null
2024-09-25	Weighted Cross-entropy for Low-Resource Languages in Multilingual Speech Recognition	Andrés Piñeiro-Martín et.al.	2409.16954	link
2024-09-27	Semi-Supervised Cognitive State Classification from Speech with Multi-View Pseudo-Labeling	Yuanchao Li et.al.	2409.16937	link
2024-09-25	Speech Recognition Rescoring with Large Speech-Text Foundation Models	Prashanth Gurunath Shivakumar et.al.	2409.16654	null
2024-09-24	Spelling Correction through Rewriting of Non-Autoregressive ASR Lattices	Leonid Velikovich et.al.	2409.16469	null
2024-09-24	Revisiting Acoustic Features for Robust ASR	Muhammad A. Shah et.al.	2409.16399	null
2024-09-10	How Redundant Is the Transformer Stack in Speech Representation Models?	Teresa Dorszewski et.al.	2409.16302	null
2024-09-24	Bridging Speech and Text: Enhancing ASR with Pinyin-to-Character Pre-training in LLMs	Yang Yuhang et.al.	2409.16005	null
2024-10-31	Boosting Code-Switching ASR with Mixture of Experts Enhanced Speech-Conditioned LLM	Fengrun Zhang et.al.	2409.15905	null
2024-09-24	WeSep: A Scalable and Flexible Toolkit Towards Generalizable Target Speaker Extraction	Shuai Wang et.al.	2409.15799	link
2024-09-24	Hypothesis Clustering and Merging: Novel MultiTalker Speech Recognition with Speaker Tokens	Yosuke Kashiwagi et.al.	2409.15732	null
2024-09-23	Revise, Reason, and Recognize: LLM-Based Emotion Recognition via Emotion-Specific Prompts and ASR Error Correction	Yuanchao Li et.al.	2409.15551	link
2024-09-17	A Joint Spectro-Temporal Relational Thinking Based Acoustic Modeling Framework	Zheng Nan et.al.	2409.15357	null
2024-09-11	Contextualization of ASR with LLM using phonetic retrieval-based augmentation	Zhihong Lei et.al.	2409.15353	null
2024-09-10	A Large Dataset of Spontaneous Speech with the Accent Spoken in São Paulo for Automatic Speech Recognition Evaluation	Rodrigo Lima et.al.	2409.15350	null
2024-09-13	CPT-Boosted Wav2vec2.0: Towards Noise Robust Speech Recognition for Classroom Environments	Ahmed Adel Attia et.al.	2409.14494	null
2024-09-21	Strong Alone, Stronger Together: Synergizing Modality-Binding Foundation Models with Optimal Transport for Non-Verbal Emotion Recognition	Orchid Chetia Phukan et.al.	2409.14221	null
2024-09-21	MultiMed: Multilingual Medical Speech Recognition via Attention Encoder Decoder	Khai Le-Duc et.al.	2409.14074	link
2024-09-20	Time and Tokens: Benchmarking End-to-End Speech Dysfluency Detection	Xuanru Zhou et.al.	2409.13582	null
2024-09-20	LM-assisted keyword biasing with Aho-Corasick algorithm for Transducer-based ASR	Iuliia Thorbecke et.al.	2409.13514	null
2024-10-07	Fast Streaming Transducer ASR Prototyping via Knowledge Distillation with Whisper	Iuliia Thorbecke et.al.	2409.13499	null
2024-09-20	A Multimodal Dense Retrieval Approach for Speech-Based Open-Domain Question Answering	Georgios Sidiropoulos et.al.	2409.13483	null
2024-09-20	Large Language Model Should Understand Pinyin for Chinese ASR Error Correction	Yuang Li et.al.	2409.13262	null
2024-09-19	Personalized Speech Recognition for Children with Test-Time Adaptation	Zhonghao Shi et.al.	2409.13095	null
2024-09-19	Enhancing Synthetic Training Data for Speech Commands: From ASR-Based Filtering to Domain Adaptation in SSL Latent Space	Sebastião Quintas et.al.	2409.12745	null
2024-09-19	Hidden in Plain Sound: Environmental Backdoor Poisoning Attacks on Whisper, and Mitigations	Jonatan Bartolini et.al.	2409.12553	null
2024-09-19	Disentangling Speakers in Multi-Talker Speech Recognition with Speaker-Aware CTC	Jiawen Kang et.al.	2409.12388	null
2024-09-19	Channel-Aware Domain-Adaptive Generative Adversarial Network for Robust Speech Recognition	Chien-Chun Wang et.al.	2409.12386	link
2024-09-19	Robust Audiovisual Speech Recognition Models with Mixture-of-Experts	Yihan Wu et.al.	2409.12370	null
2024-09-18	META-CAT: Speaker-Informed Speech Embeddings via Meta Information Concatenation for Multi-talker ASR	Jinhan Wang et.al.	2409.12352	null
2024-09-18	Large Language Models Are Strong Audio-Visual Speech Recognition Learners	Umberto Cappellazzo et.al.	2409.12319	null
2024-09-19	WeHelp: A Shared Autonomy System for Wheelchair Users	Abulikemu Abuduweili et.al.	2409.12159	link
2024-09-18	ASR Benchmarking: Need for a More Representative Conversational Dataset	Gaurav Maheshwari et.al.	2409.12042	link
2024-09-18	M2R-Whisper: Multi-stage and Multi-scale Retrieval Augmentation for Enhancing Whisper	Jiaming Zhou et.al.	2409.11889	null
2024-09-19	Simulating Native Speaker Shadowing for Nonnative Speech Assessment with Latent Speech Representations	Haopeng Geng et.al.	2409.11742	null
2024-09-17	Chain-of-Thought Prompting for Speech Translation	Ke Hu et.al.	2409.11538	null
2024-09-17	M-BEST-RQ: A Multi-Channel Speech Foundation Model for Smart Glasses	Yufeng Yang et.al.	2409.11494	null
2024-09-17	Bio-Inspired Mamba: Temporal Locality and Bioplausible Learning in Selective State Space Models	Jiahao Qin et.al.	2409.11263	null
2024-09-17	WER We Stand: Benchmarking Urdu ASR Models	Samee Arif et.al.	2409.11252	null
2024-09-17	Ideal-LLM: Integrating Dual Encoders and Language-Adapted LLM for Multilingual Speech-to-Text	Hongfei Xue et.al.	2409.11214	null
2024-09-17	Zero Shot Text to Speech Augmentation for Automatic Speech Recognition on Low-Resource Accented Speech Corpora	Francesco Nespoli et.al.	2409.11107	null
2024-09-17	Enhancing Low-Resource Language and Instruction Following Capabilities of Audio Language Models	Potsawee Manakul et.al.	2409.10999	null
2024-09-17	Speech Recognition for Analysis of Police Radio Communication	Tejes Srivastava et.al.	2409.10858	null
2024-09-16	An Efficient Self-Learning Framework For Interactive Spoken Dialog Systems	Hitesh Tulsiani et.al.	2409.10515	null
2024-09-16	Meta-Whisper: Speech-Based Meta-ICL for ASR on Low-Resource Languages	Ming-Hao Hsu et.al.	2409.10429	null
2024-09-16	Voice control interface for surgical robot assistants	Ana Davila et.al.	2409.10225	null
2024-09-17	Augmenting Automatic Speech Recognition Models with Disfluency Detection	Robin Amann et.al.	2409.10177	null
2024-09-16	Optimizing Dysarthria Wake-Up Word Spotting: An End-to-End Approach for SLT 2024 LRDWWS Challenge	Shuiyun Liu et.al.	2409.10076	null
2024-09-16	A Study on Zero-shot Non-intrusive Speech Assessment using Large Language Models	Ryandhimas E. Zezario et.al.	2409.09914	null
2024-09-17	Large Language Model Based Generative Error Correction: A Challenge and Baselines for Speech Recognition, Speaker Tagging, and Emotion Recognition	Chao-Han Huck Yang et.al.	2409.09785	null
2024-09-14	ASR Error Correction using Large Language Models	Rao Ma et.al.	2409.09554	null
2024-09-14	M $^{3}$ V: A multi-modal multi-view approach for Device-Directed Speech Detection	Anna Wang et.al.	2409.09284	null
2024-09-13	Multi-modal Speech Transformer Decoders: When Do Multiple Modalities Improve Accuracy?	Yiwen Guan et.al.	2409.09221	null
2024-09-13	Learnings from curating a trustworthy, well-annotated, and useful dataset of disordered English speech	Pan-Pan Jiang et.al.	2409.09190	null
2024-09-13	Clean Label Attacks against SLU Systems	Henry Li Xinyuan et.al.	2409.08985	null
2024-09-13	Exploring the Impact of Data Quantity on ASR in Extremely Low-resource Languages	Yao-Fei Cheng et.al.	2409.08872	null
2024-09-13	Exploring SSL Discrete Tokens for Multilingual ASR	Mingyu Cui et.al.	2409.08805	null
2024-09-13	NEST-RQ: Next Token Prediction for Speech Self-Supervised Pre-Training	Minglun Han et.al.	2409.08680	null
2024-09-13	LA-RAG:Enhancing LLM-based ASR Accuracy with Retrieval-Augmented Generation	Shaojun Li et.al.	2409.08597	null
2024-09-13	Large Language Model Can Transcribe Speech in Multi-Talker Scenarios with Versatile Instructions	Lingwei Meng et.al.	2409.08596	null
2024-09-12	Faster Speech-LLaMA Inference with Multi-token Prediction	Desh Raj et.al.	2409.08148	null
2024-09-12	WhisperNER: Unified Open Named Entity and Speech Recognition	Gil Ayache et.al.	2409.08107	null
2024-10-06	The Faetar Benchmark: Speech Recognition in a Very Under-Resourced Language	Michael Ong et.al.	2409.08103	null
2024-09-12	Auto-Landmark: Acoustic Landmark Dataset and Open-Source Toolkit for Landmark Extraction	Xiangyu Zhang et.al.	2409.07969	null
2024-09-12	Detecting and Defending Against Adversarial Attacks on Automatic Speech Recognition via Diffusion Models	Nikolai L. Kühne et.al.	2409.07936	link
2024-09-12	Full-text Error Correction for Chinese Speech Recognition with Large Language Model	Zhiyuan Tang et.al.	2409.07790	null
2024-09-11	Rethinking Mamba in Speech Processing by Self-Supervised Models	Xiangyu Zhang et.al.	2409.07273	null
2024-09-11	ManaTTS Persian: a recipe for creating TTS datasets for lower resource languages	Mahta Fetrat Qharabagh et.al.	2409.07259	null
2024-09-11	Enhancing CTC-Based Visual Speech Recognition	Hendrik Laux et.al.	2409.07210	null
2024-09-11	Linear Time Complexity Conformers with SummaryMixing for Streaming Speech Recognition	Titouan Parcollet et.al.	2409.07165	link
2024-09-10	An Effective Context-Balanced Adaptation Approach for Long-Tailed Speech Recognition	Yi-Cheng Wang et.al.	2409.06468	null
2024-09-10	Keyword-Aware ASR Error Augmentation for Robust Dialogue State Tracking	Jihyun Lee et.al.	2409.06263	null
2024-09-10	Advancing Topic Segmentation of Broadcasted Speech with Multilingual Semantic Embeddings	Sakshi Deo Shukla et.al.	2409.06222	link
2024-09-09	Retrieval Augmented Correction of Named Entity Speech Recognition Errors	Ernest Pusateri et.al.	2409.06062	null
2024-09-09	Consensus-based Distributed Quantum Kernel Learning for Speech Recognition	Kuan-Cheng Chen et.al.	2409.05770	null
2024-09-09	A Toolkit for Joint Speaker Diarization and Identification with Application to Speaker-Attributed ASR	Giovanni Morrone et.al.	2409.05750	null
2024-09-11	Evaluation of real-time transcriptions using end-to-end ASR models	Carlos Arriaga et.al.	2409.05674	null
2024-09-09	Longer is (Not Necessarily) Stronger: Punctuated Long-Sequence Training for Enhanced Speech Recognition and Translation	Nithin Rao Koluguri et.al.	2409.05601	null
2024-09-09	An investigation of modularity for noise robustness in conformer-based ASR	Louise Coppieters de Gibson et.al.	2409.05589	null
2025-08-27	Leveraging Content and Acoustic Representations for Speech Emotion Recognition	Soumya Dutta et.al.	2409.05566	null
2024-09-09	NTT Multi-Speaker ASR System for the DASR Task of CHiME-8 Challenge	Naoyuki Kamo et.al.	2409.05554	null
2024-09-09	Findings of the 2024 Mandarin Stuttering Event Detection and Automatic Speech Recognition Challenge	Hongfei Xue et.al.	2409.05430	null
2024-09-08	Exploring WavLM Back-ends for Speech Spoofing and Deepfake Detection	Theophile Stourbe et.al.	2409.05032	null
2024-09-04	Probing self-attention in self-supervised speech models for cross-linguistic differences	Sai Gopinath et.al.	2409.03115	null
2024-09-04	Quantification of stylistic differences in human- and ASR-produced transcripts of African American English	Annika Heuser et.al.	2409.03059	null
2024-09-04	Efficient Extraction of Noise-Robust Discrete Units from Self-Supervised Speech Models	Jakob Poncelet et.al.	2409.02565	null
2024-09-04	Parameter estimation of hidden Markov models: comparison of EM and quasi-Newton methods with a new hybrid algorithm	Sidonie Foulon et.al.	2409.02477	null
2024-09-04	What is lost in Normalization? Exploring Pitfalls in Multilingual ASR Model Evaluations	Kavya Manohar et.al.	2409.02449	null
2024-09-05	Temporal Order Preserved Optimal Transport-based Cross-modal Knowledge Transfer Learning for ASR	Xugang Lu et.al.	2409.02239	null
2024-08-19	Toward Large-scale Spiking Neural Networks: A Comprehensive Survey and Future Directions	Yangfan Hu et.al.	2409.02111	null
2024-09-05	Enhancing Code-Switching Speech Recognition with LID-Based Collaborative Mixture of Experts Model	Hukai Huang et.al.	2409.02050	null
2024-09-03	The USTC-NERCSLIP Systems for the CHiME-8 NOTSOFAR-1 Challenge	Shutong Niu et.al.	2409.02041	null
2024-09-03	Reassessing Noise Augmentation Methods in the Context of Adversarial Speech	Karla Pizzi et.al.	2409.01813	null
2024-09-24	VoxHakka: A Dialectally Diverse Multi-speaker Text-to-Speech System for Taiwanese Hakka	Li-Wei Chen et.al.	2409.01548	null
2024-09-02	Resource-Efficient Adaptation of Speech Foundation Models for Multi-Speaker ASR	Weiqing Wang et.al.	2409.01438	null
2024-09-23	Refined Statistical Bounds for Classification Error Mismatches with Constrained Bayes Error	Zijian Yang et.al.	2409.01309	null
2024-09-02	A Framework for Synthetic Audio Conversations Generation using Large Language Models	Kaung Myat Kyaw et.al.	2409.00946	null
2024-09-11	Serialized Speech Information Guidance with Overlapped Encoding Separation for Multi-Speaker Automatic Speech Recognition	Hao Shi et.al.	2409.00815	null
2024-09-01	Comparing Discrete and Continuous Space LLMs for Speech Recognition	Yaoxun Xu et.al.	2409.00800	null
2024-09-11	DCIM-AVSR : Efficient Audio-Visual Speech Recognition via Dual Conformer Interaction Module	Xinyu Wang et.al.	2409.00481	null
2024-08-31	Progressive Residual Extraction based Pre-training for Speech Representation Learning	Tianrui Wang et.al.	2409.00387	null
2024-09-08	ProGRes: Prompted Generative Rescoring on ASR n-Best	Ada Defne Tur et.al.	2409.00217	link
2024-08-30	Developing an End-to-End Framework for Predicting the Social Communication Severity Scores of Children with Autism Spectrum Disorder	Jihyun Mun et.al.	2409.00158	null
2024-08-30	Speaker Tagging Correction With Non-Autoregressive Language Models	Grigor Kirakosyan et.al.	2409.00151	null
2024-08-30	Advancing Multi-talker ASR Performance with Large Language Models	Mohan Shi et.al.	2408.17431	null
2024-08-30	Generative Modeling Perspective for Control and Reasoning in Robotics	Takuma Yoneda et.al.	2408.17041	null
2024-08-29	CrisperWhisper: Accurate Timestamps on Verbatim Speech Transcriptions	Laurin Wagner et.al.	2408.16589	link
2024-08-29	Human-Inspired Audio-Visual Speech Recognition: Spike Activity, Cueing Interaction and Causal Processing	Qianhui Liu et.al.	2408.16564	null
2024-08-29	Measuring the Accuracy of Automatic Speech Recognition Solutions	Korbinian Kuhn et.al.	2408.16287	link
2024-08-29	Revisit Micro-batch Clipping: Adaptive Data Pruning via Gradient Manipulation	Lun Wang et.al.	2408.16204	null
2024-08-29	Benchmarking Japanese Speech Recognition on ASR-LLM Setups with Multi-Pass Augmented Generative Error Correction	Yuka Ko et.al.	2408.16180	null
2024-08-28	Beyond Levenshtein: Leveraging Multiple Algorithms for Robust Word Error Rate Computations And Granular Error Classifications	Korbinian Kuhn et.al.	2408.15616	link
2024-08-28	Whisper-PMFA: Partial Multi-Scale Feature Aggregation for Speaker Verification using Whisper Models	Yiyang Zhao et.al.	2408.15585	null
2024-08-27	Speech Recognition Transformers: Topological-lingualism Perspective	Shruti Singh et.al.	2408.14991	null
2024-08-27	Literary and Colloquial Dialect Identification for Tamil using Acoustic Features	M. Nanmalar et.al.	2408.14887	null
2024-09-06	MEDSAGE: Enhancing Robustness of Medical Dialogue Summarization to ASR Errors with LLM-generated Synthetic Dialogues	Kuluhan Binici et.al.	2408.14418	null
2024-08-26	Self-supervised Speech Representations Still Struggle with African American Vernacular English	Kalvin Chang et.al.	2408.14262	link
2024-08-26	Automatic recognition and detection of aphasic natural speech	Mara Barberis et.al.	2408.14082	null
2024-08-28	Research Advances and New Paradigms for Biology-inspired Spiking Neural Networks	Tianyu Zheng et.al.	2408.13996	null
2024-08-25	Literary and Colloquial Tamil Dialect Identification	M. Nanmalar et.al.	2408.13739	null
2024-08-24	Studying the Effect of Audio Filters in Pre-Trained Models for Environmental Sound Classification	Aditya Dawn et.al.	2408.13644	null
2024-09-18	NEST: Self-supervised Fast Conformer as All-purpose Seasoning to Speech Processing Tasks	He Huang et.al.	2408.13106	link
2024-08-23	Focused Discriminative Training For Streaming CTC-Trained Automatic Speech Recognition Models	Adnan Haider et.al.	2408.13008	null
2024-08-22	Towards measuring fairness in speech recognition: Fair-Speech dataset	Irina-Elena Veliche et.al.	2408.12734	null
2024-08-22	WhisperMask: A Noise Suppressive Mask-Type Microphone for Whisper Speech	Hirotaka Hiraki et.al.	2408.12500	null
2024-08-22	Positional Description for Numerical Normalization	Deepanshu Gupta et.al.	2408.12430	null
2024-08-22	Developing vocal system impaired patient-aimed voice quality assessment approach using ASR representation-included multiple features	Shaoxiang Dang et.al.	2408.12279	null
2024-08-21	The State of Commercial Automatic French Legal Speech Recognition Systems and their Impact on Court Reporters et al	Nicolad Garneau et.al.	2408.11940	null
2024-08-19	Parameter-Efficient Transfer Learning under Federated Learning for Automatic Speech Recognition	Xuan Kan et.al.	2408.11873	null
2024-08-13	Style-Talker: Finetuning Audio Language Model and Style-Based Text-to-Speech Model for Fast Spoken Dialogue Generation	Yinghao Aaron Li et.al.	2408.11849	null
2024-08-21	Approaching Deep Learning through the Spectral Dynamics of Weights	David Yunis et.al.	2408.11804	link
2024-08-21	Improving Speech Recognition Error Prediction for Modern and Off-the-shelf Speech Recognizers	Prashant Serai et.al.	2408.11258	null
2024-08-20	XCB: an effective contextual biasing approach to bias cross-lingual phrases in speech recognition	Xucheng Wan et.al.	2408.10524	null
2024-08-19	Recording for Eyes, Not Echoing to Ears: Contextualized Spoken-to-Written Conversion of ASR Transcripts	Jiaqing Liu et.al.	2408.09688	null
2024-08-18	A Transcription Prompt-based Efficient Audio Large Language Model for Robust Speech Recognition	Yangze Li et.al.	2408.09491	null
2024-08-17	Generating Data with Text-to-Speech and Large-Language Models for Conversational Speech Recognition	Samuele Cornell et.al.	2408.09215	link
2024-08-15	Enhancing Large Language Model-based Speech Recognition by Contextualization for Rare and Ambiguous Words	Kento Nozawa et.al.	2408.08027	null
2024-08-14	SER Evals: In-domain and Out-of-domain Benchmarking for Speech Emotion Recognition	Mohamed Osman et.al.	2408.07851	link
2024-08-14	DPSNN: Spiking Neural Network for Low-Latency Streaming Speech Enhancement	Tao Sun et.al.	2408.07388	null
2024-08-16	MathBridge: A Large Corpus Dataset for Translating Spoken Mathematical Expressions into $LaTeX$ Formulas for Improved Readability	Kyudan Jung et.al.	2408.07081	null
2024-08-12	Cross-Lingual Conversational Speech Summarization with Large Language Models	Max Nelson et.al.	2408.06484	null
2024-08-12	Audio Enhancement for Computer Audition -- An Iterative Training Paradigm Using Sample Importance	Manuel Milling et.al.	2408.06264	null
2024-08-12	Enhancing Dialogue Speech Recognition with Robust Contextual Awareness via Noise Representation Learning	Wonjun Lee et.al.	2408.06043	null
2024-08-11	LI-TTA: Language Informed Test-Time Adaptation for Automatic Speech Recognition	Eunseop Yoon et.al.	2408.05769	null
2024-08-11	VQ-CTAP: Cross-Modal Fine-Grained Sequence Representation Learning for Speech Processing	Chunyu Qiang et.al.	2408.05758	null
2024-08-10	Improving Whisper's Recognition Performance for Under-Represented Language Kazakh Leveraging Unpaired Speech and Text	Jinpeng Li et.al.	2408.05554	null
2024-08-09	MooER: LLM-based Speech Recognition and Translation Models from Moore Threads	Junhao Xu et.al.	2408.05101	link
2024-08-08	HydraFormer: One Encoder For All Subsampling Rates	Yaoxun Xu et.al.	2408.04325	link
2024-08-08	Preserving spoken content in voice anonymisation with character-level vocoder conditioning	Michele Panariello et.al.	2408.04306	link
2024-08-08	wav2graph: A Framework for Supervised Learning Knowledge Graph from Speech	Khai Le-Duc et.al.	2408.04174	link
2024-08-07	Speaker Adaptation for Quantised End-to-End ASR Models	Qiuming Zhao et.al.	2408.03979	null
2024-08-06	ASR-enhanced Multimodal Representation Learning for Cross-Domain Product Retrieval	Ruixiang Zhao et.al.	2408.02978	null
2024-08-06	Self-Supervised Learning for Multi-Channel Neural Transducer	Atsushi Kojima et.al.	2408.02945	null
2024-08-05	Clustering and Mining Accented Speech for Inclusive and Fair Speech Recognition	Jaeyoung Kim et.al.	2408.02582	null
2024-09-12	The NPU-ASLP System Description for Visual Speech Recognition in CNVSRC 2024	He Wang et.al.	2408.02369	link
2024-08-05	StreamVoice+: Evolving into End-to-end Streaming Zero-shot Voice Conversion	Zhichao Wang et.al.	2408.02178	null
2024-08-03	ALIF: Low-Cost Adversarial Audio Attacks on Black-Box Speech Platforms using Linguistic Features	Peng Cheng et.al.	2408.01808	link
2024-08-01	SynesLM: A Unified Approach for Audio-visual Speech Recognition and Translation via Language Model and Synthetic Data	Yichen Lu et.al.	2408.00624	link
2024-08-01	Sentence-wise Speech Summarization: Task, Datasets, and End-to-End Modeling with LM Knowledge Distillation	Kohei Matsuura et.al.	2408.00205	null
2024-07-18	Framework for Curating Speech Datasets and Evaluating ASR Systems: A Case Study for Polish	Michał Junczyk et.al.	2408.00005	link
2024-07-18	Handling Numeric Expressions in Automatic Speech Recognition	Christian Huber et.al.	2408.00004	null
2024-08-15	The Llama 3 Herd of Models	Abhimanyu Dubey et.al.	2407.21783	null
2024-07-31	On the Problem of Text-To-Speech Model Selection for Synthetic Data Generation in Automatic Speech Recognition	Nick Rossenbach et.al.	2407.21476	null
2024-07-31	Towards interfacing large language models with ASR systems using confidence measures and prompting	Maryam Naderi et.al.	2407.21414	null
2024-07-30	Self-Supervised Models in Automatic Whispered Speech Recognition	Aref Farhadipour et.al.	2407.21211	null
2024-07-28	ELP-Adapters: Parameter Efficient Adapter Tuning for Various Speech Processing Tasks	Nakamasa Inoue et.al.	2407.21066	null
2024-07-26	Improving noisy student training for low-resource languages in End-to-End ASR using CycleGAN and inter-domain losses	Chia-Yu Li et.al.	2407.21061	null
2024-07-10	Dynamic Encoder Size Based on Data-Driven Layer-wise Pruning for Speech Recognition	Jingjing Xu et.al.	2407.18930	null
2024-08-07	Dynamic Language Group-Based MoE: Enhancing Code-Switching Speech Recognition with Hierarchical Routing	Hukai Huang et.al.	2407.18581	link
2024-07-29	Speech Bandwidth Expansion Via High Fidelity Generative Adversarial Networks	Mahmoud Salhab et.al.	2407.18571	null
2024-07-26	Enhancing Dysarthric Speech Recognition for Unseen Speakers via Prototype-Based Adaptation	Shiyao Wang et.al.	2407.18461	link
2024-07-08	Analyzing Speech Unit Selection for Textless Speech-to-Speech Translation	Jarod Duret et.al.	2407.18332	null
2024-07-25	On the Effect of Purely Synthetic Training Data for Different Automatic Speech Recognition Architectures	Nick Rossenbach et.al.	2407.17997	null
2024-07-25	Improving Domain-Specific ASR with LLM-Generated Contextual Descriptions	Jiwon Suh et.al.	2407.17874	null
2024-07-25	Scaling A Simple Approach to Zero-Shot Speech Recognition	Jinming Zhao et.al.	2407.17852	link
2024-07-24	Coupling Speech Encoders with Downstream Text Models	Ciprian Chelba et.al.	2407.17605	null
2024-07-30	Toward Automated Detection of Biased Social Signals from the Content of Clinical Conversations	Feng Chen et.al.	2407.17477	null
2024-07-10	Explaining Spectrograms in Machine Learning: A Study on Neural Networks for Speech Classification	Jesin James et.al.	2407.17416	null
2024-07-24	A Comparative Analysis of Bilingual and Trilingual Wav2Vec Models for Automatic Speech Recognition in Multilingual Oral History Archives	Jan Lehečka et.al.	2407.17160	null
2024-07-23	Quantifying the Role of Textual Predictability in Automatic Speech Recognition	Sean Robertson et.al.	2407.16537	null
2024-07-23	The CHiME-8 DASR Challenge for Generalizable and Array Agnostic Distant Automatic Speech Recognition and Diarization	Samuele Cornell et.al.	2407.16447	null
2024-07-23	Evolutionary Prompt Design for LLM-Based Post-ASR Error Correction	Rithik Sachdev et.al.	2407.16370	link
2024-07-22	dMel: Speech Tokenization made Simple	He Bai et.al.	2407.15835	null
2024-07-22	Robustness of Speech Separation Models for Similar-pitch Speakers	Bunlong Lay et.al.	2407.15749	null
2024-07-22	SELM: Enhancing Speech Emotion Recognition for Out-of-Domain Scenarios	Hazim Bukhari et.al.	2407.15300	null
2024-08-24	Trading Devil Final: Backdoor attack via Stock market and Bayesian Optimization	Orson Mengara et.al.	2407.14573	null
2024-07-07	Morse Code-Enabled Speech Recognition for Individuals with Visual and Hearing Impairments	Ritabrata Roy Choudhury et.al.	2407.14525	null
2024-07-19	GE2E-AC: Generalized End-to-End Loss Training for Accent Classification	Chihiro Watanabe et.al.	2407.14021	null
2024-07-19	Reexamining Racial Disparities in Automatic Speech Recognition Performance: The Role of Confounding by Provenance	Changye Li et.al.	2407.13982	null
2024-07-22	Self-supervised ASR Models and Features For Dysarthric and Elderly Speech Recognition	Shujie Hu et.al.	2407.13782	null
2024-07-18	Robust ASR Error Correction with Conservative Data Filtering	Takuma Udagawa et.al.	2407.13300	null
2024-07-18	Low-Resourced Speech Recognition for Iu Mien Language via Weakly-Supervised Phoneme-based Multilingual Pre-training	Lukuan Dong et.al.	2407.13292	null
2024-07-18	How Private is Low-Frequency Speech Audio in the Wild? An Analysis of Verbal Intelligibility by Humans and Machines	Ailin Liu et.al.	2407.13266	null
2024-07-18	A light-weight and efficient punctuation and word casing prediction model for on-device streaming ASR	Jian You et.al.	2407.13142	null
2024-06-29	Error Correction by Paying Attention to Both Acoustic and Confidence References for Automatic Speech Recognition	Yuchun Shu et.al.	2407.12817	null
2024-07-17	Morphosyntactic Analysis for CHILDES	Houjun Liu et.al.	2407.12389	null
2024-07-17	Adaptive Cascading Network for Continual Test-Time Adaptation	Kien X. Nguyen et.al.	2407.12240	null
2024-07-16	Identifying Speakers in Dialogue Transcripts: A Text-based Approach Using Pretrained Language Models	Minh Nguyen et.al.	2407.12094	link
2024-06-29	A Quality-Aware Voltage Overscaling Framework to Improve the Energy Efficiency and Lifetime of TPUs based on Statistical Error Modeling	Alireza Senobari et.al.	2407.12029	null
2024-06-28	TreeSeg: Hierarchical Topic Segmentation of Large Transcripts	Dimitrios C. Gklezakos et.al.	2407.12028	null
2024-05-31	Open the Data! Chuvash Datasets	Nikolay Plotnikov et.al.	2407.11982	null
2024-07-17	Vibravox: A Dataset of French Speech Captured with Body-conduction Audio Sensors	Julien Hauret et.al.	2407.11828	link
2024-07-16	Investigating the Effect of Label Topology and Training Criterion on ASR Performance and Alignment Quality	Tina Raissi et.al.	2407.11641	null
2024-07-16	The VoicePrivacy 2022 Challenge: Progress and Perspectives in Voice Anonymisation	Michele Panariello et.al.	2407.11516	null
2024-07-16	Beyond Binary: Multiclass Paraphasia Detection with Generative Pretrained Transformers and End-to-End Models	Matthew Perez et.al.	2407.11345	null
2024-07-15	Leave No Knowledge Behind During Knowledge Distillation: Towards Practical and Effective Knowledge Distillation for Code-Switching ASR Using Realistic Data	Liang-Hsuan Tseng et.al.	2407.10603	null
2024-07-14	Improving Neural Biasing for Contextual Speech Recognition by Early Context Injection and Text Perturbation	Ruizhe Huang et.al.	2407.10303	null
2024-07-14	CUSIDE-T: Chunking, Simulating Future and Decoding for Transducer based Streaming ASR	Wenbo Zhao et.al.	2407.10255	null
2024-07-14	Textless Dependency Parsing by Labeled Sequence Prediction	Shunsuke Kando et.al.	2407.10118	link
2024-07-14	Whisper-SV: Adapting Whisper for Low-data-resource Speaker Verification	Li Zhang et.al.	2407.10048	null
2024-07-13	Text-Based Detection of On-Hold Scripts in Contact Center Calls	Dmitrii Galimzianov et.al.	2407.09849	link
2024-08-24	Empowering Whisper as a Joint Multi-Talker and Target-Talker Speech Recognition System	Lingwei Meng et.al.	2407.09817	link
2024-07-13	A Streaming Multi-Channel End-to-End Speech Recognition System with Realistic Evaluations	Xiangzhu Kong et.al.	2407.09807	link
2024-07-13	Speech Slytherin: Examining the Performance and Efficiency of Mamba for Speech Separation, Recognition, and Synthesis	Xilin Jiang et.al.	2407.09732	link
2024-07-10	Evaluating Voice Command Pipelines for Drone Control: From STT and LLM to Direct Classification and Siamese Networks	Lucca Emmanuel Pineli Simões et.al.	2407.08658	null
2024-08-12	Tamil Language Computing: the Present and the Future	Kengatharaiyer Sarveswaran et.al.	2407.08618	null
2024-07-10	HebDB: a Weakly Supervised Dataset for Hebrew Speech Processing	Arnon Turetzky et.al.	2407.07566	null
2024-07-09	Tailored Design of Audio-Visual Speech Recognition Models using Branchformers	David Gimeno-Gómez et.al.	2407.06606	link
2024-07-08	Homogeneous Speaker Features for On-the-Fly Dysarthric and Elderly Speaker Adaptation	Mengzhe Geng et.al.	2407.06310	null
2024-07-09	CosyVoice: A Scalable Multilingual Zero-shot Text-to-speech Synthesizer based on Supervised Semantic Tokens	Zhihao Du et.al.	2407.05407	null
2024-07-10	Seed-ASR: Understanding Diverse Speech and Contexts with LLM-based Speech Recognition	Ye Bai et.al.	2407.04675	null
2024-07-05	Multitaper mel-spectrograms for keyword spotting	Douglas Baptista de Souza et.al.	2407.04662	null
2024-07-05	Pretraining End-to-End Keyword Search with Automatically Discovered Acoustic Units	Bolaji Yusuf et.al.	2407.04652	link
2024-07-05	Speculative Speech Recognition by Audio-Prefixed Low-Rank Adaptation of Language Models	Bolaji Yusuf et.al.	2407.04641	null
2024-07-05	Written Term Detection Improves Spoken Term Detection	Bolaji Yusuf et.al.	2407.04601	link
2024-07-09	Performance Analysis of Speech Encoders for Low-Resource SLU and ASR in Tunisian Dialect	Salima Mdhaffar et.al.	2407.04533	link
2024-07-05	Controlling Whisper: Universal Acoustic Adversarial Attacks to Control Speech Foundation Models	Vyas Raina et.al.	2407.04482	null
2024-07-05	XLSR-Transducer: Streaming ASR for Self-Supervised Pretrained Models	Shashi Kumar et.al.	2407.04439	null
2024-07-05	Romanization Encoding For Multilingual ASR	Wen Ding et.al.	2407.04368	null
2024-07-05	LearnerVoice: A Dataset of Non-Native English Learners' Spontaneous Speech	Haechan Kim et.al.	2407.04280	null
2024-07-05	Semi-supervised Learning for Code-Switching ASR with Large Language Model Filter	Yu Xi et.al.	2407.04219	null
2024-07-11	FunAudioLLM: Voice Understanding and Generation Foundation Models for Natural Interaction Between Humans and LLMs	Keyu An et.al.	2407.04051	link
2024-07-04	Improving Accented Speech Recognition using Data Augmentation based on Unsupervised Text-to-Speech Synthesis	Cong-Thanh Do et.al.	2407.04047	null
2024-07-04	Serialized Output Training by Learned Dominance	Ying Shi et.al.	2407.03966	null
2024-07-04	Finetuning End-to-End Models for Estonian Conversational Spoken Language Translation	Tiia Sildam et.al.	2407.03809	null
2024-07-04	Improving Self-supervised Pre-training using Accent-Specific Codebooks	Darshan Prabhu et.al.	2407.03734	link
2024-07-24	Multi-Convformer: Extending Conformer with Multiple Convolution Kernels	Darshan Prabhu et.al.	2407.03718	link
2024-07-04	Learning Video Temporal Dynamics with Cross-Modal Attention for Robust Audio-Visual Speech Recognition	Sungnyun Kim et.al.	2407.03563	null
2024-07-03	Codec-ASR: Training Performant Automatic Speech Recognition Systems with Discrete Speech Representations	Kunal Dhawan et.al.	2407.03495	null
2024-07-03	Advanced Framework for Animal Sound Classification With Features Optimization	Qiang Yang et.al.	2407.03440	null
2024-07-03	Qifusion-Net: Layer-adapted Stream/Non-stream Model for End-to-End Multi-Accent Speech Recognition	Jinming Chen et.al.	2407.03026	null
2024-07-02	Towards the Next Frontier in Speech Representation Learning Using Disentanglement	Varun Krishna et.al.	2407.02543	null
2024-07-02	The USTC-NERCSLIP Systems for The ICMC-ASR Challenge	Minghui Wu et.al.	2407.02052	null
2024-07-02	Pinyin Regularization in Error Correction for Chinese Speech Recognition with Large Language Models	Zhiyuan Tang et.al.	2407.01909	link
2024-06-30	Less Forgetting for Better Generalization: Exploring Continual-learning Fine-tuning Methods for Speech Self-supervised Representations	Salah Zaiem et.al.	2407.00756	null
2024-06-29	When Robots Get Chatty: Grounding Multimodal Human-Robot Conversation and Collaboration	Philipp Allgeuer et.al.	2407.00518	null
2024-07-18	Open-Source Conversational AI with SpeechBrain 1.0	Mirco Ravanelli et.al.	2407.00463	null
2024-06-28	SAML: Speaker Adaptive Mixture of LoRA Experts for End-to-End ASR	Qiuming Zhao et.al.	2406.19706	null
2024-06-28	Less is More: Accurate Speech Recognition & Translation without Web-Scale Data	Krishna C. Puvvada et.al.	2406.19674	null
2024-06-27	Voices Unheard: NLP Resources and Models for Yorùbá Regional Dialects	Orevaoghene Ahia et.al.	2406.19564	link
2024-06-27	Tradition or Innovation: A Comparison of Modern ASR Methods for Forced Alignment	Rotem Rousso et.al.	2406.19363	null
2024-06-27	Zero-Query Adversarial Attack on Black-box Automatic Speech Recognition Systems	Zheng Fang et.al.	2406.19311	null
2024-06-27	Applying LLMs for Rescoring N-best ASR Hypotheses of Casual Conversations: Effects of Domain Adaptation and Context Carry-over	Atsunori Ogawa et.al.	2406.18972	null
2024-06-27	Enhanced ASR Robustness to Packet Loss with a Front-End Adaptation Network	Yehoshua Dissen et.al.	2406.18928	null
2024-06-27	Streaming Decoder-Only Automatic Speech Recognition with Discrete Speech Units: A Pilot Study	Peikun Chen et.al.	2406.18862	link
2024-06-26	Dynamic Data Pruning for Automatic Speech Recognition	Qiao Xiao et.al.	2406.18373	null
2024-06-26	MSR-86K: An Evolving, Multilingual Corpus with 86,300 Hours of Transcribed Audio for Speech Recognition Research	Song Li et.al.	2406.18301	null
2024-06-26	Automatic Speech Recognition for Hindi	Anish Saha et.al.	2406.18135	null
2024-07-12	ArzEn-LLM: Code-Switched Egyptian Arabic-English Translation and Speech Recognition Using LLMs	Ahmed Heakl et.al.	2406.18120	link
2024-06-26	SC-MoE: Switch Conformer Mixture of Experts for Unified Streaming and Non-streaming Code-Switching ASR	Shuaishuai Ye et.al.	2406.18021	null
2024-06-25	Sequential Editing for Lifelong Training of Speech Recognition Models	Devang Kulshreshtha et.al.	2406.17935	null
2024-06-25	FASA: a Flexible and Automatic Speech Aligner for Extracting High-quality Aligned Children Speech Data	Dancheng Liu et.al.	2406.17926	link
2024-06-25	Automatic speech recognition for the Nepali language using CNN, bidirectional LSTM and ResNet	Manish Dhakal et.al.	2406.17825	link
2024-06-25	Towards Building an End-to-End Multilingual Automatic Lyrics Transcription Model	Jiawen Huang et.al.	2406.17618	link
2024-06-25	MSRS: Training Multimodal Speech Recognition Models from Scratch with Sparse Mask Optimization	Adriana Fernandez-Lopez et.al.	2406.17614	null
2024-06-25	A Comprehensive Solution to Connect Speech Encoder and Large Language Model for ASR	Van Tung Pham et.al.	2406.17272	null
2024-06-24	Investigating Confidence Estimation Measures for Speaker Diarization	Anurag Chowdhury et.al.	2406.17124	null
2024-06-24	Exploring the Capability of Mamba in Speech Applications	Koichi Miyazaki et.al.	2406.16808	null
2024-06-24	Blending LLMs into Cascaded Speech Translation: KIT's Offline Speech Translation System for IWSLT 2024	Sai Koneru et.al.	2406.16777	null
2024-06-23	Contextualized End-to-end Automatic Speech Recognition with Intermediate Biasing Loss	Muhammad Shakeel et.al.	2406.16120	null
2024-08-01	Decoder-only Architecture for Streaming End-to-end Speech Recognition	Emiru Tsunoo et.al.	2406.16107	null
2024-06-22	Acoustic Feature Mixup for Balanced Multi-aspect Pronunciation Assessment	Heejin Do et.al.	2406.15723	null
2024-06-21	PI-Whisper: An Adaptive and Incremental ASR Framework for Diverse and Evolving Speaker Characteristics	Amir Nassereldine et.al.	2406.15668	null
2024-06-21	Perception of Phonological Assimilation by Neural Speech Recognition Models	Charlotte Pouw et.al.	2406.15265	null
2024-06-21	InterBiasing: Boost Unseen Word Recognition through Biasing Intermediate Predictions	Yu Nakagome et.al.	2406.14890	null
2024-06-20	An Adapter-Based Unified Model for Multiple Spoken Language Processing Tasks	Varsha Suresh et.al.	2406.14747	null
2024-06-21	DASB - Discrete Audio and Speech Benchmark	Pooneh Mousavi et.al.	2406.14294	null
2024-06-20	Intelligent Interface: Enhancing Lecture Engagement with Didactic Activity Summaries	Anna Wróblewska et.al.	2406.14266	null
2024-06-19	Joint vs Sequential Speaker-Role Detection and Automatic Speech Recognition for Air-traffic Control	Alexander Blatt et.al.	2406.13842	null
2024-06-19	ManWav: The First Manchu ASR Model	Jean Seo et.al.	2406.13502	null
2024-06-24	Children's Speech Recognition through Discrete Token Enhancement	Vrunda N. Sukhadia et.al.	2406.13431	null
2024-06-17	Self-Train Before You Transcribe	Robert Flynn et.al.	2406.12937	link
2024-06-16	Automatic Speech Recognition for Biomedical Data in Bengali Language	Shariar Kabir et.al.	2406.12931	null
2024-06-18	Bridging the Gap: Integrating Pre-trained Speech Enhancement and Recognition Models for Robust Speech Recognition	Kuan-Chen Wang et.al.	2406.12699	null
2024-06-18	Transcribe, Align and Segment: Creating speech datasets for low-resource languages	Taras Sereda et.al.	2406.12674	null
2024-06-18	Growing Trees on Sounds: Assessing Strategies for End-to-End Dependency Parsing of Speech	Adrien Pupier et.al.	2406.12621	link
2024-06-18	Rapid Language Adaptation for Multilingual E2E Speech Recognition Using Encoder Prompting	Yosuke Kashiwagi et.al.	2406.12611	null
2024-06-18	Unsupervised Online Continual Learning for Automatic Speech Recognition	Steven Vander Eeckt et.al.	2406.12503	link
2024-06-18	Performant ASR Models for Medical Entities in Accented Speech	Tejumade Afonja et.al.	2406.12387	null
2024-06-18	Finding Task-specific Subnetworks in Multi-task Spoken Language Understanding Model	Hayato Futami et.al.	2406.12317	null
2024-06-18	SyncVSR: Data-Efficient Visual Speech Recognition with End-to-End Crossmodal Audio Token Synchronization	Young Jin Ahn et.al.	2406.12233	link
2024-06-17	GigaSpeech 2: An Evolving, Large-Scale and Multi-domain ASR Corpus for Low-Resource Languages with Automated Crawling, Transcription and Refinement	Yifan Yang et.al.	2406.11546	link
2024-06-16	Continual Test-time Adaptation for End-to-end Speech Recognition on Noisy Speech	Guan-Ting Lin et.al.	2406.11064	null
2024-06-16	NAST: Noise Aware Speech Tokenization for Speech Language Models	Shoval Messica et.al.	2406.11037	link
2024-06-16	Large Language Models for Dysfluency Detection in Stuttered Speech	Dominik Wagner et.al.	2406.11025	null
2024-06-16	Outlier Reduction with Gated Attention for Improved Post-training Quantization in Large Sequence-to-sequence Speech Foundation Models	Dominik Wagner et.al.	2406.11022	null
2024-06-16	Optimized Speculative Sampling for GPU Hardware Accelerators	Dominik Wagner et.al.	2406.11016	null
2024-06-16	CoSTA: Code-Switched Speech Translation using Aligned Speech-Text Interleaving	Bhavani Shankar et.al.	2406.10993	null
2024-06-16	Imperceptible Rhythm Backdoor Attacks: Exploring Rhythm Transformation for Embedding Undetectable Vulnerabilities on Speech Recognition	Wenhan Yao et.al.	2406.10932	null
2024-06-15	Speech Emotion Recognition Using CNN and Its Use Case in Digital Healthcare	Nishargo Nigar et.al.	2406.10741	null
2024-06-21	Trading Devil: Robust backdoor attack via Stochastic investment models and Bayesian approach	Orson Mengara et.al.	2406.10719	null
2024-08-06	Double Multi-Head Attention Multimodal System for Odyssey 2024 Speech Emotion Recognition Challenge	Federico Costa et.al.	2406.10598	null
2024-06-14	CNVSRC 2023: The First Chinese Continuous Visual Speech Recognition Challenge	Chen Chen et.al.	2406.10313	null
2024-06-12	Improving child speech recognition with augmented child-like speech	Yuanyuan Zhang et.al.	2406.10284	null
2024-06-14	Inclusive ASR for Disfluent Speech: Cascaded Large-Scale Self-Supervised Learning with Targeted Fine-Tuning and Data Augmentation	Dena Mujtaba et.al.	2406.10177	null
2024-06-14	On the Evaluation of Speech Foundation Models for Spoken Language Understanding	Siddhant Arora et.al.	2406.10083	null
2024-06-14	Whisper-Flamingo: Integrating Visual Features into Whisper for Audio-Visual Speech Recognition and Translation	Andrew Rouditchenko et.al.	2406.10082	link
2024-06-14	Simul-Whisper: Attention-Guided Streaming Whisper with Truncation Detection	Haoyu Wang et.al.	2406.10052	link
2024-06-14	ROAR: Reinforcing Original to Augmented Data Ratio Dynamics for Wav2Vec2.0 Based ASR	Vishwanath Pratap Singh et.al.	2406.09999	null
2024-06-14	An efficient text augmentation approach for contextualized Mandarin speech recognition	Naijun Zheng et.al.	2406.09950	null
2024-06-14	Perceiver-Prompt: Flexible Speaker Adaptation in Whisper for Chinese Disordered Speech Recognition	Yicong Jiang et.al.	2406.09873	null
2024-06-14	MMM: Multi-Layer Multi-Residual Multi-Stream Discrete Speech Representation from Self-supervised Learning Model	Jiatong Shi et.al.	2406.09869	null
2024-06-14	Optimizing Byte-level Representation for End-to-end ASR	Roger Hsiao et.al.	2406.09676	null
2024-06-14	Learning Language Structures through Grounding	Freda Shi et.al.	2406.09662	null
2024-06-13	Multi-Modal Retrieval For Large Language Model Based Speech Recognition	Jari Kolehmainen et.al.	2406.09618	null
2024-06-13	Speech ReaLLM -- Real-time Streaming Speech Recognition with Multimodal LLMs by Teaching the Flow of Time	Frank Seide et.al.	2406.09569	null
2024-06-13	The Second DISPLACE Challenge : DIarization of SPeaker and LAnguage in Conversational Environments	Shareef Babu Kalluri et.al.	2406.09494	null
2024-06-12	Comparative Analysis of Personalized Voice Activity Detection Systems: Assessing Real-World Effectiveness	Satyam Kumar et.al.	2406.09443	null
2024-04-13	SGPRS: Seamless GPU Partitioning Real-Time Scheduler for Periodic Deep Learning Workloads	Amir Fakhim Babaei et.al.	2406.09425	null
2024-06-13	Language Complexity and Speech Recognition Accuracy: Orthographic Complexity Hurts, Phonological Complexity Doesn't	Chihiro Taguchi et.al.	2406.09202	link
2024-06-13	LASER: Learning by Aligning Self-supervised Representations of Speech for Improving Content-related Tasks	Amit Meghanani et.al.	2406.09153	link
2024-06-13	Transcription-Free Fine-Tuning of Speech Separation Models for Noisy and Reverberant Multi-Speaker Automatic Speech Recognition	William Ravenscroft et.al.	2406.08914	null
2024-06-13	AdaPTwin: Low-Cost Adaptive Compression of Product Twins in Transformers	Emil Biju et.al.	2406.08904	null
2024-06-12	ML-SUPERB 2.0: Benchmarking Multilingual Speech Models Across Modeling Constraints, Languages, and Datasets	Jiatong Shi et.al.	2406.08641	null
2024-06-12	Neural Blind Source Separation and Diarization for Distant Speech Recognition	Yoshiaki Bando et.al.	2406.08396	null
2025-01-10	Towards Unsupervised Speech Recognition Without Pronunciation Models	Junrui Ni et.al.	2406.08380	null
2024-06-12	Speech Emotion Recognition with ASR Transcripts: A Comprehensive Study on Word Error Rate and Fusion Techniques	Yuanchao Li et.al.	2406.08353	link
2024-06-13	Refining Self-Supervised Learnt Speech Representation using Brain Activations	Hengyu Li et.al.	2406.08266	null
2024-06-12	Transformer-based Model for ASR N-Best Rescoring and Rewriting	Iwen E. Kang et.al.	2406.08207	null
2024-06-12	Audio-conditioned phonemic and prosodic annotation for building text-to-speech models from unlabeled speech data	Yuma Shirahata et.al.	2406.08111	null
2024-06-14	Can Large Language Models Understand Spatial Audio?	Changli Tang et.al.	2406.07914	null
2024-06-12	Guiding Frame-Level CTC Alignments Using Self-knowledge Distillation	Eungbeom Kim et.al.	2406.07909	null
2024-06-12	DualVC 3: Leveraging Language Model Generated Pseudo Context for End-to-end Low Latency Streaming Voice Conversion	Ziqian Ning et.al.	2406.07846	null
2024-06-12	Dual-Pipeline with Low-Rank Adaptation for New Language Integration in Multilingual ASR	Yerbolat Khassanov et.al.	2406.07842	null
2024-06-12	PRoDeliberation: Parallel Robust Deliberation for End-to-End Spoken Language Understanding	Trang Le et.al.	2406.07823	null
2024-06-12	PolySpeech: Exploring Unified Multitask Speech Models for Competitiveness with Single-task Models	Runyan Yang et.al.	2406.07801	null
2024-06-11	The Interspeech 2024 Challenge on Speech Processing Using Discrete Units	Xuankai Chang et.al.	2406.07725	null
2024-06-11	Tag and correct: high precision post-editing approach to correction of speech recognition errors	Tomasz Ziętkiewicz et.al.	2406.07589	null
2024-06-11	AS-70: A Mandarin stuttered speech dataset for automatic speech recognition and stuttering event detection	Rong Gong et.al.	2406.07256	null
2024-06-11	Fast Context-Biasing for CTC and Transducer ASR models with CTC-based Word Spotter	Andrei Andrusenko et.al.	2406.07096	null
2024-07-29	Spoken Language Corpora Augmentation with Domain-Specific Voice-Cloned Speech	Mateusz Czyżnikiewicz et.al.	2406.07090	null
2024-06-11	Reading Miscue Detection in Primary School through Automatic Speech Recognition	Lingyun Gao et.al.	2406.07060	null
2024-06-10	Synthetic Query Generation using Large Language Models for Virtual Assistants	Sonal Sannigrahi et.al.	2406.06729	null
2024-06-13	ASTRA: Aligning Speech and Text Representations for Asr without Sampling	Neeraj Gaur et.al.	2406.06664	null
2024-06-07	LoRA-Whisper: Parameter-Efficient and Extensible Multilingual ASR	Zheshu Song et.al.	2406.06619	null
2024-06-25	Discrete Multimodal Transformers with a Pretrained Large Language Model for Mixed-Supervision Speech Processing	Viet Anh Trinh et.al.	2406.06582	null
2024-06-10	A Parameter-efficient Language Extension Framework for Multilingual ASR	Wei Liu et.al.	2406.06329	null
2024-06-10	Prompting Large Language Models with Audio for General-Purpose Speech Summarization	Wonjune Kang et.al.	2406.05968	link
2024-07-18	Do Prompts Really Prompt? Exploring the Prompt Understanding Capability of Whisper	Chih-Kai Yang et.al.	2406.05806	null
2024-07-20	Optimizing Multi-Stuttered Speech Classification: Leveraging Whisper's Encoder for Efficient Parameter Reduction in Automated Assessment	Huma Ameer et.al.	2406.05784	null
2024-06-09	MS-HuBERT: Mitigating Pre-training and Inference Mismatch in Masked Language Modelling methods for learning Speech Representations	Hemant Yadav et.al.	2406.05661	null
2024-06-07	LLM-based speaker diarization correction: A generalizable approach	Georgios Efstathiadis et.al.	2406.04927	link
2024-07-02	Speaker-Smoothed kNN Speaker Adaptation for End-to-End ASR	Shaojun Li et.al.	2406.04791	null
2024-06-07	Pitch-Aware RNN-T for Mandarin Chinese Mispronunciation Detection and Diagnosis	Xintong Wang et.al.	2406.04595	null
2024-06-06	Flexible Multichannel Speech Enhancement for Noise-Robust Frontend	Ante Jukić et.al.	2406.04552	null
2024-06-06	Label-Synchronous Neural Transducer for E2E Simultaneous Speech Translation	Keqi Deng et.al.	2406.04541	link
2024-06-06	To Distill or Not to Distill? On the Robustness of Robust Knowledge Distillation	Abdul Waheed et.al.	2406.04512	null
2024-06-06	LipGER: Visually-Conditioned Generative Error Correction for Robust Automatic Speech Recognition	Sreyan Ghosh et.al.	2406.04432	link
2024-06-06	Beyond Performance Plateaus: A Comprehensive Study on Scalability in Speech Enhancement	Wangyou Zhang et.al.	2406.04269	link
2024-07-02	Hypernetworks for Personalizing ASR to Atypical Speech	Max Müller-Eberstein et.al.	2406.04240	null
2024-06-06	Helsinki Speech Challenge 2024	Martin Ludvigsen et.al.	2406.04123	null
2024-06-06	BLSP-Emo: Towards Empathetic Large Speech-Language Models	Chen Wang et.al.	2406.03872	link
2024-06-14	Improving Zero-Shot Chinese-English Code-Switching ASR with kNN-CTC and Gated Monolingual Datastores	Jiaming Zhou et.al.	2406.03814	null
2024-06-06	Speed of Light Exact Greedy Decoding for RNN-T Speech Recognition Models on GPU	Daniel Galvez et.al.	2406.03791	null
2024-06-11	Enhancing CTC-based speech recognition with diverse modeling units	Shiyi Han et.al.	2406.03274	null
2024-06-05	Error-preserving Automatic Speech Recognition of Young English Learners' Language	Janick Michot et.al.	2406.03235	link
2024-06-05	StreamSpeech: Simultaneous Speech-to-Speech Translation with Multi-task Learning	Shaolei Zhang et.al.	2406.03049	link
2024-06-05	4D ASR: Joint Beam Search Integrating CTC, Attention, Transducer, and Mask Predict Decoders	Yui Sudo et.al.	2406.02950	null
2024-06-15	Task Arithmetic can Mitigate Synthetic-to-Real Gap in Automatic Speech Recognition	Hsuan Su et.al.	2406.02925	null
2024-06-11	Text Injection for Neural Contextual Biasing	Zhong Meng et.al.	2406.02921	null
2024-06-04	Keyword-Guided Adaptation of Automatic Speech Recognition	Aviv Shamsian et.al.	2406.02649	null
2024-05-03	Combining X-Vectors and Bayesian Batch Active Learning: Two-Stage Active Learning Pipeline for Speech Recognition	Ognjen Kundacina et.al.	2406.02566	null
2024-05-02	Sequence-to-sequence models in peer-to-peer learning: A practical application	Robert Šajina et.al.	2406.02565	null
2024-04-29	A cost minimization approach to fix the vocabulary size in a tokenizer for an End-to-End ASR system	Sunil Kumar Kopparapu et.al.	2406.02563	null
2024-04-24	Gated Low-rank Adaptation for personalized Code-Switching Automatic Speech Recognition on the low-spec devices	Gwantae Kim et.al.	2406.02562	null
2024-04-23	Breaking Walls: Pioneering Automatic Speech Recognition for Central Kurdish: End-to-End Transformer Paradigm	Abdulhady Abas Abdullah et.al.	2406.02561	null
2024-07-18	Less Peaky and More Accurate CTC Forced Alignment by Label Priors	Ruizhe Huang et.al.	2406.02560	link
2024-03-27	PhoWhisper: Automatic Speech Recognition for Vietnamese	Thanh-Thien Le et.al.	2406.02555	link
2024-06-04	Whistle: Data-Efficient Multilingual and Crosslingual Speech Recognition via Weakly Phonetic Supervision	Saierdaer Yusuyin et.al.	2406.02166	link
2024-06-05	Efficiently Train ASR Models that Memorize Less and Perform Better with Per-core Clipping	Lun Wang et.al.	2406.02004	null
2024-06-03	Enabling ASR for Low-Resource Languages: A Comprehensive Dataset Creation Approach	Ara Yeroyan et.al.	2406.01446	null
2024-06-03	Compute-Efficient Medical Image Classification with Softmax-Free Transformers and Sequence Normalization	Firas Khader et.al.	2406.01314	null
2024-06-02	YODAS: Youtube-Oriented Dataset for Audio and Speech	Xinjian Li et.al.	2406.00899	null
2024-06-01	Wav2Prompt: End-to-End Speech Prompt Generation and Tuning For LLM in Zero and Few-shot Learning	Keqi Deng et.al.	2406.00522	null
2024-05-27	ViSpeR: Multilingual Audio-Visual Speech Recognition	Sanath Narayan et.al.	2406.00038	null
2024-05-14	Sonos Voice Control Bias Assessment Dataset: A Methodology for Demographic Bias Assessment in Voice Assistants	Chloé Sekkat et.al.	2405.19342	null
2024-05-31	Zipper: A Multi-Tower Decoder Architecture for Fusing Modalities	Vicky Zayats et.al.	2405.18669	null
2024-05-28	Augmented Conversation with Embedded Speech-Driven On-the-Fly Referencing in AR	Shivesh Jadon et.al.	2405.18537	null
2024-05-28	Intelligent Clinical Documentation: Harnessing Generative AI for Patient-Centric Clinical Note Generation	Anjanava Biswas et.al.	2405.18346	null
2024-05-28	NUTS, NARS, and Speech	D. van der Sluis et.al.	2405.17874	null
2024-05-28	TransVIP: Speech to Speech Translation System with Voice and Isochrony Preservation	Chenyang Le et.al.	2405.17809	null
2024-05-27	Federating Dynamic Models using Early-Exit Architectures for Automatic Speech Recognition on Heterogeneous Clients	Mohamed Nabih Ali et.al.	2405.17376	null
2024-05-27	"Pass the butter": A study on desktop-classic multitasking robotic arm based on advanced YOLOv7 and BERT	Haohua Que et.al.	2405.17250	null
2024-05-27	A Variance-Preserving Interpolation Approach for Diffusion Models with Applications to Single Channel Speech Enhancement and Recognition	Zilu Guo et.al.	2405.16952	link
2024-05-24	Denoising LM: Pushing the Limits of Error Correction Models for Speech Recognition	Zijin Gu et.al.	2405.15216	null
2024-05-23	Contrastive and Consistency Learning for Neural Noisy-Channel Model in Spoken Language Understanding	Suyoung Kim et.al.	2405.15097	link
2024-06-02	Let's Fuse Step by Step: A Generative Fusion Decoding Algorithm with LLMs for Multi-modal Text Recognition	Chan-Jan Hsu et.al.	2405.14259	link
2024-05-23	Self-Taught Recognizer: Toward Unsupervised Adaptation for Speech Foundation Models	Yuchen Hu et.al.	2405.14161	link
2024-05-23	A Survey on Vision-Language-Action Models for Embodied AI	Yueen Ma et.al.	2405.14093	null
2024-05-22	ST-Gait++: Leveraging spatio-temporal convolutions for gait-based emotion recognition on videos	Maria Luísa Lima et.al.	2405.13903	null
2024-09-12	Joint Optimization of Streaming and Non-Streaming Automatic Speech Recognition with Multi-Decoder and Knowledge Distillation	Muhammad Shakeel et.al.	2405.13514	null
2024-05-22	A Near-Real-Time Processing Ego Speech Filtering Pipeline Designed for Speech Interruption During Human-Robot Interaction	Yue Li et.al.	2405.13477	null
2024-05-22	You don't understand me!: Comparing ASR results for L1 and L2 speakers of Swedish	Ronald Cumbal et.al.	2405.13379	null
2024-05-22	Contextualized Automatic Speech Recognition with Dynamic Vocabulary	Yui Sudo et.al.	2405.13344	null
2024-05-28	FairLENS: Assessing Fairness in Law Enforcement Speech Recognition	Yicheng Wang et.al.	2405.13166	null
2024-05-21	Non-autoregressive real-time Accent Conversion model with voice cloning	Vladimir Nechaev et.al.	2405.13162	null
2024-05-15	Continued Pretraining for Domain Adaptation of Wav2vec2.0 in Automatic Speech Recognition for Elementary Math Classroom Settings	Ahmed Adel Attia et.al.	2405.13018	null
2024-05-12	Large Language Models for Education: A Survey	Hanyi Xu et.al.	2405.13001	null
2024-03-14	Multilingual Audio-Visual Speech Recognition with Hybrid CTC/RNN-T Fast Conformer	Maxime Burchi et.al.	2405.12983	null
2024-05-21	Could a Computer Architect Understand our Brain?	Valentin Puente-Varona et.al.	2405.12815	null
2024-07-01	Mamba in Speech: Towards an Alternative to Self-Attention	Xiangyu Zhang et.al.	2405.12609	null
2024-05-20	Continuous Sign Language Recognition with Adapted Conformer via Unsupervised Pretraining	Neena Aloysius et.al.	2405.12018	null
2024-05-21	Acoustic modeling for Overlapping Speech Recognition: JHU Chime-5 Challenge System	Vimal Manohar et.al.	2405.11078	null
2024-05-16	Listen Again and Choose the Right Answer: A New Paradigm for Automatic Speech Recognition with Large Language Models	Yuchen Hu et.al.	2405.10025	null
2024-05-15	No More Mumbles: Enhancing Robot Intelligibility through Speech Adaptation	Qiaoqiao Ren et.al.	2405.09708	link
2024-05-15	Towards Evaluating the Robustness of Automatic Speech Recognition Systems via Audio Style Transfer	Weifei Jin et.al.	2405.09470	null
2024-05-14	Investigating the 'Autoencoder Behavior' in Speech Self-Supervised Models: a focus on HuBERT's Pretraining	Valentin Vielzeuf et.al.	2405.08402	null
2024-05-31	SpeechVerse: A Large-scale Generalizable Audio Language Model	Nilaksh Das et.al.	2405.08295	null
2024-06-07	Rene: A Pre-trained Multi-modal Architecture for Auscultation of Respiratory Diseases	Pengfei Zhang et.al.	2405.07442	link
2024-05-12	SoccerNet-Echoes: A Soccer Game Audio Commentary Dataset	Sushant Gautam et.al.	2405.07354	link
2024-07-22	DP-DyLoRA: Fine-Tuning Transformer-Based Models On-Device under Differentially Private Federated Learning using Dynamic Low-Rank Adaptation	Jie Xu et.al.	2405.06368	null
2024-05-10	Lost in Transcription: Identifying and Quantifying the Accuracy Biases of Automatic Speech Recognition Systems Against Disfluent Speech	Dena Mujtaba et.al.	2405.06150	null
2024-07-17	Muting Whisper: A Universal Acoustic Adversarial Attack on Speech Foundation Models	Vyas Raina et.al.	2405.06134	link
2024-05-09	The RoyalFlush Automatic Speech Diarization and Recognition System for In-Car Multi-Channel Automatic Speech Recognition Challenge	Jingguang Tian et.al.	2405.05498	null
2024-05-07	Open Implementation and Study of BEST-RQ for Speech Processing	Ryan Whetten et.al.	2405.04296	link
2024-05-06	Whispy: Adapting STT Whisper Models to Real-Time Environments	Antonio Bevilacqua et.al.	2405.03484	null
2024-05-06	MMGER: Multi-modal and Multi-granularity Generative Error Correction with LLM for Joint Accent and Speech Recognition	Bingshen Mu et.al.	2405.03152	null
2024-05-11	Analysis about Theoretical Foundations for Method to Enhancing ASR Performance using OCR Word Frequency Differences	Kyudan Jung et.al.	2405.02995	null
2024-05-04	Mixat: A Data Set of Bilingual Emirati-English Speech	Maryam Al Ali et.al.	2405.02578	link
2024-05-06	Unveiling the Potential of LLM-Based ASR on Chinese Open-Source Datasets	Xuelong Geng et.al.	2405.02132	null
2024-05-01	Efficient Sample-Specific Encoder Perturbations	Yassir Fathullah et.al.	2405.01601	null
2024-05-02	Low-resource speech recognition and dialect identification of Irish in a multi-task framework	Liam Lonergan et.al.	2405.01293	null
2024-05-02	Improving Membership Inference in ASR Model Auditing with Perturbed Loss Features	Francisco Teixeira et.al.	2405.01207	null
2024-05-02	Deep Learning Models in Speech Recognition: Measuring GPU Energy Consumption, Impact of Noise and Model Quantization for Edge Deployment	Aditya Chakravarty et.al.	2405.01004	link
2024-05-02	Efficient Compression of Multitask Multilingual Speech Models	Thomas Palmeira Ferraz et.al.	2405.00966	null
2024-05-01	Active Learning with Task Adaptation Pre-training for Speech Emotion Recognition	Dongyuan Li et.al.	2405.00307	null
2024-07-24	Confides: A Visual Analytics Solution for Automated Speech Recognition Analysis and Exploration	Sunwoo Ha et.al.	2405.00223	null
2024-05-09	Does Whisper understand Swiss German? An automatic, qualitative, and human evaluation	Eyal Liron Dolev et.al.	2404.19310	null
2024-04-30	EfficientASR: Speech Recognition Network Compression via Attention Redundancy and Chunk-Level FFN Optimization	Jianzong Wang et.al.	2404.19214	null
2024-04-29	Towards Dog Bark Decoding: Leveraging Human Speech Processing for Automated Bark Classification	Artem Abzaliev et.al.	2404.18739	null
2024-04-26	Child Speech Recognition in Human-Robot Interaction: Problem Solved?	Ruben Janssens et.al.	2404.17394	null
2024-04-26	Automatic Speech Recognition System-Independent Word Error Rate Estimation	Chanho Park et.al.	2404.16743	null
2024-04-26	Developing Acoustic Models for Automatic Speech Recognition in Swedish	Giampiero Salvi et.al.	2404.16547	null
2024-04-25	U2++ MoE: Scaling 4.7x parameters with minimal impact on RTF	Xingchen Song et.al.	2404.16407	null
2024-04-24	Mamba-360: Survey of State Space Models as Transformer Alternative for Long Sequence Modelling: Methods, Applications, and Challenges	Badri Narayana Patro et.al.	2404.16112	link
2024-04-23	Killkan: The Automatic Speech Recognition Dataset for Kichwa with Morphosyntactic Information	Chihiro Taguchi et.al.	2404.15501	link
2024-04-18	Artificial Neural Networks to Recognize Speakers Division from Continuous Bengali Speech	Hasmot Ali et.al.	2404.15168	null
2024-04-23	Rethinking Processing Distortions: Disentangling the Impact of Speech Enhancement Errors on Speech Recognition Performance	Tsubasa Ochiai et.al.	2404.14860	null
2024-04-22	Assessment of Sign Language-Based versus Touch-Based Input for Deaf Users Interacting with Intelligent Personal Assistants	Nina Tran et.al.	2404.14605	null
2024-04-22	Exploring neural oscillations during speech perception via surrogate gradient spiking neural networks	Alexandre Bittar et.al.	2404.14024	null
2024-04-20	Semantically Corrected Amharic Automatic Speech Recognition	Samuael Adnew et.al.	2404.13362	link
2024-04-19	Learn2Talk: 3D Talking Face Learns from 2D Talking Face	Yixiang Zhuang et.al.	2404.12888	null
2024-04-19	Efficient infusion of self-supervised representations in Automatic Speech Recognition	Darshan Prabhu et.al.	2404.12628	null
2024-04-16	Teaching a Multilingual Large Language Model to Understand Multilingual Speech via Multi-Instructional Training	Pavel Denisov et.al.	2404.10922	link
2024-04-16	Anatomy of Industrial Scale Multilingual ASR	Francis McCann Ramirez et.al.	2404.09841	null
2024-04-15	Resilience of Large Language Models for Noisy Instructions	Bin Wang et.al.	2404.09754	null
2024-04-12	Comparing Apples to Oranges: LLM-powered Multimodal Intention Prediction in an Object Categorization Task	Hassan Ali et.al.	2404.08424	null
2024-07-26	Automatic Speech Recognition Advancements for Indigenous Languages of the Americas	Monica Romero et.al.	2404.08368	null
2024-04-10	An inclusive review on deep learning techniques and their scope in handwriting recognition	Sukhdeep Singh et.al.	2404.08011	null
2024-04-12	An Effective Automated Speaking Assessment Approach to Mitigating Data Scarcity and Imbalanced Distribution	Tien-Hong Lo et.al.	2404.07575	null
2024-04-12	Conformer-1: Robust ASR via Large-Scale Semisupervised Bootstrapping	Kevin Zhang et.al.	2404.07341	null
2024-03-31	Houston we have a Divergence: A Subgroup Performance Analysis of ASR Models	Alkis Koudounas et.al.	2404.07226	null
2024-04-10	The X-LANCE Technical Report for Interspeech 2024 Speech Processing Using Discrete Speech Unit Challenge	Yiwei Guo et.al.	2404.06079	null
2024-05-28	VietMed: A Dataset and Benchmark for Automatic Speech Recognition of Vietnamese in the Medical Domain	Khai Le-Duc et.al.	2404.05659	link
2024-04-07	Safeguarding Voice Privacy: Harnessing Near-Ultrasonic Interference To Protect Against Unauthorized Audio Recording	Forrest McKee et.al.	2404.04769	null
2024-04-04	Transducers with Pronunciation-aware Embeddings for Automatic Speech Recognition	Hainan Xu et.al.	2404.04295	null
2024-04-03	Mai Ho'omāuna i ka 'Ai: Language Models Improve Automatic Speech Recognition in Hawaiian	Kaavya Chaparala et.al.	2404.03073	null
2024-04-03	CMULAB: An Open-Source Framework for Training and Deployment of Natural Language Processing Models	Zaid Sheikh et.al.	2404.02408	link
2024-04-02	BRAVEn: Improving Self-Supervised Pre-training for Visual and Auditory Speech Recognition	Alexandros Haliassos et.al.	2404.02098	link
2024-04-02	Noise Masking Attacks and Defenses for Pretrained Speech Models	Matthew Jagielski et.al.	2404.02052	null
2024-04-02	Kallaama: A Transcribed Speech Dataset about Agriculture in the Three Most Widely Spoken Languages in Senegal	Elodie Gauthier et.al.	2404.01991	link
2024-04-02	Transfer Learning from Whisper for Microscopic Intelligibility Prediction	Paul Best et.al.	2404.01737	null
2024-07-22	ELITR-Bench: A Meeting Assistant Benchmark for Long-Context Language Models	Thibaut Thonet et.al.	2403.20262	link
2024-03-28	Multi-Stage Multi-Modal Pre-Training for Automatic Speech Recognition	Yash Jain et.al.	2403.19822	null
2024-03-25	Hierarchical Recurrent Adapters for Efficient Multi-Task Adaptation of Large Speech Models	Tsendsuren Munkhdalai et.al.	2403.19709	null
2024-03-29	Emotion Neural Transducer for Fine-Grained Speech Emotion Recognition	Siyuan Shen et.al.	2403.19224	null
2024-03-28	LV-CTC: Non-autoregressive ASR with CTC and latent variable models	Yuya Fujita et.al.	2403.19207	null
2024-03-04	JEP-KD: Joint-Embedding Predictive Architecture Based Knowledge Distillation for Visual Speech Recognition	Chang Sun et.al.	2403.18843	null
2024-06-04	PhysicsAssistant: An LLM-Powered Interactive Learning Robot for Physics Lab Investigations	Ehsan Latif et.al.	2403.18721	null
2024-03-27	ZAEBUC-Spoken: A Multilingual Multidialectal Arabic-English Speech Corpus	Injy Hamed et.al.	2403.18182	null
2024-04-11	DANCER: Entity Description Augmented Named Entity Corrector for Automatic Speech Recognition	Yi-Cheng Wang et.al.	2403.17645	null
2024-03-26	Extracting Biomedical Entities from Noisy Audio Transcripts	Nima Ebadi et.al.	2403.17363	null
2024-03-25	Grammatical vs Spelling Error Correction: An Investigation into the Responsiveness of Transformer-based Language Models using BART and MarianMT	Rohit Raju et.al.	2403.16655	null
2024-03-22	Privacy-Preserving End-to-End Spoken Language Understanding	Yinggui Wang et.al.	2403.15510	null
2024-03-20	Isometric Neural Machine Translation using Phoneme Count Ratio Reward-based Reinforcement Learning	Shivam Ratnakant Mhaskar et.al.	2403.15469	null
2024-07-21	Artificial Intelligence for Cochlear Implants: Review of Strategies, Challenges, and Perspectives	Billel Essaid et.al.	2403.15442	null
2024-03-26	A Multimodal Approach to Device-Directed Speech Detection with Large Language Models	Dominik Wagner et.al.	2403.14438	null
2024-03-21	XLAVS-R: Cross-Lingual Audio-Visual Speech Representation Learning for Noise-Robust Speech Perception	HyoJung Han et.al.	2403.14402	null
2024-06-04	M $^3$ AV: A Multimodal, Multigenre, and Multipurpose Audio-Visual Academic Lecture Dataset	Zhe Chen et.al.	2403.14168	null
2024-03-20	Open Access NAO (OAN): a ROS2-based software framework for HRI applications with the NAO robot	Antonio Bono et.al.	2403.13960	null
2024-03-20	BanglaNum -- A Public Dataset for Bengali Digit Recognition from Speech	Mir Sayeed Mohammad et.al.	2403.13465	null
2024-03-20	Advanced Long-Content Speech Recognition With Factorized Neural Transducer	Xun Gong et.al.	2403.13423	null
2024-03-21	FlowerFormer: Empowering Neural Architecture Encoding using a Flow-aware Graph Transformer	Dongyeong Hwang et.al.	2403.12821	link
2024-03-19	Real-time Speech Extraction Using Spatially Regularized Independent Low-rank Matrix Analysis and Rank-constrained Spatial Covariance Matrix Estimation	Yuto Ishikawa et.al.	2403.12477	null
2024-03-18	Multimodal Human-Autonomous Agents Interaction Using Pre-Trained Language and Visual Foundation Models	Linus Nwankwo et.al.	2403.12273	null
2024-03-18	AdaMER-CTC: Connectionist Temporal Classification with Adaptive Maximum Entropy Regularization for Automatic Speech Recognition	SooHwan Eom et.al.	2403.11578	null
2024-03-16	Energy-Based Models with Applications to Speech and Language Processing	Zhijian Ou et.al.	2403.10961	null
2024-03-16	Initial Decoding with Minimally Augmented Language Model for Improved Lattice Rescoring in Low Resource ASR	Savitha Murthy et.al.	2403.10937	null
2024-03-15	Neural Networks Hear You Loud And Clear: Hearing Loss Compensation Using Deep Neural Networks	Peter Leer et.al.	2403.10420	null
2024-03-14	SpokeN-100: A Cross-Lingual Benchmarking Dataset for The Classification of Spoken Numbers in Different Languages	René Groh et.al.	2403.09753	link
2024-03-15	More than words: Advancements and challenges in speech recognition for singing	Anna Kruspe et.al.	2403.09298	null
2024-05-21	Skipformer: A Skip-and-Recover Strategy for Efficient Speech Recognition	Wenjing Zhu et.al.	2403.08258	null
2024-03-13	SpeechColab Leaderboard: An Open-Source Platform for Automatic Speech Recognition Evaluation	Jiayu Du et.al.	2403.08196	link
2024-03-13	Automatic Speech Recognition (ASR) for the Diagnosis of pronunciation of Speech Sound Disorders in Korean children	Taekyung Ahn et.al.	2403.08187	null
2024-03-12	Gujarati-English Code-Switching Speech Recognition using ensemble prediction of spoken language	Yash Sharma et.al.	2403.08011	null
2024-03-11	The evaluation of a code-switched Sepedi-English automatic speech recognition system	Amanda Phaladi et.al.	2403.07947	null
2024-03-08	Speech Robust Bench: A Robustness Benchmark For Speech Recognition	Muhammad A. Shah et.al.	2403.07937	null
2024-03-12	Beyond the Labels: Unveiling Text-Dependency in Paralinguistic Speech Recognition Datasets	Jan Pešán et.al.	2403.07767	null
2024-03-11	Real-Time Multimodal Cognitive Assistant for Emergency Medical Services	Keshara Weerasinghe et.al.	2403.06734	link
2024-03-11	Towards Decoupling Frontend Enhancement and Backend Recognition in Monaural Robust ASR	Yufeng Yang et.al.	2403.06387	null
2024-03-10	SCORE: Self-supervised Correspondence Fine-tuning for Improved Content Representations	Amit Meghanani et.al.	2403.06260	link
2025-11-04	Aligning Speech to Languages to Enhance Code-switching Speech Recognition	Hexin Liu et.al.	2403.05887	null
2024-03-02	A Cross-Modal Approach to Silent Speech with LLM-Enhanced Recognition	Tyler Benster et.al.	2403.05583	link
2024-03-07	Classist Tools: Social Class Correlates with Performance in NLP	Amanda Cercas Curry et.al.	2403.04445	null
2024-05-30	A New Benchmark for Evaluating Automatic Speech Recognition in the Arabic Call Domain	Qusai Abo Obaidah et.al.	2403.04280	null
2024-03-07	A Study of Dropout-Induced Modality Bias on Robustness to Missing Video Frames for Audio-Visual Speech Recognition	Yusheng Dai et.al.	2403.04245	link
2024-03-06	RADIA -- Radio Advertisement Detection with Intelligent Analytics	Jorge Álvarez et.al.	2403.03538	null
2024-03-13	Non-verbal information in spontaneous speech -- towards a new framework of analysis	Tirza Biron et.al.	2403.03522	null
2024-03-05	AIx Speed: Playback Speed Optimization Using Listening Comprehension of Speech Recognition Models	Kazuki Kawamura et.al.	2403.02938	null
2024-03-04	PixIT: Joint Training of Speaker Diarization and Speech Separation from Real-world Multi-speaker Recordings	Joonas Kalda et.al.	2403.02288	link
2024-03-04	What has LeBenchmark Learnt about French Syntax?	Zdravko Dugonjić et.al.	2403.02173	null
2024-12-05	EMOVOME: A Dataset for Emotion Recognition in Spontaneous Real-Life Speech	Lucía Gómez-Zaragozá et.al.	2403.02167	null
2024-03-04	SA-SOT: Speaker-Aware Serialized Output Training for Multi-Talker ASR	Zhiyun Fan et.al.	2403.02010	null
2024-03-04	Language and Speech Technology for Central Kurdish Varieties	Sina Ahmadi et.al.	2403.01983	link
2024-03-03	A Closer Look at Wav2Vec2 Embeddings for On-Device Single-Channel Speech Enhancement	Ravi Shankar et.al.	2403.01369	null
2024-04-18	Automatic Speech Recognition using Advanced Deep Learning Approaches: A survey	Hamza Kheddar et.al.	2403.01255	null
2024-03-01	Post-decoder Biasing for End-to-End Speech Recognition of Multi-turn Medical Interview	Heyang Liu et.al.	2403.00370	null
2024-02-29	Probing the Information Encoded in Neural-based Acoustic Models of Automatic Speech Recognition Systems	Quentin Raymondaud et.al.	2402.19443	null
2024-02-29	Inappropriate Pause Detection In Dysarthric Speech Using Large-Scale Speech Recognition	Jeehyun Lee et.al.	2402.18923	null
2024-06-04	Exploration of Adapter for Noise Robust Automatic Speech Recognition	Hao Shi et.al.	2402.18275	null
2024-06-19	Twists, Humps, and Pebbles: Multilingual Speech Recognition Models Exhibit Gender Performance Gaps	Giuseppe Attanasio et.al.	2402.17954	link
2024-02-27	An Effective Mixture-Of-Experts Approach For Code-Switching Speech Recognition Leveraging Encoder Disentanglement	Tzu-Ting Yang et.al.	2402.17189	null
2024-02-27	Extreme Encoder Output Frame Rate Reduction: Improving Computational Latencies of Large End-to-End Models	Rohit Prabhavalkar et.al.	2402.17184	null
2024-04-01	ArEEG_Chars: Dataset for Envisioned Speech Recognition using EEG for Arabic Characters	Hazem Darwish et.al.	2402.15733	null
2024-05-14	Where Visual Speech Meets Language: VSP-LLM Framework for Efficient and Context-Aware Visual Speech Processing	Jeong Hun Yeo et.al.	2402.15151	link
2024-02-22	Efficient data selection employing Semantic Similarity-based Graph Structures for model training	Roxana Petcu et.al.	2402.14888	null
2024-02-22	Wizard of Oz Experimentation for Language Technology Applications: Challenges and Tools	Stephan Schlögl et.al.	2402.14563	null
2024-02-22	HINT: High-quality INPainting Transformer with Mask-Aware Encoding and Enhanced Attention	Shuang Chen et.al.	2402.14185	link
2024-02-21	An Augmented Lagrangian Method for Training Recurrent Neural Networks	Yue Wang et.al.	2402.13687	null
2024-02-22	Mel-FullSubNet: Mel-Spectrogram Enhancement for Improving Both Speech Quality and ASR	Rui Zhou et.al.	2402.13511	null
2024-02-20	How do Hyenas deal with Human Speech? Speech Recognition and Translation with ConfHyena	Marco Gaido et.al.	2402.13208	link
2024-02-20	Not All Weights Are Created Equal: Enhancing Energy Efficiency in On-Device Streaming Speech Recognition	Yang Li et.al.	2402.13076	null
2024-02-20	Comparison of Conventional Hybrid and CTC/Attention Decoders for Continuous Visual Speech Recognition	David Gimeno-Gómez et.al.	2402.13004	null
2024-06-16	OWSM-CTC: An Open Encoder-Only Speech Foundation Model for Speech Recognition, Translation, and Language Identification	Yifan Peng et.al.	2402.12654	null
2024-02-19	Multimodal Emotion Recognition from Raw Audio with Sinc-convolution	Xiaohui Zhang et.al.	2402.11954	null
2024-02-18	Ain't Misbehavin' -- Using LLMs to Generate Expressive Robot Behavior in Conversations with the Tabletop Robot Haru	Zining Wang et.al.	2402.11571	null
2024-02-18	Cross-Attention Fusion of Visual and Geometric Features for Large Vocabulary Arabic Lipreading	Samar Daou et.al.	2402.11520	null
2024-01-04	AntiDeepFake: AI for Deep Fake Speech Recognition	Enkhtogtokh Togootogtokh et.al.	2402.10218	null
2024-02-15	A cross-talk robust multichannel VAD model for multiparty agent interactions trained using synthetic re-recordings	Hyewon Han et.al.	2402.09797	null
2024-02-14	Listening to Multi-talker Conversations: Modular and End-to-end Perspectives	Desh Raj et.al.	2402.08932	null
2024-02-14	UniEnc-CASSNAT: An Encoder-only Non-autoregressive ASR for Speech SSL Models	Ruchao Fan et.al.	2402.08898	null
2024-02-13	An Embarrassingly Simple Approach for LLM with Strong ASR Capacity	Ziyang Ma et.al.	2402.08846	link
2024-02-13	Syllable based DNN-HMM Cantonese Speech to Text System	Timothy Wong et.al.	2402.08788	null
2024-05-03	Careless Whisper: Speech-to-Text Hallucination Harms	Allison Koenecke et.al.	2402.08021	link
2024-07-26	AIR-Bench: Benchmarking Large Audio-Language Models via Generative Comprehension	Qian Yang et.al.	2402.07729	link
2024-02-12	The Sound of Healthcare: Improving Medical Transcription ASR Accuracy with Large Language Models	Ayo Adedeji et.al.	2402.07658	null
2024-02-12	The Balancing Act: Unmasking and Alleviating ASR Biases in Portuguese	Ajinkya Kulkarni et.al.	2402.07513	null
2024-02-13	SALAD: Smart AI Language Assistant Daily	Ragib Amin Nihal et.al.	2402.07431	null
2024-02-11	Does ChatGPT and Whisper Make Humanoid Robots More Relatable?	Xiaohui Chen et.al.	2402.07095	null
2024-02-10	DeepCover: Advancing RNN Test Coverage and Online Error Prediction using State Machine Extraction	Pouria Golshanrad et.al.	2402.06966	link
2024-02-13	CochCeps-Augment: A Novel Self-Supervised Contrastive Learning Using Cochlear Cepstrum-based Masking for Speech Emotion Recognition	Ioannis Ziogas et.al.	2402.06923	null
2024-02-09	Self-consistent context aware conformer transducer for speech recognition	Konstantin Kolokolov et.al.	2402.06592	null
2024-02-08	Unified Speech-Text Pretraining for Spoken Dialog Modeling	Heeseung Kim et.al.	2402.05706	null
2024-02-08	It's Never Too Late: Fusing Acoustic Information into Large Language Models for Automatic Speech Recognition	Chen Chen et.al.	2402.05457	null
2024-02-07	Progressive unsupervised domain adaptation for ASR using ensemble models and multi-stage training	Rehan Ahmad et.al.	2402.04805	null
2024-05-28	REBORN: Reinforcement-Learned Boundary Segmentation with Iterative Training for Unsupervised ASR	Liang-Hsuan Tseng et.al.	2402.03988	link
2024-02-05	Resolving Transcription Ambiguity in Spanish: A Hybrid Acoustic-Lexical System for Punctuation Restoration	Xiliang Zhu et.al.	2402.03519	null
2024-02-05	A Comprehensive Study of the Current State-of-the-Art in Nepali Automatic Speech Recognition Systems	Rupak Raj Ghimire et.al.	2402.03050	null
2024-02-03	Predicting positive transfer for improved low-resource speech recognition using acoustic pseudo-tokens	Nay San et.al.	2402.02302	null
2024-02-02	Digits micro-model for accurate and secure transactions	Chirag Chhablani et.al.	2402.01931	null
2024-02-02	Whispering in Norwegian: Navigating Orthographic and Dialectic Challenges	Per E Kummervold et.al.	2402.01917	null
2024-02-01	Introduction to speech recognition	Gabriel Dauphin et.al.	2402.01778	null
2024-02-02	Streaming Sequence Transduction through Dynamic Compression	Weiting Tan et.al.	2402.01172	link
2024-02-05	AccentFold: A Journey through African Accents for Zero-Shot ASR Adaptation to Target Accents	Abraham Toluwase Owodunni et.al.	2402.01152	null
2024-02-01	Prosody in Cascade and Direct Speech-to-Text Translation: a case study on Korean Wh-Phrases	Giulio Zhou et.al.	2402.00632	null
2024-01-31	Exploring the limits of decoder-only models trained on public speech recognition corpora	Ankit Gupta et.al.	2402.00235	null
2024-01-31	SpeechComposer: Unifying Multiple Speech Tasks with Prompt Composition	Yihan Wu et.al.	2401.18045	null
2024-02-08	Computation and Parameter Efficient Multi-Modal Fusion Transformer for Cued Speech Recognition	Lei Liu et.al.	2401.17604	null
2024-06-16	OWSM v3.1: Better and Faster Open Whisper-Style Speech Models based on E-Branchformer	Yifan Peng et.al.	2401.16658	null
2024-01-28	Phoneme-Based Proactive Anti-Eavesdropping with Controlled Recording Privilege	Peng Huang et.al.	2401.15704	null
2024-01-28	On Speaker Attribution with SURT	Desh Raj et.al.	2401.15676	link
2024-01-28	Byte Pair Encoding Is All You Need For Automatic Bengali Speech Recognition	Ahnaf Mozib Samin et.al.	2401.15532	null
2024-01-27	Towards Event Extraction from Speech with Contextual Clues	Jingqi Kang et.al.	2401.15385	link
2024-01-26	Comparison of parameters of vowel sounds of russian and english languages	V. I. Fedoseev et.al.	2401.14890	null
2024-01-26	Toward Practical Automatic Speech Recognition and Post-Processing: a Call for Explainable Error Benchmark Guideline	Seonmin Koo et.al.	2401.14625	null
2024-01-25	TDFNet: An Efficient Audio-Visual Speech Separation Model with Top-down Fusion	Samuel Pegg et.al.	2401.14185	link
2024-01-24	CNN architecture extraction on edge GPU	Peter Horvath et.al.	2401.13575	null
2024-03-18	SpeechDPR: End-to-End Spoken Passage Retrieval for Open-Domain Spoken Question Answering	Chyi-Jiunn Lin et.al.	2401.13463	null
2024-05-28	MF-AED-AEC: Speech Emotion Recognition by Leveraging Multimodal Fusion, Asr Error Detection, and Asr Error Correction	Jiajun He et.al.	2401.13260	null
2024-01-23	Locality enhanced dynamic biasing and sampling strategies for contextual ASR	Md Asif Jalal et.al.	2401.13146	null
2024-01-23	Multilingual and Fully Non-Autoregressive ASR with Large Language Model Fusion: A Comprehensive Study	W. Ronny Huang et.al.	2401.12789	null
2024-01-22	Consistency Based Unsupervised Self-training For ASR Personalisation	Jisi Zhang et.al.	2401.12085	null
2024-01-22	Lightweight Protection for Privacy in Offloaded Speech Understanding	Dongqi Cai et.al.	2401.11983	null
2024-01-22	Keep Decoding Parallel with Effective Knowledge Distillation from Language Models to End-to-end Speech Recognisers	Michael Hentschel et.al.	2401.11700	null
2024-06-06	Using Large Language Model for End-to-End Chinese ASR and NER	Yuang Li et.al.	2401.11382	null
2024-02-02	Word-Level ASR Quality Estimation for Efficient Corpus Sampling and Post-Editing through Analyzing Attentions of a Reference-Free Metric	Golara Javadi et.al.	2401.11268	link
2024-01-20	ConceptThread: Visualizing Threaded Concepts in MOOC Videos	Zhiguang Zhou et.al.	2401.11132	null
2024-01-19	Contextualized Automatic Speech Recognition with Attention-Based Bias Phrase Boosted Beam Search	Yui Sudo et.al.	2401.10449	null
2024-01-19	Investigating Training Strategies and Model Robustness of Low-Rank Adaptation for Language Modeling in Speech Recognition	Yu Yu et.al.	2401.10447	null
2024-01-19	Large Language Models are Efficient Learners of Noise-Robust Speech Recognition	Yuchen Hu et.al.	2401.10446	link
2024-01-18	AGADIR: Towards Array-Geometry Agnostic Directional Speech Recognition	Ju Lin et.al.	2401.10411	null
2024-01-18	Communication-Efficient Personalized Federated Learning for Speech-to-Text Tasks	Yichao Du et.al.	2401.10070	null
2024-07-18	Efficient Training for Multilingual Visual Speech Recognition: Pre-training with Discretized Visual Speech Representation	Minsu Kim et.al.	2401.09802	null
2024-07-02	SlideAVSR: A Dataset of Paper Explanation Videos for Audio-Visual Speech Recognition	Hao Wang et.al.	2401.09759	null
2024-01-12	Transcending Controlled Environments Assessing the Transferability of ASRRobust NLU Models to Real-World Applications	Hania Khan et.al.	2401.09354	null
2024-01-17	On Speech Pre-emphasis as a Simple and Inexpensive Method to Boost Speech Enhancement	Iván López-Espejo et.al.	2401.09315	null
2024-01-17	Two-pass Endpoint Detection for Speech Recognition	Anirudh Raju et.al.	2401.08916	null
2024-01-16	NOTSOFAR-1 Challenge: New Datasets, Baseline, and Tasks for Distant Meeting Transcription	Alon Vinnikov et.al.	2401.08887	null
2024-01-16	Improving ASR Contextual Biasing with Guided Attention	Jiyang Tang et.al.	2401.08835	null
2024-01-16	Revisiting Self-supervised Learning of Speech Representation from a Mutual Information Perspective	Alexander H. Liu et.al.	2401.08833	null
2024-03-01	Multi-Input Multi-Output Target-Speaker Voice Activity Detection For Unified, Flexible, and Robust Audio-Visual Speaker Diarization	Ming Cheng et.al.	2401.08052	null
2024-01-15	Machine Perceptual Quality: Evaluating the Impact of Severe Lossy Compression on Audio and Image Models	Dan Jacobellis et.al.	2401.07957	link
2024-07-24	Cascaded Cross-Modal Transformer for Audio-Textual Classification	Nicolae-Catalin Ristea et.al.	2401.07575	link
2024-01-15	SeMaScore : a new evaluation metric for automatic speech recognition tasks	Zitha Sasindran et.al.	2401.07506	null
2024-01-14	Promptformer: Prompted Conformer Transducer for ASR	Sergio Duarte-Torres et.al.	2401.07360	null
2024-01-13	Joint Unsupervised and Supervised Training for Automatic Speech Recognition via Bilevel Optimization	A F M Saif et.al.	2401.06980	link
2024-01-12	XLS-R Deep Learning Model for Multilingual ASR on Low- Resource Languages: Indonesian, Javanese, and Sundanese	Panji Arisaputra et.al.	2401.06832	null
2024-02-29	The NPU-ASLP-LiAuto System Description for Visual Speech Recognition in CNVSRC 2023	He Wang et.al.	2401.06788	link
2024-01-15	Dynamic Behaviour of Connectionist Speech Recognition with Strong Latency Constraints	Giampiero Salvi et.al.	2401.06588	null
2024-01-12	LCB-net: Long-Context Biasing for Audio-Visual Speech Recognition	Fan Yu et.al.	2401.06390	link
2024-01-11	End to end Hindi to English speech conversion using Bark, mBART and a finetuned XLSR Wav2Vec2	Aniket Tathe et.al.	2401.06183	null
2024-01-11	UCorrect: An Unsupervised Framework for Automatic Speech Recognition Error Correction	Jiaxin Guo et.al.	2401.05689	null
2024-01-10	Useful Blunders: Can Automated Speech Recognition Errors Improve Downstream Dementia Classification?	Changye Li et.al.	2401.05551	null
2024-01-10	Towards Online Sign Language Recognition and Translation	Ronglai Zuo et.al.	2401.05336	link
2024-07-17	Continuously Learning New Words in Automatic Speech Recognition	Christian Huber et.al.	2401.04482	null
2024-01-08	High-precision Voice Search Query Correction via Retrievable Speech-text Embedings	Christopher Li et.al.	2401.04235	null
2024-07-22	Cross-Speaker Encoding Network for Multi-Talker Speech Recognition	Jiawen Kang et.al.	2401.04152	link
2024-01-08	Exploratory Evaluation of Speech Content Masking	Jennifer Williams et.al.	2401.03936	null
2024-03-07	An audio-quality-based multi-strategy approach for target speaker extraction in the MISP 2023 Challenge	Runduo Han et.al.	2401.03697	null
2024-06-10	LUPET: Incorporating Hierarchical Information Path into Multilingual ASR	Wei Liu et.al.	2401.03689	null
2024-01-08	BS-PLCNet: Band-split Packet Loss Concealment Network with Multi-task Learning Framework and Multi-discriminators	Zihan Zhang et.al.	2401.03687	null
2024-07-22	DiarizationLM: Speaker Diarization Post-Processing with Large Language Models	Quan Wang et.al.	2401.03506	link
2024-02-21	ICMC-ASR: The ICASSP 2024 In-Car Multi-Channel Automatic Speech Recognition Challenge	He Wang et.al.	2401.03473	null
2024-01-07	Multichannel AV-wav2vec2: A Framework for Learning Multichannel Multi-Modal Speech Representation	Qiushi Zhu et.al.	2401.03468	link
2024-04-08	MLCA-AVSR: Multi-Layer Cross Attention Fusion based Audio-Visual Speech Recognition	He Wang et.al.	2401.03424	null
2024-01-06	TeLeS: Temporal Lexeme Similarity Score to Estimate Confidence in End-to-End ASR	Nagarathna Ravi et.al.	2401.03251	link
2024-01-06	Part-of-Speech Tagger for Bodo Language using Deep Learning approach	Dhrubajyoti Pathak et.al.	2401.03175	null
2024-01-05	Towards ASR Robust Spoken Language Understanding Through In-Context Learning With Word Confusion Networks	Kevin Everson et.al.	2401.02921	null
2024-01-05	Nonlinear functional regression by functional deep neural network with kernel embedding	Zhongjie Shi et.al.	2401.02890	null
2024-01-05	A unified multichannel far-field speech recognition system: combining neural beamforming with attention based end-to-end model	Dongdi Zhao et.al.	2401.02673	null
2024-01-04	Task Oriented Dialogue as a Catalyst for Self-Supervised Automatic Speech Recognition	David M. Chan et.al.	2401.02417	link
2024-01-04	CTC Blank Triggered Dynamic Layer-Skipping for Efficient CTC-based Speech Recognition	Junfeng Hou et.al.	2401.02046	null
2024-01-03	Hallucinations in Neural Automatic Speech Recognition: Identifying Errors and Hallucinatory Models	Rita Frieske et.al.	2401.01572	null
2024-06-04	The Art of Deception: Robust Backdoor Attack using Dynamic Stacking of Triggers	Orson Mengara et.al.	2401.01537	null
2024-01-01	Enhancing Pre-trained ASR System Fine-tuning for Dysarthric Speech Recognition using Adversarial Data Augmentation	Huimeng Wang et.al.	2401.00662	null
2024-05-02	Stateful Conformer with Cache-based Inference for Streaming Automatic Speech Recognition	Vahid Noroozi et.al.	2312.17279	null
2023-12-26	The NUS-HLT System for ICASSP2024 ICMC-ASR Grand Challenge	Meng Ge et.al.	2312.16002	null
2023-12-26	Towards Probing Contact Center Large Language Models	Varun Nathan et.al.	2312.15922	null
2023-12-24	Exploring data augmentation in bias mitigation against non-native-accented speech	Yuanyuan Zhang et.al.	2312.15499	null
2023-12-22	BLSTM-Based Confidence Estimation for End-to-End Speech Recognition	Atsunori Ogawa et.al.	2312.14609	null
2024-02-09	Multimodal Attention Merging for Improved Speech Recognition and Audio Event Classification	Anirudh S. Sundar et.al.	2312.14378	null
2024-07-22	Multi-Sentence Grounding for Long-term Instructional Video	Zeqian Li et.al.	2312.14055	null
2023-12-21	BANSpEmo: A Bangla Emotional Speech Recognition Dataset	Md Gulzar Hussain et.al.	2312.14020	null
2023-12-21	Self-Supervised Adaptive AV Fusion Module for Pre-Trained ASR Models	Christopher Simic et.al.	2312.13873	null
2024-02-03	kNN-CTC: Enhancing ASR via Retrieval of CTC Pseudo Labels	Jiaming Zhou et.al.	2312.13560	link
2025-01-14	On the Effectiveness of ASR Representations in Real-world Noisy Speech Emotion Recognition	Xiaohan Shi et.al.	2311.07093	null
2023-11-20	Decoupling and Interacting Multi-Task Learning Network for Joint Speech and Accent Recognition	Qijie Shao et.al.	2311.07062	null
2023-11-02	An analysis of large speech models-based representations for speech emotion recognition	Adrian Bogdan Stânea et.al.	2311.00394	null
2024-01-29	Generative Speech Recognition Error Correction with Large Language Models and Task-Activating Prompting	Chao-Han Huck Yang et.al.	2309.15649	null
2023-08-09	Federated Representation Learning for Automatic Speech Recognition	Guruprasad V Ramesh et.al.	2308.02013	null
2023-07-07	Audio-visual End-to-end Multi-channel Speech Separation, Dereverberation and Recognition	Guinan Li et.al.	2307.02909	null
2023-05-30	HyperConformer: Multi-head HyperMixer for Efficient Speech Recognition	Florian Mai et.al.	2305.18281	null
2023-04-24	A vector quantized masked autoencoder for speech emotion recognition	Samir Sadok et.al.	2304.11117	null
2023-03-06	DWFormer: Dynamic Window transFormer for Speech Emotion Recognition	Shuaiqi Chen et.al.	2303.01694	null
2024-11-08	Pre-Finetuning for Few-Shot Emotional Speech Recognition	Maximillian Chen et.al.	2302.12921	null
2023-03-07	A Sidecar Separator Can Convert a Single-Talker Speech Recognition System to a Multi-Talker One	Lingwei Meng et.al.	2302.09908	null
2022-11-16	Improving Children's Speech Recognition by Fine-tuning Self-supervised Adult Speech Representations	Renee Lu et.al.	2211.07769	null
2022-10-27	Pretrained audio neural networks for Speech emotion recognition in Portuguese	Marcelo Matheus Gauy et.al.	2210.14716	null
2022-04-07	What can predictive speech coders learn from speaker recognizers?	Marcos Faundez-Zanuy et.al.	2204.02400	null
2022-03-18	Speaker Adaptation Using Spectro-Temporal Deep Features for Dysarthric and Elderly Speech Recognition	Mengzhe Geng et.al.	2202.10290	null
2022-02-03	Visualizing Automatic Speech Recognition -- Means for a Better Understanding?	Karla Markert et.al.	2202.00673	null
2022-01-31	Discovering Phonetic Inventories with Crosslingual Automatic Speech Recognition	Piotr Żelasko et.al.	2201.11207	null
2021-12-22	Voice Quality and Pitch Features in Transformer-Based Speech Recognition	Guillermo Cámbara et.al.	2112.11391	null
2022-05-03	Speech Pattern based Black-box Model Watermarking for Automatic Speech Recognition	Haozhe Chen et.al.	2110.09814	null
2021-11-05	Towards efficient end-to-end speech recognition with biologically-inspired neural networks	Thomas Bohnstingl et.al.	2110.02743	null
2025-02-06	Comparison of Self-Supervised Speech Pre-Training Methods on Flemish Dutch	Jakob Poncelet et.al.	2109.14357	null
2021-07-27	Differentiable Allophone Graphs for Language-Universal Speech Recognition	Brian Yan et.al.	2107.11628	null
2021-07-06	Arabic Code-Switching Speech Recognition using Monolingual Data	Ahmed Ali et.al.	2107.01573	null
2021-07-05	Supervised Contrastive Learning for Accented Speech Recognition	Tao Han et.al.	2107.00921	null
2021-07-05	Combining Frame-Synchronous and Label-Synchronous Systems for Speech Recognition	Qiujia Li et.al.	2107.00764	null
2022-03-22	Unsupervised Automatic Speech Recognition: A Review	Hanan Aldarmaki et.al.	2106.04897	null
2021-10-05	Non-autoregressive Mandarin-English Code-switching Speech Recognition	Shun-Po Chuang et.al.	2104.02258	null
2021-02-16	Thank you for Attention: A survey on Attention-based Artificial Neural Networks for Automatic Speech Recognition	Priyabrata Karmakar et.al.	2102.07259	null
2021-02-01	BCN2BRNO: ASR System Fusion for Albayzin 2020 Speech to Text Challenge	Martin Kocour et.al.	2101.12729	null
2021-09-14	Multi-task Language Modeling for Improving Speech Recognition of Rare Words	Chao-Han Huck Yang et.al.	2011.11715	null
2020-11-13	The CUHK-TUDELFT System for The SLT 2021 Children Speech Recognition Challenge	Si-Ioi Ng et.al.	2011.06239	null
2020-11-10	Data Augmentation For Children's Speech Recognition -- The "Ethiopian" System For The SLT 2021 Children Speech Recognition Challenge	Guoguo Chen et.al.	2011.04547	null
2020-11-10	Gated Recurrent Fusion with Joint Training Framework for Robust End-to-End Speech Recognition	Cunhang Fan et.al.	2011.04249	null
2021-09-20	TutorNet: Towards Flexible Knowledge Distillation for End-to-End Speech Recognition	Ji Won Yoon et.al.	2008.00671	null
2020-10-06	CTC-Segmentation of Large Corpora for German End-to-end Speech Recognition	Ludwig Kürzinger et.al.	2007.09127	null
2020-06-04	The NTNU System at the Interspeech 2020 Non-Native Children's Speech ASR Challenge	Tien-Hong Lo et.al.	2005.08433	null
2020-04-20	How to Teach DNNs to Pay Attention to the Visual Modality in Speech Recognition	George Sterpu et.al.	2004.08250	null
2022-09-28	The Effect of Silence Feature in Dimensional Speech Emotion Recognition	Bagus Tris Atmaja et.al.	2003.01277	null
2020-03-02	A Density Ratio Approach to Language Model Fusion in End-To-End Automatic Speech Recognition	Erik McDermott et.al.	2002.11268	null
2020-01-08	Domain Adaptation via Teacher-Student Learning for End-to-End Speech Recognition	Zhong Meng et.al.	2001.01798	null
2020-01-08	Character-Aware Attention-Based End-to-End Speech Recognition	Zhong Meng et.al.	2001.01795	null
2023-05-23	Leveraging End-to-End Speech Recognition with Neural Architecture Search	Ahmed Baruwa et.al.	1912.05946	null
2019-11-21	On using 2D sequence-to-sequence models for speech recognition	Parnia Bahar et.al.	1911.08888	null
2019-11-13	Recurrent Neural Network Transducer for Audio-Visual Speech Recognition	Takaki Makino et.al.	1911.04890	null
2019-10-15	VAIS ASR: Building a conversational speech recognition system using language model combination	Quang Minh Nguyen et.al.	1910.05603	null
2020-05-08	Self-Training for End-to-End Speech Recognition	Jacob Kahn et.al.	1909.09116	null
2020-03-17	Advancing Speech Recognition With No Speech Or With Noisy Speech	Gautam Krishna et.al.	1906.08871	null
2019-04-26	Phonetically-Oriented Word Error Alignment for Speech Recognition Error Analysis in Speech Translation	Nicholas Ruiz et.al.	1904.11024	null
2019-07-10	End-to-End Visual Speech Recognition for Small-Scale Datasets	Stavros Petridis et.al.	1904.01954	null
2020-01-01	A Convolutional Neural Network model based on Neutrosophy for Noisy Speech Recognition	Elyas Rashno et.al.	1901.10629	null
2018-11-20	Analysis of DNN Speech Signal Enhancement for Robust Speaker Recognition	Ondrej Novotny et.al.	1811.07629	null
2018-11-13	Reinforcement Learning Based Speech Enhancement for Robust Speech Recognition	Yih-Liang Shen et.al.	1811.04224	null
2023-05-15	End-to-end Audiovisual Speech Activity Detection with Bimodal Recurrent Neural Models	Fei Tao et.al.	1809.04553	null
2018-09-13	Isolated and Ensemble Audio Preprocessing Methods for Detecting Adversarial Examples against Automatic Speech Recognition	Krishan Rajaratnam et.al.	1809.04397	null
2018-07-04	Exploring End-to-End Techniques for Low-Resource Speech Recognition	Vladimir Bataev et.al.	1807.00868	null
2018-05-29	Automatic context window composition for distant speech recognition	Mirco Ravanelli et.al.	1805.10498	null
2022-03-17	Curriculum Learning for Speech Emotion Recognition from Crowdsourced Labels	Reza Lotfian et.al.	1805.10339	null
2018-04-27	End-to-End Multimodal Speech Recognition	Shruti Palaskar et.al.	1804.09713	null
2018-10-17	Deep Long Short-Term Memory Adaptive Beamforming Networks For Multichannel Robust Speech Recognition	Zhong Meng et.al.	1711.08016	null
2019-05-01	Unsupervised Adaptation with Domain Separation Networks for Robust Speech Recognition	Zhong Meng et.al.	1711.08010	null
2018-02-23	BridgeNets: Student-Teacher Transfer Learning Based on Recursive Neural Networks and its Application to Distant Speech Recognition	Jaeyoung Kim et.al.	1710.10224	null
2018-06-29	Combining Multiple Views for Visual Speech Recognition	Marina Zimmermann et.al.	1710.07168	null
2018-04-26	Visual speech recognition: aligning terminologies for better understanding	Helen L Bear et.al.	1710.01292	null
2018-04-26	Resolution limits on visual speech recognition	Helen L. Bear et.al.	1710.01073	null
2017-09-01	Leveraging Deep Neural Network Activation Entropy to cope with Unseen Data in Speech Recognition	Vikramjit Mitra et.al.	1708.09516	null
2018-12-06	Single-Channel Multi-talker Speech Recognition with Permutation Invariant Training	Yanmin Qian et.al.	1707.06527	null
2017-11-16	Rank-1 Constrained Multichannel Wiener Filter for Speech Recognition in Noisy Environments	Ziteng Wang et.al.	1707.00201	null
2017-04-27	Towards Estimating the Upper Bound of Visual-Speech Recognition: The Visual Lip-Reading Feasibility Database	Adriana Fernandez-Lopez et.al.	1704.08028	null
2016-12-07	Invariant Representations for Noisy Speech Recognition	Dmitriy Serdyuk et.al.	1612.01928	null
2017-08-08	Robust coherence-based spectral enhancement for speech recognition in adverse real-world environments	Hendrik Barfuss et.al.	1604.03393	null
2015-09-25	Noise-Robust ASR for the third 'CHiME' Challenge Exploiting Time-Frequency Masking based Multi-Channel Speech Enhancement and Recurrent Neural Network	Zaihu Pang et.al.	1509.07211	null
2014-09-05	Visual Speech Recognition	Ahmad B. A. Hassanat et.al.	1409.1411	null
2014-02-12	Modified SPLICE and its Extension to Non-Stereo Data for Noise Robust Speech Recognition	D. S. Pavan Kumar et.al.	1307.4048	null

(back to top)

TTS

Publish Date	Title	Authors	PDF	Code
2026-03-05	Hierarchical Decoding for Discrete Speech Synthesis with Multi-Resolution Spoof Detection	Junchuan Zhao et.al.	2603.05373	null
2026-03-05	Measuring the Redundancy of Decoder Layers in SpeechLLMs	Adel Moumen et.al.	2603.05121	null
2026-03-04	ZeSTA: Zero-Shot TTS Augmentation with Domain-Conditioned Training for Data-Efficient Personalized Speech Synthesis	Youngwon Choi et.al.	2603.04219	null
2026-03-04	VietNormalizer: An Open-Source, Dependency-Free Python Library for Vietnamese Text Normalization in TTS and NLP Applications	Hung Vu Nguyen et.al.	2603.04145	null
2026-03-02	More Data, Fewer Diacritics: Scaling Arabic TTS	Ahmed Musleh et.al.	2603.01622	null
2026-03-02	End-to-End Simultaneous Dysarthric Speech Reconstruction with Frame-Level Adaptor and Multiple Wait-k Knowledge Distillation	Minghui Wu et.al.	2603.01382	null
2026-03-02	DARS: Dysarthria-Aware Rhythm-Style Synthesis for ASR Enhancement	Minghui Wu et.al.	2603.01369	null
2026-03-01	S-VoCAL: A Dataset and Evaluation Framework for Inferring Speaking Voice Character Attributes in Literature	Abigail Berthe-Pardo et.al.	2603.00958	null
2026-02-26	Discourse-Aware Dual-Track Streaming Response for Low-Latency Spoken Dialogue Systems	Siyuan Liu et.al.	2602.23266	null
2026-02-26	TADA: A Generative Framework for Speech Modeling via Text-Acoustic Dual Alignment	Trung Dang et.al.	2602.23068	null
2026-03-03	Scalable Multilingual Multimodal Machine Translation with Speech-Text Fusion	Yexing Du et.al.	2602.21646	null
2026-02-25	The Design Space of Tri-Modal Masked Diffusion Models	Louis Bethune et.al.	2602.21472	null
2026-02-23	Can You Tell It's AI? Human Perception of Synthetic Voices in Vishing Scenarios	Zoha Hayat Bhatti et.al.	2602.20061	null
2026-02-23	CTC-TTS: LLM-based dual-streaming text-to-speech with CTC alignment	Hanwen Liu et.al.	2602.19574	null
2026-02-19	CC-G2PnP: Streaming Grapheme-to-Phoneme and prosody with Conformer-CTC for unsegmented languages	Yuma Shirahata et.al.	2602.17157	null
2026-02-13	Speech to Speech Synthesis for Voice Impersonation	Bjorn Johnson et.al.	2602.16721	null
2026-02-18	How to Label Resynthesized Audio: The Dual Role of Neural Audio Codecs in Audio Deepfake Detection	Yixuan Xiao et.al.	2602.16343	null
2026-02-17	LLM-to-Speech: A Synthetic Data Pipeline for Training Dialectal Text-to-Speech Models	Ahmed Khaled Khamis et.al.	2602.15675	null
2026-03-03	UniTAF: A Modular Framework for Joint Text-to-Speech and Audio-to-Face Modeling	Qiangong Zhou et.al.	2602.15651	null
2026-02-16	Disentangling Pitch and Creak for Speaker Identity Preservation in Speech Synthesis	Frederik Rautenberg et.al.	2602.14686	null
2026-02-16	Probing Human Articulatory Constraints in End-to-End TTS with Reverse and Mismatched Speech-Text Directions	Parth Khadse et.al.	2602.14664	null
2026-02-14	ELEAT-SAGA: Early & Late Integration with Evading Alternating Training for Spoof-Robust Speaker Verification	Amro Asali et.al.	2602.13761	null
2026-02-13	PISHYAR: A Socially Intelligent Smart Cane for Indoor Social Navigation and Multimodal Human-Robot Interaction for Visually Impaired People	Mahdi Haghighat Joo et.al.	2602.12597	null
2026-02-16	"Sorry, I Didn't Catch That": How Speech Models Miss What Matters Most	Kaitlyn Zhou et.al.	2602.12249	null
2026-02-19	When Audio-LLMs Don't Listen: A Cross-Linguistic Study of Modality Arbitration	Jayadev Billa et.al.	2602.11488	null
2026-02-12	SLD-L2S: Hierarchical Subspace Latent Diffusion for High-Fidelity Lip to Speech Synthesis	Yifan Liang et.al.	2602.11477	null
2026-02-11	Calliope: A TTS-based Narrated E-book Creator Ensuring Exact Synchronization, Privacy, and Layout Fidelity	Hugo L. Hammer et.al.	2602.10735	null
2026-02-10	Emotion-Coherent Speech Data Augmentation and Self-Supervised Contrastive Style Training for Enhancing Kids's Story Speech Synthesis	Raymond Chung et.al.	2602.10164	null
2026-02-10	Covo-Audio Technical Report	Wenfu Wang et.al.	2602.09823	null
2026-02-10	TVTSyn: Content-Synchronous Time-Varying Timbre for Streaming Voice Conversion and Anonymization	Waris Quamer et.al.	2602.09389	null
2026-02-03	DSFlow: Dual Supervision and Step-Aware Architecture for One-Step Flow Matching Speech Synthesis	Bin Lin et.al.	2602.09041	null
2026-02-19	Prototype-Based Disentanglement for Controllable Dysarthric Speech Synthesis	Haoshen Wang et.al.	2602.08696	null
2026-02-08	SoulX-Singer: Towards High-Quality Zero-Shot Singing Voice Synthesis	Jiale Qian et.al.	2602.07803	null
2026-01-14	PersonaPlex: Voice and Role Control for Full Duplex Conversational Speech Models	Rajarshi Roy et.al.	2602.06053	null
2026-02-05	ARCHI-TTS: A flow-matching-based Text-to-Speech Model with Self-supervised Semantic Aligner and Accelerated Inference	Chunyat Wu et.al.	2602.05207	null
2026-02-04	HoliAntiSpoof: Audio LLM for Holistic Speech Anti-Spoofing	Xuenan Xu et.al.	2602.04535	null
2026-02-04	PFluxTTS: Hybrid Flow-Matching TTS with Robust Cross-Lingual Voice Cloning and Inference-Time Model Fusion	Vikentii Pankov et.al.	2602.04160	null
2026-02-03	CoCoEmo: Composable and Controllable Human-Like Emotional TTS via Activation Steering	Siyi Wang et.al.	2602.03420	null
2026-03-02	WAXAL: A Large-Scale Multilingual African Language Speech Corpus	Abdoulaye Diack et.al.	2602.02734	null
2026-02-01	VividVoice: A Unified Framework for Scene-Aware Visually-Driven Speech Synthesis	Chengyuan Ma et.al.	2602.02591	null
2026-02-02	LipSody: Lip-to-Speech Synthesis with Enhanced Prosody Consistency	Jaejun Lee et.al.	2602.01908	null
2026-02-01	EmoAra: Emotion-Preserving English Speech Transcription and Cross-Lingual Translation with Arabic Text-to-Speech	Besher Hassan et.al.	2602.01170	null
2026-02-01	Bias in the Ear of the Listener: Assessing Sensitivity in Audio Language Models Across Linguistic, Demographic, and Positional Variations	Sheng-Lun Wei et.al.	2602.01030	null
2026-01-31	Edit Content, Preserve Acoustics: Imperceptible Text-Based Speech Editing via Self-Consistency Rewards	Yong Ren et.al.	2602.00560	null
2026-01-30	Multi-Speaker Conversational Audio Deepfake: Taxonomy, Dataset and Pilot Study	Alabi Ahmed et.al.	2602.00295	null
2026-01-30	Now You Hear Me: Audio Narrative Attacks Against Large Audio-Language Models	Ye Yu et.al.	2601.23255	null
2026-01-30	EmoShift: Lightweight Activation Steering for Enhanced Emotion-Aware Speech Synthesis	Li Zhou et.al.	2601.22873	null
2026-01-30	Evaluating and Rewarding LALMs for Expressive Role-Play TTS via Mean Continuation Log-Probability	Yong Ren et.al.	2601.22661	null
2026-01-29	Speech Quality-Based Localization of Low-Quality Speech and Text-to-Speech Synthesis Artefacts	Michael Kuhlmann et.al.	2601.21886	null
2026-01-28	Audio Deepfake Detection in the Age of Advanced Text-to-Speech models	Robin Singh et.al.	2601.20510	null
2026-01-28	Erasing Your Voice Before It's Heard: Training-free Speaker Unlearning for Zero-shot Text-to-Speech	Myungjin Lee et.al.	2601.20481	null
2026-01-29	Unit-Based Agent for Semi-Cascaded Full-Duplex Dialogue Systems	Haoyuan Yu et.al.	2601.20230	null
2026-01-27	T-Mimi: A Transformer-based Mimi Decoder for Real-Time On-Phone TTS	Haibin Wu et.al.	2601.20094	null
2026-01-27	Phonological Tokenizer: Prosody-Aware Phonetic Token via Multi-Objective Fine-Tuning with Differentiable K-Means	Kentaro Onda et.al.	2601.19781	null
2026-01-26	Neural Multi-Speaker Voice Cloning for Nepali in Low-Resource Settings	Aayush M. Shrestha et.al.	2601.18694	null
2026-01-26	UrgentMOS: Unified Multi-Metric and Preference Learning for Robust Speech Quality Assessment	Wei Wang et.al.	2601.18438	null
2026-01-25	Quran-MD: A Fine-Grained Multilingual Multimodal Dataset of the Quran	Muhammad Umar Salman et.al.	2601.17880	null
2026-01-23	SonoEdit: Null-Space Constrained Knowledge Editing for Pronunciation Correction in LLM-Based TTS	Ayush Pratap Singh et.al.	2601.17086	null
2026-01-16	AI-based System for Transforming text and sound to Educational Videos	M. E. ElAlami et.al.	2601.17022	null
2026-01-16	ES4R: Speech Encoding Based on Prepositive Affective Modeling for Empathetic Response Generation	Zhuoyue Gao et.al.	2601.16225	null
2026-01-22	Timbre-Aware LLM-based Direct Speech-to-Speech Translation Extendable to Multiple Language Pairs	Lalaram Arya et.al.	2601.16023	null
2026-01-22	Qwen3-TTS Technical Report	Hangrui Hu et.al.	2601.15621	link
2026-01-22	DeepASMR: LLM-Based Zero-Shot ASMR Speech Generation for Anyone of Any Voice	Leying Zhang et.al.	2601.15596	null
2026-01-20	Prosody-Guided Harmonic Attention for Phase-Coherent Neural Vocoding in the Complex Spectrum	Mohammed Salah Al-Radhi et.al.	2601.14472	null
2026-01-28	Quantifying Speaker Embedding Phonological Rule Interactions in Accented Speech Synthesis	Thanathai Lertpetchpun et.al.	2601.14417	null
2026-01-20	Habibi: Laying the Open-Source Foundation of Unified-Dialectal Arabic Speech Synthesis	Yushen Chen et.al.	2601.13802	null
2026-01-19	Lombard Speech Synthesis for Any Voice with Controllable Style Embeddings	Seymanur Akti et.al.	2601.12966	null
2026-01-18	A Unified Neural Codec Language Model for Selective Editable Text to Speech Generation	Hanchen Pei et.al.	2601.12480	null
2026-01-18	ParaMETA: Towards Learning Disentangled Paralinguistic Speaking Styles Representations from Speech	Haowei Lou et.al.	2601.12289	null
2026-01-18	Confidence-based Filtering for Speech Dataset Curation with Generative Speech Enhancement Using Discrete Tokens	Kazuki Yamauchi et.al.	2601.12254	null
2026-01-16	WenetSpeech-Wu: Datasets, Benchmarks, and Models for a Unified Chinese Wu Dialect Speech Processing Ecosystem	Chengyou Wang et.al.	2601.11027	null
2026-01-15	Unifying Speech Recognition, Synthesis and Conversion with Autoregressive Transformers	Runyuan Cai et.al.	2601.10770	null
2026-01-20	VoiceSculptor: Your Voice, Designed By You	Jingbin Hu et.al.	2601.10629	null
2026-01-15	STEAMROLLER: A Multi-Agent System for Inclusive Automatic Speech Recognition for People who Stutter	Ziqi Xu et.al.	2601.10223	null
2026-01-13	Decoding Order Matters in Autoregressive Speech Synthesis	Minghui Zhao et.al.	2601.08450	null
2026-01-13	Detecting Mental Manipulation in Speech via Synthetic Multi-Speaker Dialogue	Run Chen et.al.	2601.08342	null
2026-03-02	FOCAL: A Novel Benchmarking Technique for Multi-modal Agents	Anupam Purwar et.al.	2601.07367	null
2026-02-05	ESDD2: Environment-Aware Speech and Sound Deepfake Detection Challenge Evaluation Plan	Xueping Zhang et.al.	2601.07303	null
2026-01-10	Lightweight Resolution-Aware Audio Deepfake Detection via Cross-Scale Attention and Consistency Learning	K. A. Shahriar et.al.	2601.06560	null
2026-01-09	Pantagruel: Unified Self-Supervised Encoders for French Text and Speech	Phuong-Hang Le et.al.	2601.05911	null
2026-01-14	Afri-MCQA: Multimodal Cultural Question Answering for African Languages	Atnafu Lambebo Tonja et.al.	2601.05699	null
2026-01-09	SPAM: Style Prompt Adherence Metric for Prompt-based TTS	Chanhee Cho et.al.	2601.05554	null
2026-01-08	CosyEdit: Unlocking End-to-End Speech Editing Capability from Zero-Shot Text-to-Speech Models	Junyang Chen et.al.	2601.05329	null
2026-01-08	FlexiVoice: Enabling Flexible Style Control in Zero-Shot TTS with Natural Language Instructions	Dekun Chen et.al.	2601.04656	null
2026-01-08	LLMs-Integrated Automatic Hate Speech Recognition Using Controllable Text Generation Models	Ryutaro Oshima et.al.	2601.04654	null
2026-01-09	IndexTTS 2.5 Technical Report	Yunpei Li et.al.	2601.03888	null
2026-01-14	Stuttering-Aware Automatic Speech Recognition for Indonesian Language	Fadhil Muhammad et.al.	2601.03727	null
2026-01-07	Domain Adaptation of the Pyannote Diarization Pipeline for Conversational Indonesian Audio	Muhammad Daffa'i Rafi Prasetyo et.al.	2601.03684	null
2026-01-07	ReStyle-TTS: Relative and Continuous Style Control for Zero-Shot Speech Synthesis	Haitao Li et.al.	2601.03632	null
2026-01-06	Tigrinya Number Verbalization: Rules, Algorithm, and Implementation	Fitsum Gaim et.al.	2601.03403	null
2026-01-06	Segment-Aware Conditioning for Training-Free Intra-Utterance Emotion and Duration Control in Text-to-Speech	Qifan Liang et.al.	2601.03170	null
2026-01-24	XLSR-MamBo: Scaling the Hybrid Mamba-Attention Backbone for Audio Deepfake Detection	Kwok-Ho Ng et.al.	2601.02944	null
2026-01-06	Vulnerabilities of Audio-Based Biometric Authentication Systems Against Deepfake Speech Synthesis	Mengze Hong et.al.	2601.02914	null
2026-01-06	Vclip: Face-based Speaker Generation by Face-voice Association Learning	Yao Shi et.al.	2601.02753	null
2026-01-05	VocalBridge: Latent Diffusion-Bridge Purification for Defeating Perturbation-Based Voiceprint Defenses	Maryam Abbasihafshejani et.al.	2601.02444	null
2026-01-05	Towards Prosodically Informed Mizo TTS without Explicit Tone Markings	Abhijit Mohanta et.al.	2601.02073	null
2026-01-08	MM-Sonate: Multimodal Controllable Audio-Video Generation with Zero-Shot Voice Cloning	Chunyu Qiang et.al.	2601.01568	null
2026-01-04	OV-InstructTTS: Towards Open-Vocabulary Instruct Text-to-Speech	Yong Ren et.al.	2601.01459	null
2026-01-02	Improving Code-Switching Speech Recognition with TTS Data Augmentation	Yue Heng Yeo et.al.	2601.00935	null
2026-01-01	DepFlow: Disentangled Speech Generation to Mitigate Semantic Bias in Depression Detection	Yuxin Li et.al.	2601.00303	null
2025-12-29	AI4Reading: Chinese Audiobook Interpretation System Based on Multi-Agent Collaboration	Minjiang Huang et.al.	2512.23300	null
2025-12-27	ManchuTTS: Towards High-Quality Manchu Speech Synthesis via Flow Matching and Hierarchical Text Representation	Suhua Wang et.al.	2512.22491	null
2025-12-25	Zero-Shot to Zero-Lies: Detecting Bengali Deepfake Audio through Transfer Learning	Most. Sharmin Sultana Samu et.al.	2512.21702	null
2026-01-20	Fun-Audio-Chat Technical Report	Tongyi Fun Team et.al.	2512.20156	link
2025-12-21	Smark: A Watermark for Text-to-Speech Diffusion Models via Discrete Wavelet Transform	Yichuan Zhang et.al.	2512.18791	null
2025-12-21	Task Vector in TTS: Toward Emotionally Expressive Dialectal Speech Synthesis	Pengchao Feng et.al.	2512.18699	null
2025-12-19	Training Text-to-Speech Model with Purely Synthetic Data: Feasibility, Sensitivity, and Generalization Capability	Tingxiao Zhou et.al.	2512.17356	null
2025-12-19	Robust TTS Training via Self-Purifying Flow Matching for the WildSpoof 2026 TTS Track	June Young Yi et.al.	2512.17293	null
2025-12-24	Hearing to Translate: The Effectiveness of Speech Modality Integration into LLMs	Sara Papi et.al.	2512.16378	link
2025-12-16	Adapting Speech Language Model to Singing Voice Synthesis	Yiwen Zhao et.al.	2512.14657	null
2025-12-16	GLM-TTS Technical Report	Jiayan Cui et.al.	2512.14291	link
2025-12-18	A stylometric analysis of speaker attribution from speech transcripts	Cristina Aggazzotti et.al.	2512.13667	null
2025-12-15	Reproducing and Dissecting Denoising Language Models for Speech Recognition	Dorian Koch et.al.	2512.13576	null
2026-01-04	DisCo-Speech: Controllable Zero-Shot Speech Generation with A Disentangled Speech Codec	Tao Li et.al.	2512.13251	null
2025-12-11	CompanionCast: A Multi-Agent Conversational AI Framework with Spatial Audio for Social Co-Viewing Experiences	Yiyang Wang et.al.	2512.10918	null
2025-12-10	DMP-TTS: Disentangled multi-modal Prompting for Controllable Text-to-Speech with Chained Guidance	Kang Yin et.al.	2512.09504	null
2025-12-09	LG Uplus System with Multi-Speaker IDs and Discriminator-based Sub-Judges for the WildSpoof Challenge	Jinyoung Park et.al.	2512.09000	null
2025-12-08	Beyond Unified Models: A Service-Oriented Approach to Low Latency, Context Aware Phonemization for Real Time TTS	Mahta Fetrat et.al.	2512.08006	link
2025-12-06	Sanvaad: A Multimodal Accessibility Framework for ISL Recognition and Voice-Based Interaction	Kush Revankar et.al.	2512.06485	null
2025-12-05	SEA-SafeguardBench: Evaluating AI Safety in SEA Languages and Cultures	Panuthep Tasawong et.al.	2512.05501	null
2025-11-23	SyncVoice: Towards Video Dubbing with Vision-Augmented Pretrained TTS Model	Kaidi Wang et.al.	2512.05126	null
2025-12-04	HiPPO: Exploring A Novel Hierarchical Pronunciation Assessment Approach for Spoken Languages	Bi-Cheng Yan et.al.	2512.04964	null
2025-12-04	M3-TTS: Multi-modal DiT Alignment & Mel-latent for Zero-shot High-fidelity Speech Synthesis	Xiaopeng Wang et.al.	2512.04720	null
2026-01-26	RRPO: Robust Reward Policy Optimization for LLM-based Emotional TTS	Cong Wang et.al.	2512.04552	null
2025-12-02	How to DP-fy Your Data: A Practical Guide to Generating Synthetic Data With Differential Privacy	Natalia Ponomareva et.al.	2512.03238	null
2025-12-02	BOOM: Beyond Only One Modality KIT's Multimodal Multilingual Lecture Companion	Sai Koneru et.al.	2512.02817	null
2025-12-02	Hear What Matters! Text-conditioned Selective Video-to-Audio Generation	Junwon Lee et.al.	2512.02650	null
2025-12-02	Spoken Conversational Agents with Large Language Models	Chao-Han Huck Yang et.al.	2512.02593	null
2025-12-01	MCAT: Scaling Many-to-Many Speech-to-Text Translation with MLLMs to 70 Languages	Yexing Du et.al.	2512.01512	null
2025-12-01	fMRI2GES: Co-speech Gesture Reconstruction from fMRI Signal with Dual Brain Decoding Alignment	Chunzheng Zhu et.al.	2512.01189	null
2025-11-30	Arabic TTS with FastPitch: Reproducible Baselines, Adversarial Training, and Oversmoothing Analysis	Lars Nippert et.al.	2512.00937	null
2025-12-03	STCTS: Generative Semantic Compression for Ultra-Low Bitrate Speech via Explicit Text-Prosody-Timbre Decomposition	Siyu Wang et.al.	2512.00451	null
2025-11-28	OmniFusion: Simultaneous Multilingual Multimodal Translations via Modular Fusion	Sai Koneru et.al.	2512.00234	link
2025-11-28	CoordSpeaker: Exploiting Gesture Captioning for Coordinated Caption-Empowered Co-Speech Gesture Generation	Fengyi Fang et.al.	2511.22863	null
2025-11-27	PURE Codec: Progressive Unfolding of Residual Entropy for Speech Codec Learning	Jiatong Shi et.al.	2511.22687	null
2025-11-27	Joint Speech and Text Training for LLM-Based End-to-End Spoken Dialogue State Tracking	Katia Vendrame et.al.	2511.22503	null
2025-11-27	GLA-Grad++: An Improved Griffin-Lim Guided Diffusion Model for Speech Synthesis	Teysir Baoueb et.al.	2511.22293	null
2025-11-27	VSpeechLM: A Visual Speech Language Model for Visual Text-to-Speech Task	Yuyue Wang et.al.	2511.22229	null
2025-11-27	Layover or Direct Flight: Rethinking Audio-Guided Image Segmentation	Joel Alberto Santos et.al.	2511.22025	null
2025-11-26	Advancing Marine Bioacoustics with Deep Generative Models: A Hybrid Augmentation Strategy for Southern Resident Killer Whale Detection	Bruno Padovese et.al.	2511.21872	null
2025-12-05	Decoding inner speech with an end-to-end brain-to-text neural interface	Yizi Zhang et.al.	2511.21740	null
2025-11-26	Voice, Bias, and Coreference: An Interpretability Study of Gender in Speech Translation	Lina Conti et.al.	2511.21517	null
2025-11-26	Multi-Reward GRPO for Stable and Prosodic Single-Codebook TTS LLMs at Scale	Yicheng Zhong et.al.	2511.21270	null
2025-11-26	RosettaSpeech: Zero-Shot Speech-to-Speech Translation from Monolingual Data	Zhisheng Zheng et.al.	2511.20974	null
2025-12-24	SingingSDS: A Singing-Capable Spoken Dialogue System for Conversational Roleplay Applications	Jionghao Han et.al.	2511.20972	link
2025-11-25	Continual Audio Deepfake Detection via Universal Adversarial Perturbation	Wangjie Li et.al.	2511.19974	null
2025-11-25	It Hears, It Sees too: Multi-Modal LLM for Depression Detection By Integrating Visual Understanding into Audio Language Models	Xiangyu Zhao et.al.	2511.19877	null
2025-11-24	Dynamic Multi-Species Bird Soundscape Generation with Acoustic Patterning and 3D Spatialization	Ellie L. Zhang et.al.	2511.19275	null
2025-11-24	Context-Aware Whisper for Arabic ASR Under Linguistic Varieties	Bashar Talafha et.al.	2511.18774	null
2025-12-03	First Deep Learning Approach to Hammering Acoustics for Stem Stability Assessment in Total Hip Arthroplasty	Dongqi Zhu et.al.	2511.18725	null
2025-11-24	AIRHILT: A Human-in-the-Loop Testbed for Multimodal Conflict Detection in Aviation	Omar Garib et.al.	2511.18718	null
2025-11-23	InstructAudio: Unified speech and music generation with natural language instruction	Chunyu Qiang et.al.	2511.18487	null
2025-11-23	A Multimodal Conversational Agent for Tabular Data Analysis	Mohammad Nour Al Awad et.al.	2511.18405	null
2025-11-22	A superpersuasive autonomous policy debating system	Allen Roush et.al.	2511.17854	null
2025-11-12	Speech Recognition Model Improves Text-to-Speech Synthesis using Fine-Grained Reward	Guansu Wang et.al.	2511.17555	null
2025-11-21	AI in Music and Sound: Pedagogical Reflections, Post-Structuralist Approaches and Creative Outcomes in Seminar Practice	Guilherme Coelho et.al.	2511.17425	null
2025-11-21	Robot Confirmation Generation and Action Planning Using Long-context Q-Former Integrated with Multimodal LLM	Chiori Hori et.al.	2511.17335	null
2025-11-20	Revisiting Audio-language Pretraining for Learning General-purpose Audio Representation	Wei-Cheng Tseng et.al.	2511.16757	null
2025-11-20	Codec2Vec: Self-Supervised Speech Representation Learning Using Neural Speech Codecs	Wei-Cheng Tseng et.al.	2511.16639	null
2025-11-20	SceneGuard: Training-Time Voice Protection with Scene-Consistent Audible Background Noise	Rui Sang et.al.	2511.16114	null
2025-11-26	Step-Audio-R1 Technical Report	Fei Tian et.al.	2511.15848	link
2025-11-24	PresentCoach: Dual-Agent Presentation Coaching through Exemplars and Interactive Feedback	Sirui Chen et.al.	2511.15253	null
2025-11-18	Quality-Controlled Multimodal Emotion Recognition in Conversations with Identity-Based Transfer Learning and MAMBA Fusion	Zanxu Wang et.al.	2511.14969	null
2025-11-18	Voiced-Aware Style Extraction and Style Direction Adjustment for Expressive Text-to-Speech	Nam-Gyu Kim et.al.	2511.14824	null
2025-11-06	The Impact of Prosodic Segmentation on Speech Synthesis of Spontaneous Speech	Julio Cesar Galdino et.al.	2511.14779	null
2025-11-18	Ground Truth Generation for Multilingual Historical NLP using LLMs	Clovis Gladstone et.al.	2511.14688	null
2025-11-18	TTA: Transcribe, Translate and Alignment for Cross-lingual Speech Representation	Wei Liu et.al.	2511.14410	null
2025-11-19	StreamingTalker: Audio-driven 3D Facial Animation with Autoregressive Diffusion Model	Yifan Yang et.al.	2511.14223	null
2025-11-20	FxSearcher: gradient-free text-driven audio transformation	Hojoon Ki et.al.	2511.14138	null
2025-11-17	Human-centric Maintenance Process Through Integration of AI, Speech, and AR	Parul Khanna et.al.	2511.13918	null
2025-11-26	Passive Dementia Screening via Facial Temporal Micro-Dynamics Analysis of In-the-Wild Talking-Head Video	Filippo Cenacchi et.al.	2511.13802	null
2025-11-17	Computational Measurement of Political Positions: A Review of Text-Based Ideal Point Estimation Algorithms	Patrick Parschan et.al.	2511.13238	null
2025-11-24	FoleyBench: A Benchmark For Video-to-Audio Models	Satvik Dixit et.al.	2511.13219	null
2025-11-17	Distinguishing Repetition Disfluency from Morphological Reduplication in Bangla ASR Transcripts: A Novel Corpus and Benchmarking Analysis	Zaara Zabeen Arpa et.al.	2511.13159	null
2025-11-17	A Smart-Glasses for Emergency Medical Services via Multimodal Multitask Learning	Liuyi Jin et.al.	2511.13078	null
2025-11-16	Improving Direct Persian-English Speech-to-Speech Translation with Discrete Units and Synthetic Parallel Data	Sina Rashidi et.al.	2511.12690	null
2025-11-16	Hi-Reco: High-Fidelity Real-Time Conversational Digital Humans	Hongbin Huang et.al.	2511.12662	null
2025-11-23	Uni-MoE-2.0-Omni: Scaling Language-Centric Omnimodal Large Model with Advanced MoE, Training and Data	Yunxin Li et.al.	2511.12609	link
2025-11-16	DenseAnnotate: Enabling Scalable Dense Caption Collection for Images and 3D Scenes via Spoken Descriptions	Xiaoyu Lin et.al.	2511.12452	null
2025-11-15	VoiceCraft-X: Unifying Multilingual, Voice-Cloning Speech Synthesis and Speech Editing	Zhisheng Zheng et.al.	2511.12347	null
2025-11-15	Fusionista2.0: Efficiency Retrieval System for Large-Scale Datasets	Huy M. Le et.al.	2511.12255	null
2025-10-27	TimeStampEval: A Simple LLM Eval and a Little Fuzzy Matching Trick to Improve Search Accuracy	James McCammon et.al.	2511.11594	null
2025-11-14	Language-Aided State Estimation	Yuki Miyoshi et.al.	2511.11285	null
2025-11-14	CLARITY: Contextual Linguistic Adaptation and Accent Retrieval for Dual-Bias Mitigation in Text-to-Speech Generation	Crystal Min Hui Poon et.al.	2511.11104	null
2025-11-14	Synthetic Voices, Real Threats: Evaluating Large Text-to-Speech Models in Generating Harmful Audio	Guangke Chen et.al.	2511.10913	null
2025-11-13	Curved Worlds, Clear Boundaries: Generalizing Speech Deepfake Detection using Hyperbolic and Spherical Geometry Spaces	Farhan Sheth et.al.	2511.10793	null
2025-11-12	Do AI Voices Learn Social Nuances? A Case of Politeness and Speech Rate	Eyal Rabin et.al.	2511.10693	null
2025-11-12	StyleBreak: Revealing Alignment Vulnerabilities in Large Audio-Language Models via Style-Aware Audio Jailbreak	Hongyi Li et.al.	2511.10692	null
2025-11-09	Towards Fine-Grained Code-Switch Speech Translation with Semantic Space Alignment	Yan Gao et.al.	2511.10670	null
2025-11-13	VocalNet-M2: Advancing Low-Latency Spoken Language Modeling via Integrated Multi-Codebook Tokenization and Multi-Token Prediction	Yuhao Wang et.al.	2511.10232	null
2025-11-14	Speech-Audio Compositional Attacks on Multimodal LLMs and Their Mitigation with SALMONN-Guard	Yudong Yang et.al.	2511.10222	null
2025-11-13	FabasedVC: Enhancing Voice Conversion with Text Modality Fusion and Phoneme-Level SSL Features	Wenyu Wang et.al.	2511.10112	null
2025-11-13	Time-Layer Adaptive Alignment for Speaker Similarity in Flow-Matching Based Zero-Shot TTS	Haoyu Li et.al.	2511.09995	null
2025-11-12	End-to-end Contrastive Language-Speech Pretraining Model For Long-form Spoken Question Answering	Jiliang Hu et.al.	2511.09282	null
2025-11-12	POTSA: A Cross-Lingual Speech Alignment Framework for Low Resource Speech-to-Text Translation	Xuanchen Li et.al.	2511.09232	null
2025-11-01	Retrieval-Augmented Generation of Pediatric Speech-Language Pathology vignettes: A Proof-of-Concept Study	Yilan Liu et.al.	2511.08600	null
2025-11-11	ParliaBench: An Evaluation and Benchmarking Framework for LLM-Generated Parliamentary Speech	Marios Koniaris et.al.	2511.08247	null
2025-11-11	State of the Art in Text Classification for South Slavic Languages: Fine-Tuning or Prompting?	Taja Kuzman Pungeršek et.al.	2511.07989	null
2025-11-30	SpeechJudge: Towards Human-Level Judgment for Speech Naturalness	Xueyao Zhang et.al.	2511.07931	null
2025-11-24	SynTTS-Commands: A Public Dataset for On-Device KWS via TTS-Synthesized Multilingual Speech	Lu Gan et.al.	2511.07821	link
2025-11-10	Conditional Diffusion as Latent Constraints for Controllable Symbolic Music Generation	Matteo Pettenó et.al.	2511.07156	null
2025-11-10	Generating Novel and Realistic Speakers for Voice Conversion	Meiying Melissa Chen et.al.	2511.07135	null
2025-11-10	On the Joint Minimization of Regularization Loss Functions in Deep Variational Bayesian Methods for Attribute-Controlled Symbolic Music Generation	Matteo Pettenó et.al.	2511.07118	null
2025-11-10	E2E-VGuard: Adversarial Prevention for Production LLM-based End-To-End Speech Synthesis	Zhisheng Zhang et.al.	2511.07099	null
2025-11-10	MedVoiceBias: A Controlled Study of Audio LLM Behavior in Clinical Decision-Making	Zhi Rui Tam et.al.	2511.06592	null
2025-11-09	SAR-LM: Symbolic Audio Reasoning with Large Language Models	Termeh Taheri et.al.	2511.06483	null
2025-11-18	TalkSketch: Multimodal Generative AI for Real-time Sketch Ideation with Speech	Weiyan Shi et.al.	2511.05817	null
2025-11-07	Shared Latent Representation for Joint Text-to-Audio-Visual Synthesis	Dogucan Yaman et.al.	2511.05432	null
2025-11-07	Synthesizing speech with selected perceptual voice qualities - A case study with creaky voice	Frederik Rautenberg et.al.	2511.05143	null
2025-11-06	PromptSep: Generative Audio Separation via Multimodal Prompting	Yutong Wen et.al.	2511.04623	null
2025-11-19	Step-Audio-EditX Technical Report	Chao Yan et.al.	2511.03601	link
2025-11-05	Seeing What You Say: Expressive Image Generation from Speech	Jiyoung Lee et.al.	2511.03423	null
2025-11-05	TASU: Text-Only Alignment for Speech Understanding	Jing Peng et.al.	2511.03310	null
2025-11-11	How to Evaluate Speech Translation with Source-Aware Neural MT Metrics	Mauro Cettolo et.al.	2511.03295	null
2025-11-05	PolyNorm: Few-Shot LLM-Based Text Normalization for Text-to-Speech	Michel Wong et.al.	2511.03080	null
2025-11-04	Augmenting Open-Vocabulary Dysarthric Speech Assessment with Human Perceptual Supervision	Kaimeng Jia et.al.	2511.02270	null
2025-11-03	Toward Objective and Interpretable Prosody Evaluation in Text-to-Speech: A Linguistically Motivated Approach	Cedric Chan et.al.	2511.02104	null
2025-11-03	SeaLLMs-Audio: Large Audio-Language Models for Southeast Asia	Chaoqun Liu et.al.	2511.01670	null
2025-11-03	Speech-DRAME: A Framework for Human-Aligned Benchmarks in Speech Role-Play	Jiatong Shi et.al.	2511.01261	null
2025-11-28	LongCat-Flash-Omni Technical Report	Meituan LongCat Team et.al.	2511.00279	null
2025-10-31	Reconstructing Unseen Sentences from Speech-related Biosignals for Open-vocabulary Neural Communication	Deok-Seon Kim et.al.	2510.27247	null
2025-10-30	UniTok-Audio: A Unified Audio Generation Framework via Generative Modeling on Discrete Codec Tokens	Chengwei Liu et.al.	2510.26372	null
2025-10-30	SP-MCQA: Evaluating Intelligibility of TTS Beyond the Word Level	Hitomi Jin Ling Tee et.al.	2510.26190	null
2025-10-30	ALMGuard: Safety Shortcuts and Where to Find Them as Guardrails for Audio-Language Models	Weifei Jin et.al.	2510.26096	null
2025-10-27	SFMS-ALR: Script-First Multilingual Speech Synthesis with Adaptive Locale Resolution	Dharma Teja Donepudi et.al.	2510.25178	null
2025-10-29	Explainable Disentanglement on Discrete Speech Representations for Noise-Robust ASR	Shreyas Gopal et.al.	2510.25150	null
2025-10-30	Evaluating Emotion Recognition in Spoken Language Models on Emotionally Incongruent Speech	Pedro Corrêa et.al.	2510.25054	null
2025-10-28	POWSM: A Phonetic Open Whisper-Style Speech Foundation Model	Chin-Jou Li et.al.	2510.24992	null
2025-11-25	Ming-Flash-Omni: A Sparse, Unified Architecture for Multimodal Perception and Generation	Inclusion AI et.al.	2510.24821	null
2025-11-28	STAR-Bench: Probing Deep Spatio-Temporal Reasoning as Audio 4D Intelligence	Zihan Liu et.al.	2510.24693	link
2025-10-28	Levée d'ambiguïtés par grammaires locales	Eric G. C. Laporte et.al.	2510.24530	null
2025-10-28	Bayesian Speech synthesizers Can Learn from Multiple Teachers	Ziyang Zhang et.al.	2510.24372	null
2025-10-28	Abjad AI at NADI 2025: CATT-Whisper: Multimodal Diacritic Restoration Using Text and Speech Representations	Ahmad Ghannam et.al.	2510.24247	null
2025-10-28	V-SAT: Video Subtitle Annotation Tool	Arpita Kundu et.al.	2510.24180	null
2025-10-30	TeleEgo: Benchmarking Egocentric AI Assistants in the Wild	Jiaqi Yan et.al.	2510.23981	null
2025-10-28	emg2speech: synthesizing speech from electromyography using self-supervised speech models	Harshavardhana T. Gowda et.al.	2510.23969	null
2025-10-27	AfriMTEB and AfriE5: Benchmarking and Adapting Text Embedding Models for African Languages	Kosei Uemura et.al.	2510.23896	null
2025-11-01	RoboOmni: Proactive Robot Manipulation in Omni-modal Context	Siyin Wang et.al.	2510.23763	link
2025-10-28	SoulX-Podcast: Towards Realistic Long-form Podcasts with Dialectal and Paralinguistic Diversity	Hanke Xie et.al.	2510.23541	null
2025-10-29	Are ASR foundation models generalized enough to capture features of regional dialects for low-resource languages?	Tawsif Tashwar Dipto et.al.	2510.23252	null
2025-10-27	Flexing in 73 Languages: A Single Small Model for Multilingual Inflection	Tomáš Sourada et.al.	2510.23114	null
2025-10-27	Adapting Speech Foundation Models with Large Language Models for Unified Speech Recognition	Jing-Xuan Zhang et.al.	2510.22961	null
2025-10-30	DiffRhythm 2: Efficient and High Fidelity Song Generation via Block Flow Matching	Yuepeng Jiang et.al.	2510.22950	null
2025-10-26	UltraVoice: Scaling Fine-Grained Style-Controlled Speech Conversations for Spoken Dialogue Models	Wenming Tu et.al.	2510.22588	link
2025-10-25	M-CIF: Multi-Scale Alignment For CIF-Based Non-Autoregressive ASR	Ruixiang Mao et.al.	2510.22172	null
2025-10-23	GuitarFlow: Realistic Electric Guitar Synthesis From Tablatures via Flow Matching and Style Transfer	Jackson Loth et.al.	2510.21872	null
2025-10-24	Foley Control: Aligning a Frozen Latent Text-to-Audio Model to Video	Ciara Rowles et.al.	2510.21581	null
2025-10-24	SindBERT, the Sailor: Charting the Seas of Turkish NLP	Raphael Scheible-Schmitt et.al.	2510.21364	null
2025-10-30	Elementary, My Dear Watson: Non-Invasive Neural Keyword Spotting in the LibriBrain Dataset	Gereon Elvers et.al.	2510.21038	null
2025-10-27	ReFESS-QI: Reference-Free Evaluation For Speech Separation With Joint Quality And Intelligibility Scoring	Ari Frummer et.al.	2510.21014	null
2025-11-13	Can Current Detectors Catch Face-to-Voice Deepfake Attacks?	Nguyen Linh Bao Nguyen et.al.	2510.21004	null
2025-10-22	Data-Centric Lessons To Improve Speech-Language Pretraining	Vishaal Udandarao et.al.	2510.20860	null
2025-10-23	\textsc{CantoNLU}: A benchmark for Cantonese natural language understanding	Junghyun Min et.al.	2510.20670	null
2025-10-23	Speaking Clearly: A Simplified Whisper-Based Codec for Low-Bitrate Speech Coding	Xin Zhang et.al.	2510.20504	null
2025-10-23	Vox-Evaluator: Enhancing Stability and Fidelity for Zero-shot TTS with A Multi-Level Evaluator	Hualei Wang et.al.	2510.20210	null
2025-10-23	Are Stereotypes Leading LLMs' Zero-Shot Stance Detection ?	Anthony Dubreuil et.al.	2510.20154	null
2025-10-23	SpeechAgent: An End-to-End Mobile Infrastructure for Speech Impairment Assistance	Haowei Lou et.al.	2510.20113	null
2025-10-22	OmniMotion-X: Versatile Multimodal Whole-Body Motion Generation	Guowei Xu et.al.	2510.19789	null
2025-10-23	Adapting Multilingual Models to Code-Mixed Tasks via Model Merging	Prashant Kodali et.al.	2510.19782	null
2025-10-22	Style Attack Disguise: When Fonts Become a Camouflage for Adversarial Intent	Yangshijie Zhang et.al.	2510.19641	null
2025-10-22	Re-evaluating Minimum Bayes Risk Decoding for Automatic Speech Recognition	Yuu Jinnai et.al.	2510.19471	null
2025-10-22	EchoFake: A Replay-Aware Dataset for Practical Speech Deepfake Detection	Tong Zhang et.al.	2510.19414	null
2025-10-22	SONAR-SLT: Multilingual Sign Language Translation via Language-Agnostic Sentence Embedding Supervision	Yasser Hamidullah et.al.	2510.19398	null
2025-10-22	M3-SLU: Evaluating Speaker-Attributed Reasoning in Multimodal Large Language Models	Yejin Kwon et.al.	2510.19358	null
2025-10-22	Modeling Turn-Taking with Semantically Informed Gestures	Varsha Suresh et.al.	2510.19350	null
2025-10-22	Slot Filling as a Reasoning Task for SpeechLLMs	Kadri Hacioglu et.al.	2510.19326	null
2025-10-21	Steering Autoregressive Music Generation with Recursive Feature Machines	Daniel Zhao et.al.	2510.19127	null
2025-11-07	Re:Member: Emotional Question Generation from Personal Memories	Zackary Rackauckas et.al.	2510.19030	null
2025-11-05	StutterZero and StutterFormer: End-to-End Speech Conversion for Stuttering Transcription and Correction	Qianheng Xu et.al.	2510.18938	null
2025-10-21	ProLAP: Probabilistic Language-Audio Pre-Training	Toranosuke Manabe et.al.	2510.18423	null
2025-10-21	KrishokBondhu: A Retrieval-Augmented Voice-Based Agricultural Advisory Call Center for Bengali Farmers	Mohd Ruhul Ameen et.al.	2510.18355	null
2025-10-21	ParaStyleTTS: Toward Efficient and Robust Paralinguistic Style Control for Expressive Text-to-Speech Generation	Haowei Lou et.al.	2510.18308	link
2025-10-20	SARSteer: Safeguarding Large Audio Language Models via Safe-Ablated Refusal Steering	Weilin Lin et.al.	2510.17633	null
2025-10-20	ImaGGen: Zero-Shot Generation of Co-Speech Semantic Gestures Grounded in Language and Image Input	Hendric Voss et.al.	2510.17617	null
2025-10-20	Addressing Antisocial Behavior in Multi-Party Dialogs Through Multimodal Representation Learning	Hajar Bakarou et.al.	2510.17289	null
2025-10-19	Investigating Safety Vulnerabilities of Large Audio-Language Models Under Speaker Emotional Variations	Bo-Han Feng et.al.	2510.16893	link
2025-12-14	SAC: Neural Speech Codec with Semantic-Acoustic Dual-Stream Quantization	Wenxi Chen et.al.	2510.16841	link
2025-10-19	End-to-end Listen, Look, Speak and Act	Siyin Wang et.al.	2510.16756	null
2025-10-19	U-Codec: Ultra Low Frame-rate Neural Speech Codec for Fast High-fidelity Speech Generation	Xusheng Yang et.al.	2510.16718	null
2025-10-19	Zero- and One-Shot Data Augmentation for Sentence-Level Dysarthric Speech Recognition in Constrained Scenarios	Shiyao Wang et.al.	2510.16700	null
2025-10-18	Edge-Based Speech Transcription and Synthesis for Kinyarwanda and Swahili Languages	Pacome Simon Mbonimpa et.al.	2510.16497	null
2025-10-18	Probing the Hidden Talent of ASR Foundation Models for L2 English Oral Assessment	Fu-An Chao et.al.	2510.16387	null
2025-10-17	AsyncVoice Agent: Real-Time Explanation for LLM Planning and Reasoning	Yueqian Lin et.al.	2510.16156	null
2025-10-17	Leveraging LLMs for Context-Aware Implicit Textual and Multimodal Hate Speech Detection	Joshua Wolfe Brook et.al.	2510.15685	null
2025-10-17	SpikeVox: Towards Energy-Efficient Speech Therapy Framework with Spike-driven Generative Language Models	Rachmad Vidya Wicaksana Putra et.al.	2510.15566	null
2025-10-17	Extending Audio Context for Long-Form Understanding in Large Audio-Language Models	Yuatyong Chaichana et.al.	2510.15231	null
2025-10-17	LongCat-Audio-Codec: An Audio Tokenizer and Detokenizer Solution Designed for Speech Large Language Models	Xiaohan Zhao et.al.	2510.15227	null
2025-10-16	OmniMotion: Multimodal Motion Generation with Continuous Masked Autoregression	Zhe Li et.al.	2510.14954	null
2025-10-16	TASLA: Text-Aligned Speech Tokens with Multiple Layer-Aggregation	Ming-Hao Hsu et.al.	2510.14934	null
2025-10-16	TRI-DEP: A Trimodal Comparative Study for Depression Detection Using Speech, Text, and EEG	Annisaa Fitri Nurfidausi et.al.	2510.14922	null
2025-10-16	RLAIF-SPA: Optimizing LLM-based Emotional Speech Synthesis via RLAIF	Qing Yang et.al.	2510.14628	null
2025-10-15	Closing the Gap Between Text and Speech Understanding in LLMs	Santiago Cuervo et.al.	2510.13632	null
2025-10-15	Mismatch Aware Guidance for Robust Emotion Control in Auto-Regressive TTS Models	Yizhou Peng et.al.	2510.13293	null
2025-10-23	Continuous-Token Diffusion for Speaker-Referenced TTS in Multimodal LLMs	Xinlu He et.al.	2510.12995	null
2025-10-15	DiSTAR: Diffusion over a Scalable Token Autoregressive Representation for Speech Generation	Yakun Song et.al.	2510.12210	null
2025-10-14	Understanding the Modality Gap: An Empirical Study on the Speech-Text Alignment Mechanism of Large Speech Language Models	Bajian Xiang et.al.	2510.12116	null
2025-10-13	BridgeCode: A Dual Speech Representation Paradigm for Autoregressive Zero-Shot Text-to-Speech Synthesis	Jingyuan Xing et.al.	2510.11646	null
2025-10-13	Perturbation Self-Supervised Representations for Cross-Lingual Emotion TTS: Stage-Wise Modeling of Emotion and Speaker	Cheng Gong et.al.	2510.11124	null
2025-10-14	ParsVoice: A Large-Scale Multi-Speaker Persian Speech Corpus for Text-to-Speech Synthesis	Mohammad Javad Ranjbar Kalahroodi et.al.	2510.10774	null
2025-10-12	End-to-end Speech Recognition with similar length speech and text	Peng Fan et.al.	2510.10453	null
2025-10-17	MRSAudio: A Large-Scale Multimodal Recorded Spatial Audio Dataset with Refined Annotations	Wenxiang Guo et.al.	2510.10396	link
2025-10-10	O_O-VC: Synthetic Data-Driven One-to-One Alignment for Any-to-Any Voice Conversion	Huu Tuong Tu et.al.	2510.09061	link
2025-10-09	DialoSpeech: Dual-Speaker Dialogue Generation with LLM and Flow Matching	Hanke Xie et.al.	2510.08373	null
2025-10-09	IntMeanFlow: Few-step Speech Generation with Integral Velocity Distillation	Wei Wang et.al.	2510.07979	null
2025-10-09	Standard-to-Dialect Transfer Trends Differ across Text and Speech: A Case Study on Intent and Topic Classification in German Dialects	Verena Blaschke et.al.	2510.07890	null
2025-10-08	Making Machines Sound Sarcastic: LLM-Enhanced and Retrieval-Guided Sarcastic Speech Synthesis	Zhu Li et.al.	2510.07096	null
2025-10-08	Towards Responsible Evaluation for Text-to-Speech	Yifan Yang et.al.	2510.06927	null
2025-10-08	XLSR-Kanformer: A KAN-Intergrated model for Synthetic Speech Detection	Phuong Tuan Dat et.al.	2510.06706	null
2025-10-07	EverydayMMQA: A Multilingual and Multimodal Framework for Culturally Grounded Spoken Visual QA	Firoj Alam et.al.	2510.06371	null
2025-10-08	TokenChain: A Discrete Speech Chain via Semantic Token Modeling	Mingxuan Wang et.al.	2510.06201	null
2025-10-07	Latent Speech-Text Transformer	Yen-Ju Lu et.al.	2510.06195	null
2025-10-07	ECTSpeech: Enhancing Efficient Speech Synthesis via Easy Consistency Tuning	Tao Zhu et.al.	2510.05984	null
2025-10-07	Data-efficient Targeted Token-level Preference Optimization for LLM-based Text-to-Speech	Rikuto Kotoge et.al.	2510.05799	null
2025-10-07	Sparse deepfake detection promotes better disentanglement	Antoine Teissier et.al.	2510.05696	null
2025-10-09	Paper2Video: Automatic Video Generation from Scientific Papers	Zeyu Zhu et.al.	2510.05096	link
2025-10-06	Speak, Edit, Repeat: High-Fidelity Voice Editing and Zero-Shot TTS with Cross-Attentive Mamba	Baher Mohammad et.al.	2510.04738	null
2025-11-20	UniVoice: Unifying Autoregressive ASR and Flow-Matching based TTS with Large Language Models	Wenhao Guan et.al.	2510.04593	link
2025-10-07	Synthetic Audio Forensics Evaluation (SAFE) Challenge	Kirill Trapeznikov et.al.	2510.03387	null
2025-10-03	Flamed-TTS: Flow Matching Attention-Free Models for Efficient Generating and Dynamic Pacing Zero-shot Text-to-Speech	Hieu-Nghia Huynh-Nguyen et.al.	2510.02848	null
2025-09-26	KAME: Tandem Architecture for Enhancing Knowledge in Real-Time Speech-to-Speech Conversational AI	So Kuroki et.al.	2510.02327	null
2025-09-24	SpeechCT-CLIP: Distilling Text-Image Knowledge to Speech for Voice-Native Multimodal CT Analysis	Lukas Buess et.al.	2510.02322	null
2025-10-02	Emotional Text-To-Speech Based on Mutual-Information-Guided Emotion-Timbre Disentanglement	Jianing Yang et.al.	2510.01722	null
2025-09-30	BatonVoice: An Operationalist Framework for Enhancing Controllable Speech Synthesis with Linguistic Intelligence from LLMs	Yue Wang et.al.	2509.26514	link
2025-09-30	Optimizing Speech Language Models for Acoustic Consistency	Morteza Rohanian et.al.	2509.26276	null
2025-09-30	HiStyle: Hierarchical Style Embedding Predictor for Text-Prompt-Guided Controllable Speech Synthesis	Ziyu Zhang et.al.	2509.25842	null
2025-09-30	LTA-L2S: Lexical Tone-Aware Lip-to-Speech Synthesis for Mandarin with Cross-Lingual Transfer Learning	Kang Yang et.al.	2509.25670	null
2025-09-29	Emotion-Aligned Generation in Diffusion Text to Speech Models via Preference-Guided Optimization	Jiacheng Shi et.al.	2509.25416	null
2025-09-29	MGM-Omni: Scaling Omni LLMs to Personalized Long-Horizon Speech	Chengyao Wang et.al.	2509.25131	link
2025-09-30	VSSFlow: Unifying Video-conditioned Sound and Speech Generation via Joint Learning	Xin Cheng et.al.	2509.24773	null
2025-09-29	VoxCPM: Tokenizer-Free TTS for Context-Aware Speech Generation and True-to-Life Voice Cloning	Yixuan Zhou et.al.	2509.24650	null
2025-09-29	Word-Level Emotional Expression Control in Zero-Shot Text-to-Speech Synthesis	Tianrui Wang et.al.	2509.24629	null
2025-09-29	ISSE: An Instruction-Guided Speech Style Editing Dataset And Benchmark	Yun Chen et.al.	2509.24570	null
2025-09-29	UniFlow-Audio: Unified Flow Matching for Audio Generation from Omni-Modalities	Xuenan Xu et.al.	2509.24391	link
2025-09-28	Generalizable Speech Deepfake Detection via Information Bottleneck Enhanced Adversarial Alignment	Pu Huang et.al.	2509.23618	null
2025-09-27	BFA: Real-time Multilingual Text-to-speech Forced Alignment	Abdul Rehman et.al.	2509.23147	null
2025-09-26	ArFake: A Multi-Dialect Benchmark and Baselines for Arabic Spoof-Speech Detection	Mohamed Maged et.al.	2509.22808	null
2025-09-25	DiaMoE-TTS: A Unified IPA-Based Dialect TTS Framework with Mixture-of-Experts and Parameter-Efficient Zero-Shot Adaptation	Ziqi Chen et.al.	2509.22727	null
2025-09-26	Semantic-VAE: Semantic-Alignment Latent Representation for Better Speech Synthesis	Zhikang Niu et.al.	2509.22167	null
2025-09-26	Speaker Anonymisation for Speech-based Suicide Risk Detection	Ziyun Cui et.al.	2509.22148	null
2025-09-26	Comprehend and Talk: Text to Speech Synthesis via Dual Language Modeling	Junjie Cao et.al.	2509.22062	null
2025-09-26	Align2Speak: Improving TTS for Low Resource Languages via ASR-Guided Online Preference Optimization	Shehzeen Hussain et.al.	2509.21718	null
2025-09-25	UniSS: Unified Expressive Speech-to-Speech Translation with Your Voice	Sitong Cheng et.al.	2509.21144	null
2025-09-27	i-LAVA: Insights on Low Latency Voice-2-Voice Architecture for Agents	Anupam Purwar et.al.	2509.20971	null
2025-09-26	SPADE: Structured Pruning and Adaptive Distillation for Efficient LLM-TTS	Tan Dat Nguyen et.al.	2509.20802	null
2025-09-24	Objective Evaluation of Prosody and Intelligibility in Speech Synthesis via Conditional Prediction of Discrete Tokens	Ismail Rasim Ulgen et.al.	2509.20485	null
2025-09-20	Beyond Global Emotion: Fine-Grained Emotional Speech Synthesis with Dynamic Word-Level Modulation	Sirui Wang et.al.	2509.20378	null
2025-09-24	OLaPh: Optimal Language Phonemizer	Johannes Wirth et.al.	2509.20086	null
2025-09-25	Measuring Prosody Diversity in Zero-Shot TTS: A New Metric, Benchmark, and Exploration	Yifan Yang et.al.	2509.19928	null
2025-09-24	CoMelSinger: Discrete Token-Based Zero-Shot Singing Synthesis With Structured Melody Control and Guidance	Junchuan Zhao et.al.	2509.19883	null
2025-09-24	Eliminating stability hallucinations in llm-based tts models via attention guidance	ShiMing Wang et.al.	2509.19852	null
2025-09-24	Efficient Speech Watermarking for Speech Synthesis via Progressive Knowledge Distillation	Yang Cui et.al.	2509.19812	null
2025-09-24	PART: Progressive Alignment Representation Training for Multilingual Speech-To-Text with LLMs	Pei Zhang et.al.	2509.19745	null
2025-09-24	Selective Classifier-free Guidance for Zero-shot Text-to-speech	John Zheng et.al.	2509.19668	null
2025-09-23	HD-PPT: Hierarchical Decoding of Content- and Prompt-Preference Tokens for Instruction-based TTS	Sihang Nie et.al.	2509.19001	null
2025-09-23	Direct Preference Optimization for Speech Autoregressive Diffusion Models	Zhijun Liu et.al.	2509.18928	null
2025-09-23	Group Relative Policy Optimization for Text-to-Speech with Large Language Models	Chang Liu et.al.	2509.18798	null
2025-09-23	Explore the Reinforcement Learning for the LLM based ASR and TTS system	Changfeng Gao et.al.	2509.18569	null
2025-09-23	No Verifiable Reward for Prosody: Toward Preference-Guided Prosody Learning in TTS	Seungyoun Shin et.al.	2509.18531	null
2025-10-13	Discrete-Time Diffusion-Like Models for Speech Synthesis	Xiaozhou Tan et.al.	2509.18470	null
2025-09-22	TMD-TTS: A Unified Tibetan Multi-Dialect Text-to-Speech Synthesis for Ü-Tsang, Amdo and Kham Speech Dataset Generation	Yutong Liu et.al.	2509.18060	null
2025-09-22	Nord-Parl-TTS: Finnish and Swedish TTS Dataset from Parliament Speech	Zirui Li et.al.	2509.17988	null
2025-09-22	Audiobook-CC: Controllable Long-context Speech Generation for Multicast Audiobook	Min Liu et.al.	2509.17516	null
2025-09-29	Sidon: Fast and Robust Open-Source Multilingual Speech Restoration for Large-scale Dataset Cleansing	Wataru Nakata et.al.	2509.17052	link
2025-09-21	Bridging the gap between training and inference in LM-based TTS models	Ruonan Zhang et.al.	2509.17021	null
2025-09-21	MBCodec:Thorough disentangle for high-fidelity audio compression	Ruonan Zhang et.al.	2509.17006	null
2025-09-19	Fed-PISA: Federated Voice Cloning via Personalized Identity-Style Adaptation	Qi Wang et.al.	2509.16010	null
2025-09-19	VoXtream: Full-Stream Text-to-Speech with Extremely Low Latency	Nikita Torgashov et.al.	2509.15969	link
2025-09-19	Deep Dubbing: End-to-End Auto-Audiobook System with Text-to-Timbre and Context-Aware Instruct-TTS	Ziqi Dai et.al.	2509.15845	null
2025-09-19	LibriTTS-VI: A Public Corpus and Novel Methods for Efficient Voice Impression Control	Junki Ohmura et.al.	2509.15626	null
2025-09-19	Beyond Video-to-SFX: Video to Audio Synthesis with Environmentally Aware Speech	Xinlei Niu et.al.	2509.15492	null
2025-09-18	A Novel Semantic Compression Approach for Ultra-low Bandwidth Voice Communication	Ryan Collette et.al.	2509.15462	null
2025-09-23	Frustratingly Easy Data Augmentation for Low-Resource ASR	Katsumi Ibaraki et.al.	2509.15373	null
2025-09-18	Emotion-Aware Speech Generation with Character-Specific Voices for Comics	Zhiwen Qian et.al.	2509.15253	null
2025-09-18	Real-Time Streaming Mel Vocoding with Generative Flow Matching	Simon Welker et.al.	2509.15085	null
2025-09-18	MELA-TTS: Joint transformer-diffusion model with representation alignment for speech synthesis	Keyu An et.al.	2509.14784	null
2025-09-19	DAIEN-TTS: Disentangled Audio Infilling for Environment-Aware Text-to-Speech Synthesis	Ye-Xin Lu et.al.	2509.14684	null
2025-09-18	Stochastic Clock Attention for Aligning Continuous and Ordered Sequences	Hyungjoon Soh et.al.	2509.14678	null
2025-09-20	Cross-Lingual F5-TTS: Towards Language-Agnostic Voice Cloning and Speech Synthesis	Qingyu Liu et.al.	2509.14579	null
2025-09-17	SpeechOp: Inference-Time Task Composition for Generative Speech Processing	Justin Lovelace et.al.	2509.14298	null
2025-10-01	SpeechWeave: Diverse Multilingual Synthetic Text & Audio Data Generation Pipeline for Training Text to Speech Models	Karan Dua et.al.	2509.14270	null
2025-09-17	CS-FLEURS: A Massively Multilingual and Code-Switched Speech Dataset	Brian Yan et.al.	2509.14161	null
2025-09-22	Do You Hear What I Mean? Quantifying the Instruction-Perception Gap in Instruction-Guided Expressive Text-To-Speech Systems	Yi-Cheng Lin et.al.	2509.13989	null
2025-10-15	MSR-Codec: A Low-Bitrate Multi-Stream Residual Codec for High-Fidelity Speech Generation with Information Disentanglement	Jingyu Li et.al.	2509.13068	null
2025-09-16	A Lightweight Pipeline for Noisy Speech Voice Cloning and Accurate Lip Sync Synthesis	Javeria Amir et.al.	2509.12831	null
2025-10-16	Preservation of Language Understanding Capabilities in Speech-aware Large Language Models	Marek Kubis et.al.	2509.12171	null
2025-09-29	FuseCodec: Semantic-Contextual Fusion and Supervision for Neural Codecs	Md Mubtasim Ahasan et.al.	2509.11425	null
2025-09-14	Length-Aware Rotary Position Embedding for Text-Speech Alignment	Hyeongju Kim et.al.	2509.11084	null
2025-09-12	WhisTLE: Deeply Supervised, Text-Only Domain Adaptation for Pretrained Speech Recognition Transformers	Akshat Pandey et.al.	2509.10452	null
2025-09-12	Towards Data Drift Monitoring for Speech Deepfake Detection in the context of MLOps	Xin Wang et.al.	2509.10086	null
2025-09-11	DiTReducio: A Training-Free Acceleration for DiT-Based TTS via Progressive Calibration	Yanru Huo et.al.	2509.09748	null
2025-09-12	DiFlow-TTS: Discrete Flow Matching with Factorized Speech Tokens for Low-Latency Zero-Shot Text-To-Speech	Ngoc-Son Nguyen et.al.	2509.09631	null
2025-09-11	HISPASpoof: A New Dataset For Spanish Speech Forensics	Maria Risques et.al.	2509.09155	null
2025-09-29	Streaming Sequence-to-Sequence Learning with Delayed Streams Modeling	Neil Zeghidour et.al.	2509.08753	null
2025-09-09	ParCzech4Speech: A New Speech Corpus Derived from Czech Parliamentary Data	Vladislav Stankov et.al.	2509.06675	null
2025-08-19	Integrating Feedback Loss from Bi-modal Sarcasm Detector for Sarcastic Speech Synthesis	Zhu Li et.al.	2508.13028	null
2025-10-07	EmoSSLSphere: Multilingual Emotional Speech Synthesis with Spherical Vectors and Discrete Speech Tokens	Joonyong Park et.al.	2508.11273	null
2025-08-08	UniTalker: Conversational Speech-Visual Synthesis	Yifan Hu et.al.	2508.04585	null
2025-08-29	Parallel GPT: Harmonizing the Independence and Interdependence of Acoustic and Semantic Information for Zero-Shot Text-to-Speech	Jingyuan Xing et.al.	2508.04141	null
2025-07-23	AI Telephone Surveying: Automating Quantitative Data Collection with an AI Interviewer	Danny D. Leybzon et.al.	2507.17718	null
2025-07-23	BoSS: Beyond-Semantic Speech	Qing Wang et.al.	2507.17563	null
2025-07-22	SplitMeanFlow: Interval Splitting Consistency in Few-Step Generative Modeling	Yi Guo et.al.	2507.16884	null
2025-07-15	Evaluating Speech-to-Text x LLM x Text-to-Speech Combinations for AI Interview Systems	Nima Yazdani et.al.	2507.16835	null
2025-07-21	A2TTS: TTS for Low Resource Indian Languages	Ayush Singh Bhadoriya et.al.	2507.15272	null
2025-07-21	EchoVoices: Preserving Generational Voices and Memories for Seniors and Children	Haiying Xu et.al.	2507.15221	null
2025-07-22	Hear Your Code Fail, Voice-Assisted Debugging for Python	Sayed Mahbub Hasan Amiri et.al.	2507.15007	null
2025-07-20	DMOSpeech 2: Reinforcement Learning for Duration Prediction in Metric-Optimized Speech Synthesis	Yinghao Aaron Li et.al.	2507.14988	null
2025-07-17	A Data-Centric Framework for Addressing Phonetic and Prosodic Challenges in Russian Speech Generative Models	Kirill Borodin et.al.	2507.13563	null
2025-07-17	NonverbalTTS: A Public English Corpus of Text-Aligned Nonverbal Vocalizations with Emotion Annotations for Text-to-Speech	Maksim Borisov et.al.	2507.13155	null
2025-07-17	Intelligent Virtual Sonographer (IVS): Enhancing Physician-Robot-Patient Communication	Tianyu Song et.al.	2507.13052	null
2025-07-17	Enkidu: Universal Frequential Perturbation for Real-Time Audio Privacy Protection against Voice Deepfakes	Zhou Feng et.al.	2507.12932	null
2025-07-16	Quantize More, Lose Less: Autoregressive Generation from Residually Quantized Speech Representations	Yichen Han et.al.	2507.12197	null
2025-07-16	EME-TTS: Unlocking the Emphasis and Emotion Link in Speech Synthesis	Haoxun Li et.al.	2507.12015	null
2025-07-15	Towards Scalable AASIST: Refining Graph Attention for Speech Deepfake Detection	Ivan Viakhirev et.al.	2507.11777	null
2025-07-15	P.808 Multilingual Speech Enhancement Testing: Approach and Results of URGENT 2025 Challenge	Marvin Sach et.al.	2507.11306	null
2025-07-20	Supporting SENCOTEN Language Documentation Efforts with Automatic Speech Recognition	Mengzhe Geng et.al.	2507.10827	null
2025-07-14	An Empirical Evaluation of AI-Powered Non-Player Characters' Perceived Realism and Performance in Virtual Reality Environments	Mikko Korkiakoski et.al.	2507.10469	null
2025-07-12	ZipVoice-Dialog: Non-Autoregressive Spoken Dialogue Generation with Flow Matching	Han Zhu et.al.	2507.09318	null
2025-07-12	Voice Conversion for Lombard Speaking Style with Implicit and Explicit Acoustic Feature Conditioning	Dominika Woszczyk et.al.	2507.09310	null
2025-07-12	ClaritySpeech: Dementia Obfuscation in Speech	Dominika Woszczyk et.al.	2507.09282	null
2025-07-11	SemAlignVC: Enhancing zero-shot timbre conversion using semantic alignment	Shivam Mehta et.al.	2507.09070	null
2025-07-11	Exploiting Leaderboards for Large-Scale Distribution of Malicious Models	Anshuman Suri et.al.	2507.08983	null
2025-07-06	A Hybrid Machine Learning Framework for Optimizing Crop Selection via Agronomic and Economic Forecasting	Niranjan Mallikarjun Sindhur et.al.	2507.08832	null
2025-07-11	Unlocking Speech Instruction Data Potential with Query Rewriting	Yonghua Hei et.al.	2507.08603	null
2025-07-11	MIDI-VALLE: Improving Expressive Piano Performance Synthesis Through Neural Codec Language Modelling	Jingjing Tang et.al.	2507.08530	null
2025-07-11	Active Learning for Text-to-Speech Synthesis with Informative Sample Collection	Kentaro Seki et.al.	2507.08319	null
2025-07-05	RepeaTTS: Towards Feature Discovery through Repeated Fine-Tuning	Atli Sigurgeirsson et.al.	2507.08012	null
2025-07-10	SecureSpeech: Prompt-based Speaker and Content Protection	Belinda Soh Hui Hui et.al.	2507.07799	null
2025-07-09	Learning Japanese with Jouzu: Interaction Outcomes with Stylized Dialogue Fictional Agents	Zackary Rackauckas et.al.	2507.06483	null
2025-07-08	Speech Quality Assessment Model Based on Mixture of Experts: System-Level Performance Enhancement and Utterance-Level Challenge Analysis	Xintong Hu et.al.	2507.06116	null
2025-07-08	Differentiable Reward Optimization for LLM based TTS system	Changfeng Gao et.al.	2507.05911	null
2025-07-08	OpenS2S: Advancing Fully Open-Source End-to-End Empathetic Large Speech Language Model	Chen Wang et.al.	2507.05177	null
2025-07-07	Multi-Step Prediction and Control of Hierarchical Emotion Distribution in Text-to-Speech Synthesis	Sho Inoue et.al.	2507.04598	null
2025-07-06	TTS-CtrlNet: Time varying emotion aligned text-to-speech generation with ControlNet	Jaeseok Jeong et.al.	2507.04349	null
2025-07-05	PresentAgent: Multimodal Agent for Presentation Video Generation	Jingwei Shi et.al.	2507.04036	null
2025-07-08	Prosody Labeling with Phoneme-BERT and Speech Foundation Models	Tomoki Koriyama et.al.	2507.03912	null
2025-07-05	Traceable TTS: Toward Watermark-Free TTS with Strong Traceability	Yuxiang Zhao et.al.	2507.03887	null
2025-07-14	DeepGesture: A conversational gesture synthesis system based on emotions and semantics	Thanh Hoang-Minh et.al.	2507.03147	null
2025-07-03	Open-Source System for Multilingual Translation and Cloned Speech Synthesis	Mateo Cámara et.al.	2507.02530	null
2025-07-03	JoyTTS: LLM-based Spoken Chatbot With Voice Cloning	Fangru Zhou et.al.	2507.02380	null
2025-07-02	Analyzing and Improving Speaker Similarity Assessment for Speech Synthesis	Marc-André Carbonneau et.al.	2507.02176	null
2025-07-08	Pronunciation Editing for Finnish Speech using Phonetic Posteriorgrams	Zirui Li et.al.	2507.02115	null
2025-07-02	A Dataset for Automatic Assessment of TTS Quality in Spanish	Alejandro Sosa Welford et.al.	2507.01805	null
2025-07-02	Voice Conversion for Likability Control via Automated Rating of Speech Synthesis Corpora	Hitoshi Suda et.al.	2507.01356	null
2025-07-08	SpeechAccentLLM: A Unified Framework for Foreign Accent Conversion and Text to Speech	Zhuangfei Cheng et.al.	2507.01348	null
2025-07-02	Multi-interaction TTS toward professional recording reproduction	Hiroki Kanagawa et.al.	2507.00808	null
2025-07-18	MuteSwap: Visual-informed Silent Video Identity Conversion	Yifan Liu et.al.	2507.00498	null
2025-06-30	Collecting, Curating, and Annotating Good Quality Speech deepfake dataset for Famous Figures: Process and Challenges	Hashim Ali et.al.	2507.00324	null
2025-06-30	Investigating Stochastic Methods for Prosody Modeling in Speech Synthesis	Paul Mayer et.al.	2507.00227	null
2025-06-30	JAM-Flow: Joint Audio-Motion Synthesis with Flow Matching	Mingi Kwon et.al.	2506.23552	null
2025-06-29	You Sound a Little Tense: L2 Tailored Clear TTS Using Durational Vowel Properties	Paige Tuttösí et.al.	2506.23367	null
2025-06-27	Evaluating Pointing Gestures for Target Selection in Human-Robot Collaboration	Noora Sassali et.al.	2506.22116	null
2025-06-27	Robust and Efficient Autoregressive Speech Synthesis with Dynamic Chunk-wise Prediction Policy	Bohan Li et.al.	2506.22023	null
2025-06-23	IndexTTS2: A Breakthrough in Emotionally Expressive and Duration-Controlled Auto-Regressive Zero-Shot Text-to-Speech	Siyi Zhou et.al.	2506.21619	null
2025-06-27	A Multi-Stage Framework for Multimodal Controllable Speech Synthesis	Rui Niu et.al.	2506.20945	null
2025-06-25	An Exploration of ECAPA-TDNN and x-vector Speaker Representations in Zero-shot Multi-speaker TTS	Marie Kunešová et.al.	2506.20190	null
2025-06-24	TTSDS2: Resources and Benchmark for Evaluating Human-Quality Text to Speech Systems	Christoph Minixhofer et.al.	2506.19441	null
2025-06-23	Selecting N-lowest scores for training MOS prediction models	Yuto Kondo et.al.	2506.18326	null
2025-06-23	Rethinking Mean Opinion Scores in Speech Quality Assessment: Aggregation through Quantized Distribution Fitting	Yuto Kondo et.al.	2506.18307	null
2025-07-15	JIS: A Speech Corpus of Japanese Idol Speakers with Various Speaking Styles	Yuto Kondo et.al.	2506.18296	null
2025-06-21	OpusLM: A Family of Open Unified Speech Language Models	Jinchuan Tian et.al.	2506.17611	null
2025-06-20	RapFlow-TTS: Rapid and High-Fidelity Text-to-Speech with Improved Consistency Flow Matching	Hyun Joon Park et.al.	2506.16741	null
2025-06-20	LM-SPT: LM-Aligned Semantic Distillation for Speech Tokenization	Daejin Jo et.al.	2506.16738	null
2025-06-20	V-CASS: Vision-context-aware Expressive Speech Synthesis for Enhancing User Understanding of Videos	Qixin Wang et.al.	2506.16716	null
2025-06-19	Streaming Non-Autoregressive Model for Accent Conversion and Pronunciation Improvement	Tuan-Nam Nguyen et.al.	2506.16580	null
2025-06-19	InstructTTSEval: Benchmarking Complex Natural-Language Instruction Following in Text-to-Speech Systems	Kexin Huang et.al.	2506.16381	link
2025-06-19	Optimizing Multilingual Text-To-Speech with Accents & Emotions	Pranav Pawar et.al.	2506.16310	null
2025-06-18	TTSOps: A Closed-Loop Corpus Optimization Framework for Training Multi-Speaker TTS Models from Dark Data	Kentaro Seki et.al.	2506.15614	null
2025-06-18	PredGen: Accelerated Inference of Large Language Models through Input-Time Speculation for Real-Time Speech Interaction	Shufan Li et.al.	2506.15556	null
2025-06-18	EmojiVoice: Towards long-term controllable expressivity in robot speech	Paige Tuttösí et.al.	2506.15085	null
2025-06-18	An accurate and revised version of optical character recognition-based speech synthesis using LabVIEW	Prateek Mehta et.al.	2506.15029	null
2025-06-17	Investigation of Zero-shot Text-to-Speech Models for Enhancing Short-Utterance Speaker Verification	Yiyang Zhao et.al.	2506.14226	null
2025-06-17	Pushing the Performance of Synthetic Speech Detection with Kolmogorov-Arnold Networks and Self-Supervised Learning Models	Tuan Dat Phuong et.al.	2506.14153	link
2025-06-16	EmoNews: A Spoken Dialogue System for Expressive News Conversations	Ryuki Matsuura et.al.	2506.13894	link
2025-07-08	Multimodal Integration Challenges in Emotionally Expressive Child Avatars for Training Applications	Pegah Salehi et.al.	2506.13477	null
2025-06-20	ZipVoice: Fast and High-Quality Zero-Shot Text-to-Speech with Flow Matching	Han Zhu et.al.	2506.13053	link
2025-06-14	StreamMel: Real-Time Zero-shot Text-to-Speech via Interleaved Continuous Autoregressive Modeling	Hui Wang et.al.	2506.12570	null
2025-06-14	Phonikud: Hebrew Grapheme-to-Phoneme Conversion for Real-Time Text-to-Speech	Yakov Kolani et.al.	2506.12311	null
2025-07-08	S2ST-Omni: An Efficient Multilingual Speech-to-Speech Translation Framework via Seamless Speech-Text Alignment and Progressive Fine-tuning	Yu Pan et.al.	2506.11160	null
2025-06-16	A Self-Refining Framework for Enhancing ASR Using TTS-Synthesized Data	Cheng-Kang Chou et.al.	2506.11130	null
2025-06-10	GUIRoboTron-Speech: Towards Automated GUI Agents Based on Speech Instructions	Wenkang Han et.al.	2506.11127	null
2025-06-10	ASRJam: Human-Friendly AI Speech Jamming to Prevent Automated Phone Scams	Freddie Grabovski et.al.	2506.11125	null
2025-06-05	Intelligibility of Text-to-Speech Systems for Mathematical Expressions	Sujoy Roychowdhury et.al.	2506.11086	null
2025-06-12	Scheduled Interleaved Speech-Text Training for Speech-to-Speech Translation with LLMs	Hayato Futami et.al.	2506.10299	null
2025-07-10	UmbraTTS: Adapting Text-to-Speech to Environmental Contexts with Flow Matching	Neta Glazer et.al.	2506.09874	null
2025-06-15	EmoNet-Voice: A Fine-Grained, Expert-Verified Benchmark for Speech Emotion Detection	Christoph Schuhmann et.al.	2506.09827	null
2025-06-11	OmniDRCA: Parallel Speech-Text Foundation Model via Dual-Resolution Speech Representations and Contrastive Alignment	Chao-Hong Tan et.al.	2506.09349	link
2025-06-11	Ming-Omni: A Unified Multimodal Model for Perception and Generation	Inclusion AI et.al.	2506.09344	link
2025-06-13	Step-Audio-AQAA: a Fully End-to-End Expressive Large Audio Language Model	Ailin Huang et.al.	2506.08967	null
2025-06-10	A Review on Score-based Generative Models for Audio Applications	Ge Zhu et.al.	2506.08457	null
2025-06-09	Seeing Voices: Generating A-Roll Video from Audio with Mirage	Aditi Sundararaman et.al.	2506.08279	null
2025-06-09	Transcript-Prompted Whisper with Dictionary-Enhanced Decoding for Japanese Speech Annotation	Rui Hu et.al.	2506.07646	null
2025-06-07	SynHate: Detecting Hate Speech in Synthetic Deepfake Audio	Rishabh Ranjan et.al.	2506.06772	null
2025-06-09	Voice Impression Control in Zero-Shot TTS	Keinichi Fujita et.al.	2506.05688	null
2025-05-28	Speaking images. A novel framework for the automated self-description of artworks	Valentine Bernasconi et.al.	2506.05368	null
2025-06-05	Grapheme-Coherent Phonemic and Prosodic Annotation of Speech by Implicit and Explicit Grapheme Conditioning	Hien Ohnaka et.al.	2506.04527	null
2025-06-04	Can we reconstruct a dysarthric voice with the large speech model Parler TTS?	Ariadna Sanchez et.al.	2506.04397	null
2025-06-04	HiFiTTS-2: A Large-Scale High Bandwidth Speech Dataset	Ryan Langman et.al.	2506.04152	null
2025-07-23	UniCUE: Unified Recognition and Generation Framework for Chinese Cued Speech Video-to-Speech Generation	Jinting Wang et.al.	2506.04134	null
2025-06-04	A Novel Data Augmentation Approach for Automatic Speaking Assessment on Opinion Expressions	Chung-Chun Wang et.al.	2506.04077	null
2025-06-04	Kinship in Speech: Leveraging Linguistic Relatedness for Zero-Shot TTS in Indian Languages	Utkarsh Pathak et.al.	2506.03884	null
2025-06-04	Mark My Words: A Robust Multilingual Model for Punctuation in Text and Speech Transcripts	Sidharth Pulipaka et.al.	2506.03793	null
2025-06-04	Comparative Analysis of Fast and High-Fidelity Neural Vocoders for Low-Latency Streaming Synthesis in Resource-Constrained Environments	Reo Yoneyama et.al.	2506.03554	null
2025-06-04	BitTTS: Highly Compact Text-to-Speech Using 1.58-bit Quantization and Weight Indexing	Masaya Kawamura et.al.	2506.03515	null
2025-06-03	Controllable Text-to-Speech Synthesis with Masked-Autoencoded Style-Rich Representation	Yongqi Wang et.al.	2506.02997	null
2025-06-03	Towards a Japanese Full-duplex Spoken Dialogue System	Atsumoto Ohashi et.al.	2506.02979	null
2025-06-03	CapSpeech: Enabling Downstream Applications in Style-Captioned Text-to-Speech	Helin Wang et.al.	2506.02863	null
2025-06-03	Prompt-Unseen-Emotion: Zero-shot Expressive Speech Synthesis with Prompt-LLM Contextual Knowledge for Mixed Emotions	Xiaoxue Gao et.al.	2506.02742	null
2025-06-03	Trusted Fake Audio Detection Based on Dirichlet Distribution	Chi Ding et.al.	2506.02401	null
2025-06-02	SALF-MOS: Speaker Agnostic Latent Features Downsampled for MOS Prediction	Saurabh Agrawal et.al.	2506.02082	null
2025-06-02	Speech-to-Speech Translation Pipelines for Conversations in Low-Resource Languages	Andrei Popescu-Belis et.al.	2506.01406	null
2025-06-02	Zero-Shot Text-to-Speech for Vietnamese	Thi Vu et.al.	2506.01322	null
2025-06-02	CleanS2S: Single-file Framework for Proactive Speech-to-Speech Interaction	Yudong Lu et.al.	2506.01268	null
2025-06-02	WCTC-Biasing: Retraining-free Contextual Biasing ASR with Wildcard CTC-based Keyword Spotting and Inter-layer Biasing	Yu Nakagome et.al.	2506.01263	null
2025-06-01	DS-TTS: Zero-Shot Speaker Style Adaptation from Voice Clips via Dynamic Dual-Style Feature Modulation	Ming Meng et.al.	2506.01020	null
2025-06-01	Counterfactual Activation Editing for Post-hoc Prosody and Mispronunciation Correction in TTS Models	Kyowoon Lee et.al.	2506.00832	null
2025-05-31	Chain-of-Thought Training for Open E2E Spoken Dialogue Systems	Siddhant Arora et.al.	2506.00722	null
2025-05-30	Werewolf: A Straightforward Game Framework with TTS for Improved User Engagement	Qihui Fan et.al.	2506.00160	null
2025-05-30	SwitchLingua: The First Large-Scale Multilingual and Multi-Ethnic Code-Switching Dataset	Peng Xie et.al.	2506.00087	null
2025-05-30	Speech Token Prediction via Compressed-to-fine Language Modeling for Speech Generation	Wenrui Liu et.al.	2505.24496	null
2025-05-30	DS-Codec: Dual-Stage Training with Mirror-to-NonMirror Architecture Switching for Speech Codec	Peijie Chen et.al.	2505.24314	null
2025-05-29	Can Emotion Fool Anti-spoofing?	Aurosweta Mahapatra et.al.	2505.23962	null
2025-05-29	Few-Shot Speech Deepfake Detection Adaptation with Gaussian Processes	Neta Glazer et.al.	2505.23619	link
2025-05-29	EmergentTTS-Eval: Evaluating TTS Models on Complex Prosodic, Expressiveness, and Linguistic Challenges Using Model-as-a-Judge	Ruskin Raj Manku et.al.	2505.23009	link
2025-05-29	LLM-Synth4KWS: Scalable Automatic Generation and Synthesis of Confusable Data for Custom Keyword Spotting	Pai Zhu et.al.	2505.22995	null
2025-05-28	BinauralFlow: A Causal and Streamable Approach for High-Quality Binaural Speech Synthesis with Flow Matching Models	Susan Liang et.al.	2505.22865	null
2025-05-28	Tell me Habibi, is it Real or Fake?	Kartik Kuckreja et.al.	2505.22581	null
2025-05-28	A Linguistically Motivated Analysis of Intonational Phrasing in Text-to-Speech Systems: Revealing Gaps in Syntactic Sensitivity	Charlotte Pouw et.al.	2505.22236	null
2025-06-29	Spotlight-TTS: Spotlighting the Style via Voiced-Aware Style Extraction and Style Direction Adjustment for Expressive Text-to-Speech	Nam-Gyu Kim et.al.	2505.20868	null
2025-05-26	ArVoice: A Multi-Speaker Dataset for Arabic Speech Synthesis	Hawau Olamide Toyin et.al.	2505.20506	null
2025-06-04	Accelerating Flow-Matching-Based Text-to-Speech via Empirically Pruned Step Sampling	Qixi Zheng et.al.	2505.19931	null
2025-05-26	DiEmo-TTS: Disentangled Emotion Representations via Self-Supervised Distillation for Cross-Speaker Emotion Transfer in Text-to-Speech	Deok-Hyeon Cho et.al.	2505.19687	null
2025-05-26	KIT's Low-resource Speech Translation Systems for IWSLT2025: System Enhancement with Synthetic Data and Model Regularization	Zhaolin Li et.al.	2505.19679	null
2025-06-02	Zero-Shot Streaming Text to Speech Synthesis with Transducer and Auto-Regressive Modeling	Haiyang Sun et.al.	2505.19669	null
2025-05-30	Accelerating Diffusion-based Text-to-Speech Model Training with Dual Modality Alignment	Jeongsoo Choi et.al.	2505.19595	link
2025-05-26	GSA-TTS : Toward Zero-Shot Speech Synthesis based on Gradual Style Adaptor	Seokgi Lee et.al.	2505.19384	null
2025-05-25	SpeakStream: Streaming Text-to-Speech with Interleaved Data	Richard He Bai et.al.	2505.19206	null
2025-05-25	CloneShield: A Framework for Universal Perturbation Against Zero-Shot Voice Cloning	Renyuan Li et.al.	2505.19119	null
2025-05-27	Revival with Voice: Multi-modal Controllable Text-to-Speech Synthesis	Minsu Kim et.al.	2505.18972	null
2025-05-27	RASMALAI: Resources for Adaptive Speech Modeling in Indian Languages with Accents and Intonations	Ashwin Sankar et.al.	2505.18609	null
2025-05-24	MPE-TTS: Customized Emotion Zero-Shot Text-To-Speech Using Multi-Modal Prompt	Zhichao Wu et.al.	2505.18453	null
2025-05-27	CosyVoice 3: Towards In-the-wild Speech Generation via Scaling-up and Post-training	Zhihao Du et.al.	2505.17589	null
2025-05-23	What You Read Isn't What You Hear: Linguistic Sensitivity in Deepfake Speech Detection	Binh Nguyen et.al.	2505.17513	null
2025-05-23	UniTTS: An end-to-end TTS system without decoupling of acoustic and semantic information	Rui Wang et.al.	2505.17426	link
2025-05-23	Speechless: Speech Instruction Training Without Speech for Low Resource Languages	Alan Dao et.al.	2505.17417	link
2025-05-22	Benchmarking Expressive Japanese Character Text-to-Speech with VITS and Style-BERT-VITS2	Zackary Rackauckas et.al.	2505.17320	null
2025-05-21	Voicing Personas: Rewriting Persona Descriptions into Style Prompts for Controllable Text-to-Speech	Yejin Lee et.al.	2505.17093	null
2025-06-13	Impact of Frame Rates on Speech Tokenizer: A Case Study on Mandarin and English	Haoyang Zhang et.al.	2505.17076	null
2025-05-22	From Tens of Hours to Tens of Thousands: Scaling Back-Translation for Speech Recognition	Tianduo Wang et.al.	2505.16972	link
2025-05-21	MIKU-PAL: An Automated and Standardized Multi-Modal Method for Speech Paralinguistic and Affect Labeling	Yifan Cheng et.al.	2505.15772	null
2025-05-21	Segmentation-Variant Codebooks for Preservation of Paralinguistic and Prosodic Information	Nicholas Sanders et.al.	2505.15667	null
2025-05-21	Audio Jailbreak: An Open Comprehensive Benchmark for Jailbreaking Large Audio-Language Models	Zirui Song et.al.	2505.15406	link
2025-05-21	Prosody-Adaptable Audio Codecs for Zero-Shot Voice Conversion via In-Context Learning	Junchuan Zhao et.al.	2505.15402	null
2025-06-03	Accelerating Autoregressive Speech Synthesis Inference With Speech Speculative Decoding	Zijian Lin et.al.	2505.15380	null
2025-05-20	Pairwise Evaluation of Accent Similarity in Speech Synthesis	Jinzuomu Zhong et.al.	2505.14410	null
2025-05-20	FMSD-TTS: Few-shot Multi-Speaker Multi-Dialect Text-to-Speech Synthesis for Ü-Tsang, Amdo and Kham Speech Dataset Generation	Yutong Liu et.al.	2505.14351	null
2025-05-21	AudioJailbreak: Jailbreak Attacks against End-to-End Large Audio-Language Models	Guangke Chen et.al.	2505.14103	null
2025-05-20	SeamlessEdit: Background Noise Aware Zero-Shot Speech Editing with in-Context Enhancement	Kuan-Yu Chen et.al.	2505.14066	null
2025-05-22	Improving Noise Robustness of LLM-based Zero-shot TTS via Discrete Acoustic Token Denoising	Ye-Xin Lu et.al.	2505.13830	null
2025-05-29	Articulatory Feature Prediction from Surface EMG during Speech Production	Jihwan Lee et.al.	2505.13814	null
2025-05-19	Efficient Speech Language Modeling via Energy Distance in Continuous Latent Space	Zhengrui Ma et.al.	2505.13181	link
2025-05-19	OZSpeech: One-step Zero-shot Speech Synthesis with Learned-Prior-Conditioned Flow Matching	Hieu-Nghia Huynh-Nguyen et.al.	2505.12800	null
2025-05-19	RoVo: Robust Voice Protection Against Unauthorized Speech Synthesis with Embedding-Level Perturbations	Seungmin Kim et.al.	2505.12686	null
2025-05-19	Chain-Talker: Chain Understanding and Rendering for Empathetic Conversational Speech Synthesis	Yifan Hu et.al.	2505.12597	link
2025-05-18	Shallow Flow Matching for Coarse-to-Fine Text-to-Speech Synthesis	Dong Yang et.al.	2505.12226	null
2025-05-16	Audio Turing Test: Benchmarking the Human-likeness of Large Language Model-based Text-to-Speech Systems in Chinese	Xihuai Wang et.al.	2505.11200	null
2025-05-16	BanglaFake: Constructing and Evaluating a Specialized Bengali Deepfake Audio Dataset	Istiaq Ahmed Fahad et.al.	2505.10885	link
2025-05-15	UDDETTS: Unifying Discrete and Dimensional Emotions for Controllable Emotional Text-to-Speech	Jiaxuan Liu et.al.	2505.10599	null
2025-05-14	DPN-GAN: Inducing Periodic Activations in Generative Adversarial Networks for High-Fidelity Audio Synthesis	Zeeshan Ahmad et.al.	2505.09091	null
2025-05-13	Investigating self-supervised features for expressive, multilingual voice conversion	Álvaro Martín-Cortinas et.al.	2505.08278	null
2025-05-12	MiniMax-Speech: Intrinsic Zero-Shot Text-to-Speech with a Learnable Speaker Encoder	Bowen Zhang et.al.	2505.07916	null
2025-05-13	Lightweight End-to-end Text-to-speech Synthesis for low resource on-device applications	Biel Tura Vecino et.al.	2505.07701	null
2025-05-10	VTutor: An Animated Pedagogical Agent SDK that Provide Real Time Multi-Model Feedback	Eason Chen et.al.	2505.06676	null
2025-05-10	Bridging the Gap: An Intermediate Language for Enhanced and Cost-Effective Grapheme-to-Phoneme Conversion with Homographs with Multiple Pronunciations Disambiguation	Abbas Bertina et.al.	2505.06599	null
2025-05-15	FlexSpeech: Towards Stable, Controllable and Expressive Text-to-Speech	Linhan Ma et.al.	2505.05159	null
2025-05-08	Teochew-Wild: The First In-the-wild Teochew Dataset with Orthographic Annotations	Linrong Pan et.al.	2505.05056	null
2025-05-08	A Multi-Agent AI Framework for Immersive Audiobook Production through Spatial Audio and Neural Narration	Shaja Arul Selvamani et.al.	2505.04885	null
2025-06-06	Advancing Zero-shot Text-to-Speech Intelligibility across Diverse Domains via Preference Alignment	Xueyao Zhang et.al.	2505.04113	null
2025-05-06	VITA-Audio: Fast Interleaved Cross-Modal Token Generation for Efficient Large Speech-Language Model	Zuwei Long et.al.	2505.03739	link
2025-05-13	SonicRAG : High Fidelity Sound Effects Synthesis Based on Retrival Augmented Generation	Yu-Ren Guo et.al.	2505.03244	null
2025-05-05	Generating Narrated Lecture Videos from Slides with Synchronized Highlights	Alexander Holmberg et.al.	2505.02966	null
2025-05-05	Voila: Voice-Language Foundation Models for Real-Time Autonomous Interaction and Voice Role-Play	Yemin Shi et.al.	2505.02707	link
2025-05-05	LLaMA-Omni2: LLM-based Real-time Spoken Chatbot with Autoregressive Streaming Speech Synthesis	Qingkai Fang et.al.	2505.02625	link
2025-04-30	Sadeed: Advancing Arabic Diacritization Through Small Language Model	Zeina Aldallal et.al.	2504.21635	null
2025-04-29	AlignDiT: Multimodal Aligned Diffusion Transformer for Synchronized Speech Generation	Jeongsoo Choi et.al.	2504.20629	null
2025-05-28	ClonEval: An Open Voice Cloning Benchmark	Iwona Christop et.al.	2504.20581	link
2025-05-02	Towards Flow-Matching-based TTS without Classifier-Free Guidance	Yuzhe Liang et.al.	2504.20334	null
2025-04-27	Generative Adversarial Network based Voice Conversion: Techniques, Challenges, and Recent Advancements	Sandipan Dhar et.al.	2504.19197	null
2025-04-27	Muyan-TTS: A Trainable Text-to-Speech Model Optimized for Podcast Scenarios with a $50K Budget	Xin Li et.al.	2504.19146	link
2025-04-22	FADEL: Uncertainty-aware Fake Audio Detection with Evidential Deep Learning	Ju Yeon Kang et.al.	2504.15663	null
2025-04-22	A Multi-Agent Framework for Automated Qinqiang Opera Script Generation Using Large Language Models	Gengxian Cao et.al.	2504.15552	null
2025-04-21	SOLIDO: A Robust Watermarking Method for Speech Synthesis via Low-Rank Adaptation	Yue Li et.al.	2504.15035	null
2025-04-20	DialogueAgents: A Hybrid Agent-Based Speech Synthesis Framework for Multi-Party Dialogue	Xiang Li et.al.	2504.14482	link
2025-04-18	ChatNekoHacker: Real-Time Fan Engagement with Conversational Agents	Takuya Sera et.al.	2504.13793	null
2025-04-18	Collective Learning Mechanism based Optimal Transport Generative Adversarial Network for Non-parallel Voice Conversion	Sandipan Dhar et.al.	2504.13791	null
2025-04-22	EmoVoice: LLM-based Emotional Text-To-Speech Model with Freestyle Text Prompting	Guanrou Yang et.al.	2504.12867	null
2025-05-28	GOAT-TTS: Expressive and Realistic Speech Generation via A Dual-Branch LLM	Yaodong Song et.al.	2504.12339	null
2025-04-15	Dopamine Audiobook: A Training-free MLLM Agent for Emotional and Human-like Audiobook Generation	Yan Rong et.al.	2504.11002	null
2025-04-15	Generalized Audio Deepfake Detection Using Frame-level Latent Information Entropy	Botao Zhao et.al.	2504.10819	null
2025-04-14	Pseudo-Autoregressive Neural Codec Language Models for Efficient Zero-Shot Text-to-Speech Synthesis	Yifan Yang et.al.	2504.10352	null
2025-04-14	AutoStyle-TTS: Retrieval-Augmented Generation based Automatic Style Matching Text-to-Speech Synthesis	Dan Luo et.al.	2504.10309	null
2025-04-14	SafeSpeech: Robust and Universal Voice Protection Against Malicious Speech Synthesis	Zhisheng Zhang et.al.	2504.09839	link
2025-04-12	AMNet: An Acoustic Model Network for Enhanced Mandarin Speech Synthesis	Yubing Cao et.al.	2504.09225	null
2025-04-11	Generalized Multilingual Text-to-Speech Generation with Language-Aware Style Adaptation	Haowei Lou et.al.	2504.08274	null
2025-04-10	Empowering Global Voices: A Data-Efficient, Phoneme-Tone Adaptive Approach to High-Fidelity Speech Synthesis	Yizhong Geng et.al.	2504.07858	null
2025-05-16	SlimSpeech: Lightweight and Efficient Text-to-Speech with Slim Rectified Flow	Kaidi Wang et.al.	2504.07776	null
2025-04-08	AVENet: Disentangling Features by Approximating Average Features for Voice Conversion	Wenyu Wang et.al.	2504.05833	null
2025-04-07	SpeakEasy: Enhancing Text-to-Speech Interactions for Expressive Content Creation	Stephen Brade et.al.	2504.05106	null
2025-04-04	RWKVTTS: Yet another TTS based on RWKV-7	Lin yueyu et.al.	2504.03289	link
2025-04-22	F5R-TTS: Improving Flow-Matching based Text-to-Speech with Group Relative Policy Optimization	Xiaohui Sun et.al.	2504.02407	null
2025-04-03	VoiceCraft-Dub: Automated Video Dubbing with Neural Codec Language Models	Kim Sung-Bin et.al.	2504.02386	null
2025-04-02	TeleAntiFraud-28k: An Audio-Text Slow-Thinking Dataset for Telecom Fraud Detection	Zhiming Ma et.al.	2503.24115	link
2025-03-31	SpeechDialogueFactory: Generating High-Quality Speech Dialogue Data to Accelerate Your Speech-LLM Development	Minghan Wang et.al.	2503.23848	link
2025-03-30	Speculative End-Turn Detector for Efficient Speech Chatbot Assistant	Hyunjong Ok et.al.	2503.23439	null
2025-05-16	SupertonicTTS: Towards Highly Scalable and Efficient Text-to-Speech System	Hyeongju Kim et.al.	2503.23108	null
2025-03-26	Dual Audio-Centric Modality Coupling for Talking Head Generation	Ao Fu et.al.	2503.22728	null
2025-03-28	DeepAudio-V1:Towards Multi-Modal Multi-Stage End-to-End Video to Speech and Audio Generation	Haomin Zhang et.al.	2503.22265	null
2025-03-26	Text-Driven Voice Conversion via Latent State-Space Modeling	Wen Li et.al.	2503.20999	null
2025-05-26	FireRedTTS-1S: An Upgraded Streamable Foundation Text-to-Speech System	Hao-Han Guo et.al.	2503.20499	null
2025-03-21	Your voice is your voice: Supporting Self-expression through Speech Generation and LLMs in Augmented and Alternative Communication	Yiwen Xu et.al.	2503.17479	null
2025-03-21	From Faces to Voices: Learning Hierarchical Representations for High-quality Video-to-Speech	Ji-Hoon Kim et.al.	2503.16956	null
2025-03-20	WaveFM: A High-Fidelity and Efficient Vocoder Based on Flow Matching	Tianze Luo et.al.	2503.16689	link
2025-03-10	VocalEyes: Enhancing Environmental Perception for the Visually Impaired through Vision-Language Models and Distance-Aware Object Detection	Kunal Chavan et.al.	2503.16488	null
2025-01-22	Development of an Inclusive Educational Platform Using Open Technologies and Machine Learning: A Case Study on Accessibility Enhancement	Jimi Togni et.al.	2503.15501	null
2025-01-14	AI-Powered Assistive Technologies for Visual Impairment	Prudhvi Naayini et.al.	2503.15494	null
2025-03-19	MoonCast: High-Quality Zero-Shot Podcast Generation	Zeqian Ju et.al.	2503.14345	link
2025-03-26	InnerSelf: Designing Self-Deepfaked Voice for Emotional Well-being	Guang Dai et.al.	2503.14257	null
2025-03-14	MAVFlow: Preserving Paralinguistic Elements with Conditional Flow Matching for Zero-Shot AV2AV Multilingual Translation	Sungwoo Cho et.al.	2503.11026	null
2025-03-11	An Exhaustive Evaluation of TTS- and VC-based Data Augmentation for ASR	Sewade Ogun et.al.	2503.08954	null
2025-03-07	DiVISe: Direct Visual-Input Speech Synthesis Preserving Speaker Characteristics And Intelligibility	Yifan Liu et.al.	2503.05223	link
2025-03-03	Direct Speech to Speech Translation: A Review	Mohammad Sarim et.al.	2503.04799	null
2025-03-06	LLMVoX: Autoregressive Streaming Text-to-Speech Model for Any LLM	Sambal Shikhar et.al.	2503.04724	null
2025-03-06	Scaling Rich Style-Prompted Text-to-Speech Datasets	Anuj Diwan et.al.	2503.04713	link
2025-03-05	Good practices for evaluation of synthesized speech	Erica Cooper et.al.	2503.03250	null
2025-03-04	InSerter: Speech Instruction Following with Unsupervised Interleaved Pre-training	Dingdong Wang et.al.	2503.02769	null
2025-03-03	Spark-TTS: An Efficient LLM-Based Text-to-Speech Model with Single-Stream Decoupled Speech Tokens	Xinsheng Wang et.al.	2503.01710	link
2025-03-03	Voice Cloning for Dysarthric Speech Synthesis: Addressing Data Scarcity in Speech-Language Pathology	Birger Moell et.al.	2503.01266	null
2025-03-02	UniWav: Towards Unified Pre-training for Speech Representation Learning and Generation	Alexander H. Liu et.al.	2503.00733	null
2025-03-01	PodAgent: A Comprehensive Framework for Podcast Generation	Yujia Xiao et.al.	2503.00455	link
2025-03-12	Telephone Surveys Meet Conversational AI: Evaluating a LLM-Based Telephone Survey System at Scale	Max M. Lang et.al.	2502.20140	null
2025-02-27	DiffCSS: Diverse and Expressive Conversational Speech Synthesis with Diffusion Models	Weihao wu et.al.	2502.19924	null
2025-03-28	MegaTTS 3: Sparse Alignment Enhanced Latent Diffusion Transformer for Zero-Shot Speech Synthesis	Ziyue Jiang et.al.	2502.18924	null
2025-03-08	Clip-TTS: Contrastive Text-content and Mel-spectrogram, A High-Quality Text-to-Speech Method based on Contextual Semantic Understanding	Tianyun Liu et.al.	2502.18889	null
2025-02-24	Balancing Speech Understanding and Generation Using Continual Pre-training for Codec-based Speech LLM	Jiatong Shi et.al.	2502.16897	null
2025-02-18	AV-Flow: Transforming Text to Audio-Visual Human-like Interactions	Aggelina Chatziagapi et.al.	2502.13133	null
2025-02-18	High-Fidelity Music Vocoder using Neural Audio Codecs	Luca A. Lanzendörfer et.al.	2502.12759	null
2025-02-18	A Survey on Bridging EEG Signals and Generative AI: From Image and Text to Beyond	Shreya Shukla et.al.	2502.12048	null
2025-02-17	NaturalL2S: End-to-End High-quality Multispeaker Lip-to-Speech Synthesis with Differential Digital Signal Processing	Yifan Liang et.al.	2502.12002	null
2025-02-16	FELLE: Autoregressive Speech Synthesis with Token-Wise Coarse-to-Fine Flow Matching	Hui Wang et.al.	2502.11128	null
2025-02-16	SyncSpeech: Low-Latency and Efficient Dual-Stream Text-to-Speech based on Temporal Masked Transformer	Zhengyan Sheng et.al.	2502.11094	null
2025-02-14	VocalCrypt: Novel Active Defense Against Deepfake Voice Based on Masking Effect	Qingyuan Fei et.al.	2502.10329	null
2025-02-13	TokenSynth: A Token-based Neural Synthesizer for Instrument Cloning and Text-to-Instrument	Kyungsu Kim et.al.	2502.08939	link
2025-04-24	ASVspoof 5: Design, Collection and Validation of Resources for Spoofing, Deepfake, and Adversarial Attack Detection Using Crowdsourced Speech	Xin Wang et.al.	2502.08857	null
2025-02-11	LoRP-TTS: Low-Rank Personalized Text-To-Speech	Łukasz Bondaruk et.al.	2502.07562	null
2025-02-11	Advanced Zero-Shot Text-to-Speech for Background Removal and Preservation with Controllable Masked Speech Prediction	Leying Zhang et.al.	2502.07345	null
2025-02-11	Vevo: Controllable Zero-Shot Voice Imitation with Self-Supervised Disentanglement	Xueyao Zhang et.al.	2502.07243	null
2025-02-10	Synthetic Audio Helps for Cognitive State Tasks	Adil Soubki et.al.	2502.06922	link
2025-02-19	Speech to Speech Translation with Translatotron: A State of the Art Review	Jules R. Kala et.al.	2502.05980	null
2025-02-09	Non-invasive electromyographic speech neuroprosthesis: a geometric perspective	Harshavardhana T. Gowda et.al.	2502.05762	null
2025-02-09	BnTTS: Few-Shot Speaker Adaptation in Low-Resource Setting	Mohammad Jahid Ibna Basher et.al.	2502.05729	null
2025-02-08	Gender Bias in Instruction-Guided Speech Synthesis Models	Chun-Yi Kuan et.al.	2502.05649	null
2025-02-08	IndexTTS: An Industrial-Level Controllable and Efficient Zero-Shot Text-To-Speech System	Wei Deng et.al.	2502.05512	link
2025-02-22	Llasa: Scaling Train-Time and Inference-Time Compute for Llama-based Speech Synthesis	Zhen Ye et.al.	2502.04128	link
2025-02-05	Metis: A Foundation Speech Generation Model with Masked Generative Pre-training	Yuancheng Wang et.al.	2502.03128	link
2025-02-05	Fine-grained Preference Optimization Improves Zero-shot Text-to-Speech	Jixun Yao et.al.	2502.02950	null
2025-02-04	Developing multilingual speech synthesis system for Ojibwe, Mi'kmaq, and Maliseet	Shenran Wang et.al.	2502.02703	link
2025-02-04	Streaming Speaker Change Detection and Gender Classification for Transducer-Based Multi-Talker Speech Translation	Peidong Wang et.al.	2502.02683	null
2025-02-13	Continuous Autoregressive Modeling with Stochastic Monotonic Alignment for Speech Synthesis	Weiwei Lin et.al.	2502.01084	null
2025-02-02	EmoTalkingGaussian: Continuous Emotion-conditioned Talking Head Synthesis	Junuk Cha et.al.	2502.00654	null
2025-01-31	VisualSpeech: Enhance Prosody with Visual Context in TTS	Shumin Que et.al.	2501.19258	null
2025-01-29	BreezyVoice: Adapting TTS for Taiwanese Mandarin with Enhanced Polyphone Disambiguation -- Challenges and Insights	Chan-Jan Hsu et.al.	2501.17790	null
2025-01-28	Compact Neural TTS Voices for Accessibility	Kunal Jain et.al.	2501.17332	null
2025-02-11	Overview of the Amphion Toolkit (v0.2)	Jiaqi Li et.al.	2501.15442	link
2025-01-24	Characteristic-Specific Partial Fine-Tuning for Efficient Emotion and Speaker Adaptation in Codec Language Text-to-Speech Models	Tianrui Wang et.al.	2501.14273	null
2025-01-24	Generalizable Audio Deepfake Detection via Latent Space Refinement and Augmentation	Wen Huang et.al.	2501.14240	null
2025-01-24	LoCoML: A Framework for Real-World ML Inference Pipelines	Kritin Maddireddy et.al.	2501.14165	null
2025-01-23	Generative Data Augmentation Challenge: Zero-Shot Speech Synthesis for Personalized Speech Enhancement	Jae-Sung Bae et.al.	2501.13372	null
2025-01-21	A Domain Adaptation Framework for Speech Recognition Systems with Only Synthetic data	Minh Tran et.al.	2501.12501	null
2025-01-20	A Non-autoregressive Model for Joint STT and TTS	Vishal Sunder et.al.	2501.09104	null
2025-01-15	Speech Synthesis along Perceptual Voice Quality Dimensions	Frederik Rautenberg et.al.	2501.08791	null
2025-01-15	Adaptive Data Augmentation with NaturalSpeech3 for Far-field Speaker Verification	Li Zhang et.al.	2501.08691	null
2025-01-15	Towards Lightweight and Stable Zero-shot TTS with Self-distilled Representation Disentanglement	Qianniu Chen et.al.	2501.08566	null
2025-03-17	CodecFake+: A Large-Scale Neural Audio Codec-Based Deepfake Speech Dataset	Xuanjun Chen et.al.	2501.08238	null
2025-01-13	Exploring the encoding of linguistic representations in the Fully-Connected Layer of generative CNNs for Speech	Bruno Ferenc Šegedin et.al.	2501.07726	null
2025-01-19	MathReader : Text-to-Speech for Mathematical Documents	Sieun Hyeon et.al.	2501.07088	link
2025-01-11	Retrieval-Augmented Dialogue Knowledge Aggregation for Expressive Conversational Speech Synthesis	Rui Liu et.al.	2501.06467	link
2025-01-10	TTS-Transducer: End-to-End Speech Synthesis with Neural Transducer	Vladimir Bataev et.al.	2501.06320	null
2025-01-10	MinMo: A Multimodal Large Language Model for Seamless Voice Interaction	Qian Chen et.al.	2501.06282	null
2025-01-10	PROEMO: Prompt-Driven Text-to-Speech Synthesis Based on Emotion and Intensity Control	Shaozuo Zhang et.al.	2501.06276	null
2025-06-03	Low-Resource Text-to-Speech Synthesis Using Noise-Augmented Training of ForwardTacotron	Kishor Kayyar Lakshminarayana et.al.	2501.05976	null
2025-01-10	MARS6: A Small and Robust Hierarchical-Codec Text-to-Speech Model	Matthew Baas et.al.	2501.05787	null
2025-01-09	Probing Speaker-specific Features in Speaker Representations	Aemon Yat Fei Chiu et.al.	2501.05310	null
2025-01-09	JELLY: Joint Emotion Recognition and Context Reasoning with LLMs for Conversational Speech Synthesis	Jun-Hyeok Cha et.al.	2501.04904	null
2025-01-08	Cued Speech Generation Leveraging a Pre-trained Audiovisual Text-to-Speech Model	Sanjana Sankar et.al.	2501.04799	null
2025-01-08	FleSpeech: Flexibly Controllable Speech Generation with Various Prompts	Hanzhao Li et.al.	2501.04644	null
2025-02-23	OpenOmni: Advancing Open-Source Omnimodal Large Language Models with Progressive Multimodal Alignment and Real-Time Self-Aware Emotional Speech Synthesis	Run Luo et.al.	2501.04561	link
2025-01-08	DrawSpeech: Expressive Speech Synthesis Using Prosodic Sketches as Control Conditions	Weidong Chen et.al.	2501.04256	null
2025-01-07	NeuroIncept Decoder for High-Fidelity Speech Reconstruction from Neural Activity	Owais Mujtaba Khanday et.al.	2501.03757	link
2025-01-02	FaceSpeak: Expressive and High-Quality Speech Synthesis from Human Portraits of Different Styles	Tian-Hao Zhang et.al.	2501.03181	null
2025-01-02	RingFormer: A Neural Vocoder with Ring Attention and Convolution-Augmented Transformer	Seongho Hong et.al.	2501.01182	link
2025-01-02	Disambiguation of Chinese Polyphones in an End-to-End Framework with Semantic Features Extracted by Pre-trained BERT	Dongyang Dai et.al.	2501.01102	null
2025-01-06	Takeaways from Applying LLM Capabilities to Multiple Conversational Avatars in a VR Pilot Study	Mykola Maslych et.al.	2501.00168	null
2024-12-16	SECodec: Structural Entropy-based Compressive Speech Representation Codec for Speech Language Models	Linqin Wang et.al.	2501.00018	null
2024-12-28	Stable-TTS: Stable Speaker-Adaptive Text-to-Speech Synthesis via Prosody Prompting	Wooseok Han et.al.	2412.20155	null
2024-12-28	CrossSpeech++: Cross-lingual Speech Synthesis with Decoupled Language and Speaker Generation	Ji-Hoon Kim et.al.	2412.20048	null
2024-12-26	VoiceDiT: Dual-Condition Diffusion Transformer for Environment-Aware Speech Synthesis	Jaemin Jung et.al.	2412.19259	null
2024-12-26	"I've Heard of You!": Generate Spoken Named Entity Recognition Data for Unseen Entities	Jiawei Yu et.al.	2412.19102	null
2024-12-26	Indonesian-English Code-Switching Speech Synthesizer Utilizing Multilingual STEN-TTS and Bert LID	Ahmad Alfani Handoyo et.al.	2412.19043	null
2025-01-23	Advancing NAM-to-Speech Conversion with Novel Methods and the MultiNAM Dataset	Neil Shah et.al.	2412.18839	null
2025-01-17	MRI2Speech: Speech Synthesis from Articulatory Movements Recorded by Real-time MRI	Neil Shah et.al.	2412.18836	null
2024-12-25	Intra- and Inter-modal Context Interaction Modeling for Conversational Speech Synthesis	Zhenqi Jia et.al.	2412.18733	null
2024-12-24	GenPod: Constructive News Framing in AI-Generated Podcasts More Effectively Reduces Negative Emotions Than Non-Constructive Framing	Wen Ku et.al.	2412.18300	null
2025-03-27	VERSA: A Versatile Evaluation Toolkit for Speech, Audio, and Music	Jiatong Shi et.al.	2412.17667	link
2024-12-22	Why Do Speech Language Models Fail to Generate Semantically Coherent Outputs? A Modality Evolving Perspective	Hankun Wang et.al.	2412.17048	null
2024-12-22	Incremental Disentanglement for Environment-Aware Zero-Shot Text-to-Speech Synthesis	Ye-Xin Lu et.al.	2412.16977	null
2025-09-18	KALL-E:Autoregressive Speech Synthesis with Next-Distribution Prediction	Kangxiang Xia et.al.	2412.16846	null
2024-12-23	Interleaved Speech-Text Language Models are Simple Streaming Text to Speech Synthesizers	Yifan Yang et.al.	2412.16102	null
2024-12-19	Scale This, Not That: Investigating Key Dataset Attributes for Efficient Speech Enhancement Scaling	Leying Zhang et.al.	2412.14890	null
2024-12-17	Deep Speech Synthesis from Multimodal Articulatory Representations	Peter Wu et.al.	2412.13387	null
2024-12-17	Synthetic Speech Classification: IEEE Signal Processing Cup 2022 challenge	Mahieyin Rahmun et.al.	2412.13279	link
2024-12-17	Enhancing Naturalness in LLM-Generated Utterances through Disfluency Insertion	Syed Zohaib Hassan et.al.	2412.12710	null
2024-12-17	Phoneme-Level Feature Discrepancies: A Key to Detecting Sophisticated Speech Deepfakes	Kuiyuan Zhang et.al.	2412.12619	null
2025-01-10	Hierarchical Control of Emotion Rendering in Speech Synthesis	Sho Inoue et.al.	2412.12498	link
2024-12-19	ProsodyFM: Unsupervised Phrasing and Intonation Control for Intelligible Speech Synthesis	Xiangheng He et.al.	2412.11795	null
2024-12-16	Region-Based Optimization in Continual Learning for Audio Deepfake Detection	Yujie Chen et.al.	2412.11551	link
2025-01-15	Multi-modal and Multi-scale Spatial Environment Understanding for Immersive Visual Text-to-Speech	Rui Liu et.al.	2412.11409	link
2024-12-16	Efficient Generative Modeling with Residual Vector Quantization-Based Tokens	Jaehyeon Kim et.al.	2412.10208	null
2024-12-25	CosyVoice 2: Scalable Streaming Speech Synthesis with Large Language Models	Zhihao Du et.al.	2412.10117	link
2024-12-13	AMuSeD: An Attentive Deep Neural Network for Multimodal Sarcasm Detection Incorporating Bi-modal Data Augmentation	Xiyuan Gao et.al.	2412.10103	null
2024-12-13	CSSinger: End-to-End Chunkwise Streaming Singing Voice Synthesis System Based on Conditional Variational Autoencoder	Jianwei Cui et.al.	2412.08918	null
2024-12-11	Multimodal Latent Language Modeling with Next-Token Diffusion	Yutao Sun et.al.	2412.08635	link
2024-12-11	Zero-Shot Mono-to-Binaural Speech Synthesis	Alon Levkovitch et.al.	2412.08356	null
2024-12-11	A Unified Model For Voice and Accent Conversion In Speech and Singing using Self-Supervised Learning and Feature Extraction	Sowmya Cheripally et.al.	2412.08312	null
2024-12-11	A Preliminary Analysis of Automatic Word and Syllable Prominence Detection in Non-Native Speech With Text-to-Speech Prosody Embeddings	Anindita Mondal et.al.	2412.08283	null
2024-12-11	LatentSpeech: Latent Diffusion for Text-To-Speech Generation	Haowei Lou et.al.	2412.08117	null
2024-12-11	Aligner-Guided Training Paradigm: Advancing Text-to-Speech Models with Aligner Guided Duration	Haowei Lou et.al.	2412.08112	null
2024-12-09	Towards Controllable Speech Synthesis in the Era of Large Language Models: A Survey	Tianxin Xie et.al.	2412.06602	link
2024-12-12	EmoSpeech: A Corpus of Emotionally Rich and Contextually Detailed Speech Annotations	Weizhen Bian et.al.	2412.06581	null
2024-12-01	Text Is Not All You Need: Multimodal Prompting Helps LLMs Understand Humor	Ashwin Baluja et.al.	2412.05315	null
2024-12-04	DiffStyleTTS: Diffusion-based Hierarchical Prosody Modeling for Text-to-Speech with Diverse and Controllable Styles	Jiaxuan Liu et.al.	2412.03388	null
2024-12-05	Analytic Study of Text-Free Speech Synthesis for Raw Audio using a Self-Supervised Learning Model	Joonyong Park et.al.	2412.03074	null
2024-12-03	GLM-4-Voice: Towards Intelligent and Human-Like End-to-End Spoken Chatbot	Aohan Zeng et.al.	2412.02612	link
2024-11-19	A Context-Based Numerical Format Prediction for a Text-To-Speech System	Yaser Darwesh et.al.	2412.00028	null
2024-11-27	Continual Learning in Machine Speech Chain Using Gradient Episodic Memory	Geoffrey Tyndall et.al.	2411.18320	null
2024-11-27	SALMONN-omni: A Codec-free LLM for Full-duplex Speech Understanding and Generation	Wenyi Yu et.al.	2411.18138	null
2024-11-26	Visatronic: A Multimodal Decoder-Only Model for Speech Synthesis	Akshita Gupta et.al.	2411.17690	null
2024-11-22	VQalAttent: a Transparent Speech Generation Pipeline based on Transformer-learned VQ-VAE Latent Space	Armani Rodriguez et.al.	2411.14642	null
2024-11-26	WavChat: A Survey of Spoken Dialogue Models	Shengpeng Ji et.al.	2411.13577	link
2024-12-02	I2TTS: Image-indicated Immersive Text-to-speech Synthesis with Spatial Perception	Jiawei Zhang et.al.	2411.13314	null
2024-11-20	Hard-Synth: Synthesizing Diverse Hard Samples for ASR using Zero-Shot TTS and LLM	Jiawei Yu et.al.	2411.13159	null
2024-12-15	Rethinking MUSHRA: Addressing Modern Challenges in Text-to-Speech Evaluation	Praveen Srinivasa Varadhan et.al.	2411.12719	null
2024-11-19	Leveraging Virtual Reality and AI Tutoring for Language Learning: A Case Study of a Virtual Campus Environment with OpenAI GPT Integration with Unity 3D	Adithya TG et.al.	2411.12619	null
2024-11-18	ESTVocoder: An Excitation-Spectral-Transformed Neural Vocoder Conditioned on Mel Spectrogram	Xiao-Hang Jiang et.al.	2411.11258	null
2024-11-18	SAMOS: A Neural MOS Prediction Model Leveraging Semantic Representations and Acoustic Features	Yu-Fei Shi et.al.	2411.11232	null
2024-11-15	SmoothCache: A Universal Inference Acceleration Technique for Diffusion Transformers	Joseph Liu et.al.	2411.10510	link
2024-11-14	Robust AI-Synthesized Speech Detection Using Feature Decomposition Learning and Synthesizer Feature Augmentation	Kuiyuan Zhang et.al.	2411.09167	null
2024-11-14	Evaluating Synthetic Command Attacks on Smart Voice Assistants	Zhengxian He et.al.	2411.08316	null
2024-11-12	Improving Grapheme-to-Phoneme Conversion through In-Context Knowledge Retrieval with Large Language Models	Dongrui Han et.al.	2411.07563	null
2024-11-11	Enhancing Accessibility in Special Libraries: A Study on AI-Powered Assistive Technologies for Patrons with Disabilities	Snehasish Paul Shivali Chauhan et.al.	2411.06970	null
2024-12-04	Debatts: Zero-Shot Debating Text-to-Speech Synthesis	Yiqiao Huang et.al.	2411.06540	null
2024-11-07	CUIfy the XR: An Open-Source Package to Embed LLM-powered Conversational Agents in XR	Kadir Burak Buldu et.al.	2411.04671	null
2024-11-04	EmoSphere++: Emotion-Controllable Zero-Shot Text-to-Speech via Emotion-Adaptive Spherical Vector	Deok-Hyeon Cho et.al.	2411.02625	link
2024-11-04	Complete reconstruction of the tongue contour through acoustic to articulatory inversion using real-time MRI data	Sofiane Azzouz et.al.	2411.02037	null
2024-11-09	Fish-Speech: Leveraging Large Language Models for Advanced Multilingual Text-to-Speech Synthesis	Shijia Liao et.al.	2411.01156	link
2024-10-31	Speech is More Than Words: Do Speech-to-Text Translation Systems Leverage Prosody?	Ioannis Tsiamas et.al.	2410.24019	null
2024-10-30	Lina-Speech: Gated Linear Attention is a Fast and Parameter-Efficient Learner for text-to-speech synthesis	Théodor Lemerle et.al.	2410.23320	link
2024-10-30	Augmenting Polish Automatic Speech Recognition System With Synthetic Data	Łukasz Bondaruk et.al.	2410.22903	null
2024-10-29	Very Attentive Tacotron: Robust and Unbounded Length Generalization in Autoregressive Transformer-Based Text-to-Speech	Eric Battenberg et.al.	2410.22179	link
2024-10-29	Fast and High-Quality Auto-Regressive Speech Synthesis via Speculative Decoding	Bohan Li et.al.	2410.21951	null
2024-10-29	RDSinger: Reference-based Diffusion Network for Singing Voice Synthesis	Kehan Sui et.al.	2410.21641	null
2024-10-28	Asynchronous Tool Usage for Real-Time Agents	Antonio A. Ginart et.al.	2410.21620	null
2024-10-28	Enhancing TTS Stability in Hebrew using Discrete Semantic Units	Ella Zeldes et.al.	2410.21502	null
2024-10-28	Mitigating Unauthorized Speech Synthesis for Voice Protection	Zhisheng Zhang et.al.	2410.20742	link
2024-10-27	Get Large Language Models Ready to Speak: A Late-fusion Approach for Speech Generation	Maohao Shen et.al.	2410.20336	null
2024-10-24	Making Social Platforms Accessible: Emotion-Aware Speech Generation with Integrated Text Analysis	Suparna De et.al.	2410.19199	null
2024-10-24	STTATTS: Unified Speech-To-Text And Text-To-Speech Model	Hawau Olamide Toyin et.al.	2410.18607	link
2024-10-24	Evaluating and Improving Automatic Speech Recognition Systems for Korean Meteorological Experts	ChaeHun Park et.al.	2410.18444	null
2024-10-23	ELAICHI: Enhancing Low-resource TTS by Addressing Infrequent and Low-frequency Character Bigrams	Srija Anand et.al.	2410.17901	null
2024-10-22	Continuous Speech Tokenizer in Text To Speech	Yixing Li et.al.	2410.17081	null
2024-10-22	Enhancing Low-Resource ASR through Versatile TTS: Bridging the Data Gap	Guanrou Yang et.al.	2410.16726	null
2024-10-21	Continuous Speech Synthesis using per-token Latent Diffusion	Arnon Turetzky et.al.	2410.16048	null
2024-10-18	A Unified Framework for Collecting Text-to-Speech Synthesis Datasets for 22 Indian Languages	Sujitha Sathiyamoorthy et.al.	2410.14197	null
2024-12-23	Multi-Source Spatial Knowledge Understanding for Immersive Visual Text-to-Speech	Shuwei He et.al.	2410.14101	link
2024-10-17	Accelerating Codec-based Speech Synthesis with Multi-Token Prediction and Speculative Decoding	Tan Dat Nguyen et.al.	2410.13839	null
2024-10-17	Enhancing Crowdsourced Audio for Text-to-Speech Models	José Giraldo et.al.	2410.13357	null
2024-10-17	DART: Disentanglement of Accent and Speaker Representation in Multispeaker Text-to-Speech	Jan Melechovsky et.al.	2410.13342	null
2024-10-17	DurIAN-E 2: Duration Informed Attention Network with Adaptive Variational Autoencoder and Adversarial Learning for Expressive Text-to-Speech Synthesis	Yu Gu et.al.	2410.13288	null
2024-10-17	Failing Forward: Improving Generative Error Correction for ASR with Synthetic Data and Retrieval Augmentation	Sreyan Ghosh et.al.	2410.13198	null
2024-10-16	ERVQ: Enhanced Residual Vector Quantization with Intra-and-Inter-Codebook Optimization for Neural Audio Codecs	Rui-Chen Zheng et.al.	2410.12359	null
2024-10-16	Beyond Oversmoothing: Evaluating DDPM and MSE for Scalable Speech Synthesis in ASR	Christoph Minixhofer et.al.	2410.12279	null
2024-10-14	IsoChronoMeter: A simple and effective isochronic translation evaluation metric	Nikolai Rozanov et.al.	2410.11127	null
2024-10-14	DMDSpeech: Distilled Diffusion Model Surpassing The Teacher in Zero-shot Speech Synthesis via Direct Metric Optimization	Yingahao Aaron Li et.al.	2410.11097	null
2024-10-14	Everyday Speech in the Indian Subcontinent	Utkarsh Pathak et.al.	2410.10508	null
2024-10-12	Emphasis Rendering for Conversational Text-to-Speech with Multi-modal Multi-scale Context Modeling	Rui Liu et.al.	2410.09524	null
2024-10-10	Unsupervised Data Validation Methods for Efficient Model Training	Yurii Paniv et.al.	2410.07880	null
2024-10-15	F5-TTS: A Fairytaler that Fakes Fluent and Faithful Speech with Flow Matching	Yushen Chen et.al.	2410.06885	link
2024-10-09	Efficient training strategies for natural sounding speech synthesis and speaker adaptation based on FastPitch	Teodora Răgman et.al.	2410.06787	null
2024-10-09	Bahasa Harmony: A Comprehensive Dataset for Bahasa Text-to-Speech Synthesis with Discrete Codec Modeling of EnGen-TTS	Onkar Kishor Susladkar et.al.	2410.06608	null
2024-10-09	Can DeepFake Speech be Reliably Detected?	Hongbin Liu et.al.	2410.06572	null
2024-10-07	SegINR: Segment-wise Implicit Neural Representation for Sequence Alignment in Neural Text-to-Speech	Minchan Kim et.al.	2410.04690	null
2024-10-06	HALL-E: Hierarchical Neural Codec Language Model for Minute-Long Zero-Shot Text-to-Speech Synthesis	Yuto Nishimura et.al.	2410.04380	null
2024-10-10	SONAR: A Synthetic AI-Audio Detection Framework and Benchmark	Xiang Li et.al.	2410.04324	link
2024-10-05	Adversarial Attacks and Robust Defenses in Speaker Embedding based Zero-Shot Text-to-Speech System	Ze Li et.al.	2410.04017	null
2024-10-01	Recent Advances in Speech Language Models: A Survey	Wenqian Cui et.al.	2410.03751	null
2024-09-30	Accent conversion using discrete units with parallel data synthesized from controllable accented TTS	Tuan Nam Nguyen et.al.	2410.03734	null
2024-09-28	FluentEditor+: Text-based Speech Editing by Modeling Local Hierarchical Acoustic Smoothness and Global Prosody Consistency	Rui Liu et.al.	2410.03719	null
2024-10-04	Generative Semantic Communication for Text-to-Speech Synthesis	Jiahao Zheng et.al.	2410.03459	null
2024-10-04	Textless Streaming Speech-to-Speech Translation using Semantic Speech Tokens	Jinzheng Zhao et.al.	2410.03298	null
2024-10-04	Narrative Player: Reviving Data Narratives with Visuals	Zekai Shao et.al.	2410.03268	null
2024-10-04	MultiVerse: Efficient and Expressive Zero-Shot Multi-Task Text-to-Speech	Taejun Bak et.al.	2410.03192	null
2024-10-07	Algorithms For Automatic Accentuation And Transcription Of Russian Texts In Speech Recognition Systems	Olga Iakovenko et.al.	2410.02538	null
2024-10-01	Augmentation through Laundering Attacks for Audio Spoof Detection	Hashim Ali et.al.	2410.01108	null
2024-10-01	Zero-Shot Text-to-Speech from Continuous Text Streams	Trung Dang et.al.	2410.00767	null
2024-10-01	EmoKnob: Enhance Voice Cloning with Fine-Grained Emotion Control	Haozhe Chen et.al.	2410.00316	link
2024-10-02	Moshi: a speech-text foundation model for real-time dialogue	Alexandre Défossez et.al.	2410.00037	link
2024-09-30	Word-wise intonation model for cross-language TTS systems	Tomilov A. A. et.al.	2409.20374	null
2024-09-29	Quantitative Analysis of Audio-Visual Tasks: An Information-Theoretic Perspective	Chen Chen et.al.	2409.19575	null
2024-09-27	Audio-Based Linguistic Feature Extraction for Enhancing Multi-lingual and Low-Resource Text-to-Speech	Youngjae Kim et.al.	2409.18622	null
2024-09-27	EmoPro: A Prompt Selection Strategy for Emotional Expression in LM-based Speech Synthesis	Haoyu Wang et.al.	2409.18512	null
2024-09-26	Description-based Controllable Text-to-Speech with Cross-Lingual Voice Control	Ryuichi Yamamoto et.al.	2409.17452	null
2024-09-25	Exploring synthetic data for cross-speaker style transfer in style representation based TTS	Lucas H. Ueda et.al.	2409.17364	null
2024-09-18	SpoofCeleb: Speech Deepfake Detection and SASV In The Wild	Jee-weon Jung et.al.	2409.17285	null
2024-09-25	Emotional Dimension Control in Language Model-Based Text-to-Speech: Spanning a Broad Spectrum of Human Emotions	Kun Zhou et.al.	2409.16681	null
2024-09-25	Enabling Auditory Large Language Models for Automatic Speech Quality Evaluation	Siyin Wang et.al.	2409.16644	null
2024-09-24	FastTalker: Jointly Generating Speech and Conversational Gestures from Text	Zixin Guo et.al.	2409.16404	null
2024-09-24	Beyond Text-to-Text: An Overview of Multimodal and Generative Artificial Intelligence for Education Using Topic Modeling	Ville Heilala et.al.	2409.16376	null
2024-09-24	Facial Expression-Enhanced TTS: Combining Face Representation and Emotion Intensity for Adaptive Speech	Yunji Chu et.al.	2409.16203	null
2024-09-24	NanoVoice: Efficient Speaker-Adaptive Text-to-Speech for Multiple Speakers	Nohil Park et.al.	2409.15760	null
2024-09-24	VoiceGuider: Enhancing Out-of-Domain Performance in Parameter-Efficient Speaker-Adaptive Text-to-Speech via Autoguidance	Jiheum Yeom et.al.	2409.15759	null
2024-09-24	StyleFusion TTS: Multimodal Style-control and Enhanced Feature Fusion for Zero-shot Text-to-speech Synthesis	Zhiyong Chen et.al.	2409.15741	null
2024-09-04	Real-time Robotics Situation Awareness for Accident Prevention in Industry	Juan M. Deniz et.al.	2409.15305	null
2024-11-28	A Comprehensive Survey with Critical Analysis for Deepfake Speech Detection	Lam Pham et.al.	2409.15180	null
2024-09-23	HiFi-Glot: Neural Formant Synthesis with Differentiable Resonant Filters	Lauri Juvela et.al.	2409.14823	null
2024-09-23	LlamaPartialSpoof: An LLM-Driven Fake Speech Dataset Simulating Disinformation Generation	Hieu-Thi Luong et.al.	2409.14743	null
2024-09-20	Zero-shot Cross-lingual Voice Transfer for TTS	Fadi Biadsy et.al.	2409.13910	null
2024-09-20	On the Feasibility of Fully AI-automated Vishing Attacks	João Figueiredo et.al.	2409.13793	null
2024-09-24	Enhancing Kurdish Text-to-Speech with Native Corpus Training: A High-Quality WaveGlow Vocoder Approach	Abdulhady Abas Abdullah et.al.	2409.13734	null
2024-09-20	Audio Codec Augmentation for Robust Collaborative Watermarking of Speech Synthesis	Lauri Juvela et.al.	2409.13382	link
2024-09-19	Enhancing Synthetic Training Data for Speech Commands: From ASR-Based Filtering to Domain Adaptation in SSL Latent Space	Sebastião Quintas et.al.	2409.12745	null
2024-09-19	NDVQ: Robust Neural Audio Codec with Normal Distribution-Based Vector Quantization	Zhikang Niu et.al.	2409.12717	null
2024-09-19	Preference Alignment Improves Language Model-Based TTS	Jinchuan Tian et.al.	2409.12403	null
2024-09-10	Prosodic Parameter Manipulation in TTS generated speech for Controlled Speech Generation	Podakanti Satyajith Chary et.al.	2409.12176	null
2024-09-18	Low Frame-rate Speech Codec: a Codec Designed for Fast High-quality Speech LLM Training and Inference	Edresson Casanova et.al.	2409.12117	null
2024-09-18	Exploring an Inter-Pausal Unit (IPU) based Approach for Indic End-to-End TTS Systems	Anusha Prakash et.al.	2409.11915	null
2024-09-18	Mixture of Experts Fusion for Fake Audio Detection Using Frozen wav2vec 2.0	Zhiyong Wang et.al.	2409.11909	null
2024-09-18	DPI-TTS: Directional Patch Interaction for Fast-Converging and Style Temporal Modeling in Text-to-Speech	Xin Qi et.al.	2409.11835	null
2024-09-18	Speaking from Coarse to Fine: Improving Neural Codec Language Model via Multi-Scale Speech Coding and Generation	Haohan Guo et.al.	2409.11630	null
2024-09-17	SpMis: An Investigation of Synthetic Spoken Misinformation Detection	Peizhuo Liu et.al.	2409.11308	null
2024-09-19	The Art of Storytelling: Multi-Agent Generative AI for Dynamic Multimodal Narratives	Samee Arif et.al.	2409.11261	link
2024-09-17	Zero Shot Text to Speech Augmentation for Automatic Speech Recognition on Low-Resource Accented Speech Corpora	Francesco Nespoli et.al.	2409.11107	null
2024-09-17	Single-stage TTS with Masked Audio Token Modeling and Semantic Knowledge Distillation	Gerard I. Gállego et.al.	2409.11003	null
2024-09-17	Enhancing Multilingual Speech Generation and Recognition Abilities in LLMs with Constructed Code-switched Data	Jing Xu et.al.	2409.10969	null
2024-09-16	Emo-DPO: Controllable Emotional Speech Synthesis through Direct Preference Optimization	Xiaoxue Gao et.al.	2409.10157	null
2024-09-16	StyleTTS-ZS: Efficient High-Quality Zero-Shot Text-to-Speech Synthesis with Distilled Time-Varying Style Diffusion	Yinghao Aaron Li et.al.	2409.10058	null
2024-09-15	Acquiring Pronunciation Knowledge from Transcribed Speech Audio via Multi-task Learning	Siqi Sun et.al.	2409.09891	null
2025-01-13	MacST: Multi-Accent Speech Synthesis via Text Transliteration for Accent Conversion	Sho Inoue et.al.	2409.09352	null
2024-09-14	E1 TTS: Simple and Fast Non-Autoregressive TTS	Zhijun Liu et.al.	2409.09351	null
2024-09-14	Improving Robustness of Diffusion-Based Zero-Shot Speech Synthesis via Stable Formant Generation	Changjin Han et.al.	2409.09311	null
2024-09-14	SafeEar: Content Privacy-Preserving Audio Deepfake Detection	Xinfeng Li et.al.	2409.09272	link
2024-09-13	AccentBox: Towards High-Fidelity Zero-Shot Accent Generation	Jinzuomu Zhong et.al.	2409.09098	null
2024-09-17	HLTCOE JHU Submission to the Voice Privacy Challenge 2024	Henry Li Xinyuan et.al.	2409.08913	null
2024-09-13	Text-To-Speech Synthesis In The Wild	Jee-weon Jung et.al.	2409.08711	null
2024-09-13	LLM-Powered Grapheme-to-Phoneme Conversion: Benchmark and Case Study	Mahta Fetrat Qharabagh et.al.	2409.08554	null
2024-09-14	Exploring Accessibility Trends and Challenges in Mobile App Development: A Study of Stack Overflow Questions	Amila Indika et.al.	2409.07945	null
2024-09-12	Full-text Error Correction for Chinese Speech Recognition with Large Language Model	Zhiyuan Tang et.al.	2409.07790	null
2025-01-03	SSR-Speech: Towards Stable, Safe and Robust Zero-shot Text-based Speech Editing and Synthesis	Helin Wang et.al.	2409.07556	link
2024-09-11	D-CAPTCHA++: A Study of Resilience of Deepfake CAPTCHA under Transferable Imperceptible Adversarial Attack	Hong-Hanh Nguyen-Le et.al.	2409.07390	null
2024-09-11	Cross-Dialect Text-To-Speech in Pitch-Accent Language Incorporating Multi-Dialect Phoneme-Level BERT	Kazuki Yamauchi et.al.	2409.07265	null
2024-09-11	Zero-Shot Text-to-Speech as Golden Speech Generator: A Systematic Framework and its Applicability in Automatic Pronunciation Assessment	Tien-Hong Lo et.al.	2409.07151	null
2024-09-11	The VoiceMOS Challenge 2024: Beyond Speech Quality Prediction	Wen-Chin Huang et.al.	2409.07001	null
2024-09-10	Enhancing Emotional Text-to-Speech Controllability with Natural Language Guidance through Contrastive Learning and Diffusion Models	Xin Jing et.al.	2409.06451	null
2024-09-26	What happens to diffusion model likelihood when your model is conditional?	Mattias Cross et.al.	2409.06364	null
2024-09-10	VoiceWukong: Benchmarking Deepfake Voice Detection	Ziwei Yan et.al.	2409.06348	null
2024-09-10	AS-Speech: Adaptive Style For Speech Synthesis	Zhipeng Li et.al.	2409.05730	null
2024-10-07	IndicVoices-R: Unlocking a Massive Multilingual Multi-speaker Speech Corpus for Scaling Indian TTS	Ashwin Sankar et.al.	2409.05356	link
2024-09-10	Disentangling the Prosody and Semantic Information with Pre-trained Model for In-Context Learning based Zero-Shot Voice Conversion	Zhengyang Chen et.al.	2409.05004	null
2024-09-01	Sample-Efficient Diffusion for Text-To-Speech Synthesis	Justin Lovelace et.al.	2409.03717	link
2024-09-10	LAST: Language Model Aware Speech Tokenization	Arnon Turetzky et.al.	2409.03701	null
2024-09-05	FireRedTTS: A Foundation Text-To-Speech Framework for Industry-Level Generative Speech Applications	Hao-Han Guo et.al.	2409.03283	null
2024-09-04	Training Universal Vocoders with Feature Smoothing-Based Augmentation Methods for High-Quality TTS Systems	Jeongmin Liu et.al.	2409.02517	null
2024-09-04	Fast, High-Quality and Parameter-Efficient Articulatory Synthesis using Differentiable DSP	Yisi Liu et.al.	2409.02451	null
2024-09-11	vec2wav 2.0: Advancing Voice Conversion via Discrete Token Vocoders	Yiwei Guo et.al.	2409.01995	null
2024-10-02	VoxHakka: A Dialectally Diverse Multi-speaker Text-to-Speech System for Taiwanese Hakka	Li-Wei Chen et.al.	2409.01548	null
2024-09-02	A multilingual training strategy for low resource Text to Speech	Asma Amalas et.al.	2409.01217	null
2024-09-02	A Framework for Synthetic Audio Conversations Generation using Large Language Models	Kaung Myat Kyaw et.al.	2409.00946	null
2024-09-02	SoCodec: A Semantic-Ordered Multi-Stream Speech Codec for Efficient Language Model Based Text-to-Speech Synthesis	Haohan Guo et.al.	2409.00933	link
2024-10-11	MaskGCT: Zero-Shot Text-to-Speech with Masked Generative Codec Transformer	Yuancheng Wang et.al.	2409.00750	null
2024-08-30	SelectTTS: Synthesizing Anyone's Voice via Discrete Unit-Based Frame Selection	Ismail Rasim Ulgen et.al.	2408.17432	null
2024-08-30	AASIST3: KAN-Enhanced AASIST Speech Deepfake Detection using SSL Features and Additional Regularization for the ASVspoof 2024 Challenge	Kirill Borodin et.al.	2408.17352	null
2024-09-19	Codec Does Matter: Exploring the Semantic Shortcoming of Codec for Audio Language Model	Zhen Ye et.al.	2408.17175	link
2024-08-30	Utilizing Speaker Profiles for Impersonation Audio Detection	Hao Gu et.al.	2408.17009	null
2024-08-30	Mini-Omni: Language Models Can Hear, Talk While Thinking in Streaming	Zhifei Xie et.al.	2408.16725	link
2024-08-29	RAVE for Speech: Efficient Voice Conversion at High Sampling Rates	Anders R. Bargum et.al.	2408.16546	null
2024-08-29	Enabling Beam Search for Language Model-Based Text-to-Speech Synthesis	Zehai Tu et.al.	2408.16373	null
2024-08-28	Multi-modal Adversarial Training for Zero-Shot Voice Cloning	John Janiczek et.al.	2408.15916	null
2024-08-29	Easy, Interpretable, Effective: openSMILE for voice deepfake detection	Octavian Pascu et.al.	2408.15775	null
2024-08-28	VoxInstruct: Expressive Human Instruction-to-Speech Generation with Unified Multilingual Codec Language Modelling	Yixuan Zhou et.al.	2408.15676	link
2024-08-27	Literary and Colloquial Dialect Identification for Tamil using Acoustic Features	M. Nanmalar et.al.	2408.14887	null
2024-08-28	VoiceTailor: Lightweight Plug-In Adapter for Diffusion-Based Personalized Text-to-Speech	Heeseung Kim et.al.	2408.14739	null
2024-08-27	StyleSpeech: Parameter-efficient Fine Tuning for Pre-trained Controllable Text-to-Speech	Haowei Lo

Name		Name	Last commit message	Last commit date
Latest commit History 910 Commits
.github		.github
assets		assets
docs		docs
.gitignore		.gitignore
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
LICENSE		LICENSE
README.md		README.md
config.yaml		config.yaml
daily_arxiv.py		daily_arxiv.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Updated on 2026.03.08

ASR

TTS

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Updated on 2026.03.08

ASR

TTS

About

Resources

License

Code of conduct

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages