Skip to content

halsay/ASR-TTS-paper-daily

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

910 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Updated on 2026.03.08

Usage instructions: here

This page is modified from here

Table of Contents
  1. ASR
  2. TTS

ASR

Publish Date Title Authors PDF Code
2026-03-05 PersianPunc: A Large-Scale Dataset and BERT-Based Approach for Persian Punctuation Restoration Mohammad Javad Ranjbar Kalahroodi et.al. 2603.05314 null
2026-03-05 Visual-Informed Speech Enhancement Using Attention-Based Beamforming Chihyun Liu et.al. 2603.05270 null
2026-03-05 Beyond Word Error Rate: Auditing the Diversity Tax in Speech Recognition through Dataset Cartography Ting-Hui Cheng et.al. 2603.05267 null
2026-03-05 Boosting ASR Robustness via Test-Time Reinforcement Learning with Audio-Text Semantic Rewards Linghan Fang et.al. 2603.05231 null
2026-03-05 Federated Heterogeneous Language Model Optimization for Hybrid Automatic Speech Recognition Mengze Hong et.al. 2603.04945 null
2026-03-05 Spectral dynamics reservoir computing for high-speed hardware-efficient neuromorphic processing Jiaxuan Chen et.al. 2603.04901 null
2026-03-05 WhisperAlign: Word-Boundary-Aware ASR and WhisperX-Anchored Pyannote Diarization for Long-Form Bengali Speech Aurchi Chowdhury et.al. 2603.04809 null
2026-03-05 When Denoising Hinders: Revisiting Zero-Shot ASR with SAM-Audio and Whisper Akif Islam et.al. 2603.04710 null
2026-02-16 Generating Realistic, Protocol-Compliant Maritime Radio Dialogues using Self-Instruct and Low-Rank Adaptation Gürsel Akdeniz et.al. 2603.04423 null
2026-03-04 Robust LLM-based Audio-Visual Speech Recognition with Sparse Modality Alignment and Visual Unit-Guided Refinement Fei Su et.al. 2603.03811 null
2026-02-28 ACES: Accent Subspaces for Coupling, Explanations, and Stress-Testing in Automatic Speech Recognition Swapnil Parekh et.al. 2603.03359 null
2026-03-03 An Investigation Into Various Approaches For Bengali Long-Form Speech Transcription and Bengali Speaker Diarization Epshita Jahan et.al. 2603.03158 null
2026-03-03 Speech recognition assisted by large language models to command software orally -- Application to an augmented and virtual reality web app for immersive molecular graphics Fabio Cortes Rodriguez et.al. 2603.02901 null
2026-03-04 SilentWear: an Ultra-Low Power Wearable System for EMG-based Silent Speech Recognition Giusy Spacone et.al. 2603.02847 null
2026-03-05 Benchmarking Speech Systems for Frontline Health Conversations: The DISPLACE-M Challenge Dhanya E et.al. 2603.02813 null
2026-03-02 GLoRIA: Gated Low-Rank Interpretable Adaptation for Dialectal ASR Pouya Mehralian et.al. 2603.02464 null
2026-03-02 RO-N3WS: Enhancing Generalization in Low-Resource ASR with Diverse Romanian Speech Benchmarks Alexandra Diaconu et.al. 2603.02368 null
2026-03-02 Sequence-Level Unsupervised Training in Speech Recognition: A Theoretical Study Zijian Yang et.al. 2603.02285 null
2026-02-27 Whisper-RIR-Mega: A Paired Clean-Reverberant Speech Benchmark for ASR Robustness to Room Acoustics Mandip Goswami et.al. 2603.02252 link
2026-02-25 Quality of Automatic Speech Recognition -- Polish Language case study -- from Wav2Vec to Scribe ElevenLabs Marcin Pietroń et.al. 2603.02246 null
2026-03-02 VietSuperSpeech: A Large-Scale Vietnamese Conversational Speech Dataset for ASR Fine-Tuning in Chatbot, Customer Support, and Call Center Applications Loan Do et.al. 2603.01894 null
2026-03-02 More Data, Fewer Diacritics: Scaling Arabic TTS Ahmed Musleh et.al. 2603.01622 null
2026-03-02 The USTC-NERCSLIP Systems for the CHiME-9 MCoRec Challenge Ya Jiang et.al. 2603.01415 null
2026-03-02 End-to-End Simultaneous Dysarthric Speech Reconstruction with Frame-Level Adaptor and Multiple Wait-k Knowledge Distillation Minghui Wu et.al. 2603.01382 null
2026-03-02 DARS: Dysarthria-Aware Rhythm-Style Synthesis for ASR Enhancement Minghui Wu et.al. 2603.01369 null
2026-03-03 Using Songs to Improve Kazakh Automatic Speech Recognition Rustem Yeshpanov et.al. 2603.00961 null
2026-03-01 Towards Orthographically-Informed Evaluation of Speech Recognition Systems for Indian Languages Kaushal Santosh Bhogale et.al. 2603.00941 null
2026-02-28 Polynomial Mixing for Efficient Self-supervised Speech Encoders Eva Feillet et.al. 2603.00683 null
2026-02-28 Whisper-MLA: Reducing GPU Memory Consumption of ASR Models based on MHA2MLA Conversion Sen Zhang et.al. 2603.00563 null
2026-02-16 Iterative LLM-based improvement for French Clinical Interview Transcription and Speaker Diarization Ambre Marie et.al. 2603.00086 null
2026-02-27 Chunk-wise Attention Transducers for Fast and Accurate Streaming Speech-to-Text Hainan Xu et.al. 2602.24245 null
2026-02-27 Dialect and Gender Bias in YouTube's Spanish Captioning System Iris Dania Jimenez et.al. 2602.24002 null
2026-02-26 Challenges in Automatic Speech Recognition for Adults with Cognitive Impairment Michelle Cohn et.al. 2602.23436 null
2026-02-16 Hello-Chat: Towards Realistic Social Audio Interactions Yueran Hou et.al. 2602.23387 null
2026-02-26 Make It Hard to Hear, Easy to Learn: Long-Form Bengali ASR and Speaker Diarization via Extreme Augmentation and Perfect Alignment Sanjid Hasan et.al. 2602.23070 null
2026-02-26 A Holistic Framework for Robust Bangla ASR and Speaker Diarization with Optimized VAD and CTC Alignment Zarif Ishmam et.al. 2602.22935 null
2026-02-26 Efficient Dialect-Aware Modeling and Conditioning for Low-Resource Taiwanese Hakka Speech Processing An-Ci Peng et.al. 2602.22522 null
2026-02-25 TG-ASR: Translation-Guided Learning with Parallel Gated Cross Attention for Low-Resource Automatic Speech Recognition Cheng-Yeh Yang et.al. 2602.22039 null
2026-02-25 Robust Long-Form Bangla Speech Processing: Automatic Speech Recognition and Speaker Diarization MD. Sagor Chowdhury et.al. 2602.21741 null
2026-03-02 Mitigating Structural Noise in Low-Resource S2TT: An Optimized Cascaded Nepali-English Pipeline with Punctuation Restoration Tangsang Chongbang et.al. 2602.21647 null
2026-02-24 823-OLT @ BUET DL Sprint 4.0: Context-Aware Windowing for ASR and Fine-Tuned Speaker Diarization in Bengali Long Form Audio Ratnajit Dhar et.al. 2602.21183 null
2026-02-24 Training-Free Intelligibility-Guided Observation Addition for Noisy ASR Haoyang Li et.al. 2602.20967 null
2026-02-23 An Approach to Combining Video and Speech with Large Language Models in Human-Robot Interaction Guanting Shen et.al. 2602.20219 null
2026-02-22 Pay Attention to CTC: Fast and Robust Pseudo-Labelling for Unified Speech Recognition Alexandros Haliassos et.al. 2602.19316 null
2026-02-21 Whisper: Courtside Edition Enhancing ASR Performance Through LLM-Driven Context Generation Yonathan Ron et.al. 2602.18966 null
2026-02-21 ReHear: Iterative Pseudo-Label Refinement for Semi-Supervised Speech Recognition via Audio Large Language Models Zefang Liu et.al. 2602.18721 null
2026-02-18 Fine-Pruning: A Biologically Inspired Algorithm for Personalization of Machine Learning Models Joseph Bingham et.al. 2602.18507 null
2026-02-19 Voice-Driven Semantic Perception for UAV-Assisted Emergency Networks Nuno Saavedra et.al. 2602.17394 null
2026-02-13 Speech to Speech Synthesis for Voice Impersonation Bjorn Johnson et.al. 2602.16721 null
2026-02-24 Enroll-on-Wakeup: A First Comparative Study of Target Speech Extraction for Seamless Interaction in Real Noisy Human-Machine Dialogue Scenarios Yiming Yang et.al. 2602.15519 null
2026-02-17 Joint Enhancement and Classification using Coupled Diffusion Models of Signals and Logits Gilad Nurko et.al. 2602.15405 null
2026-02-16 CLAP-Based Automatic Word Naming Recognition in Post-Stroke Aphasia Yacouba Kaloga et.al. 2602.14584 null
2026-02-15 From Scarcity to Scale: A Release-Level Analysis of the Pashto Common Voice Dataset Jandad Jahani et.al. 2602.14062 null
2026-02-15 Eureka-Audio: Triggering Audio Intelligence in Compact Language Models Dan Zhang et.al. 2602.13954 null
2026-02-14 voice2mode: Phonation Mode Classification in Singing using Self-Supervised Speech Models Aju Ani Justus et.al. 2602.13928 null
2026-02-03 Multimodal Consistency-Guided Reference-Free Data Selection for ASR Accent Adaptation Ligong Lei et.al. 2602.13263 null
2026-02-13 ViMedCSS: A Vietnamese Medical Code-Switching Speech Dataset & Benchmark Tung X. Nguyen et.al. 2602.12911 null
2026-02-13 Lamer-SSL: Layer-aware Mixture of LoRA Experts for Continual Multilingual Expansion of Self-supervised Models without Forgetting Jing Xu et.al. 2602.12746 null
2026-02-13 PISHYAR: A Socially Intelligent Smart Cane for Indoor Social Navigation and Multimodal Human-Robot Interaction for Visually Impaired People Mahdi Haghighat Joo et.al. 2602.12597 null
2026-02-13 Decoder-only Conformer with Modality-aware Sparse Mixtures of Experts for ASR Jaeyoung Lee et.al. 2602.12546 null
2026-01-21 Retrieval-Augmented Self-Taught Reasoning Model with Adaptive Chain-of-Thought for ASR Named Entity Correction Junjie An et.al. 2602.12287 null
2026-02-16 "Sorry, I Didn't Catch That": How Speech Models Miss What Matters Most Kaitlyn Zhou et.al. 2602.12249 null
2026-02-12 Moonshine v2: Ergodic Streaming Encoder ASR for Latency-Critical Speech Applications Manjunath Kudlur et.al. 2602.12241 null
2026-02-12 On the Sensitivity of Firing Rate-Based Federated Spiking Neural Networks to Differential Privacy Luiz Pereira et.al. 2602.12009 null
2026-02-28 TC-BiMamba: Trans-Chunk bidirectionally within BiMamba for unified streaming and non-streaming ASR Qingshun She et.al. 2602.11546 null
2026-02-21 Voxtral Realtime Alexander H. Liu et.al. 2602.11298 null
2026-02-11 Self-Supervised Learning for Speaker Recognition: A study and review Theo Lepage et.al. 2602.10829 null
2026-02-10 ViSpeechFormer: A Phonemic Approach for Vietnamese Automatic Speech Recognition Khoa Anh Nguyen et.al. 2602.10003 null
2026-02-10 Where Are We At with Automatic Speech Recognition for the Bambara Language? Seydou Diallo et.al. 2602.09785 null
2026-02-04 Beyond the Utterance: An Empirical Study of Very Long Context Speech Recognition Robert Flynn et.al. 2602.09044 null
2026-02-04 Windowed SummaryMixing: An Efficient Fine-Tuning of Self-Supervised Learning Models for Low-resource Speech Recognition Aditya Srinivas Menon et.al. 2602.09043 null
2026-02-19 Prototype-Based Disentanglement for Controllable Dysarthric Speech Synthesis Haoshen Wang et.al. 2602.08696 null
2026-02-09 Cross-Modal Bottleneck Fusion For Noise Robust Audio-Visual Speech Recognition Seaone Ok et.al. 2602.08293 null
2026-02-08 D-ORCA: Dialogue-Centric Optimization for Robust Audio-Visual Captioning Changli Tang et.al. 2602.07960 null
2026-02-06 Equipping LLM with Directional Multi-Talker Speech Understanding Capabilities Ju Lin et.al. 2602.07211 null
2026-02-05 From Hallucination to Articulation: Language Model-Driven Losses for Ultra Low-Bitrate Neural Speech Coding Jayeon Yi et.al. 2602.06213 null
2026-02-05 Enabling Automatic Disordered Speech Recognition: An Impaired Speech Dataset in the Akan Language Isaac Wiafe et.al. 2602.05406 null
2026-02-11 Evaluating Kubernetes Performance for GenAI Inference: From Automatic Speech Recognition to LLM Summarization Sai Sindhur Malleni et.al. 2602.04900 null
2026-02-04 Speaker-Aware Simulation Improves Conversational Speech Recognition Máté Gedeon et.al. 2602.04776 null
2026-03-01 Universal Robust Speech Adaptation for Cross-Domain Speech Recognition and Enhancement Chien-Chun Wang et.al. 2602.04307 null
2026-02-04 Frontend Token Enhancement for Token-Based Speech Recognition Takanori Ashihara et.al. 2602.04217 null
2026-02-06 Benchmarking Automatic Speech Recognition for Indian Languages in Agricultural Contexts Chandrashekar M S et.al. 2602.03868 null
2026-02-03 Mići Princ -- A Little Boy Teaching Speech Technologies the Chakavian Dialect Nikola Ljubešić et.al. 2602.03245 null
2026-03-02 WAXAL: A Large-Scale Multilingual African Language Speech Corpus Abdoulaye Diack et.al. 2602.02734 null
2026-02-02 Mixture-of-Experts with Intermediate CTC Supervision for Accented Speech Recognition Wonjun Lee et.al. 2602.01967 null
2026-02-02 BBPE16: UTF-16-based byte-level byte-pair encoding for improved multilingual speech recognition Hyunsik Kim et.al. 2602.01717 null
2026-02-01 EmoAra: Emotion-Preserving English Speech Transcription and Cross-Lingual Translation with Arabic Text-to-Speech Besher Hassan et.al. 2602.01170 null
2026-02-01 Adapting Where It Matters: Depth-Aware Adaptation for Efficient Multilingual Speech Recognition in Low-Resource Languages Yang Xiao et.al. 2602.01008 null
2026-02-01 MedSpeak: A Knowledge Graph-Aided ASR Error Correction Framework for Spoken Medical QA Yutong Song et.al. 2602.00981 null
2026-01-30 CALM: Joint Contextual Acoustic-Linguistic Modeling for Personalization of Multi-Speaker ASR Muhammad Shakeel et.al. 2601.22792 null
2026-01-30 Streaming Speech Recognition with Decoder-Only Large Language Models and Latency Optimization Genshun Wan et.al. 2601.22779 null
2026-01-29 Towards Robust Dysarthric Speech Recognition: LLM-Agent Post-ASR Correction Beyond WER Xiuwen Zheng et.al. 2601.21347 null
2026-01-30 Qwen3-ASR Technical Report Xian Shi et.al. 2601.21337 link
2026-01-28 asr_eval: Algorithms and tools for multi-reference and streaming speech recognition evaluation Oleg Sedukhin et.al. 2601.20992 null
2026-01-30 Text-only adaptation in LLM-based ASR through text denoising Sergio Burdisso et.al. 2601.20900 null
2026-01-28 Reducing Prompt Sensitivity in LLM-based Speech Recognition Through Learnable Projection Sergio Burdisso et.al. 2601.20898 null
2026-01-28 A Study of Data Selection Strategies for Pre-training Self-Supervised Speech Models Ryan Whetten et.al. 2601.20896 null
2026-01-28 SW-ASR: A Context-Aware Hybrid ASR Pipeline for Robust Single Word Speech Recognition Manali Sharma et.al. 2601.20890 null
2026-01-27 MA-LipNet: Multi-Dimensional Attention Networks for Robust Lipreading Matteo Rossi et.al. 2601.20881 null
2026-01-28 ASR for Affective Speech: Investigating Impact of Emotion and Speech Generative Strategy Ya-Tse Wu et.al. 2601.20319 null
2026-01-28 Mind the Shift: Using Delta SSL Embeddings to Enhance Child ASR Zilai Wang et.al. 2601.20142 null
2026-01-27 Do we really need Self-Attention for Streaming Automatic Speech Recognition? Youness Dkhissi et.al. 2601.19960 null
2026-01-23 Benchmarking von ASR-Modellen im deutschen medizinischen Kontext: Eine Leistungsanalyse anhand von Anamnesegesprächen Thomas Schuster et.al. 2601.19945 null
2026-01-08 FastWhisper: Adaptive Self-knowledge Distillation for Real-time Automatic Speech Recognition Junseok Lee et.al. 2601.19919 null
2026-01-27 SLM-SS: Speech Language Model for Generative Speech Separation Tianhua Li et.al. 2601.19533 null
2026-01-27 Dynamic Multi-Expert Projectors with Stabilized Routing for Multilingual Speech Recognition Isha Pandey et.al. 2601.19451 null
2026-01-27 SE-DiCoW: Self-Enrolled Diarization-Conditioned Whisper Alexander Polok et.al. 2601.19194 null
2026-02-02 Language Family Matters: Evaluating LLM-Based ASR Across Linguistic Boundaries Yuchen Zhang et.al. 2601.18899 null
2026-01-29 Unheard in the Digital Age: Rethinking AI Bias and Speech Diversity Onyedikachi Hope Amaechi-Okorie et.al. 2601.18641 null
2026-01-26 Pisets: A Robust Speech Recognition System for Lectures and Interviews Ivan Bondarenko et.al. 2601.18415 link
2026-01-26 Noise-Robust AV-ASR Using Visual Features Both in the Whisper Encoder and Decoder Zhengyang Li et.al. 2601.18396 null
2026-01-26 OCR-Enhanced Multimodal ASR Can Read While Listening Junli Chen et.al. 2601.18393 null
2026-01-26 Efficient Rehearsal for Continual Learning in ASR via Singular Value Tuning Steven Vander Eeckt et.al. 2601.18266 null
2026-01-26 VIBEVOICE-ASR Technical Report Zhiliang Peng et.al. 2601.18184 null
2026-01-25 SpatialEmb: Extract and Encode Spatial Information for 1-Stage Multi-channel Multi-speaker ASR on Arbitrary Microphone Arrays Yiwen Shao et.al. 2601.18037 null
2026-01-25 dLLM-ASR: A Faster Diffusion LLM-based Framework for Speech Recognition Wenjie Tian et.al. 2601.17902 null
2026-02-28 Speech Emotion Recognition with ASR Integration Yuanchao Li et.al. 2601.17901 null
2026-01-25 Quran-MD: A Fine-Grained Multilingual Multimodal Dataset of the Quran Muhammad Umar Salman et.al. 2601.17880 null
2026-01-25 BanglaRobustNet: A Hybrid Denoising-Attention Architecture for Robust Bangla Speech Recognition Md Sazzadul Islam Ridoy et.al. 2601.17679 null
2026-01-25 End-to-End Joint ASR and Speaker Role Diarization with Child-Adult Interactions Anfeng Xu et.al. 2601.17640 link
2026-01-24 Window Size Versus Accuracy Experiments in Voice Activity Detectors Max McKinnon et.al. 2601.17270 null
2026-01-22 Sink or SWIM: Tackling Real-Time ASR at Scale Federico Bruzzone et.al. 2601.17097 null
2026-01-16 AI-based System for Transforming text and sound to Educational Videos M. E. ElAlami et.al. 2601.17022 null
2026-01-21 Test-Time Adaptation for Speech Emotion Recognition Jiaheng Dong et.al. 2601.16240 null
2026-01-20 SoundBreak: A Systematic Study of Audio-Only Adversarial Attacks on Trimodal Models Aafiya Hussain et.al. 2601.16231 null
2026-01-22 Quantum Dimension Reduction of Hidden Markov Models Rishi Sundar et.al. 2601.16126 null
2026-01-27 Distillation-based Layer Dropping (DLD): Effective End-to-end Framework for Dynamic Speech Networks Abdul Hannan et.al. 2601.16117 null
2026-01-20 Lost in Transcription: How Speech-to-Text Errors Derail Code Understanding Jayant Havare et.al. 2601.15339 null
2026-01-22 Deaf and Hard of Hearing Access to Intelligent Personal Assistants: Comparison of Voice-Based Options with an LLM-Powered Touch Interface Paige S. DeVries et.al. 2601.15209 null
2026-01-21 Inverse-Hessian Regularization for Continual Learning in ASR Steven Vander Eeckt et.al. 2601.14751 null
2026-01-19 Typhoon ASR Real-time: FastConformer-Transducer for Thai Automatic Speech Recognition Warit Sirichotedumrong et.al. 2601.13044 link
2026-01-19 DUAP: Dual-task Universal Adversarial Perturbations Against Voice Control Systems Suyang Sun et.al. 2601.12786 null
2026-01-18 SSVD-O: Parameter-Efficient Fine-Tuning with Structured SVD for Speech Recognition Pu Wang et.al. 2601.12600 null
2026-01-18 Purification Before Fusion: Toward Mask-Free Speech Enhancement for Robust Audio-Visual Speech Recognition Linzhi Wu et.al. 2601.12436 null
2026-01-18 CTC-DID: CTC-Based Arabic dialect identification for streaming applications Muhammad Umar Farooq et.al. 2601.12199 null
2026-01-16 WenetSpeech-Wu: Datasets, Benchmarks, and Models for a Unified Chinese Wu Dialect Speech Processing Ecosystem Chengyou Wang et.al. 2601.11027 null
2026-01-15 Unifying Speech Recognition, Synthesis and Conversion with Autoregressive Transformers Runyuan Cai et.al. 2601.10770 null
2026-01-15 STEAMROLLER: A Multi-Agent System for Inclusive Automatic Speech Recognition for People who Stutter Ziqi Xu et.al. 2601.10223 null
2025-12-23 Multi-Level Embedding Conformer Framework for Bengali Automatic Speech Recognition Md. Nazmus Sakib et.al. 2601.09710 null
2026-01-14 Linear Complexity Self-Supervised Learning for Music Understanding with Random Quantizer Petros Vavaroutsos et.al. 2601.09603 null
2026-01-14 Speech-Hands: A Self-Reflection Voice Agentic Approach to Speech Recognition and Audio Reasoning with Omni Perception Zhen Wan et.al. 2601.09413 null
2026-01-14 SLAM-LLM: A Modular, Open-Source Multimodal Large Language Model Framework and Best Practice for Speech, Language, Audio and Music Processing Ziyang Ma et.al. 2601.09385 null
2026-01-17 MCGA: A Multi-task Classical Chinese Literary Genre Audio Corpus Yexing Du et.al. 2601.09270 link
2026-01-13 Robust CAPTCHA Using Audio Illusions in the Era of Large Language Models: from Evaluation to Advances Ziqi Ding et.al. 2601.08516 null
2026-01-12 Towards Comprehensive Semantic Speech Embeddings for Chinese Dialects Kalvin Chang et.al. 2601.07274 link
2026-01-11 Task Arithmetic with Support Languages for Low-Resource ASR Emma Rafkin et.al. 2601.07038 null
2026-01-11 Categorize Early, Integrate Late: Divergent Processing Strategies in Automatic Speech Recognition Nathan Roll et.al. 2601.06972 null
2026-01-11 Variational decomposition autoencoding improves disentanglement of latent representations Ioannis Ziogas et.al. 2601.06844 null
2026-01-11 Doing More with Less: Data Augmentation for Sudanese Dialect Automatic Speech Recognition Ayman Mansour et.al. 2601.06802 null
2026-01-10 QMAVIS: Long Video-Audio Understanding using Fusion of Large Multimodal Models Zixing Lin et.al. 2601.06573 null
2026-01-09 An Intelligent AI glasses System with Multi-Agent Architecture for Real-Time Voice Processing and Task Execution Sheng-Kai Chen et.al. 2601.06235 null
2026-01-13 GenAITEd Ghana: A First-of-Its-Kind Context-Aware and Curriculum-Aligned Conversational AI Agent for Teacher Education Matthew Nyaaba et.al. 2601.06093 null
2026-01-09 Multimodal In-context Learning for ASR of Low-resource Languages Zhaolin Li et.al. 2601.05707 null
2026-01-08 LLMs-Integrated Automatic Hate Speech Recognition Using Controllable Text Generation Models Ryutaro Oshima et.al. 2601.04654 null
2026-01-08 WESR: Scaling and Evaluating Word-level Event-Speech Recognition Chenchen Yang et.al. 2601.04508 null
2026-01-08 Latent-Level Enhancement with Flow Matching for Robust Automatic Speech Recognition Da-Hee Yang et.al. 2601.04459 null
2026-01-14 Stuttering-Aware Automatic Speech Recognition for Indonesian Language Fadhil Muhammad et.al. 2601.03727 null
2026-01-08 TellWhisper: Tell Whisper Who Speaks When Yifan Hu et.al. 2601.03712 null
2026-01-06 Linear Script Representations in Speech Foundation Models Enable Zero-Shot Transliteration Ryan Soh-Eun Shim et.al. 2601.02906 null
2026-01-06 Multi-channel multi-speaker transformer for speech recognition Guo Yifan et.al. 2601.02688 null
2026-01-05 Dynamic Quantization Error Propagation in Encoder-Decoder ASR Quantization Xinyu Wang et.al. 2601.02455 null
2026-01-05 VocalBridge: Latent Diffusion-Bridge Purification for Defeating Perturbation-Based Voiceprint Defenses Maryam Abbasihafshejani et.al. 2601.02444 null
2026-01-14 MORE: Multi-Objective Adversarial Attacks on Speech Recognition Xiaoxue Gao et.al. 2601.01852 null
2026-01-03 IO-RAE: Information-Obfuscation Reversible Adversarial Example for Audio Privacy Protection Jiajie Zhu et.al. 2601.01239 null
2026-01-02 Improving Code-Switching Speech Recognition with TTS Data Augmentation Yue Heng Yeo et.al. 2601.00935 null
2025-12-31 Index-ASR Technical Report Zheshu Song et.al. 2601.00890 null
2026-01-02 Three factor delay learning rules for spiking neural networks Luke Vassallo et.al. 2601.00668 null
2026-01-01 IKFST: IOO and KOO Algorithms for Accelerated and Precise WFST-based End-to-End Automatic Speech Recognition Zhuoran Zhuang et.al. 2601.00160 null
2025-12-31 Learning Speech Representations with Variational Predictive Coding Sung-Lin Yeh et.al. 2601.00100 null
2025-12-31 SLM-TTA: A Framework for Test-Time Adaptation of Generative Spoken Language Models Yuan-Kuei Wu et.al. 2512.24739 null
2025-12-29 PROFASR-BENCH: A Benchmark for Context-Conditioned ASR in High-Stakes Professional Speech Deepak Babu Piskala et.al. 2512.23686 link
2025-12-17 Marco-ASR: A Principled and Metric-Driven Framework for Fine-Tuning Large-Scale ASR Models for Domain Adaptation Xuanfan Ni et.al. 2512.22165 null
2025-12-14 EEG-to-Voice Decoding of Spoken and Imagined speech Using Non-Invasive EEG Hanbeot Park et.al. 2512.22146 null
2025-12-26 Contextual Biasing for LLM-Based ASR with Hotword Retrieval and Reinforcement Learning YuXiang Kong et.al. 2512.21828 null
2025-12-25 Broadband tunable microwave photonic radar for simultaneous detection of human respiration, heartbeat, and speech with deep learning-based speech recognition Lei Gao et.al. 2512.21566 null
2025-12-29 VALLR-Pin: Uncertainty-Factorized Visual Speech Recognition for Mandarin with Pinyin Guidance Chang Sun et.al. 2512.20032 null
2025-12-22 From Speech to Subtitles: Evaluating ASR Models in Subtitling Italian Television Programs Alessandro Lucca et.al. 2512.19161 null
2025-12-22 Enhancing Fully Formatted End-to-End Speech Recognition with Knowledge Distillation via Multi-Codebook Vector Quantization Jian You et.al. 2512.18967 null
2025-12-20 Phoneme-based speech recognition driven by large language models and sampling marginalization Te Ma et.al. 2512.18371 null
2025-12-20 TICL+: A Case Study On Speech In-Context Learning for Children's Speech Recognition Haolong Zheng et.al. 2512.18263 null
2025-11-27 Supplementary Resources and Analysis for Automatic Speech Recognition Systems Trained on the Loquacious Dataset Nick Rossenbach et.al. 2512.17915 null
2025-12-19 Peeking Into The Future For Contextual Biasing Ramaneswaran Selvakumar et.al. 2512.17657 null
2025-12-19 When De-noising Hurts: A Systematic Study of Speech Enhancement Effects on Modern Medical ASR Systems Sujal Chondhekar et.al. 2512.17562 null
2025-12-19 Zero-Shot Recognition of Dysarthric Speech Using Commercial Automatic Speech Recognition and Multimodal Large Language Models Ali Alsayegh et.al. 2512.17474 null
2025-12-19 Incorporating Error Level Noise Embedding for Improving LLM-Assisted Robustness in Persian Speech Recognition Zahra Rahmani et.al. 2512.17247 null
2025-11-04 V-Agent: An Interactive Video Search System Using Vision-Language Models SunYoung Park et.al. 2512.16925 null
2026-01-14 Navigating the Reality Gap: Privacy-Preserving On-Device Continual Adaptation of ASR for Clinical Telephony Darshil Chauhan et.al. 2512.16401 null
2026-01-15 TinyMyo: a Tiny Foundation Model for Flexible EMG Signal Processing at the Edge Matteo Fasulo et.al. 2512.15729 link
2025-12-16 ComMark: Covert and Robust Black-Box Model Watermarking with Compressed Samples Yunfei Yang et.al. 2512.15641 null
2025-12-16 Adapting Speech Language Model to Singing Voice Synthesis Yiwen Zhao et.al. 2512.14657 null
2025-12-16 Scalable Frameworks for Real-World Audio-Visual Speech Recognition Sungnyun Kim et.al. 2512.14083 null
2025-12-15 Reproducing and Dissecting Denoising Language Models for Speech Recognition Dorian Koch et.al. 2512.13576 null
2025-12-18 Adaptive Edge-Cloud Inference for Speech-to-Action Systems Using ASR and Large Language Models Mohammad Jalili Torkamani et.al. 2512.12769 null
2025-12-13 System X: A Mobile Voice-Based AI System for EMR Generation and Clinical Decision Support in Low-Resource Maternal Healthcare Maryam Mustafa et.al. 2512.12240 null
2025-12-12 All-in-One ASR: Unifying Encoder-Decoder Models of CTC, Attention, and Transducer in Dual-Mode ASR Takafumi Moriya et.al. 2512.11543 null
2025-12-12 The Affective Bridge: Unifying Feature Representations for Speech Deepfake Detection Yupei Li et.al. 2512.11241 null
2025-12-11 The TCG CREST -- RKMVERI Submission for the NCIIPC Startup India AI Grand Challenge Nikhil Raghav et.al. 2512.11009 null
2025-11-30 Benchmarking Automatic Speech Recognition Models for African Languages Alvin Nahabwe et.al. 2512.10968 null
2025-11-30 ASR Under the Stethoscope: Evaluating Biases in Clinical Speech Recognition across Indian Languages Subham Kumar et.al. 2512.10967 null
2025-12-11 TRIDENT: A Redundant Architecture for Caribbean-Accented Emergency Speech Triage Elroy Galbraith et.al. 2512.10741 null
2025-12-10 Robust Speech Activity Detection in the Presence of Singing Voice Philipp Grundhuber et.al. 2512.09713 null
2025-12-02 Enhancing Automatic Speech Recognition Through Integrated Noise Detection Architecture Karamvir Singh et.al. 2512.08973 null
2025-12-08 A Simple Method to Enhance Pre-trained Language Models with Speech Tokens for Classification Nicolas Calbucura et.al. 2512.07571 null
2025-12-08 Efficient ASR for Low-Resource Languages: Leveraging Cross-Lingual Unlabeled Data Srihari Bandarupalli et.al. 2512.07277 null
2025-12-06 Sanvaad: A Multimodal Accessibility Framework for ISL Recognition and Voice-Based Interaction Kush Revankar et.al. 2512.06485 null
2025-12-01 KidSpeak: A General Multi-purpose LLM for Kids' Speech Recognition and Screening Rohan Sharma et.al. 2512.05994 null
2025-11-23 SyncVoice: Towards Video Dubbing with Vision-Augmented Pretrained TTS Model Kaidi Wang et.al. 2512.05126 null
2025-12-04 Measuring the Unspoken: A Disentanglement Model and Benchmark for Psychological Analysis in the Wild Yigui Feng et.al. 2512.04728 null
2025-12-04 Multi-Loss Learning for Speech Emotion Recognition with Energy-Adaptive Mixup and Frame-Level Attention Cong Wang et.al. 2512.04551 null
2025-12-02 Comparing Unsupervised and Supervised Semantic Speech Tokens: A Case Study of Child ASR Mohan Shi et.al. 2512.03301 null
2025-12-02 MAViD: A Multimodal Framework for Audio-Visual Dialogue Understanding and Generation Youxin Pang et.al. 2512.03034 null
2025-12-02 Bangla Hate Speech Classification with Fine-tuned Transformer Models Yalda Keivan Jafari et.al. 2512.02845 null
2025-12-02 Reasoning-Aware Multimodal Fusion for Hateful Video Detection Shuonan Yang et.al. 2512.02743 null
2025-12-02 Hear What Matters! Text-conditioned Selective Video-to-Audio Generation Junwon Lee et.al. 2512.02650 null
2025-12-01 See, Hear, and Understand: Benchmarking Audiovisual Human Speech Understanding in Multimodal Large Language Models Le Thien Phuc Nguyen et.al. 2512.02231 null
2026-01-19 Swivuriso: The South African Next Voices Multilingual Speech Dataset Vukosi Marivate et.al. 2512.02201 null
2025-11-18 On the Difficulty of Token-Level Modeling of Dysfluency and Fluency Shaping Artifacts Kashaf Gulzar et.al. 2512.02027 null
2025-12-01 MAC-SLU: Multi-Intent Automotive Cabin Spoken Language Understanding Benchmark Yuezhang Peng et.al. 2512.01603 link
2025-12-01 ZO-ASR: Zeroth-Order Fine-Tuning of Speech Foundation Models without Back-Propagation Yuezhang Peng et.al. 2512.01267 null
2025-11-28 OmniFusion: Simultaneous Multilingual Multimodal Translations via Modular Fusion Sai Koneru et.al. 2512.00234 link
2025-11-28 Scaling HuBERT for African Languages: From Base to Large and XL Antoine Caubrière et.al. 2511.23370 null
2025-11-28 HPSU: A Benchmark for Human-Level Perception in Real-World Spoken Speech Understanding Chen Li et.al. 2511.23178 null
2025-11-28 Group-Aware Partial Model Merging for Children's Automatic Speech Recognition Thomas Rolland et.al. 2511.23098 null
2025-11-27 Modeling Romanized Hindi and Bengali: Dataset Creation and Multilingual LLM Integration Kanchon Gharami et.al. 2511.22769 null
2025-11-27 Do You See What I Say? Generalizable Deepfake Detection based on Visual Speech Recognition Maheswar Bora et.al. 2511.22443 null
2025-11-27 Layover or Direct Flight: Rethinking Audio-Guided Image Segmentation Joel Alberto Santos et.al. 2511.22025 null
2025-11-16 On the Cross-lingual Transferability of Pre-trained wav2vec2-based Models Jonatas Grosman et.al. 2511.21704 null
2025-11-26 Multi-Reward GRPO for Stable and Prosodic Single-Codebook TTS LLMs at Scale Yicheng Zhong et.al. 2511.21270 null
2025-11-26 ASR Error Correction in Low-Resource Burmese with Alignment-Enhanced Transformers using Phonetic Features Ye Bhone Lin et.al. 2511.21088 null
2025-11-26 Towards Audio Token Compression in Large Audio Language Models Saurabhchand Bhati et.al. 2511.20973 null
2025-12-24 SingingSDS: A Singing-Capable Spoken Dialogue System for Conversational Roleplay Applications Jionghao Han et.al. 2511.20972 link
2025-11-25 Bridging the Language Gap: Synthetic Voice Diversity via Latent Mixup for Equitable Speech Recognition Wesley Bian et.al. 2511.20534 null
2025-11-25 Mispronunciation Detection and Diagnosis Without Model Training: A Retrieval-Based Approach Huu Tuong Tu et.al. 2511.20107 null
2025-11-25 EM2LDL: A Multilingual Speech Corpus for Mixed Emotion Recognition through Label Distribution Learning Xingfeng Li et.al. 2511.20106 null
2025-11-25 It Hears, It Sees too: Multi-Modal LLM for Depression Detection By Integrating Visual Understanding into Audio Language Models Xiangyu Zhao et.al. 2511.19877 null
2025-11-24 Neural Architecture Search for Quantum Autoencoders Hibah Agha et.al. 2511.19246 null
2025-11-24 AuViRe: Audio-visual Speech Representation Reconstruction for Deepfake Temporal Localization Christos Koutlis et.al. 2511.18993 null
2025-11-27 PrismAudio: Decomposed Chain-of-Thoughts and Multi-dimensional Rewards for Video-to-Audio Generation Huadai Liu et.al. 2511.18833 null
2025-11-24 Context-Aware Whisper for Arabic ASR Under Linguistic Varieties Bashar Talafha et.al. 2511.18774 null
2025-11-24 AIRHILT: A Human-in-the-Loop Testbed for Multimodal Conflict Detection in Aviation Omar Garib et.al. 2511.18718 null
2025-11-23 A Multimodal Conversational Agent for Tabular Data Analysis Mohammad Nour Al Awad et.al. 2511.18405 null
2025-11-21 Point of Order: Action-Aware LLM Persona Modeling for Realistic Civic Simulation Scott Merrill et.al. 2511.17813 null
2025-11-12 Speech Recognition Model Improves Text-to-Speech Synthesis using Fine-Grained Reward Guansu Wang et.al. 2511.17555 null
2025-11-21 Enhancing Quranic Learning: A Multimodal Deep Learning Approach for Arabic Phoneme Recognition Ayhan Kucukmanisa et.al. 2511.17477 null
2025-11-21 Robot Confirmation Generation and Action Planning Using Long-context Q-Former Integrated with Multimodal LLM Chiori Hori et.al. 2511.17335 null
2025-11-21 Investigating self-supervised representations for audio-visual deepfake detection Dragos-Alexandru Boldisor et.al. 2511.17181 null
2026-01-19 WER is Unaware: Assessing How ASR Errors Distort Clinical Understanding in Patient Facing Dialogue Zachary Ellis et.al. 2511.16544 null
2025-12-03 NLP Datasets for Idiom and Figurative Language Tasks Blake Matheny et.al. 2511.16345 null
2025-11-20 Train Short, Infer Long: Speech-LLM Enables Zero-Shot Streamable Joint ASR and Diarization on Long Audio Mohan Shi et.al. 2511.16046 null
2025-11-19 Scriboora: Rethinking Human Pose Forecasting Daniel Bermuth et.al. 2511.15565 null
2025-11-18 Quality-Controlled Multimodal Emotion Recognition in Conversations with Identity-Based Transfer Learning and MAMBA Fusion Zanxu Wang et.al. 2511.14969 null
2025-11-18 Ground Truth Generation for Multilingual Historical NLP using LLMs Clovis Gladstone et.al. 2511.14688 null
2025-12-01 IMSE: Efficient U-Net-based Speech Enhancement using Inception Depthwise Convolution and Amplitude-Aware Linear Attention Xinxin Tang et.al. 2511.14515 null
2025-11-18 TTA: Transcribe, Translate and Alignment for Cross-lingual Speech Representation Wei Liu et.al. 2511.14410 null
2025-11-18 AfriSpeech-MultiBench: A Verticalized Multidomain Multicountry Benchmark Suite for African Accented English ASR Gabrial Zencha Ashungafac et.al. 2511.14255 null
2025-11-19 StreamingTalker: Audio-driven 3D Facial Animation with Autoregressive Diffusion Model Yifan Yang et.al. 2511.14223 null
2025-11-18 Listen Like a Teacher: Mitigating Whisper Hallucinations using Adaptive Layer Attention and Knowledge Distillation Kumud Tripathi et.al. 2511.14219 null
2025-11-17 Human-centric Maintenance Process Through Integration of AI, Speech, and AR Parul Khanna et.al. 2511.13918 null
2025-11-19 Segmenting Collision Sound Sources in Egocentric Videos Kranti Kumar Parida et.al. 2511.13863 null
2025-11-26 Passive Dementia Screening via Facial Temporal Micro-Dynamics Analysis of In-the-Wild Talking-Head Video Filippo Cenacchi et.al. 2511.13802 null
2025-11-05 Emotion Recognition in Multi-Speaker Conversations through Speaker Identification, Knowledge Distillation, and Hierarchical Fusion Xiao Li et.al. 2511.13731 null
2026-01-14 Toward Conversational Hungarian Speech Recognition: Introducing the BEA-Large and BEA-Dialogue Datasets Máté Gedeon et.al. 2511.13529 null
2025-11-17 Spatial Blind Spot: Auditory Motion Perception Deficits in Audio LLMs Zhe Sun et.al. 2511.13273 null
2025-11-17 Distinguishing Repetition Disfluency from Morphological Reduplication in Bangla ASR Transcripts: A Novel Corpus and Benchmarking Analysis Zaara Zabeen Arpa et.al. 2511.13159 null
2025-11-16 Hi-Reco: High-Fidelity Real-Time Conversational Digital Humans Hongbin Huang et.al. 2511.12662 null
2025-11-23 Uni-MoE-2.0-Omni: Scaling Language-Centric Omnimodal Large Model with Advanced MoE, Training and Data Yunxin Li et.al. 2511.12609 link
2025-11-15 How Far Do SSL Speech Models Listen for Tone? Temporal Focus of Tone Representation under Low-resource Transfer Minu Kim et.al. 2511.12285 null
2025-11-15 Fusionista2.0: Efficiency Retrieval System for Large-Scale Datasets Huy M. Le et.al. 2511.12255 null
2025-11-12 Tighter Truncated Rectangular Prism Approximation for RNN Robustness Verification Xingqi Lin et.al. 2511.11699 null
2025-11-12 Beyond saliency: enhancing explanation of speech emotion recognition with expert-referenced acoustic cues Seham Nasr et.al. 2511.11691 null
2025-11-14 Speech-Aware Long Context Pruning and Integration for Contextualized Automatic Speech Recognition Yiming Rong et.al. 2511.11139 null
2025-11-13 TEDxTN: A Three-way Speech Translation Corpus for Code-Switched Tunisian Arabic - English Fethi Bougares et.al. 2511.10780 null
2025-11-09 Towards Fine-Grained Code-Switch Speech Translation with Semantic Space Alignment Yan Gao et.al. 2511.10670 null
2025-11-13 Music Flamingo: Scaling Music Understanding in Audio Language Models Sreyan Ghosh et.al. 2511.10289 null
2025-11-12 Omnilingual ASR: Open-Source Multilingual Speech Recognition for 1600+ Languages Omnilingual ASR team et.al. 2511.09690 link
2025-11-12 End-to-end Contrastive Language-Speech Pretraining Model For Long-form Spoken Question Answering Jiliang Hu et.al. 2511.09282 null
2025-11-12 Context-Aware Dynamic Chunking for Streaming Tibetan Speech Recognition Chao Wang et.al. 2511.09085 null
2025-11-12 Towards Effective and Efficient Non-autoregressive decoders for Conformer and LLM-based ASR using Block-based Attention Mask Tianzi Wang et.al. 2511.09084 null
2025-11-11 Quantizing Whisper-small: How design choices affect ASR performance Arthur Söhler et.al. 2511.08093 null
2025-11-11 Speech Emotion Recognition with Phonation Excitation Information and Articulatory Kinematics Ziqian Zhang et.al. 2511.07955 null
2025-11-13 SpikCommander: A High-performance Spiking Transformer with Multi-view Learning for Efficient Speech Command Recognition Jiaqi Wang et.al. 2511.07883 null
2025-11-24 SynTTS-Commands: A Public Dataset for On-Device KWS via TTS-Synthesized Multilingual Speech Lu Gan et.al. 2511.07821 null
2025-11-10 LiveNeRF: Efficient Face Replacement Through Neural Radiance Fields Integration Tung Vu et.al. 2511.07552 null
2025-11-10 Enabling Automatic Self-Talk Detection via Earables Euihyeok Lee et.al. 2511.07493 null
2025-11-11 Surgical Agent Orchestration Platform for Voice-directed Patient Data Interaction Hyeryun Park et.al. 2511.07392 null
2025-11-10 Omni-AVSR: Towards Unified Multimodal Speech Recognition with Large Language Models Umberto Cappellazzo et.al. 2511.07253 link
2025-11-10 Improving Remote Patient Monitoring Systems Using a Fog-based IoT Platform with Speech Recognition Marc Jayson Baucas et.al. 2511.07189 null
2025-11-10 E2E-VGuard: Adversarial Prevention for Production LLM-based End-To-End Speech Synthesis Zhisheng Zhang et.al. 2511.07099 null
2025-11-10 CLiFT-ASR: A Cross-Lingual Fine-Tuning Framework for Low-Resource Taiwanese Hokkien Speech Recognition Hung-Yang Sung et.al. 2511.06860 null
2025-11-10 MedVoiceBias: A Controlled Study of Audio LLM Behavior in Clinical Decision-Making Zhi Rui Tam et.al. 2511.06592 null
2025-11-07 Shared Latent Representation for Joint Text-to-Audio-Visual Synthesis Dogucan Yaman et.al. 2511.05432 null
2025-11-12 MERaLiON-SER: Robust Speech Emotion Recognition Model for English and SEA Languages Hardik B. Sailor et.al. 2511.04914 null
2025-11-06 CantoASR: Prosody-Aware ASR-LALM Collaboration for Low-Resource Cantonese Dazhong Chen et.al. 2511.04139 null
2025-11-06 WST: Weakly Supervised Transducer for Automatic Speech Recognition Dongji Gao et.al. 2511.04035 null
2025-11-06 Accelerating scientific discovery with the common task framework J. Nathan Kutz et.al. 2511.04001 null
2025-11-05 Seeing What You Say: Expressive Image Generation from Speech Jiyoung Lee et.al. 2511.03423 null
2025-11-05 Open Source State-Of-the-Art Solution for Romanian Speech Recognition Gabriel Pirlogeanu et.al. 2511.03361 null
2025-11-05 TASU: Text-Only Alignment for Speech Understanding Jing Peng et.al. 2511.03310 null
2025-11-11 How to Evaluate Speech Translation with Source-Aware Neural MT Metrics Mauro Cettolo et.al. 2511.03295 null
2025-11-04 An unscented Kalman filter method for real time input-parameter-state estimation Marios Impraimakis et.al. 2511.02717 null
2025-11-04 Energy-Efficient Hardware Acceleration of Whisper ASR on a CGLA Takuto Ando et.al. 2511.02269 null
2025-11-03 SeaLLMs-Audio: Large Audio-Language Models for Southeast Asia Chaoqun Liu et.al. 2511.01670 null
2025-11-02 MULTI-Bench: A Multi-Turn Interactive Benchmark for Assessing Emotional Intelligence ability of Spoken Dialogue Models Yayue Deng et.al. 2511.00850 null
2025-11-01 Emotion Detection in Speech Using Lightweight and Transformer-Based Models: A Comparative and Ablation Study Lucky Onyekwelu-Udoka et.al. 2511.00402 null
2025-10-31 Reference Microphone Selection for Guided Source Separation based on the Normalized L-p Norm Anselm Lohmann et.al. 2510.27198 null
2025-10-30 Overview of the MEDIQA-OE 2025 Shared Task on Medical Order Extraction from Doctor-Patient Consultations Jean-Philippe Corbeil et.al. 2510.26974 null
2025-10-29 Multi-Representation Attention Framework for Underwater Bioacoustic Denoising and Recognition Amine Razig et.al. 2510.26838 null
2025-10-29 Audio-Visual Speech Enhancement In Complex Scenarios With Separation And Dereverberation Joint Modeling Jiarong Du et.al. 2510.26825 null
2025-10-28 Cross-Corpus Validation of Speech Emotion Recognition in Urdu using Domain-Knowledge Acoustic Features Unzela Talpur et.al. 2510.26823 null
2025-10-28 See the Speaker: Crafting High-Resolution Talking Faces from Speech with Prior Guidance and Region Refinement Jinting Wang et.al. 2510.26819 null
2025-10-30 HMM for short independent sequences: Multiple sequence Baum-Welch application Margarita Cabrera-Bean et.al. 2510.26532 null
2025-10-29 Lost in Phonation: Voice Quality Variation as an Evaluation Dimension for Speech Foundation Models Harm Lameris et.al. 2510.25577 null
2025-10-29 Learning Disentangled Speech- and Expression-Driven Blendshapes for 3D Talking Face Animation Yuxiang Mao et.al. 2510.25234 null
2025-10-30 Evaluating Emotion Recognition in Spoken Language Models on Emotionally Incongruent Speech Pedro Corrêa et.al. 2510.25054 null
2025-10-28 POWSM: A Phonetic Open Whisper-Style Speech Foundation Model Chin-Jou Li et.al. 2510.24992 null
2025-11-25 Ming-Flash-Omni: A Sparse, Unified Architecture for Multimodal Perception and Generation Inclusion AI et.al. 2510.24821 null
2025-10-28 BEST-RQ-Based Self-Supervised Learning for Whisper Domain Adaptation Raphaël Bagat et.al. 2510.24570 null
2025-10-28 Levée d'ambiguïtés par grammaires locales Eric G. C. Laporte et.al. 2510.24530 null
2025-10-30 Audio Signal Processing Using Time Domain Mel-Frequency Wavelet Coefficient Rinku Sebastian et.al. 2510.24519 null
2025-10-28 Sound Source Localization for Spatial Mapping of Surgical Actions in Dynamic Scenes Jonas Hein et.al. 2510.24332 null
2025-10-28 V-SAT: Video Subtitle Annotation Tool Arpita Kundu et.al. 2510.24180 null
2025-10-28 RegSpeech12: A Regional Corpus of Bengali Spontaneous Speech Across Dialects Md. Rezuwan Hassan et.al. 2510.24096 null
2025-10-28 Listening without Looking: Modality Bias in Audio-Visual Captioning Yuchi Ishikawa et.al. 2510.24024 null
2025-10-30 TeleEgo: Benchmarking Egocentric AI Assistants in the Wild Jiaqi Yan et.al. 2510.23981 null
2025-10-27 A Neural Model for Contextual Biasing Score Learning and Filtering Wanting Huang et.al. 2510.23849 null
2025-11-01 RoboOmni: Proactive Robot Manipulation in Omni-modal Context Siyin Wang et.al. 2510.23763 link
2025-10-27 LibriConvo: Simulating Conversations from Read Literature for ASR and Diarization Máté Gedeon et.al. 2510.23320 null
2025-10-27 Arabic Little STT: Arabic Children Speech Recognition Dataset Mouhand Alkadri et.al. 2510.23319 null
2025-10-27 A Cocktail-Party Benchmark: Multi-Modal dataset and Comparative Evaluation Results Thai-Binh Nguyen et.al. 2510.23276 null
2025-10-29 Are ASR foundation models generalized enough to capture features of regional dialects for low-resource languages? Tawsif Tashwar Dipto et.al. 2510.23252 null
2025-10-27 Treble10: A high-quality dataset for far-field speech recognition, dereverberation, and enhancement Sarabeth S. Mullins et.al. 2510.23141 null
2025-10-27 Adapting Speech Foundation Models with Large Language Models for Unified Speech Recognition Jing-Xuan Zhang et.al. 2510.22961 null
2025-10-26 EchoMind: An Interrelated Multi-level Benchmark for Evaluating Empathetic Speech Language Models Li Zhou et.al. 2510.22758 null
2025-10-26 LRW-Persian: Lip-reading in the Wild Dataset for Persian Language Zahra Taghizadeh et.al. 2510.22716 null
2025-10-28 Look and Tell: A Dataset for Multimodal Grounding Across Egocentric and Exocentric Views Anna Deichler et.al. 2510.22672 null
2025-11-02 Mitigating Attention Sinks and Massive Activations in Audio-Visual Speech Recognition with LLMs Anand et.al. 2510.22603 link
2025-10-26 A Sociophonetic Analysis of Racial Bias in Commercial ASR Systems Using the Pacific Northwest English Corpus Michael Scott et.al. 2510.22495 null
2025-10-26 The Tonogenesis Continuum in Tibetan: A Computational Investigation Siyu Liang et.al. 2510.22485 null
2025-10-25 M-CIF: Multi-Scale Alignment For CIF-Based Non-Autoregressive ASR Ruixiang Mao et.al. 2510.22172 null
2025-10-23 LSF-Animation: Label-Free Speech-Driven Facial Animation via Implicit Feature Representation Xin Lu et.al. 2510.21864 null
2025-10-24 Compressing Quaternion Convolutional Neural Networks for Audio Classification Arshdeep Singh et.al. 2510.21388 null
2025-10-24 SindBERT, the Sailor: Charting the Seas of Turkish NLP Raphael Scheible-Schmitt et.al. 2510.21364 null
2025-10-27 ReFESS-QI: Reference-Free Evaluation For Speech Separation With Joint Quality And Intelligibility Scoring Ari Frummer et.al. 2510.21014 null
2025-10-22 Beyond Hearing: Learning Task-agnostic ExG Representations from Earphones via Physiology-informed Tokenization Hyungjun Yoon et.al. 2510.20853 null
2025-10-21 Can large audio language models understand child stuttering speech? speech summarization, and source separation Chibuzor Okocha et.al. 2510.20850 null
2025-10-23 Decoding the Ear: A Framework for Objectifying Expressiveness from Human Preference Through Efficient Alignment Zhiyu Lin et.al. 2510.20513 null
2025-10-23 Speaking Clearly: A Simplified Whisper-Based Codec for Low-Bitrate Speech Coding Xin Zhang et.al. 2510.20504 link
2025-10-23 SpeechAgent: An End-to-End Mobile Infrastructure for Speech Impairment Assistance Haowei Lou et.al. 2510.20113 null
2025-10-22 Re-evaluating Minimum Bayes Risk Decoding for Automatic Speech Recognition Yuu Jinnai et.al. 2510.19471 null
2025-10-22 Time delay embeddings to characterize the timbre of musical instruments using Topological Data Analysis: a study on synthetic and real data Gakusei Sato et.al. 2510.19435 null
2025-10-23 FLASH Viterbi: Fast and Adaptive Viterbi Decoding for Modern Data Systems Ziheng Deng et.al. 2510.19301 null
2025-10-22 Tibetan Language and AI: A Comprehensive Survey of Resources, Methods and Challenges Cheng Huang et.al. 2510.19144 null
2025-11-05 StutterZero and StutterFormer: End-to-End Speech Conversion for Stuttering Transcription and Correction Qianheng Xu et.al. 2510.18938 null
2025-10-28 RIR-Mega: a large-scale simulated room impulse response dataset for machine learning and room acoustics modeling Mandip Goswami et.al. 2510.18917 link
2025-10-21 Adapting Language Balance in Code-Switching Speech Enes Yavuz Ugan et.al. 2510.18724 null
2025-10-23 MLMA: Towards Multilingual ASR With Mamba-based Architectures Mohamed Nabih Ali et.al. 2510.18684 null
2025-10-21 KrishokBondhu: A Retrieval-Augmented Voice-Based Agricultural Advisory Call Center for Bengali Farmers Mohd Ruhul Ameen et.al. 2510.18355 null
2025-10-20 Transformer Redesign for Late Fusion of Audio-Text Features on Ultra-Low-Power Edge Hardware Stavros Mitsis et.al. 2510.18036 null
2025-10-20 ImaGGen: Zero-Shot Generation of Co-Speech Semantic Gestures Grounded in Language and Image Input Hendric Voss et.al. 2510.17617 null
2025-10-20 Conveying Meaning through Gestures: An Investigation into Semantic Co-Speech Gesture Generation Hendric Voss et.al. 2510.17599 null
2025-10-19 End-to-end Listen, Look, Speak and Act Siyin Wang et.al. 2510.16756 null
2025-10-19 Zero- and One-Shot Data Augmentation for Sentence-Level Dysarthric Speech Recognition in Constrained Scenarios Shiyao Wang et.al. 2510.16700 null
2025-10-18 Hallucination Benchmark for Speech Foundation Models Alkis Koudounas et.al. 2510.16567 null
2025-10-18 Interpreting the Dimensions of Speaker Embedding Space Mark Huckvale et.al. 2510.16489 null
2025-10-18 Probing the Hidden Talent of ASR Foundation Models for L2 English Oral Assessment Fu-An Chao et.al. 2510.16387 null
2025-10-18 MuseTok: Symbolic Music Tokenization for Generation and Semantic Understanding Jingyue Huang et.al. 2510.16273 null
2025-10-17 SpeechLLMs for Large-scale Contextualized Zero-shot Slot Filling Kadri Hacioglu et.al. 2510.15851 null
2025-10-17 Magnitude and Phase-based Feature Fusion Using Co-attention Mechanism for Speaker recognition Rongfeng Su et.al. 2510.15659 null
2025-10-17 SpikeVox: Towards Energy-Efficient Speech Therapy Framework with Spike-driven Generative Language Models Rachmad Vidya Wicaksana Putra et.al. 2510.15566 null
2025-10-17 VocalBench-DF: A Benchmark for Evaluating Speech LLM Robustness to Disfluency Hongcheng Liu et.al. 2510.15406 null
2025-10-16 OmniMotion: Multimodal Motion Generation with Continuous Masked Autoregression Zhe Li et.al. 2510.14954 null
2025-10-16 RLAIF-SPA: Optimizing LLM-based Emotional Speech Synthesis via RLAIF Qing Yang et.al. 2510.14628 null
2025-10-15 Do Slides Help? Multi-modal Context for Automatic Transcription of Conference Talks Supriti Sinhamahapatra et.al. 2510.13979 null
2025-10-15 Two Heads Are Better Than One: Audio-Visual Speech Error Correction with Dual Hypotheses Sungnyun Kim et.al. 2510.13281 null
2025-11-13 A Critical Review of the Need for Knowledge-Centric Evaluation of Quranic Recitation Mohammed Hilal Al-Kharusi et.al. 2510.12858 null
2025-10-14 Adaptive vector steering: A training-free, layer-wise intervention for hallucination mitigation in large audio and multimodal models Tsung-En Lin et.al. 2510.12851 null
2025-10-11 Automatic Speech Recognition in the Modern Era: Architectures, Training, and Evaluation Md. Nayeem et.al. 2510.12827 null
2025-10-14 Structured Sparsity and Weight-adaptive Pruning for Memory and Compute efficient Whisper models Prasenjit K Mudi et.al. 2510.12666 null
2025-10-12 End-to-end Speech Recognition with similar length speech and text Peng Fan et.al. 2510.10453 null
2025-10-11 End-to-end Automatic Speech Recognition and Speech Translation: Integration of Speech Foundational Models and LLMs Nam Luu et.al. 2510.10329 null
2025-10-11 SyncLipMAE: Contrastive Masked Pretraining for Audio-Visual Talking-Face Representation Zeyu Ling et.al. 2510.10069 null
2025-10-10 Accent-Invariant Automatic Speech Recognition via Saliency-Driven Spectrogram Masking Mohammad Hossein Sameti et.al. 2510.09528 null
2025-10-10 WildElder: A Chinese Elderly Speech Dataset from the Wild with Fine-Grained Manual Annotations Hui Wang et.al. 2510.09344 null
2025-10-10 Effects of automotive microphone frequency response characteristics and noise conditions on speech and ASR quality -- an experimental evaluation Michele Buccoli et.al. 2510.09236 null
2025-10-10 FLToP CTC: Frame-Level Token Pruning via Relative Threshold for Efficient and Memory-Saving Decoding on Diverse Platforms Atul Shree et.al. 2510.09085 null
2025-10-08 Look before Transcription: End-to-End SlideASR with Visually-Anchored Policy Optimization Rui Hu et.al. 2510.08618 null
2025-10-01 Articulation-Informed ASR: Integrating Articulatory Features into ASR via Auxiliary Speech Inversion and Cross-Attention Fusion Ahmed Adel Attia et.al. 2510.08585 null
2025-10-09 Pseudo2Real: Task Arithmetic for Pseudo-Label Correction in Automatic Speech Recognition Yi-Cheng Lin et.al. 2510.08047 null
2025-10-09 Bloodroot: When Watermarking Turns Poisonous For Stealthy Backdoor Kuan-Yu Chen et.al. 2510.07909 null
2025-10-08 How much speech data is necessary for ASR in African languages? An evaluation of data scaling in Kinyarwanda and Kikuyu Benjamin Akera et.al. 2510.07221 null
2025-10-09 Open ASR Leaderboard: Towards Reproducible and Transparent Multilingual and Long-Form Speech Recognition Evaluation Vaibhav Srivastav et.al. 2510.06961 null
2025-10-07 Linguistically Informed Tokenization Improves ASR for Underresourced Languages Massimo Daul et.al. 2510.06461 null
2025-10-06 How I Built ASR for Endangered Languages with a Spoken Dictionary Christopher Bartley et.al. 2510.04832 null
2025-10-06 UniVoice: Unifying Autoregressive ASR and Flow-Matching based TTS with Large Language Models Wenhao Guan et.al. 2510.04593 null
2025-10-06 Evaluating Self-Supervised Speech Models via Text-Based LLMS Takashi Maekaku et.al. 2510.04463 null
2025-10-05 Probing Whisper for Dysarthric Speech in Detection and Assessment Zhengjun Yue et.al. 2510.04219 null
2025-10-05 Drax: Speech Recognition with Discrete Flow Matching Aviv Navon et.al. 2510.04162 link
2025-10-05 MoME: Mixture of Matryoshka Experts for Audio-Visual Speech Recognition Umberto Cappellazzo et.al. 2510.04136 null
2025-10-04 Adapting Diarization-Conditioned Whisper for End-to-End Multi-Talker Speech Recognition Martin Kocour et.al. 2510.03723 null
2025-10-04 Towards Unsupervised Speech Recognition at the Syllable-Level Liming Wang et.al. 2510.03639 null
2025-10-04 Scaling Multi-Talker ASR with Speaker-Agnostic Activity Streams Xiluo He et.al. 2510.03630 null
2025-10-03 Listening or Reading? Evaluating Speech Awareness in Chain-of-Thought Speech-to-Text Translation Jacobo Romero-Díaz et.al. 2510.03115 null
2025-10-03 Revisiting Direct Speech-to-Text Translation with Speech LLMs: Better Scaling than CoT Prompting? Oriol Pareras et.al. 2510.03093 null
2025-10-16 Transcribe, Translate, or Transliterate: An Investigation of Intermediate Representations in Spoken Language Models Tolúlopé Ògúnrèmí et.al. 2510.02569 null
2025-09-26 KAME: Tandem Architecture for Enhancing Knowledge in Real-Time Speech-to-Speech Conversational AI So Kuroki et.al. 2510.02327 null
2025-10-02 EvolveCaptions: Empowering DHH Users Through Real-Time Collaborative Captioning Liang-Yuan Wu et.al. 2510.02181 null
2025-10-01 Backdoor Attacks Against Speech Language Models Alexandrine Fortier et.al. 2510.01157 null
2025-10-01 Automatic Speech Recognition (ASR) for African Low-Resource Languages: A Systematic Literature Review Sukairaj Hafiz Imam et.al. 2510.01145 null
2025-10-01 Spiralformer: Low Latency Encoder for Streaming Speech Recognition with Circular Layer Skipping and Early Exiting Emiru Tsunoo et.al. 2510.00982 null
2025-09-30 IR-UWB Radar-Based Contactless Silent Speech Recognition with Attention-Enhanced Temporal Convolutional Networks Sunghwa Lee et.al. 2509.26409 null
2025-09-30 ASR Under Noise: Exploring Robustness for Sundanese and Javanese Salsabila Zahirah Pranida et.al. 2509.25878 null
2025-09-29 Beyond WER: Probing Whisper's Sub-token Decoder Across Diverse Language Resource Levels Siyu Liang et.al. 2509.25516 null
2025-09-29 Confidence-Guided Error Correction for Disordered Speech Recognition Abner Hernandez et.al. 2509.25048 null
2025-10-05 HiKE: Hierarchical Evaluation Framework for Korean-English Code-Switching Speech Recognition Gio Paik et.al. 2509.24613 link
2025-09-29 A Text-To-Text Alignment Algorithm for Better Evaluation of Modern Speech Recognition Systems Lasse Borgholt et.al. 2509.24478 null
2025-09-29 Code-switching Speech Recognition Under the Lens: Model- and Data-Centric Perspectives Hexin Liu et.al. 2509.24310 null
2025-09-28 AISHELL6-whisper: A Chinese Mandarin Audio-visual Whisper Speech Dataset with Speech Recognition Baselines Cancan Li et.al. 2509.23833 null
2025-09-28 Automatic Speech Recognition for Greek Medical Dictation Vardis Georgilas et.al. 2509.23550 null
2025-09-30 MeanFlowSE: One-Step Generative Speech Enhancement via MeanFlow Yike Zhu et.al. 2509.23299 null
2025-09-26 ArFake: A Multi-Dialect Benchmark and Baselines for Arabic Spoof-Speech Detection Mohamed Maged et.al. 2509.22808 null
2025-09-26 Index-MSR: A high-efficiency multimodal fusion framework for speech recognition Jinming Chen et.al. 2509.22744 null
2025-10-10 From Coarse to Fine: Recursive Audio-Visual Semantic Enhancement for Speech Separation Ke Xue et.al. 2509.22425 null
2025-09-26 Decoding Deception: Understanding Automatic Speech Recognition Vulnerabilities in Evasion and Poisoning Attacks Aravindhan G et.al. 2509.22060 null
2025-09-26 A Parallel Ultra-Low Power Silent Speech Interface based on a Wearable, Fully-dry EMG Neckband Fiona Meier et.al. 2509.21964 null
2025-09-26 Lightweight Front-end Enhancement for Robust ASR via Frame Resampling and Sub-Band Pruning Siyi Zhao et.al. 2509.21833 null
2025-09-26 Align2Speak: Improving TTS for Low Resource Languages via ASR-Guided Online Preference Optimization Shehzeen Hussain et.al. 2509.21718 null
2025-09-27 i-LAVA: Insights on Low Latency Voice-2-Voice Architecture for Agents Anupam Purwar et.al. 2509.20971 null
2025-09-25 Real-Time System for Audio-Visual Target Speech Enhancement T. Aleksandra Ma et.al. 2509.20741 null
2025-09-25 Visual Authority and the Rhetoric of Health Misinformation: A Multimodal Analysis of Social Media Videos Mohammad Reza Zarei et.al. 2509.20724 null
2025-09-23 Variational Low-Rank Adaptation for Personalized Impaired Speech Recognition Niclas Pokel et.al. 2509.20397 null
2025-09-23 Data-Efficient ASR Personalization for Non-Normative Speech Using an Uncertainty-Based Phoneme Difficulty Score for Guided Sampling Niclas Pokel et.al. 2509.20396 null
2025-09-26 MMedFD: A Real-world Healthcare Benchmark for Multi-turn Full-Duplex Automatic Speech Recognition Hongzhao Chen et.al. 2509.19817 null
2025-09-23 Retrieval Augmented Generation based context discovery for ASR Dimitrios Siskos et.al. 2509.19567 null
2025-09-23 SloPalSpeech: A 2,8000-Hour Slovak Speech Corpus from Parliamentary Data Erik Božík et.al. 2509.19270 null
2025-09-23 HD-PPT: Hierarchical Decoding of Content- and Prompt-Preference Tokens for Instruction-based TTS Sihang Nie et.al. 2509.19001 null
2025-09-23 Group Relative Policy Optimization for Text-to-Speech with Large Language Models Chang Liu et.al. 2509.18798 null
2025-09-24 M4SER: Multimodal, Multirepresentation, Multitask, and Multistrategy Learning for Speech Emotion Recognition Jiajun He et.al. 2509.18706 null
2025-09-23 HarmoniFuse: A Component-Selective and Prompt-Adaptive Framework for Multi-Task Speech Language Modeling Yuke Si et.al. 2509.18570 null
2025-09-23 Explore the Reinforcement Learning for the LLM based ASR and TTS system Changfeng Gao et.al. 2509.18569 null
2025-09-24 MNV-17: A High-Quality Performative Mandarin Dataset for Nonverbal Vocalization Recognition in Speech Jialong Mai et.al. 2509.18196 null
2025-09-22 Transformer-Encoder Trees for Efficient Multilingual Machine Translation and Speech Translation Yiwen Guan et.al. 2509.17930 null
2025-09-22 Leveraging Audio-Visual Data to Reduce the Multilingual Gap in Self-Supervised Speech Models María Andrea Cruz Blandón et.al. 2509.17523 null
2025-09-29 Sidon: Fast and Robust Open-Source Multilingual Speech Restoration for Large-scale Dataset Cleansing Wataru Nakata et.al. 2509.17052 link
2025-09-20 Idiosyncratic Versus Normative Modeling of Atypical Speech Recognition: Dysarthric Case Studies Vishnu Raja et.al. 2509.16718 null
2025-10-09 Audio-Conditioned Diffusion LLMs for ASR and Deliberation Processing Mengqi Wang et.al. 2509.16622 null
2025-09-26 GLip: A Global-Local Integrated Progressive Framework for Robust Visual Speech Recognition Tianyue Wang et.al. 2509.16031 null
2025-09-22 Interpreting the Role of Visemes in Audio-Visual Speech Recognition Aristeidis Papadopoulos et.al. 2509.16023 null
2025-09-19 VOX-KRIKRI: Unifying Speech and Language through Continuous Fusion Dimitrios Damianos et.al. 2509.15667 null
2025-09-19 Layer-wise Minimal Pair Probing Reveals Contextual Grammatical-Conceptual Hierarchy in Speech Representations Linyang He et.al. 2509.15655 null
2025-09-19 Thinking in cocktail party: Chain-of-Thought and reinforcement learning for target speaker automatic speech recognition Yiru Zhang et.al. 2509.15612 null
2025-09-19 Chunk Based Speech Pre-training with High Resolution Finite Scalar Quantization Yun Tang et.al. 2509.15579 null
2025-09-19 State-of-the-Art Dysarthric Speech Recognition with MetaICL for on-the-fly Personalization Dhruuv Agarwal et.al. 2509.15516 null
2025-09-18 Impact of Phonetics on Speaker Identity in Adversarial Voice Attack Daniyal Kabir Dar et.al. 2509.15437 null
2025-09-18 BiRQ: Bi-Level Self-Labeling Random Quantization for Self-Supervised Speech Recognition Liuyuan Jiang et.al. 2509.15430 null
2025-09-23 Frustratingly Easy Data Augmentation for Low-Resource ASR Katsumi Ibaraki et.al. 2509.15373 null
2025-09-25 Speech Language Models for Under-Represented Languages: Insights from Wolof Yaya Sy et.al. 2509.15362 null
2025-09-20 Listening, Imagining & Refining: A Heuristic Optimized ASR Correction Framework with LLMs Yutong Liu et.al. 2509.15095 null
2025-09-18 From Who Said What to Who They Are: Modular Training-free Identity-Aware LLM Refinement of Speaker Diarization Yu-Wen Chen et.al. 2509.15082 null
2025-09-19 From Hype to Insight: Rethinking Large Language Model Integration in Visual Speech Recognition Rishabh Jain et.al. 2509.14880 null
2025-09-18 UMA-Split: unimodal aggregation for both English and Mandarin non-autoregressive speech recognition Ying Fang et.al. 2509.14653 null
2025-09-17 Multi-Channel Differential ASR for Robust Wearer Speech Recognition on Smart Glasses Yufeng Yang et.al. 2509.14430 null
2025-09-17 CS-FLEURS: A Massively Multilingual and Code-Switched Speech Dataset Brian Yan et.al. 2509.14161 null
2025-09-25 Canary-1B-v2 & Parakeet-TDT-0.6B-v3: Efficient and High-Performance Models for Multilingual ASR and AST Monica Sekoyan et.al. 2509.14128 null
2025-09-17 Language Conditioning Improves Accuracy of Aircraft Goal Prediction in Untowered Airspace Sundhar Vinodh Sangeetha et.al. 2509.14063 null
2025-09-17 Conducting Mission-Critical Voice Experiments with Automated Speech Recognition and Crowdsourcing Jan Janak et.al. 2509.13724 null
2025-09-09 On the Contribution of Lexical Features to Speech Emotion Recognition David Combei et.al. 2509.05634 null
2025-07-23 AI Telephone Surveying: Automating Quantitative Data Collection with an AI Interviewer Danny D. Leybzon et.al. 2507.17718 null
2025-07-23 Synthetic Voice Data for Automatic Speech Recognition in African Languages Brian DeRenzi et.al. 2507.17578 null
2025-07-23 BoSS: Beyond-Semantic Speech Qing Wang et.al. 2507.17563 null
2025-07-23 Application of Whisper in Clinical Practice: the Post-Stroke Speech Assessment during a Naming Task Milena Davudova et.al. 2507.17326 null
2025-07-23 Triple X: A LLM-Based Multilingual Speech Recognition System for the INTERSPEECH2025 MLC-SLM Challenge Miaomiao Gao et.al. 2507.17288 null
2025-07-20 Weak Supervision Techniques towards Enhanced ASR Models in Industry-level CRM Systems Zhongsheng Wang et.al. 2507.16843 null
2025-07-15 Towards Robust Speech Recognition for Jamaican Patois Music Transcription Jordan Madden et.al. 2507.16834 null
2025-07-22 Step-Audio 2 Technical Report Boyong Wu et.al. 2507.16632 null
2025-07-22 An approach to measuring the performance of Automatic Speech Recognition (ASR) models in the context of Large Language Model (LLM) powered applications Sujith Pulikodan et.al. 2507.16456 null
2025-07-21 Beyond Rate Coding: Surrogate Gradients Enable Spike Timing Learning in Spiking Neural Networks Ziqiao Yu et.al. 2507.16043 null
2025-07-21 Mixture to Beamformed Mixture: Leveraging Beamformed Mixture as Weak-Supervision for Speech Enhancement and Noise-Robust ASR Zhong-Qiu Wang et.al. 2507.15229 null
2025-07-21 EchoVoices: Preserving Generational Voices and Memories for Seniors and Children Haiying Xu et.al. 2507.15221 null
2025-07-19 Adapting Whisper for Lightweight and Efficient Automatic Speech Recognition of Children for On-device Edge Applications Satwik Dutta et.al. 2507.14451 null
2025-07-18 Open Automatic Speech Recognition Models for Classical and Modern Standard Arabic Lilit Grigoryan et.al. 2507.13977 null
2025-07-18 Optimizing ASR for Catalan-Spanish Code-Switching: A Comparative Analysis of Methodologies Carlos Mena et.al. 2507.13875 null
2025-07-17 Reading Between the Lines: Combining Pause Dynamics and Semantic Coherence for Automated Assessment of Thought Disorder Feng Chen et.al. 2507.13551 null
2025-07-18 Automatically assessing oral narratives of Afrikaans and isiXhosa children Retief Louw et.al. 2507.13205 null
2025-07-17 NonverbalTTS: A Public English Corpus of Text-Aligned Nonverbal Vocalizations with Emotion Annotations for Text-to-Speech Maksim Borisov et.al. 2507.13155 null
2025-07-17 UniSLU: Unified Spoken Language Understanding from Heterogeneous Cross-Task Datasets Zhichao Sheng et.al. 2507.12951 null
2025-07-17 Task-Specific Audio Coding for Machines: Machine-Learned Latent Features Are Codes for That Machine Anastasia Kuznetsova et.al. 2507.12701 null
2025-07-16 Improving Contextual ASR via Multi-grained Fusion with Large Language Models Shilin Zhou et.al. 2507.12252 null
2025-07-14 WhisperKit: On-device Real-time ASR with Billion-Scale Transformers Atila Orhon et.al. 2507.10860 null
2025-07-20 Supporting SENCOTEN Language Documentation Efforts with Automatic Speech Recognition Mengzhe Geng et.al. 2507.10827 null
2025-07-14 DQLoRA: A Lightweight Domain-Aware Denoising ASR via Adapter-guided Distillation Yiru Yang et.al. 2507.10313 null
2025-07-13 The DKU System for Multi-Speaker Automatic Speech Recognition in MLC-SLM Challenge Yuke Lin et.al. 2507.09499 null
2025-07-12 Can We Really Repurpose Multi-Speaker ASR Corpus for Speaker Diarization? Shota Horiguchi et.al. 2507.09226 null
2025-07-22 Mixture of LoRA Experts with Multi-Modal and Multi-Granularity LLM Generative Error Correction for Accented Speech Recognition Bingshen Mu et.al. 2507.09116 null
2025-07-06 A Hybrid Machine Learning Framework for Optimizing Crop Selection via Agronomic and Economic Forecasting Niranjan Mallikarjun Sindhur et.al. 2507.08832 null
2025-07-11 The Impact of Automatic Speech Transcription on Speaker Attribution Cristina Aggazzotti et.al. 2507.08660 null
2025-07-11 ILT-Iterative LoRA Training through Focus-Feedback-Fix for Multilingual Speech Recognition Qingliang Meng et.al. 2507.08477 null
2025-07-10 DARAS: Dynamic Audio-Room Acoustic Synthesis for Blind Room Impulse Response Estimation Chunxi Wang et.al. 2507.08135 null
2025-07-10 Modèle physique variationnel pour l'estimation de réponses impulsionnelles de salles Louis Lalay et.al. 2507.08051 null
2025-07-10 Edge-ASR: Towards Low-Bit Quantization of Automatic Speech Recognition Models Chen Feng et.al. 2507.07877 null
2025-07-10 Code-Switching in End-to-End Automatic Speech Recognition: A Systematic Literature Review Maha Tufail Agro et.al. 2507.07741 null
2025-07-08 Deep Feed-Forward Neural Network for Bangla Isolated Speech Recognition Dipayan Bhadra et.al. 2507.07068 null
2025-07-04 Pronunciation-Lexicon Free Training for Phoneme-based Crosslingual ASR via Joint Stochastic Approximation Saierdaer Yusuyin et.al. 2507.06249 null
2025-07-21 VisualSpeaker: Visually-Guided 3D Avatar Lip Synthesis Alexandre Symeonidis-Herzig et.al. 2507.06060 null
2025-07-08 How to Evaluate Automatic Speech Recognition: Comparing Different Performance and Bias Measures Tanvina Patel et.al. 2507.05885 null
2025-07-08 ContextASR-Bench: A Massive Contextual Speech Recognition Benchmark He Wang et.al. 2507.05727 null
2025-11-06 Omni-Router: Sharing Routing Decisions in Sparse Mixture-of-Experts for Speech Recognition Zijin Gu et.al. 2507.05724 null
2025-07-07 Adaptive Slimming for Scalable and Efficient Speech Enhancement Riccardo Miccini et.al. 2507.04879 null
2025-07-08 SHNU Multilingual Conversational Speech Recognition System for INTERSPEECH 2025 MLC-SLM Challenge Yuxiang Mei et.al. 2507.03343 null
2025-06-26 A Unified Speech LLM for Diarization and Speech Recognition in Multilingual Conversations Phurich Saengthong et.al. 2507.02927 null
2025-07-03 Open-Source System for Multilingual Translation and Cloned Speech Synthesis Mateo Cámara et.al. 2507.02530 null
2025-07-03 A Cookbook for Community-driven Data Collection of Impaired Speech in LowResource Languages Sumaya Ahmed Salihs et.al. 2507.02428 null
2025-07-03 Benchmarking Akan ASR Models Across Domain-Specific Datasets: A Comparative Evaluation of Performance, Scalability, and Adaptability Mark Atta Mensah et.al. 2507.02407 null
2025-07-02 Adaptability of ASR Models on Low-Resource Language: A Comparative Study of Whisper and Wav2Vec-BERT on Bangla Md Sazzadul Islam Ridoy et.al. 2507.01931 null
2025-07-02 First Steps Towards Voice Anonymization for Code-Switching Speech Sarina Meyer et.al. 2507.01765 null
2025-07-02 PERTINENCE: Input-based Opportunistic Neural Network Dynamic Execution Omkar Shende et.al. 2507.01695 null
2025-07-02 Learning from Random Subspace Exploration: Generalized Test-Time Augmentation with Self-supervised Distillation Andrei Jelea et.al. 2507.01347 null
2025-07-02 AI Meets Maritime Training: Precision Analytics for Enhanced Safety and Performance Vishakha Lall et.al. 2507.01274 null
2025-06-16 Hello Afrika: Speech Commands in Kinyarwanda George Igwegbe et.al. 2507.01024 null
2025-07-01 MambAttention: Mamba with Multi-Head Attention for Generalizable Single-Channel Speech Enhancement Nikolai Lund Kühne et.al. 2507.00966 null
2025-07-01 Rectifying Magnitude Neglect in Linear Attention Qihang Fan et.al. 2507.00698 null
2025-07-01 Audio-3DVG: Unified Audio - Point Cloud Fusion for 3D Visual Grounding Duc Cao-Dinh et.al. 2507.00669 null
2025-06-29 Research on Comprehensive Classroom Evaluation System Based on Multiple AI Models Cong Xie et.al. 2506.23079 null
2025-06-28 Mind the Gap: Entity-Preserved Context-Aware ASR Structured Transcriptions Duygu Altinok et.al. 2506.22858 null
2025-06-28 Boosting CTC-Based ASR Using LLM-Based Intermediate Loss Regularization Duygu Altinok et.al. 2506.22846 null
2025-06-28 A Self-Training Approach for Whisper to Enhance Long Dysarthric Speech Recognition Shiyao Wang et.al. 2506.22810 null
2025-06-27 Speaker Targeting via Self-Speaker Adaptation for Multi-talker ASR Weiqing Wang et.al. 2506.22646 null
2025-06-27 Cross-lingual Data Selection Using Clip-level Acoustic Similarity for Enhancing Low-resource Automatic Speech Recognition Shunsuke Mitsumori et.al. 2506.22194 null
2025-06-27 SAGE: Spliced-Audio Generated Data for Enhancing Foundational Models in Low-Resource Arabic-English Code-Switched Speech Recognition Muhammad Umar Farooq et.al. 2506.22143 null
2025-06-27 Analyzing and Fine-Tuning Whisper Models for Multilingual Pilot Speech Transcription in the Cockpit Kartheek Kumar Reddy Nareddy et.al. 2506.21990 null
2025-06-23 Adapting Foundation Speech Recognition Models to Impaired Speech: A Semantic Re-chaining Approach for Personalization of German Speech Niclas Pokel et.al. 2506.21622 null
2025-06-16 Language-Aware Prompt Tuning for Parameter-Efficient Seamless Language Expansion in Multilingual ASR Hongli Yang et.al. 2506.21577 null
2025-06-16 Adapting Whisper for Parameter-efficient Code-Switching Speech Recognition via Soft Prompt Tuning Hongli Yang et.al. 2506.21576 null
2025-06-12 FormosanBench: Benchmarking Low-Resource Austronesian Languages in the Era of Large Language Models Kaiying Kevin Lin et.al. 2506.21563 null
2025-06-11 Efficient Multilingual ASR Finetuning via LoRA Language Experts Jiahong Li et.al. 2506.21555 null
2025-06-25 Multimodal Representation Learning and Fusion Qihang Jin et.al. 2506.20494 null
2025-06-25 Lightweight Target-Speaker-Based Overlap Transcription for Practical Streaming ASR Aleš Pražák et.al. 2506.20288 null
2025-06-24 Accurate, fast, cheap: Choose three. Replacing Multi-Head-Attention with Bidirectional Recurrent Attention for Long-Form ASR Martin Ratajczak et.al. 2506.19761 null
2025-06-23 Context Biasing for Pronunciations-Orthography Mismatch in Automatic Speech Recognition Christian Huber et.al. 2506.18703 null
2025-06-23 Evaluating Multichannel Speech Enhancement Algorithms at the Phoneme Scale Across Genders Nasser-Eddine Monir et.al. 2506.18691 null
2025-06-23 End-to-End Spoken Grammatical Error Correction Mengjie Qian et.al. 2506.18532 null
2025-06-28 AI-Generated Song Detection via Lyrics Transcripts Markus Frohmann et.al. 2506.18488 null
2025-06-22 Splitformer: An improved early-exit architecture for automatic speech recognition on edge devices Maxence Lasbordes et.al. 2506.18035 null
2025-06-21 OpusLM: A Family of Open Unified Speech Language Models Jinchuan Tian et.al. 2506.17611 null
2025-06-27 Data Quality Issues in Multilingual Speech Datasets: The Need for Sociolinguistic Awareness and Proactive Language Planning Mingfei Lau et.al. 2506.17525 null
2025-06-20 Breaking the Transcription Bottleneck: Fine-tuning ASR Models for Extremely Low-Resource Fieldwork Languages Siyu Liang et.al. 2506.17459 null
2025-06-20 Simultaneous Translation with Offline Speech and LLM Models in CUNI Submission to IWSLT 2025 Dominik Macháček et.al. 2506.17077 link
2025-06-20 Instituto de Telecomunicações at IWSLT 2025: Aligning Small-Scale Speech and Language Models for Speech-to-Text Learning Giuseppe Attanasio et.al. 2506.17019 link
2025-06-27 State-Space Models in Efficient Whispered and Multi-dialect Speech Recognition Aref Farhadipour et.al. 2506.16969 null
2025-06-20 LM-SPT: LM-Aligned Semantic Distillation for Speech Tokenization Daejin Jo et.al. 2506.16738 null
2025-06-19 Weight Factorization and Centralization for Continual Learning in Speech Recognition Enes Yavuz Ugan et.al. 2506.16574 null
2025-06-19 Automatic Speech Recognition Biases in Newcastle English: an Error Analysis Dana Serditova et.al. 2506.16558 null
2025-06-18 Exploiting Music Source Separation for Automatic Lyrics Transcription with Whisper Jaza Syed et.al. 2506.15514 null
2025-06-18 Foundation of Affective Computing and Interaction Changzeng Fu et.al. 2506.15497 null
2025-06-17 Thinking in Directivity: Speech Large Language Model for Multi-Talker Directional Speech Recognition Jiamin Xie et.al. 2506.14973 null
2025-06-17 Unifying Streaming and Non-streaming Zipformer-based ASR Bidisha Sharma et.al. 2506.14434 null
2025-06-17 Improving Practical Aspects of End-to-End Multi-Talker Speech Recognition for Online and Offline Scenarios Aswin Shanmugam Subramanian et.al. 2506.14204 null
2025-06-17 AsyncSwitch: Asynchronous Text-Speech Adaptation for Code-Switched ASR Tuan Nguyen et.al. 2506.14190 null
2025-06-16 A Silent Speech Decoding System from EEG and EMG with Heterogenous Electrode Configurations Masakazu Inoue et.al. 2506.13835 null
2025-07-07 Qwen vs. Gemma Integration with Whisper: A Comparative Study in Multilingual SpeechLLM Systems Tuan Nguyen et.al. 2506.13596 null
2025-06-16 BUT System for the MLC-SLM Challenge Alexander Polok et.al. 2506.13414 null
2025-07-04 Bi-directional Context-Enhanced Speech Large Language Models for Multilingual Conversational ASR Yizhou Peng et.al. 2506.13396 null
2025-07-04 NTU Speechlab LLM-Based Multilingual ASR System for Interspeech MLC-SLM Challenge 2025 Yizhou Peng et.al. 2506.13339 null
2025-06-18 Seewo's Submission to MLC-SLM: Lessons learned from Speech Reasoning Language Models Bo Li et.al. 2506.13300 null
2025-06-15 SC-SOT: Conditioning the Decoder on Diarized Speaker Information for End-to-End Overlapped Speech Recognition Yuta Hirano et.al. 2506.12672 null
2025-06-13 Adapting Whisper for Streaming Speech Recognition via Two-Pass Decoding Haoran Zhou et.al. 2506.12154 null
2025-05-31 CMT-LLM: Contextual Multi-Talker ASR Utilizing Large Language Models Jiajun He et.al. 2506.12059 null
2025-06-13 Enabling automatic transcription of child-centered audio recordings from real-world environments Daniil Kocharov et.al. 2506.11747 null
2025-06-13 Lightweight and Robust Multi-Channel End-to-End Speech Recognition with Spherical Harmonic Transform Xiangzhu Kong et.al. 2506.11630 null
2025-06-13 (SimPhon Speech Test): A Data-Driven Method for In Silico Design and Validation of a Phonetically Balanced Speech Test Stefan Bleeck et.al. 2506.11620 null
2025-06-13 Machine Unlearning for Robust DNNs: Attribution-Guided Partitioning and Neuron Pruning in Noisy Environments Deliang Jin et.al. 2506.11615 null
2025-06-12 Advances in Small-Footprint Keyword Spotting: A Comprehensive Review of Efficient Models and Algorithms Soumen Garai et.al. 2506.11169 link
2025-06-10 ASRJam: Human-Friendly AI Speech Jamming to Prevent Automated Phone Scams Freddie Grabovski et.al. 2506.11125 null
2025-06-09 Benchmarking Foundation Speech and Language Models for Alzheimer's Disease and Related Dementia Detection from Spontaneous Speech Jingyu Li et.al. 2506.11119 null
2025-06-05 Customizing Speech Recognition Model with Large Language Model Feedback Shaoshi Ling et.al. 2506.11091 null
2025-06-05 Better Pseudo-labeling with Multi-ASR Fusion and Error Correction by SpeechLLM Jeena Prakash et.al. 2506.11089 null
2025-06-04 Improving Child Speech Recognition and Reading Mistake Detection by Using Prompts Lingyun Gao et.al. 2506.11079 null
2025-06-02 Regularized Federated Learning for Privacy-Preserving Dysarthric and Elderly Speech Recognition Tao Zhong et.al. 2506.11069 null
2025-05-31 PMF-CEC: Phoneme-augmented Multimodal Fusion for Context-aware ASR Error Correction with Error-specific Selective Decoding Jiajun He et.al. 2506.11064 null
2025-06-12 Improving Named Entity Transcription with Contextual LLM-based Revision Viet Anh Trinh et.al. 2506.10779 null
2025-06-12 FairASR: Fair Audio Contrastive Learning for Automatic Speech Recognition Jongsuk Kim et.al. 2506.10747 null
2025-06-12 Joint ASR and Speaker Role Tagging with Serialized Output Training Anfeng Xu et.al. 2506.10349 null
2025-06-11 Regularizing Learnable Feature Extraction for Automatic Speech Recognition Peter Vieting et.al. 2506.09804 null
2025-06-11 OWSM-Biasing: Contextualizing Open Whisper-Style Speech Models for Automatic Speech Recognition with Dynamic Vocabulary Yui Sudo et.al. 2506.09448 null
2025-06-10 SimClass: A Classroom Speech Dataset Generated via Game Engine Simulation For Automatic Speech Recognition Research Ahmed Adel Attia et.al. 2506.09206 null
2025-07-11 Addressing Pitfalls in Auditing Practices of Automatic Speech Recognition Technologies: A Case Study of People with Aphasia Katelyn Xiaoying Mei et.al. 2506.08846 link
2025-06-09 Uncovering the Functional Roles of Nonlinearity in Memory Manuel Brenner et.al. 2506.07919 null
2025-06-09 Unified Semi-Supervised Pipeline for Automatic Speech Recognition Nune Tadevosyan et.al. 2506.07659 null
2025-06-09 Transcript-Prompted Whisper with Dictionary-Enhanced Decoding for Japanese Speech Annotation Rui Hu et.al. 2506.07646 null
2025-06-09 Speaker-Distinguishable CTC: Learning Speaker Distinction Using CTC for Multi-Talker Speech Recognition Asahi Sakuma et.al. 2506.07515 null
2025-06-09 DeRAGEC: Denoising Named Entity Candidates with Synthetic Rationale for ASR Error Correction Solee Im et.al. 2506.07510 null
2025-06-11 Towards Energy-Efficient and Low-Latency Voice-Controlled Smart Homes: A Proposal for Offline Speech Recognition and IoT Integration Peng Huang et.al. 2506.07494 null
2025-06-08 Speech Recognition on TV Series with Video-guided Post-Correction Haoyuan Yang et.al. 2506.07323 null
2025-06-08 Technical Report: A Practical Guide to Kaldi ASR Optimization Mengze Hong et.al. 2506.07149 null
2025-06-07 Automatic Speech Recognition of African American English: Lexical and Contextual Effects Hamid Mojarad et.al. 2506.06888 null
2025-06-07 Beyond Classification: Towards Speech Emotion Reasoning with Multitask AudioLLMs Wenyu Zhang et.al. 2506.06820 null
2025-06-07 A Survey of Retentive Network Haiqi Yang et.al. 2506.06708 null
2025-06-06 AS-ASR: A Lightweight Framework for Aphasia-Specific Automatic Speech Recognition Chen Bao et.al. 2506.06566 null
2025-06-13 Structured State Space Model Dynamics and Parametrization for Spiking Neural Networks Maxime Fabre et.al. 2506.06374 link
2025-06-06 Lightweight Prompt Biasing for Contextualized End-to-End ASR Systems Bo Ren et.al. 2506.06252 null
2025-06-06 Phonetically-Augmented Discriminative Rescoring for Voice Search Error Correction Christophe Van Gysel et.al. 2506.06117 null
2025-06-06 Diarization-Aware Multi-Speaker Automatic Speech Recognition via Large Language Models Yuke Lin et.al. 2506.05796 null
2025-06-06 Bridging the Modality Gap: Softly Discretizing Audio Representation for LLM-based Automatic Speech Recognition Mu Yang et.al. 2506.05706 null
2025-06-06 Low-Resource Domain Adaptation for Speech LLMs via Text-Only Fine-Tuning Yangui Fang et.al. 2506.05671 null
2025-06-03 Auto Review: Second Stage Error Detection for Highly Accurate Information Extraction from Phone Conversations Ayesha Qamar et.al. 2506.05400 null
2025-06-05 LLM-based phoneme-to-grapheme for phoneme-based speech recognition Te Ma et.al. 2506.04711 null
2025-06-05 ViCocktail: Automated Multi-Modal Data Collection for Vietnamese Audio-Visual Speech Recognition Thai-Binh Nguyen et.al. 2506.04635 null
2025-06-05 LESS: Large Language Model Enhanced Semi-Supervised Learning for Speech Foundational Models Wen Ding et.al. 2506.04586 null
2025-06-04 Effects of Speaker Count, Duration, and Accent Diversity on Zero-Shot Accent Robustness in Low-Resource ASR Zheng-Xin Yong et.al. 2506.04364 null
2025-06-04 MFLA: Monotonic Finite Look-ahead Attention for Streaming Speech Recognition Yinfeng Xia et.al. 2506.03722 null
2025-06-03 A Multi-Dialectal Dataset for German Dialect ASR and Dialect-to-Standard Speech Translation Verena Blaschke et.al. 2506.02894 null
2025-06-03 Overcoming Data Scarcity in Multi-Dialectal Arabic ASR via Whisper Fine-Tuning Ömer Tarik Özyilmaz et.al. 2506.02627 null
2025-06-03 On the Language and Gender Biases in PSTN, VoIP and Neural Audio Codecs Kemal Altwlkany et.al. 2506.02545 null
2025-06-03 SOVA-Bench: Benchmarking the Speech Conversation Ability for LLM-based Voice Assistant Yixuan Hou et.al. 2506.02457 null
2025-06-03 Enhancing Lyrics Transcription on Music Mixtures with Consistency Loss Jiawen Huang et.al. 2506.02339 null
2025-06-02 Cocktail-Party Audio-Visual Speech Recognition Thai-Binh Nguyen et.al. 2506.02178 null
2025-06-02 HENT-SRT: Hierarchical Efficient Neural Transducer with Self-Distillation for Joint Speech Recognition and Translation Amir Hussein et.al. 2506.02157 null
2025-06-01 Enhancing Speech Instruction Understanding and Disambiguation in Robotics via Speech Prosody David Sasu et.al. 2506.02057 null
2025-05-31 No Audiogram: Leveraging Existing Scores for Personalized Speech Intelligibility Prediction Haoshuai Zhou et.al. 2506.02039 null
2025-05-27 Leveraging Large Language Models in Visual Speech Recognition: Model Scaling, Context-Aware Decoding, and Iterative Polishing Zehua Liu et.al. 2506.02012 null
2025-05-27 CNVSRC 2024: The Second Chinese Continuous Visual Speech Recognition Challenge Zehua Liu et.al. 2506.02010 null
2025-06-02 DNCASR: End-to-End Training for Speaker-Attributed ASR Xianrui Zheng et.al. 2506.01916 null
2025-06-02 Reasoning-Based Approach with Chain-of-Thought for Alzheimer's Detection Using Speech and Large Language Models Chanwoo Park et.al. 2506.01683 null
2025-06-02 Self-Supervised Speech Quality Assessment (S3QA): Leveraging Speech Foundation Models for a Scalable Speech Quality Metric Mattson Ogg et.al. 2506.01655 null
2025-06-02 Riemannian Time Warping: Multiple Sequence Alignment in Curved Spaces Julian Richter et.al. 2506.01635 null
2025-06-02 Unsupervised Rhythm and Voice Conversion to Improve ASR on Dysarthric Speech Karl El Hajal et.al. 2506.01618 null
2025-06-02 Analyzing the Importance of Blank for CTC-Based Knowledge Distillation Benedikt Hilmes et.al. 2506.01503 null
2025-06-02 TalTech Systems for the Interspeech 2025 ML-SUPERB 2.0 Challenge Tanel Alumäe et.al. 2506.01458 null
2025-06-02 Whale: Large-Scale multilingual ASR model with w2v-BERT and E-Branchformer with large speech data Yosuke Kashiwagi et.al. 2506.01439 null
2025-06-02 Speech-to-Speech Translation Pipelines for Conversations in Low-Resource Languages Andrei Popescu-Belis et.al. 2506.01406 null
2025-06-02 CleanS2S: Single-file Framework for Proactive Speech-to-Speech Interaction Yudong Lu et.al. 2506.01268 null
2025-06-02 WCTC-Biasing: Retraining-free Contextual Biasing ASR with Wildcard CTC-based Keyword Spotting and Inter-layer Biasing Yu Nakagome et.al. 2506.01263 null
2025-06-01 GigaAM: Efficient Self-Supervised Learner for Speech Recognition Aleksandr Kutsakov et.al. 2506.01192 link
2025-06-01 What do self-supervised speech models know about Dutch? Analyzing advantages of language-specific pre-training Marianne de Heer Kloots et.al. 2506.00981 link
2025-06-01 Fine-Tuning ASR for Stuttered Speech: Personalized vs. Generalized Approaches Dena Mujtaba et.al. 2506.00853 null
2025-05-31 Chain-of-Thought Training for Open E2E Spoken Dialogue Systems Siddhant Arora et.al. 2506.00722 null
2025-05-31 Towards Temporally Explainable Dysarthric Speech Clarity Assessment Seohyun Park et.al. 2506.00454 link
2025-05-31 DYNAC: Dynamic Vocabulary based Non-Autoregressive Contextualization for Speech Recognition Yui Sudo et.al. 2506.00422 null
2025-05-31 Causal Structure Discovery for Error Diagnostics of Children's ASR Vishwanath Pratap Singh et.al. 2506.00402 null
2025-05-30 Can LLMs Understand Unvoiced Speech? Exploring EMG-to-Text Conversion with LLMs Payal Mohapatra et.al. 2506.00304 null
2025-05-30 Vedavani: A Benchmark Corpus for ASR on Vedic Sanskrit Poetry Sujeet Kumar et.al. 2506.00145 null
2025-05-30 SwitchLingua: The First Large-Scale Multilingual and Multi-Ethnic Code-Switching Dataset Peng Xie et.al. 2506.00087 null
2025-05-30 Running Conventional Automatic Speech Recognition on Memristor Hardware: A Simulated Approach Nick Rossenbach et.al. 2505.24721 null
2025-06-02 MSDA: Combining Pseudo-labeling and Self-Supervision for Unsupervised Domain Adaptation in ASR Dimitrios Damianos et.al. 2505.24656 null
2025-05-30 SuPseudo: A Pseudo-supervised Learning Method for Neural Speech Enhancement in Far-field Speech Recognition Longjie Luo et.al. 2505.24450 null
2025-05-30 Pseudo Labels-based Neural Speech Enhancement for the AVSR Task in the MISP-Meeting Challenge Longjie Luo et.al. 2505.24446 null
2025-06-05 Fewer Hallucinations, More Verification: A Three-Stage LLM-Based Framework for ASR Error Correction Yangui Fang et.al. 2505.24347 null
2025-05-30 Dynamic Context-Aware Streaming Pretrained Language Model For Inverse Text Normalization Luong Ho et.al. 2505.24229 null
2025-05-30 MOPSA: Mixture of Prompt-Experts Based Speaker Adaptation for Elderly Speech Recognition Chengxi Deng et.al. 2505.24224 null
2025-06-03 Improving Multilingual Speech Models on ML-SUPERB 2.0: Fine-tuning with Data Augmentation and LID-Aware CTC Qingzheng Wang et.al. 2505.24200 null
2025-05-29 BeaverTalk: Oregon State University's IWSLT 2025 Simultaneous Speech Translation System Matthew Raffel et.al. 2505.24016 link
2025-05-29 Prompting Whisper for Improved Verbatim Transcription and End-to-end Miscue Detection Griffin Dietz Smith et.al. 2505.23627 null
2025-05-29 Contextualized Automatic Speech Recognition with Dynamic Vocabulary Prediction and Activation Zhennan Lin et.al. 2505.23077 null
2025-05-29 AISHELL-5: The First Open-Source In-Car Multi-Channel Multi-Speaker Speech Dataset for Automatic Speech Diarization and Recognition Yuhang Dai et.al. 2505.23036 link
2025-05-28 NGPU-LM: GPU-Accelerated N-Gram Language Model for Context-Biasing in Greedy ASR Decoding Vladimir Bataev et.al. 2505.22857 null
2025-06-05 Evaluation of LLMs in Speech is Often Flawed: Test Set Contamination in Large Language Models for Speech Recognition Yuan Tseng et.al. 2505.22251 null
2025-05-28 Advancing Hearing Assessment: An ASR-Based Frequency-Specific Speech Test for Diagnosing Presbycusis Stefan Bleeck et.al. 2505.22231 null
2025-05-28 On-the-fly Routing for Zero-shot MoE Speaker Adaptation of Speech Foundation Models for Dysarthric Speech Recognition Shujie HU et.al. 2505.22072 null
2025-05-28 Weakly Supervised Data Refinement and Flexible Sequence Compression for Efficient Thai LLM-based ASR Mingchen Shao et.al. 2505.22063 null
2025-05-28 Overlap-Adaptive Hybrid Speaker Diarization and ASR-Aware Observation Addition for MISP 2025 Challenge Shangkun Huang et.al. 2505.22013 null
2025-05-28 Leveraging LLM for Stuttering Speech: A Unified Architecture Bridging Recognition and Event Detection Shangkun Huang et.al. 2505.22005 null
2025-05-27 GMU Systems for the IWSLT 2025 Low-Resource Speech Translation Shared Task Chutong Meng et.al. 2505.21781 null
2025-05-27 Loquacious Set: 25,000 Hours of Transcribed and Diverse English Speech Recognition Data for Research and Commercial Use Titouan Parcollet et.al. 2505.21578 null
2025-05-25 WhisperD: Dementia Speech Recognition and Filler Word Detection with Whisper Emmanuel Akinrintoyo et.al. 2505.21551 null
2025-05-29 VietASR: Achieving Industry-level Vietnamese ASR with 50-hour labeled data and Large-Scale Speech Pretraining Jianheng Zhuo et.al. 2505.21527 null
2025-05-27 Towards One-bit ASR: Extremely Low-bit Conformer Quantization Using Co-training and Stochastic Precision Zhaoqing Li et.al. 2505.21245 null
2025-05-27 PSRB: A Comprehensive Benchmark for Evaluating Persian ASR Systems Nima Sedghiyeh et.al. 2505.21230 null
2025-05-27 Topological Deep Learning for Speech Data Zhiwang Yu et.al. 2505.21173 null
2025-05-27 Leveraging LLM and Self-Supervised Training Models for Speech Recognition in Chinese Dialects: A Comparative Analysis Tianyi Xu et.al. 2505.21138 null
2025-05-27 Towards Pretraining Robust ASR Foundation Model with Acoustic-Aware Data Augmentation Dancheng Liu et.al. 2505.20606 null
2025-05-30 The NaijaVoices Dataset: Cultivating Large-Scale, High-Quality, Culturally-Rich Speech Data for African Languages Chris Emezue et.al. 2505.20564 null
2025-05-26 Robust fine-tuning of speech recognition models via model merging: application to disordered speech Alexandre Ducorroy et.al. 2505.20477 null
2025-06-05 In-context Language Learning for Endangered Languages in Speech Recognition Zhaolin Li et.al. 2505.20445 null
2025-05-26 Continuous Learning for Children's ASR: Overcoming Catastrophic Forgetting with Elastic Weight Consolidation and Synaptic Intelligence Edem Ahadzi et.al. 2505.20216 null
2025-05-26 Exploring Generative Error Correction for Dysarthric Speech Recognition Moreno La Quatra et.al. 2505.20163 link
2025-05-26 Mixture of LoRA Experts for Low-Resourced Multi-Accent Automatic Speech Recognition Raphaël Bagat et.al. 2505.20006 null
2025-05-26 Novel Loss-Enhanced Universal Adversarial Patches for Sustainable Speaker Privacy Elvir Karimov et.al. 2505.19951 null
2025-05-26 KIT's Low-resource Speech Translation Systems for IWSLT2025: System Enhancement with Synthetic Data and Model Regularization Zhaolin Li et.al. 2505.19679 null
2025-05-26 Languages in Multilingual Speech Foundation Models Align Both Phonetically and Semantically Ryan Soh-Eun Shim et.al. 2505.19606 null
2025-05-26 Beyond Manual Transcripts: The Potential of Automated Speech Recognition Errors in Improving Alzheimer's Disease Detection Yin-Long Liu et.al. 2505.19448 null
2025-05-25 BR-ASR: Efficient and Scalable Bias Retrieval Framework for Contextual Biasing ASR in Speech LLM Xun Gong et.al. 2505.19179 null
2025-05-24 Building a Functional Machine Translation Corpus for Kpelle Kweku Andoh Yamoah et.al. 2505.18905 null
2025-05-24 StandUp4AI: A New Multilingual Dataset for Humor Detection in Stand-up Comedy Videos Valentin Barriere et.al. 2505.18903 null
2025-05-24 CHSER: A Dataset and Case Study on Generative Speech Error Correction for Child ASR Natarajan Balaji Shankar et.al. 2505.18463 link
2025-05-23 Daily-Omni: Towards Audio-Visual Reasoning with Temporal Alignment across Modalities Ziwei Zhou et.al. 2505.17862 link
2025-05-27 CosyVoice 3: Towards In-the-wild Speech Generation via Scaling-up and Post-training Zhihao Du et.al. 2505.17589 null
2025-05-23 Swedish Whispers; Leveraging a Massive Speech Corpus for Swedish Speech Recognition Leonora Vesterbacka et.al. 2505.17538 null
2025-05-23 Speechless: Speech Instruction Training Without Speech for Low Resource Languages Alan Dao et.al. 2505.17417 link
2025-05-23 LLM-based Generative Error Correction for Rare Words with Synthetic Data and Phonetic Context Natsuo Yamashita et.al. 2505.17410 link
2025-06-02 An End-to-End Approach for Child Reading Assessment in the Xhosa Language Sergio Chevtchenko et.al. 2505.17371 null
2025-05-20 From Weak Labels to Strong Results: Utilizing 5,000 Hours of Noisy Classroom Transcripts with Minimal Accurate Data Ahmed Adel Attia et.al. 2505.17088 null
2025-05-30 Impact of Frame Rates on Speech Tokenizer: A Case Study on Mandarin and English Haoyang Zhang et.al. 2505.17076 null
2025-05-28 An Effective Training Framework for Light-Weight Automatic Speech Recognition Models Abdul Hannan et.al. 2505.16991 null
2025-05-22 From Tens of Hours to Tens of Thousands: Scaling Back-Translation for Speech Recognition Tianduo Wang et.al. 2505.16972 link
2025-05-22 SoccerChat: Integrating Multimodal Data for Enhanced Soccer Game Understanding Sushant Gautam et.al. 2505.16630 null
2025-05-27 X-ARES: A Comprehensive Framework for Assessing Audio Encoder Performance Junbo Zhang et.al. 2505.16369 link
2025-05-24 Large Language Models based ASR Error Correction for Child Conversations Anfeng Xu et.al. 2505.16212 null
2025-05-22 Differentiable K-means for Fully-optimized Discrete Token-based ASR Kentaro Onda et.al. 2505.16207 null
2025-05-22 Prosodically Enhanced Foreign Accent Simulation by Discrete Token-based Resynthesis Only with Native Speech Corpora Kentaro Onda et.al. 2505.16191 null
2025-05-22 Selective Invocation for Multilingual ASR: A Cost-effective Approach Adapting to Speech Recognition Difficulty Hongfei Xue et.al. 2505.16168 null
2025-05-21 Word Level Timestamp Generation for Automatic Speech Recognition and Translation Ke Hu et.al. 2505.15646 link
2025-05-20 In-Context Learning Boosts Speech Recognition via Human-like Adaptation to Speakers and Language Varieties Nathan Roll et.al. 2505.14887 null
2025-05-30 Towards Inclusive ASR: Investigating Voice Conversion for Dysarthric Speech Recognition in Low-Resource Languages Chin-Jou Li et.al. 2505.14874 link
2025-05-20 Vox-Profile: A Speech Foundation Model Benchmark for Characterizing Diverse Speaker and Speech Traits Tiantian Feng et.al. 2505.14648 link
2025-05-20 Dual Precision Quantization for Efficient and Accurate Deep Neural Networks Inference Tomer Gafni et.al. 2505.14638 link
2025-05-20 PersonaTAB: Predicting Personality Traits using Textual, Acoustic, and Behavioral Cues in Fully-Duplex Speech Dialogs Sho Inoue et.al. 2505.14356 link
2025-05-21 Scaling and Enhancing LLM-based AVSR: A Sparse Mixture of Projectors Approach Umberto Cappellazzo et.al. 2505.14336 null
2025-05-23 HausaNLP: Current Status, Challenges and Future Directions for Hausa Natural Language Processing Shamsuddeen Hassan Muhammad et.al. 2505.14311 null
2025-05-27 The Multimodal Information Based Speech Processing (MISP) 2025 Challenge: Audio-Visual Diarization and Recognition Ming Gao et.al. 2505.13971 null
2025-08-12 Transfer Learning from Visual Speech Recognition to Mouthing Recognition in German Sign Language Dinh Nam Pham et.al. 2505.13784 null
2025-05-21 Multi-head Temporal Latent Attention Keqi Deng et.al. 2505.13544 link
2025-05-21 Granary: Speech Recognition and Translation Dataset in 25 European Languages Nithin Rao Koluguri et.al. 2505.13404 null
2025-05-19 Cross-modal Knowledge Transfer Learning as Graph Matching Based on Optimal Transport for ASR Xugang Lu et.al. 2505.13079 null
2025-05-19 KIT's Offline Speech Translation and Instruction Following Submission for IWSLT 2025 Sai Koneru et.al. 2505.13036 null
2025-05-19 Personalized Fine-Tuning with Controllable Synthetic Speech from LLM-Generated Transcripts for Dysarthric Speech Recognition Dominik Wagner et.al. 2505.12991 null
2025-05-19 Calm-Whisper: Reduce Whisper Hallucination On Non-Speech By Calming Crazy Heads Down Yingzhi Wang et.al. 2505.12969 null
2025-05-16 Automatic Speech Recognition for African Low-Resource Languages: Challenges and Future Directions Sukairaj Hafiz Imam et.al. 2505.11690 null
2025-05-16 ASR-FAIRBENCH: Measuring and Benchmarking Equity Across Speech Recognition Systems Anand Rai et.al. 2505.11572 null
2025-05-26 LipDiffuser: Lip-to-Speech Generation with Conditional Diffusion Models Danilo de Oliveira et.al. 2505.11391 null
2025-05-16 LegoSLM: Connecting LLM with Speech Encoder using CTC Posteriors Rao Ma et.al. 2505.11352 null
2025-05-16 Survey of End-to-End Multi-Speaker Automatic Speech Recognition for Monaural Audio Xinlu He et.al. 2505.10975 null
2025-05-27 Multi-Stage Speaker Diarization for Noisy Classrooms Ali Sartaz Khan et.al. 2505.10879 link
2025-05-15 Inclusivity of AI Speech in Healthcare: A Decade Look Back Retno Larasati et.al. 2505.10596 null
2025-05-15 Quantized Approximate Signal Processing (QASP): Towards Homomorphic Encryption for audio Tu Duyen Nguyen et.al. 2505.10500 null
2025-05-12 Full simulation on the dynamics of auditory synaptic fusion: Strong clustering of calcium channel might be the origin of the coherent release in the auditory hair cells Jaeyun Yoo et.al. 2505.07273 null
2025-05-09 Remote Rowhammer Attack using Adversarial Observations on Federated Learning Clients Jinsheng Yuan et.al. 2505.06335 null
2025-05-08 Teochew-Wild: The First In-the-wild Teochew Dataset with Orthographic Annotations Linrong Pan et.al. 2505.05056 null
2025-05-07 SwinLip: An Efficient Visual Speech Encoder for Lip Reading Using Swin Transformer Young-Hu Park et.al. 2505.04394 null
2025-05-09 Robust Speech Recognition with Schrödinger Bridge-Based Speech Enhancement Rauf Nasretdinov et.al. 2505.04237 null
2025-05-06 VITA-Audio: Fast Interleaved Cross-Modal Token Generation for Efficient Large Speech-Language Model Zuwei Long et.al. 2505.03739 link
2025-05-06 Fairness of Automatic Speech Recognition in Cleft Lip and Palate Speech Susmita Bhattacharjee et.al. 2505.03697 null
2025-05-26 SepALM: Audio Language Models Are Error Correctors for Robust Speech Separation Zhaoxi Mu et.al. 2505.03273 null
2025-05-15 CoGenAV: Versatile Audio-Visual Representation Learning via Contrastive-Generative Synchronization Detao Bai et.al. 2505.03186 link
2025-05-05 Voila: Voice-Language Foundation Models for Real-Time Autonomous Interaction and Voice Role-Play Yemin Shi et.al. 2505.02707 link
2025-05-08 Transforming faces into video stories -- VideoFace2.0 Branko Brkljač et.al. 2505.02060 link
2025-05-06 A Synergistic Framework of Nonlinear Acoustic Computing and Reinforcement Learning for Real-World Human-Robot Interaction Xiaoliang Chen et.al. 2505.01998 null
2025-05-02 Transfer Learning-Based Deep Residual Learning for Speech Recognition in Clean and Noisy Environments Noussaiba Djeffal et.al. 2505.01632 null
2025-05-01 Scaling On-Device GPU Inference for Large Generative Models Jiuqiang Tang et.al. 2505.00232 null
2025-07-31 BERSting at the Screams: A Benchmark for Distanced, Emotional and Shouted Speech Recognition Paige Tuttösí et.al. 2505.00059 link
2025-04-30 Retrieval-Enhanced Few-Shot Prompting for Speech Event Extraction Máté Gedeon et.al. 2504.21372 null
2025-04-28 A Comprehensive Part-of-Speech Tagging to Standardize Central-Kurdish Language: A Research Guide for Kurdish Natural Language Processing Tasks Shadan Shukr Sabr et.al. 2504.19645 null
2025-04-25 Kimi-Audio Technical Report KimiTeam et.al. 2504.18425 link
2025-04-28 Augmenting Captions with Emotional Cues: An AR Interface for Real-Time Accessible Communication Sunday David Ubur et.al. 2504.17171 null
2025-04-22 TinyML for Speech Recognition Andrew Barovic et.al. 2504.16213 null
2025-04-22 LiveCC: Learning Video LLM with Streaming Speech Transcription at Scale Joya Chen et.al. 2504.16030 null
2025-04-22 Development and evaluation of a deep learning algorithm for German word recognition from lip movements Dinh Nam Pham et.al. 2504.15792 null
2025-04-21 Chinese-LiPS: A Chinese audio-visual speech recognition dataset with Lip-reading and Presentation Slides Jinghua Zhao et.al. 2504.15066 null
2025-04-21 StableQuant: Layer Adaptive Post-Training Quantization for Speech Foundation Models Yeona Hong et.al. 2504.14915 null
2025-04-17 Acoustic to Articulatory Inversion of Speech; Data Driven Approaches, Challenges, Applications, and Future Scope Leena G Pillai et.al. 2504.13308 null
2025-05-04 Dysarthria Normalization via Local Lie Group Transformations for Robust ASR Mikhail Osipov et.al. 2504.12279 link
2025-04-03 Edge Intelligence for Wildlife Conservation: Real-Time Hornbill Call Classification Using TinyML Kong Ka Hing et.al. 2504.12272 null
2025-04-19 Advancing Arabic Speech Recognition Through Large-Scale Weakly Supervised Learning Mahmoud Salhab et.al. 2504.12254 null
2025-04-15 Real-Time Word-Level Temporal Segmentation in Streaming Speech Recognition Naoto Nishida et.al. 2504.10849 null
2025-04-25 Spatial Audio Processing with Large Language Model on Wearable Devices Ayushi Mishra et.al. 2504.08907 null
2025-04-10 From Speech to Summary: A Comprehensive Survey of Speech Summarization Fabian Retkowski et.al. 2504.08024 null
2025-04-09 Visual-Aware Speech Recognition for Noisy Scenarios Lakshmipathi Balaji et.al. 2504.07229 null
2025-04-09 RNN-Transducer-based Losses for Speech Recognition on Noisy Targets Vladimir Bataev et.al. 2504.06963 link
2025-04-07 DoCIA: An Online Document-Level Context Incorporation Agent for Speech Translation Xinglin Lyu et.al. 2504.05122 null
2025-04-06 Public speech recognition transcripts as a configuring parameter Damien Rudaz et.al. 2504.04488 null
2025-04-06 Selective Masking Adversarial Attack on Automatic Speech Recognition Systems Zheng Fang et.al. 2504.04394 null
2025-05-08 An Efficient GPU-based Implementation for Noise Robust Sound Source Localization Zirui Lin et.al. 2504.03373 null
2025-04-04 A Human Digital Twin Architecture for Knowledge-based Interactions and Context-Aware Conversations Abdul Mannan Mohammed et.al. 2504.03147 null
2025-03-26 Efficient First-Order Optimization on the Pareto Set for Multi-Objective Learning under Preference Guidance Lisha Chen et.al. 2504.02854 null
2025-04-03 LinTO Audio and Textual Datasets to Train and Evaluate Automatic Speech Recognition in Tunisian Arabic Dialect Hedi Naouara et.al. 2504.02604 null
2025-04-22 F5R-TTS: Improving Flow-Matching based Text-to-Speech with Group Relative Policy Optimization Xiaohui Sun et.al. 2504.02407 null
2025-04-02 Chain of Correction for Full-text Speech Recognition with Large Language Models Zhiyuan Tang et.al. 2504.01519 null
2025-04-01 Whispering Under the Eaves: Protecting User Privacy Against Commercial and LLM-powered Automatic Speech Recognition Systems Weifei Jin et.al. 2504.00858 link
2025-03-31 SVLA: A Unified Speech-Vision-Language Assistant with Multimodal Reasoning and Speech Generation Ngoc Dung Huynh et.al. 2503.24164 null
2025-04-02 TeleAntiFraud-28k: An Audio-Text Slow-Thinking Dataset for Telecom Fraud Detection Zhiming Ma et.al. 2503.24115 link
2025-03-30 The Impact of Code-switched Synthetic Data Quality is Task Dependent: Insights from MT and ASR Injy Hamed et.al. 2503.23576 null
2025-03-30 Whisper-LM: Improving ASR Models with Language Models for Low-Resource Languages Xabier de Zuazo et.al. 2503.23542 link
2025-03-30 Scaling Auditory Cognition via Test-Time Compute in Audio Language Models Ting Dang et.al. 2503.23395 null
2025-04-25 Coverage-Guaranteed Speech Emotion Recognition via Calibrated Uncertainty-Adaptive Prediction Sets Zijun Jia et.al. 2503.22712 null
2025-03-13 Enhancing Aviation Communication Transcription: Fine-Tuning Distil-Whisper with LoRA Shokoufeh Mirzaei et.al. 2503.22692 null
2025-03-05 Qieemo: Speech Is All You Need in the Emotion Recognition in Conversations Jinming Chen et.al. 2503.22687 null
2025-03-11 Lend a Hand: Semi Training-Free Cued Speech Recognition via MLLM-Driven Hand Modeling for Barrier-free Communication Guanjie Huang et.al. 2503.21785 link
2025-03-27 VALLR: Visual ASR Language Model for Lip Reading Marshall Thomas et.al. 2503.21408 null
2025-03-27 A 71.2- $μ$ W Speech Recognition Accelerator with Recurrent Spiking Neural Network Chih-Chyau Yang et.al. 2503.21337 null
2025-03-26 Improving Speech Recognition Accuracy Using Custom Language Models with the Vosk Toolkit Aniket Abhishek Soni et.al. 2503.21025 null
2025-03-26 FinAudio: A Benchmark for Audio Large Language Models in Financial Applications Yupeng Cao et.al. 2503.20990 null
2025-03-26 Dolphin: A Large-Scale Automatic Speech Recognition Model for Eastern Languages Yangyang Meng et.al. 2503.20212 link
2025-03-25 Contextual Metric Meta-Evaluation by Measuring Local Metric Accuracy Athiya Deviyani et.al. 2503.19828 null
2025-03-25 Boosting the Transferability of Audio Adversarial Examples with Acoustic Representation Optimization Weifei Jin et.al. 2503.19591 null
2025-03-25 Design of Seamless Multi-modal Interaction Framework for Intelligent Virtual Agents in Wearable Mixed Reality Environment Ghazanfar Ali et.al. 2503.19334 null
2025-05-13 From S4 to Mamba: A Comprehensive Survey on Structured State Space Models Shriyank Somvanshi et.al. 2503.18970 null
2025-03-28 Whispering in Amharic: Fine-tuning Whisper for Low-resource Language Dawit Ketema Gete et.al. 2503.18485 null
2025-03-23 Elevating Robust Multi-Talker ASR by Decoupling Speaker Separation and Speech Recognition Yufeng Yang et.al. 2503.17886 null
2025-03-21 Your voice is your voice: Supporting Self-expression through Speech Generation and LLMs in Augmented and Alternative Communication Yiwen Xu et.al. 2503.17479 null
2025-03-20 SeniorTalk: A Chinese Conversation Dataset with Rich Annotations for Super-Aged Seniors Yang Chen et.al. 2503.16578 null
2025-03-19 A Comprehensive Survey on Architectural Advances in Deep CNNs: Challenges, Applications, and Emerging Research Directions Saddam Hussain Khan et.al. 2503.16546 null
2025-02-27 ACE, Action and Control via Explanations: A Proposal for LLMs to Provide Human-Centered Explainability for Multimodal AI Assistants Elizabeth Anne Watkins et.al. 2503.16466 null
2025-03-19 Evaluating ASR Confidence Scores for Automated Error Detection in User-Assisted Correction Interfaces Korbinian Kuhn et.al. 2503.15124 null
2025-03-19 Communication Access Real-Time Translation Through Collaborative Correction of Automatic Speech Recognition Korbinian Kuhn et.al. 2503.15120 null
2025-03-07 A Causal Inference Approach for Quantifying Research Impact Keiichi Ochiai et.al. 2503.13485 null
2025-04-19 Halving transcription time: A fast, user-friendly and GDPR-compliant workflow to create AI-assisted transcripts for content analysis Jakob Sponholz et.al. 2503.13031 null
2025-03-04 CORDIC Is All You Need Omkar Kokane et.al. 2503.11685 null
2025-03-14 MMS-LLaMA: Efficient LLM-based Audio-Visual Speech Recognition with Minimal Multimodal Speech Tokens Jeong Hun Yeo et.al. 2503.11315 link
2025-03-13 Whisper Speaker Identification: Leveraging Pre-Trained Multilingual Transformers for Robust Speaker Embeddings Jakaria Islam Emon et.al. 2503.10446 link
2025-03-14 Proceedings of the ISCA/ITG Workshop on Diversity in Large Speech and Language Models Sebastian Möller et.al. 2503.10298 null
2025-04-07 ValSub: Subsampling Validation Data to Mitigate Forgetting during ASR Personalization Haaris Mehmood et.al. 2503.09906 null
2025-03-12 Quantization for OpenAI's Whisper Models: A Comparative Analysis Allison Andreyev et.al. 2503.09905 link
2025-03-12 Everything Can Be Described in Words: A Simple Unified Multi-Modal Framework with Semantic and Temporal Alignment Xiaowei Bi et.al. 2503.09081 null
2025-03-11 An Exhaustive Evaluation of TTS- and VC-based Data Augmentation for ASR Sewade Ogun et.al. 2503.08954 null
2025-03-11 Prompt2LVideos: Exploring Prompts for Understanding Long-Form Multimodal Videos Soumya Shamarao Jahagirdar et.al. 2503.08335 null
2025-03-10 Building English ASR model with regional language support Purvi Agrawal et.al. 2503.07522 null
2025-03-30 Automatic Speech Recognition for Non-Native English: Accuracy and Disfluency Handling Michael McGuire et.al. 2503.06924 null
2025-03-09 Adaptive Audio-Visual Speech Recognition via Matryoshka-Based Multimodal LLMs Umberto Cappellazzo et.al. 2503.06362 null
2025-03-08 Zero-AVSR: Zero-Shot Audio-Visual Speech Recognition with LLMs by Learning Language-Agnostic Speech Representations Jeong Hun Yeo et.al. 2503.06273 link
2025-03-08 A Noise-Robust Turn-Taking System for Real-World Dialogue Robots: A Field Experiment Koji Inoue et.al. 2503.06241 null
2025-03-06 From Voice to Safety: Language AI Powered Pilot-ATC Communication Understanding for Airport Surface Movement Collision Risk Assessment Yutian Pang et.al. 2503.04974 null
2025-03-04 Normalization through Fine-tuning: Understanding Wav2vec 2.0 Embeddings for Phonetic Analysis Yiming Wang et.al. 2503.04814 null
2025-03-03 Direct Speech to Speech Translation: A Review Mohammad Sarim et.al. 2503.04799 null
2025-03-06 Self-Supervised Models for Phoneme Recognition: Applications in Children's Speech for Reading Learning Lucas Block Medin et.al. 2503.04710 null
2025-03-07 Efficient Finetuning for Dimensional Speech Emotion Recognition in the Age of Transformers Aneesha Sampath et.al. 2503.03756 null
2025-03-03 Fine-Tuning Whisper for Inclusive Prosodic Stress Analysis Samuel S. Sohn et.al. 2503.02907 null
2025-03-04 Go Beyond Your Means: Unlearning with Per-Sample Gradient Orthogonalization Aviv Shamsian et.al. 2503.02312 null
2025-03-05 Pruning Deep Neural Networks via a Combination of the Marchenko-Pastur Distribution and Regularization Leonid Berlyand et.al. 2503.01922 null
2025-03-07 Nexus-O: An Omni-Perceptive And -Interactive Model for Language, Audio, And Vision Che Liu et.al. 2503.01879 null
2025-03-02 Unveiling Biases while Embracing Sustainability: Assessing the Dual Challenges of Automatic Speech Recognition Systems Ajinkya Kulkarni et.al. 2503.00907 null
2025-03-02 UniWav: Towards Unified Pre-training for Speech Representation Learning and Generation Alexander H. Liu et.al. 2503.00733 null
2025-02-27 LiteASR: Efficient Automatic Speech Recognition with Low-Rank Approximation Keisuke Kamahori et.al. 2502.20583 link
2025-02-27 Adapting Automatic Speech Recognition for Accented Air Traffic Control Communications Marcus Yu Zhe Wee et.al. 2502.20311 null
2025-02-27 CleanMel: Mel-Spectrogram Enhancement for Improving Both Speech Quality and ASR Nian Shao et.al. 2502.20040 null
2025-03-12 CS-Dialogue: A 104-Hour Dataset of Spontaneous Mandarin-English Code-Switching Dialogues for Speech Recognition Jiaming Zhou et.al. 2502.18913 null
2025-02-26 Exploring Gender Disparities in Automatic Speech Recognition Technology Hend ElGhazaly et.al. 2502.18434 null
2025-02-25 Silent Speech Sentence Recognition with Six-Axis Accelerometers using Conformer and CTC Algorithm Yudong Xie et.al. 2502.17829 null
2025-02-26 Low-Rank and Sparse Model Merging for Multi-Lingual Speech Recognition and Translation Qiuming Zhao et.al. 2502.17380 null
2025-02-25 Improving the Inclusivity of Dutch Speech Recognition by Fine-tuning Whisper on the JASMIN-CGN Corpus Golshid Shekoufandeh et.al. 2502.17284 link
2025-02-24 Balancing Speech Understanding and Generation Using Continual Pre-training for Codec-based Speech LLM Jiatong Shi et.al. 2502.16897 null
2025-02-22 Understanding Zero-shot Rare Word Recognition Improvements Through LLM Integration Haoxuan Wang et.al. 2502.16142 null
2025-02-21 The Esethu Framework: Reimagining Sustainable Dataset Governance and Curation for Low-Resource Languages Jenalea Rajab et.al. 2502.15916 null
2025-02-21 Retrieval-Augmented Speech Recognition Approach for Domain Challenges Peng Shen et.al. 2502.15264 null
2025-02-21 Enhancing Speech Large Language Models with Prompt-Aware Mixture of Audio Encoders Weiqiao Shan et.al. 2502.15178 null
2025-02-21 Improving Streaming Speech Recognition With Time-Shifted Contextual Attention And Dynamic Right Context Masking Khanh Le et.al. 2502.15158 null
2025-02-20 WavRAG: Audio-Integrated Retrieval Augmented Generation for Spoken Dialogue Models Yifu Chen et.al. 2502.14727 null
2025-02-20 SegAug: CTC-Aligned Segmented Augmentation For Robust RNN-Transducer Based Speech Recognition Khanh Le et.al. 2502.14685 null
2025-02-20 Moshi Moshi? A Model Selection Hijacking Adversarial Attack Riccardo Petrucci et.al. 2502.14586 null
2025-02-18 Gesture-Aware Zero-Shot Speech Recognition for Patients with Language Disorders Seungbae Kim et.al. 2502.13983 null
2025-02-18 Benchmarking Automatic Speech Recognition coupled LLM Modules for Medical Diagnostics Kabir Kumar et.al. 2502.13982 null
2025-02-19 Measuring the Effect of Transcription Noise on Downstream Language Understanding Tasks Ori Shapira et.al. 2502.13645 link
2025-02-21 VLAS: Vision-Language-Action Model With Speech Instructions For Customized Robot Manipulation Wei Zhao et.al. 2502.13508 link
2025-02-19 Adopting Whisper for Confidence Estimation Vaibhav Aggarwal et.al. 2502.13446 null
2025-02-18 Neuro-oscillatory models of cortical speech processing Olesia Dogonasheva et.al. 2502.12935 null
2025-02-18 Lost in Transcription, Found in Distribution Shift: Demystifying Hallucination in Speech Foundation Models Hanin Atwany et.al. 2502.12414 null
2025-02-18 On the Robust Approximation of ASR Metrics Abdul Waheed et.al. 2502.12408 null
2025-02-17 NaturalL2S: End-to-End High-quality Multispeaker Lip-to-Speech Synthesis with Differential Digital Signal Processing Yifan Liang et.al. 2502.12002 null
2025-02-17 Can you pass that tool?: Implications of Indirect Speech in Physical Human-Robot Collaboration Yan Zhang et.al. 2502.11720 null
2025-02-28 In Situ Optimization of an Optoelectronic Reservoir Computer with Digital Delayed Feedback Fyodor Morozko et.al. 2502.11126 null
2025-04-03 DuplexMamba: Enhancing Real-time Speech Conversations with Duplex and Streaming Capabilities Xiangyu Lu et.al. 2502.11123 link
2025-02-11 MoHAVE: Mixture of Hierarchical Audio-Visual Experts for Robust Speech Recognition Sungnyun Kim et.al. 2502.10447 null
2025-02-14 OWLS: Scaling Laws for Multilingual Speech Recognition and Translation Models William Chen et.al. 2502.10373 null
2025-02-14 MTLM: an Innovative Language Model Training Paradigm for ASR Qingliang Meng et.al. 2502.10058 null
2025-02-14 A Preliminary Exploration with GPT-4o Voice Mode Yu-Xiang Lin et.al. 2502.09940 null
2025-02-14 Microphone Array Geometry Independent Multi-Talker Distant ASR: NTT System for the DASR Task of the CHiME-8 Challenge Naoyuki Kamo et.al. 2502.09859 null
2025-02-13 Shortcut Learning Susceptibility in Vision Classifiers Pirzada Suhail et.al. 2502.09150 null
2025-02-13 Quantum Approaches for Dysphonia Assessment in Small Speech Datasets Ha Tran et.al. 2502.08968 null
2025-02-12 Causal Analysis of ASR Errors for Children: Quantifying the Impact of Physiological, Cognitive, and Extrinsic Factors Vishwanath Pratap Singh et.al. 2502.08587 null
2025-02-24 VINP: Variational Bayesian Inference with Neural Speech Prior for Joint ASR-Effective Speech Dereverberation and Blind RIR Identification Pengyu Wang et.al. 2502.07205 link
2025-02-16 A Comparative Study of ASR Implementations in Resource-Constrained Wireless Sensor Networks for Real-Time Voice Communication Inaam F. Qutaiba I. Ali et.al. 2502.06969 null
2025-02-19 Speech to Speech Translation with Translatotron: A State of the Art Review Jules R. Kala et.al. 2502.05980 null
2025-02-09 Audio-Visual Representation Learning via Knowledge Distillation from Speech Foundation Models Jing-Xuan Zhang et.al. 2502.05766 link
2025-02-07 Koel-TTS: Enhancing LLM based Speech Generation with Preference Alignment and Classifier Free Guidance Shehzeen Hussain et.al. 2502.05236 null
2025-02-06 Aligner-Encoders: Self-Attention Transformers Can Be Self-Transducers Adam Stooke et.al. 2502.05232 null
2025-02-07 Evaluating Standard and Dialectal Frisian ASR: Multilingual Fine-tuning and Language Identification for Improved Low-resource Performance Reihaneh Amooie et.al. 2502.04883 null
2025-02-07 Lightweight Operations for Visual Speech Recognition Iason Ioannis Panagos et.al. 2502.04834 null
2025-02-06 Afrispeech-Dialog: A Benchmark Dataset for Spontaneous English Conversations in Healthcare and Beyond Mardhiyah Sanni et.al. 2502.03945 null
2025-02-06 Rule-Based Modeling of Low-Dimensional Data with PCA and Binary Particle Swarm Optimization (BPSO) in ANFIS Afnan Al-Ali et.al. 2502.03895 null
2025-02-05 Integrating automatic speech recognition into remote healthcare interpreting: A pilot study of its impact on interpreting quality Shiyi Tan et.al. 2502.03381 null
2025-02-05 Leveraging Broadcast Media Subtitle Transcripts for Automatic Speech Recognition and Subtitling Jakob Poncelet et.al. 2502.03212 link
2025-01-26 SEAL: Speech Embedding Alignment Learning for Speech Large Language Model with Retrieval-Augmented Generation Chunyu Sun et.al. 2502.02603 null
2025-03-05 CTC-DRO: Robust Optimization for Reducing Language Disparities in Speech Recognition Martijn Bartelds et.al. 2502.01777 null
2025-02-03 Adapter-Based Multi-Agent AVSR Extension for Pre-Trained ASR Models Christopher Simic et.al. 2502.01709 null
2025-01-29 Privacy-Preserving Edge Speech Understanding with Tiny Foundation Models Afsara Benazir et.al. 2502.01649 null
2025-02-03 A Differentiable Alignment Framework for Sequence-to-Sequence Modeling via Optimal Transport Yacouba Kaloga et.al. 2502.01588 null
2025-02-11 mWhisper-Flamingo for Multilingual Audio-Visual Noise-Robust Speech Recognition Andrew Rouditchenko et.al. 2502.01547 link
2025-02-03 Gradient Norm-based Fine-Tuning for Backdoor Defense in Automatic Speech Recognition Nanjun Zhou et.al. 2502.01152 null
2025-02-01 Data-Driven Mispronunciation Pattern Discovery for Robust Speech Recognition Anna Seo Gyeong Choi et.al. 2502.00583 null
2025-02-17 Evaluation of End-to-End Continuous Spanish Lipreading in Different Data Conditions David Gimeno-Gómez et.al. 2502.00464 link
2025-02-04 Sagalee: an Open Source Automatic Speech Recognition Dataset for Oromo Language Turi Abu et.al. 2502.00421 link
2025-02-01 When End-to-End is Overkill: Rethinking Cascaded Speech-to-Text Translation Anna Min et.al. 2502.00377 null
2025-02-03 SELMA: A Speech-Enabled Language Model for Virtual Assistant Interactions Dominik Wagner et.al. 2501.19377 null
2025-01-31 Language Bias in Self-Supervised Learning For Automatic Speech Recognition Edward Storey et.al. 2501.19321 null
2025-02-03 DyPCL: Dynamic Phoneme-level Contrastive Learning for Dysarthric Speech Recognition Wonjun Lee et.al. 2501.19010 null
2025-01-29 Cross-lingual Embedding Clustering for Hierarchical Softmax in Low-Resource Multilingual Speech Recognition Zhengdong Yang et.al. 2501.17615 null
2025-01-28 RDMM: Fine-Tuned LLM Models for On-Device Robotic Decision Making with Enhanced Contextual Awareness in Specific Domains Shady Nasrat et.al. 2501.16899 link
2025-01-28 AVE Speech Dataset: A Comprehensive Benchmark for Multi-Modal Speech Recognition Integrating Audio, Visual, and Electromyographic Signals Dongliang Zhou et.al. 2501.16780 null
2025-01-28 SCDiar: a streaming diarization system based on speaker change detection and speech recognition Naijun Zheng et.al. 2501.16641 null
2025-01-27 Optimized Self-supervised Training with BEST-RQ for Speech Recognition Ilja Baumann et.al. 2501.16131 null
2025-01-27 Classification Error Bound for Low Bayes Error Conditions in Machine Learning Zijian Yang et.al. 2501.15977 null
2025-01-26 End-to-End Target Speaker Speech Recognition Using Context-Aware Attention Mechanisms for Challenging Enrollment Scenario Mohsen Ghane et.al. 2501.15466 null
2025-01-25 The Multicultural Medical Assistant: Can LLMs Improve Medical ASR Errors Across Borders? Ayo Adedeji et.al. 2501.15310 null
2025-01-25 Speech Translation Refinement using Large Language Models Huaixia Dou et.al. 2501.15090 link
2025-01-25 Robust Cross-Etiology and Speaker-Independent Dysarthric Speech Recognition Satwinder Singh et.al. 2501.14994 null
2025-02-07 Methods to Increase the Amount of Data for Speech Recognition for Low Resource Languages Alexan Ayrapetyan et.al. 2501.14788 null
2025-01-24 FireRedASR: Open-Source Industrial-Grade Mandarin Speech Recognition Models from Encoder-Decoder to LLM Integration Kai-Tuo Xu et.al. 2501.14350 link
2025-01-24 LoCoML: A Framework for Real-World ML Inference Pipelines Kritin Maddireddy et.al. 2501.14165 null
2025-01-23 Integrating Persian Lip Reading in Surena-V Humanoid Robot for Human-Robot Interaction Ali Farshian Abbasi et.al. 2501.13996 null
2025-01-18 Fanar: An Arabic-Centric Multimodal Generative AI Platform Fanar Team et.al. 2501.13944 null
2025-01-23 Predicting Compact Phrasal Rewrites with Large Language Models for ASR Post Editing Hao Zhang et.al. 2501.13831 null
2025-01-23 Learning-based A Posteriori Speech Presence Probability Estimation and Applications Shuai Tao et.al. 2501.13642 null
2025-01-23 DQ-Data2vec: Decoupling Quantization for Multilingual Speech Recognition Qijie Shao et.al. 2501.13497 null
2025-02-16 OSUM: Advancing Open Speech Understanding Models with Limited Resources in Academia Xuelong Geng et.al. 2501.13306 link
2025-01-22 Let SSMs be ConvNets: State-space Modeling with Optimal Tensor Contractions Yan Ru Pei et.al. 2501.13230 null
2025-01-22 FlanEC: Exploring Flan-T5 for Post-ASR Error Correction Moreno La Quatra et.al. 2501.12979 link
2025-01-21 A Domain Adaptation Framework for Speech Recognition Systems with Only Synthetic data Minh Tran et.al. 2501.12501 null
2025-01-21 DOTA-ME-CS: Daily Oriented Text Audio-Mandarin English-Code Switching Dataset Yupei Li et.al. 2501.12122 null
2025-01-20 Investigation of Whisper ASR Hallucinations Induced by Non-Speech Audio Mateusz Barański et.al. 2501.11378 null
2025-01-19 Enhancing Neural Spoken Language Recognition: An Exploration with Multilingual Datasets Or Haim Anidjar et.al. 2501.11065 null
2025-01-18 A Benchmark of French ASR Systems Based on Error Severity Antoine Tholly et.al. 2501.10879 null
2025-01-18 GEC-RAG: Improving Generative Error Correction via Retrieval-Augmented Generation for Automatic Speech Recognition Systems Amin Robatian et.al. 2501.10734 null
2025-01-17 Unsupervised Rhythm and Voice Conversion of Dysarthric to Healthy Speech for ASR Karl El Hajal et.al. 2501.10256 null
2025-01-17 Automatic Speech Recognition for Sanskrit with Transfer Learning Bidit Sadhukhan et.al. 2501.10024 null
2025-01-21 PIER: A Novel Metric for Evaluating What Matters in Code-Switching Enes Yavuz Ugan et.al. 2501.09512 null
2025-01-16 Teaching Wav2Vec2 the Language of the Brain Tobias Fiedler et.al. 2501.09459 link
2025-01-16 Delayed Fusion: Integrating Large Language Models into First-Pass Decoding in End-to-end Speech Recognition Takaaki Hori et.al. 2501.09258 null
2025-01-17 persoDA: Personalized Data Augmentation for Personalized ASR Pablo Peso Parada et.al. 2501.09113 null
2025-01-20 A Non-autoregressive Model for Joint STT and TTS Vishal Sunder et.al. 2501.09104 null
2025-01-13 Discrimination loss vs. SRT: A model-based approach towards harmonizing speech test interpretations Mareike Buhl et.al. 2501.08921 null
2025-01-15 Adapting Whisper for Regional Dialects: Enhancing Public Services for Vulnerable Populations in the United Kingdom Melissa Torgbi et.al. 2501.08502 null
2025-01-14 Selective Attention Merging for low resource tasks: A case study of Child ASR Natarajan Balaji Shankar et.al. 2501.08468 link
2025-01-14 Loudspeaker Beamforming to Enhance Speech Recognition Performance of Voice Driven Applications Dimme de Groot et.al. 2501.08104 null
2025-01-17 Joint Automatic Speech Recognition And Structure Learning For Better Speech Understanding Jiliang Hu et.al. 2501.07329 link
2025-01-13 Audio-CoT: Exploring Chain-of-Thought Reasoning in Large Audio Language Model Ziyang Ma et.al. 2501.07246 null
2025-01-13 AdaCS: Adaptive Normalization for Enhanced Code-Switching ASR The Chuong Chu et.al. 2501.07102 link
2025-01-11 Discrete Speech Unit Extraction via Independent Component Analysis Tomohiko Nakamura et.al. 2501.06562 link
2025-01-11 A Survey on Spoken Italian Datasets and Corpora Marco Giordano et.al. 2501.06557 null
2025-01-11 Speech Recognition for Automatically Assessing Afrikaans and isiXhosa Preschool Oral Narratives Christiaan Jacobs et.al. 2501.06478 null
2025-01-10 TTS-Transducer: End-to-End Speech Synthesis with Neural Transducer Vladimir Bataev et.al. 2501.06320 null
2025-01-10 Contextual ASR Error Handling with LLMs Augmentation for Goal-Oriented Conversational AI Yuya Asano et.al. 2501.06129 null
2025-02-19 Fleurs-SLU: A Massively Multilingual Benchmark for Spoken Language Understanding Fabian David Schmidt et.al. 2501.06117 link
2025-01-10 Benchmarking Rotary Position Embeddings for Automatic Speech Recognition Shucong Zhang et.al. 2501.06051 null
2025-01-19 Comparing Self-Supervised Learning Models Pre-Trained on Human Speech and Animal Vocalizations for Bioacoustics Processing Eklavya Sarkar et.al. 2501.05987 link
2025-01-10 Universal-2-TF: Robust All-Neural Text Formatting for ASR Yash Khare et.al. 2501.05948 null
2025-01-09 Right Label Context in End-to-End Training of Time-Synchronous ASR Models Tina Raissi et.al. 2501.04521 null
2025-01-08 Phone-purity Guided Discrete Tokens for Dysarthric Speech Recognition Huimeng Wang et.al. 2501.04379 null
2025-01-08 LipGen: Viseme-Guided Lip Video Generation for Enhancing Visual Speech Recognition Bowen Hao et.al. 2501.04204 null
2025-01-03 Listening and Seeing Again: Generative Error Correction for Audio-Visual Speech Recognition Rui Liu et.al. 2501.04038 link
2025-01-07 Universal Speaker Embedding Free Target Speaker Extraction and Personal Voice Activity Detection Bang Zeng et.al. 2501.03612 null
2025-01-14 Towards a Generalizable Speech Marker for Parkinson's Disease Diagnosis Maksim Siniukov et.al. 2501.03581 null
2025-01-07 Deep Learning for Pathological Speech: A Survey Shakeel A. Sheikh et.al. 2501.03536 null
2025-01-01 Breaking Through the Spike: Spike Window Decoding for Accelerated and Precise Automatic Speech Recognition Wei Zhang et.al. 2501.03257 null
2025-01-08 Samba-ASR: State-Of-The-Art Speech Recognition Leveraging Structured State-Space Models Syed Abdul Gaffar Shakhadri et.al. 2501.02832 null
2025-01-05 Reducing the Gap Between Pretrained Speech Enhancement and Recognition Models Using a Real Speech-Trained Bridging Module Zhongjian Cui et.al. 2501.02452 null
2025-01-03 Improving Transducer-Based Spoken Language Understanding with Self-Conditioned CTC and Knowledge Transfer Vishal Sunder et.al. 2501.01936 null
2025-01-11 Advancing Singlish Understanding: Bridging the Gap with Datasets and Multimodal Models Bin Wang et.al. 2501.01034 link
2025-01-01 Incremental Dialogue Management: Survey, Discussion, and Implications for HRI Casey Kennington et.al. 2501.00953 null
2025-01-01 Large Language Models Are Read/Write Policy-Makers for Simultaneous Generation Shoutao Guo et.al. 2501.00868 link
2025-01-01 Automatic Text Pronunciation Correlation Generation and Application for Contextual Biasing Gaofeng Cheng et.al. 2501.00804 null
2024-12-31 Fotheidil: an Automatic Transcription System for the Irish Language Liam Lonergan et.al. 2501.00509 null
2024-12-31 Whisper Turns Stronger: Augmenting Wav2Vec 2.0 for Superior ASR in Low-Resource Languages Or Haim Anidjar et.al. 2501.00425 null
2025-01-06 Takeaways from Applying LLM Capabilities to Multiple Conversational Avatars in a VR Pilot Study Mykola Maslych et.al. 2501.00168 null
2024-12-30 DiCoW: Diarization-Conditioned Whisper for Target Speaker Automatic Speech Recognition Alexander Polok et.al. 2501.00114 link
2024-12-25 Speech Recognition With LLMs Adapted to Disordered Speech Using Reinforcement Learning Chirag Nagpal et.al. 2501.00039 null
2024-12-27 Enhancing Whisper's Accuracy and Speed for Indian Languages through Prompt-Tuning and Tokenization Kumud Tripathi et.al. 2412.19785 null
2024-12-26 Towards a Single ASR Model That Generalizes to Disordered Speech Jimmy Tobin et.al. 2412.19315 null
2024-12-26 Enhancing Audiovisual Speech Recognition through Bifocal Preference Optimization Yihan Wu et.al. 2412.19005 link
2024-12-25 Structured Speaker-Deficiency Adaptation of Foundation Models for Dysarthric and Elderly Speech Recognition Shujie Hu et.al. 2412.18832 null
2024-12-30 Zero-resource Speech Translation and Recognition with LLMs Karel Mundnich et.al. 2412.18566 null
2025-01-09 Trading Devil RL: Backdoor attack via Stock market, Bayesian Optimization and Reinforcement Learning Orson Mengara et.al. 2412.17908 null
2024-12-09 Ensemble Machine Learning Model for Inner Speech Recognition: A Subject-Specific Investigation Shahamat Mustavi Tasin et.al. 2412.17824 null
2024-12-23 Investigating Prosodic Signatures via Speech Pre-Trained Models for Audio Deepfake Source Attribution Orchid Chetia Phukan et.al. 2412.17796 null
2024-12-23 UME: Upcycling Mixture-of-Experts for Scalable and Efficient Automatic Speech Recognition Li Fu et.al. 2412.17507 null
2024-12-23 Deep Learning in Proteomics Informatics: Applications, Challenges, and Future Directions Yindan Luo et.al. 2412.17349 null
2025-01-17 Uncovering the Visual Contribution in Audio-Visual Speech Recognition Zhaofeng Lin et.al. 2412.17129 null
2025-01-05 Adapting Whisper for Code-Switching through Encoding Refining and Language-Aware Decoding Jiahui Zhao et.al. 2412.16507 null
2025-01-03 Speech Retrieval-Augmented Generation without Automatic Speech Recognition Do June Min et.al. 2412.16500 null
2024-12-21 Enhancing Multilingual ASR for Unseen Languages via Language Embedding Modeling Shao-Syuan Huang et.al. 2412.16474 null
2024-12-21 Transducer-Llama: Integrating LLMs into Streamable Transducer-based Speech Recognition Keqi Deng et.al. 2412.16464 null
2025-01-19 MathSpeech: Leveraging Small LMs for Accurate Conversion in Mathematical Speech-to-Formula Sieun Hyeon et.al. 2412.15655 link
2024-12-20 TouchASP: Elastic Automatic Speech Perception that Everyone Can Touch Xingchen Song et.al. 2412.15622 null
2024-12-19 Transcribing and Translating, Fast and Slow: Joint Speech Translation and Recognition Niko Moritz et.al. 2412.15415 null
2024-12-23 LAMA-UT: Language Agnostic Multilingual ASR through Orthography Unification and Language-Specific Transliteration Sangmin Lee et.al. 2412.15299 null
2025-01-09 CAMEL: Cross-Attention Enhanced Mixture-of-Experts and Language Bias for Code-Switching Speech Recognition He Wang et.al. 2412.12760 null
2024-12-24 Streaming Keyword Spotting Boosted by Cross-layer Discrimination Consistency Yu Xi et.al. 2412.12635 null
2024-12-11 Greek2MathTex: A Greek Speech-to-Text Framework for LaTeX Equations Generation Evangelia Gkritzali et.al. 2412.12167 null
2024-12-09 Harnessing Transfer Learning from Swahili: Advancing Solutions for Comorian Dialects Naira Abdou Mohamed et.al. 2412.12143 null
2024-12-17 Speak & Improve Corpus 2025: an L2 English Speech Corpus for Language Assessment and Feedback Kate Knill et.al. 2412.11986 null
2024-12-17 Speak & Improve Challenge 2025: Tasks and Baseline Systems Mengjie Qian et.al. 2412.11985 null
2024-12-20 MERaLiON-SpeechEncoder: Towards a Speech Foundation Model for Singapore and Beyond Muhammad Huzaifah et.al. 2412.11538 null
2024-12-15 Transliterated Zero-Shot Domain Adaptation for Automatic Speech Recognition Han Zhu et.al. 2412.11185 null
2024-12-14 Robust Recognition of Persian Isolated Digits in Speech using Deep Neural Network Ali Nasr-Esfahani et.al. 2412.10857 null
2024-12-14 Efficient Adaptation of Multilingual Models for Japanese ASR Mark Bajo et.al. 2412.10705 link
2025-01-16 MERaLiON-AudioLLM: Bridging Audio and Language with Large Language Models Yingxu He et.al. 2412.09818 null
2024-11-26 Enhancing Code-Switching ASR Leveraging Non-Peaky CTC Loss and Deep Language Posterior Injection Tzu-Ting Yang et.al. 2412.08651 null
2024-12-11 Bilevel Joint Unsupervised and Supervised Training for Automatic Speech Recognition Xiaodong Cui et.al. 2412.08548 null
2024-12-10 Style-agnostic evaluation of ASR using multiple reference transcripts Quinten McNamara et.al. 2412.07937 null
2024-12-09 Effective Text Adaptation for LLM-based ASR through Soft Prompt Fine-Tuning Yingyi Ma et.al. 2412.06967 null
2024-12-09 Not All Errors Are Equal: Investigation of Speech Recognition Errors in Alzheimer's Disease Detection Jiawen Kang et.al. 2412.06332 null
2024-12-09 Leveraging Prompt Learning and Pause Encoding for Alzheimer's Disease Detection Yin-Long Liu et.al. 2412.06259 null
2024-12-07 SQ-Whisper: Speaker-Querying based Whisper Model for Target-Speaker ASR Pengcheng Guo et.al. 2412.05589 link
2024-12-06 Adaptive Dropout for Pruning Conformers Yotaro Kubo et.al. 2412.04836 null
2024-12-05 Comprehensive Audio Query Handling System with Integrated Expert Models and Contextual Understanding Vakada Naveen et.al. 2412.03980 null
2024-12-05 Speech Recognition-based Feature Extraction for Enhanced Automatic Severity Classification in Dysarthric Speech Yerin Choi et.al. 2412.03784 null
2024-12-04 ASR-EC Benchmark: Evaluating Large Language Models on Chinese ASR Error Correction Victor Junqiu Wei et.al. 2412.03075 null
2024-12-03 GLM-4-Voice: Towards Intelligent and Human-Like End-to-End Spoken Chatbot Aohan Zeng et.al. 2412.02612 link
2024-12-01 Late fusion ensembles for speech recognition on diverse input audio representations Marin Jezidžić et.al. 2412.01861 null
2024-12-01 Automating Feedback Analysis in Surgical Training: Detection, Categorization, and Assessment Firdavs Nasriddinov et.al. 2412.00760 link
2024-12-04 A Comparative Study of LLM-based ASR and Whisper in Low Resource and Code Switching Scenario Zheshu Song et.al. 2412.00721 null
2024-11-30 Sample adaptive data augmentation with progressive scheduling Hongxuan Lu et.al. 2412.00415 null
2024-11-30 Empowering the Deaf and Hard of Hearing Community: Enhancing Video Captions Using Large Language Models Nadeen Fathallah et.al. 2412.00342 null
2024-11-24 High-precision medical speech recognition through synthetic data and semantic correction: UNITED-MEDASR Sourav Banerjee et.al. 2412.00055 null
2024-11-29 Memristive Nanowire Network for Energy Efficient Audio Classification: Pre-Processing-Free Reservoir Computing with Reduced Latency Akshaya Rajesh et.al. 2411.19611 null
2024-11-28 ArEEG_Words: Dataset for Envisioned Speech Recognition using EEG for Arabic Words Hazem Darwish et.al. 2411.18888 null
2024-11-20 Towards Advanced Speech Signal Processing: A Statistical Perspective on Convolution-Based Architectures and its Applications Nirmal Joshua Kapu et.al. 2411.18636 null
2024-11-27 EEG-Based Analysis of Brain Responses in Multi-Modal Human-Robot Interaction: Modulating Engagement Suzanne Oliver et.al. 2411.18587 null
2024-11-27 AMPS: ASR with Multimodal Paraphrase Supervision Amruta Parulekar et.al. 2411.18368 null
2024-11-27 Continual Learning in Machine Speech Chain Using Gradient Episodic Memory Geoffrey Tyndall et.al. 2411.18320 null
2024-11-27 Aligning Pre-trained Models for Spoken Language Translation Šimon Sedláček et.al. 2411.18294 null
2024-11-27 Efficient Nonlinear Function Approximation in Analog Resistive Crossbars for Recurrent Neural Networks Junyi Yang et.al. 2411.18271 null
2025-01-05 How to Learn a New Language? An Efficient Solution for Self-Supervised Learning Models Unseen Languages Adaption in Low-Resource Scenario Shih-Heng Wang et.al. 2411.18217 null
2025-01-15 MSA-ASR: Efficient Multilingual Speaker Attribution with frozen ASR Models Thai-Binh Nguyen et.al. 2411.18152 null
2024-11-27 SALMONN-omni: A Codec-free LLM for Full-duplex Speech Understanding and Generation Wenyi Yu et.al. 2411.18138 null
2024-11-27 Fusion of Discrete Representations and Self-Augmented Representations for Multilingual Automatic Speech Recognition Shih-heng Wang et.al. 2411.18107 null
2024-11-26 Disentangled-Transformer: An Explainable End-to-End Automatic Speech Recognition Model with Speech Content-Context Separation Pu Wang et.al. 2411.17846 null
2024-12-02 Scaling Speech-Text Pre-training with Synthetic Interleaved Data Aohan Zeng et.al. 2411.17607 null
2024-11-26 Towards Maximum Likelihood Training for Transducer-based Streaming Speech Recognition Hyeonseung Lee et.al. 2411.17537 null
2024-11-26 Comparative Analysis of ASR Methods for Speech Deepfake Detection Davide Salvi et.al. 2411.17349 null
2024-11-26 k2SSL: A Faster and Better Framework for Self-Supervised Speech Representation Learning Yifan Yang et.al. 2411.17100 link
2024-11-22 TSkips: Efficiency Through Explicit Temporal Delay Connections in Spiking Neural Networks Prajna G. Malettira et.al. 2411.16711 null
2024-11-22 Transforming NLU with Babylon: A Case Study in Development of Real-time, Edge-Efficient, Multi-Intent Translation System for Automated Drive-Thru Ordering Mostafa Varzaneh et.al. 2411.15372 null
2024-11-20 From Statistical Methods to Pre-Trained Models; A Survey on Automatic Speech Recognition for Resource Scarce Urdu Language Muhammad Sharif et.al. 2411.14493 null
2024-11-26 Tiny-Align: Bridging Automatic Speech Recognition and Large Language Model on the Edge Ruiyang Qin et.al. 2411.13766 null
2024-11-18 A Novel Speech Analysis and Correction Tool for Arabic-Speaking Children Lamia Berriche et.al. 2411.13592 null
2024-11-26 WavChat: A Survey of Spoken Dialogue Models Shengpeng Ji et.al. 2411.13577 link
2024-11-20 CAFE A Novel Code switching Dataset for Algerian Dialect French and English Houssam Eddine-Othman Lachemat et.al. 2411.13424 null
2024-11-20 Hard-Synth: Synthesizing Diverse Hard Samples for ASR using Zero-Shot TTS and LLM Jiawei Yu et.al. 2411.13159 null
2024-11-19 Whisper Finetuning on Nepali Language Sanjay Rijal et.al. 2411.12587 null
2024-11-27 Inter-linguistic Phonetic Composition (IPC): A Theoretical and Computational Approach to Enhance Second Language Pronunciation Jisang Park et.al. 2411.10927 null
2024-11-16 BanglaDialecto: An End-to-End AI-Powered Regional Speech Standardization Md. Nazmus Sadat Samin et.al. 2411.10879 link
2024-12-08 Interactive Cycle Model -- The Linkage Combination among Automatic Speech Recognition, Large Language Models and Smart Glasses Libo Wang et.al. 2411.10362 link
2024-11-15 Systolic Arrays and Structured Pruning Co-design for Efficient Transformers in Edge Systems Pedro Palacios et.al. 2411.10285 null
2024-11-15 DiMoDif: Discourse Modality-information Differentiation for Audio-visual Deepfake Detection and Localization Christos Koutlis et.al. 2411.10193 null
2024-11-15 XLSR-Mamba: A Dual-Column Bidirectional State Space Model for Spoofing Attack Detection Yang Xiao et.al. 2411.10027 null
2024-11-14 Everyone deserves their voice to be heard: Analyzing Predictive Gender Bias in ASR Models Applied to Dutch Speech Data Rik Raes et.al. 2411.09431 null
2024-11-14 Transferable Adversarial Attacks against ASR Xiaoxue Gao et.al. 2411.09220 null
2024-10-28 Multilingual Standalone Trustworthy Voice-Based Social Network for Disaster Situations Majid Behravan et.al. 2411.08889 null
2024-11-11 Mamba-based Decoder-Only Approach with Bidirectional Speech Modeling for Speech Recognition Yoshiki Masuyama et.al. 2411.06968 link
2024-12-28 DCF-DS: Deep Cascade Fusion of Diarization and Separation for Speech Recognition under Realistic Single-Channel Conditions Shu-Tong Niu et.al. 2411.06667 null
2024-11-10 CTC-Assisted LLM-Based Contextual ASR Guanrou Yang et.al. 2411.06437 link
2024-12-04 Dialectal Coverage And Generalization in Arabic Speech Recognition Amirbek Djanibekov et.al. 2411.05872 link
2024-11-07 Sentiment Analysis of Spanish Political Party Tweets Using Pre-trained Language Models Chuqiao Song et.al. 2411.04862 null
2024-11-07 Multistage Fine-tuning Strategies for Automatic Speech Recognition in Low-resource Languages Leena G Pillai et.al. 2411.04573 null
2024-11-04 Unified Speech Recognition: A Single Model for Auditory, Visual, and Audiovisual Inputs Alexandros Haliassos et.al. 2411.02256 link
2024-11-03 SPES: Spectrogram Perturbation for Explainable Speech-to-Text Generation Dennis Fucci et.al. 2411.01710 null
2024-11-08 Enhancing AAC Software for Dysarthric Speakers in e-Health Settings: An Evaluation Using TORGO Macarious Hui et.al. 2411.00980 null
2024-11-04 Optimizing Contextual Speech Recognition Using Vector Quantization for Efficient Retrieval Nikolaos Flemotomos et.al. 2411.00664 null
2024-10-31 IO Transformer: Evaluating SwinV2-Based Reward Models for Computer Vision Maxwell Meyer et.al. 2411.00252 null
2024-10-31 Speech is More Than Words: Do Speech-to-Text Translation Systems Leverage Prosody? Ioannis Tsiamas et.al. 2410.24019 null
2024-10-30 Augmenting Polish Automatic Speech Recognition System With Synthetic Data Łukasz Bondaruk et.al. 2410.22903 null
2024-10-30 Run-Time Adaptation of Neural Beamforming for Robust Speech Dereverberation and Denoising Yoto Fujita et.al. 2410.22805 null
2024-10-29 Joint Beamforming and Speaker-Attributed ASR for Real Distant-Microphone Meeting Transcription Can Cui et.al. 2410.21849 null
2024-10-28 Asynchronous Tool Usage for Real-Time Agents Antonio A. Ginart et.al. 2410.21620 null
2024-10-27 Using Confidence Scores to Improve Eyes-free Detection of Speech Recognition Errors Sadia Nowrin et.al. 2410.20564 null
2024-10-27 Improving Speech-based Emotion Recognition with Contextual Utterance Analysis and LLMs Enshi Zhang et.al. 2410.20334 null
2024-11-04 emg2qwerty: A Large Dataset with Baselines for Touch Typing using Surface Electromyography Viswanath Sivakumar et.al. 2410.20081 link
2024-10-25 A Survey on Speech Large Language Models Jing Peng et.al. 2410.18908 null
2024-10-24 We Augmented Whisper With kNN and You Won't Believe What Came Next Maya K. Nachesa et.al. 2410.18850 null
2024-10-24 STTATTS: Unified Speech-To-Text And Text-To-Speech Model Hawau Olamide Toyin et.al. 2410.18607 link
2024-10-24 Evaluating and Improving Automatic Speech Recognition Systems for Korean Meteorological Experts ChaeHun Park et.al. 2410.18444 null
2024-10-24 Contextual Biasing to Improve Domain-specific Custom Vocabulary Audio Transcription without Explicit Fine-Tuning of Whisper Model Vishakha Lall et.al. 2410.18363 null
2024-10-23 ELAICHI: Enhancing Low-resource TTS by Addressing Infrequent and Low-frequency Character Bigrams Srija Anand et.al. 2410.17901 null
2024-10-23 VoiceTextBlender: Augmenting Large Language Models with Speech Capabilities via Single-Stage Joint Speech-Text Supervised Fine-Tuning Yifan Peng et.al. 2410.17485 null
2024-10-22 mmWave-Whisper: Phone Call Eavesdropping and Transcription Using Millimeter-Wave Radar Suryoday Basak et.al. 2410.17457 null
2024-10-22 Improving Automatic Speech Recognition with Decoder-Centric Regularisation in Encoder-Decoder Models Alexander Polok et.al. 2410.17437 null
2024-12-11 VoiceBench: Benchmarking LLM-Based Voice Assistants Yiming Chen et.al. 2410.17196 link
2024-10-22 Enhancing Low-Resource ASR through Versatile TTS: Bridging the Data Gap Guanrou Yang et.al. 2410.16726 null
2024-10-22 DENOASR: Debiasing ASRs through Selective Denoising Anand Kumar Rai et.al. 2410.16712 null
2024-10-21 AlignVSR: Audio-Visual Cross-Modal Alignment for Visual Speech Recognition Zehua Liu et.al. 2410.16438 link
2024-10-19 End-to-End Transformer-based Automatic Speech Recognition for Northern Kurdish: A Pioneering Approach Abdulhady Abas Abdullah et.al. 2410.16330 null
2024-10-21 Acoustic Model Optimization over Multiple Data Sources: Merging and Valuation Victor Junqiu Wei et.al. 2410.15620 null
2024-10-21 Interventional Speech Noise Injection for ASR Generalizable Spoken Language Understanding Yeonjoon Jung et.al. 2410.15609 null
2024-10-22 Moonshine: Speech Recognition for Live Transcription and Voice Commands Nat Jeffries et.al. 2410.15608 link
2024-10-20 Ichigo: Mixed-Modal Early-Fusion Realtime Voice Assistant Alan Dao et.al. 2410.15316 link
2024-10-19 Enhancing Multimodal Sentiment Analysis for Missing Modality through Self-Distillation and Unified Modality Cross-Attention Yuzhe Weng et.al. 2410.15029 link
2024-10-18 AC-Mix: Self-Supervised Adaptation for Low-Resource Automatic Speech Recognition using Agnostic Contrastive Mixup Carlos Carvalho et.al. 2410.14910 null
2024-10-09 A two-stage transliteration approach to improve performance of a multilingual ASR Rohit Kumar et.al. 2410.14709 null
2024-10-17 Parameter-efficient Adaptation of Multilingual Multimodal Models for Low-resource ASR Abhishek Gupta et.al. 2410.13445 null
2024-10-17 Computational Approaches to Arabic-English Code-Switching Caroline Sabty et.al. 2410.13318 null
2024-10-17 Roadmap towards Superhuman Speech Understanding using Large Language Models Fan Bu et.al. 2410.13268 null
2024-10-17 Failing Forward: Improving Generative Error Correction for ASR with Synthetic Data and Retrieval Augmentation Sreyan Ghosh et.al. 2410.13198 null
2024-10-17 EH-MAM: Easy-to-Hard Masked Acoustic Modeling for Self-Supervised Speech Representation Learning Ashish Seth et.al. 2410.13179 link
2024-10-17 Deep Learning-based Software Engineering: Progress, Challenges, and Opportunities Xiangping Chen et.al. 2410.13110 null
2024-10-07 Automatic Screening for Children with Speech Disorder using Automatic Speech Recognition: Opportunities and Challenges Dancheng Liu et.al. 2410.11865 null
2024-10-15 A Framework for Adapting Human-Robot Interaction to Diverse User Groups Theresa Pekarek Rosin et.al. 2410.11377 link
2024-10-15 Investigation of Speaker Representation for Target-Speaker Speech Processing Takanori Ashihara et.al. 2410.11243 null
2024-10-14 Character-aware audio-visual subtitling in context Jaesung Huh et.al. 2410.11068 null
2024-10-14 In-Materia Speech Recognition Mohamadreza Zolfagharinejad et.al. 2410.10434 null
2024-10-13 State of NLP in Kenya: A Survey Cynthia Jayne Amol et.al. 2410.09948 null
2024-10-12 SLAM-AAC: Enhancing Audio Captioning with Paraphrasing Augmentation and CLAP-Refine through LLMs Wenxi Chen et.al. 2410.09503 link
2024-10-12 Automatic Speech Recognition with BERT and CTC Transformers: A Review Noussaiba Djeffal et.al. 2410.09456 null
2024-10-11 UniGlyph: A Seven-Segment Script for Universal Language Representation G. V. Bency Sherin et.al. 2410.08974 null
2024-10-14 Enhancing Indonesian Automatic Speech Recognition: Evaluating Multilingual Models with Diverse Speech Variabilities Aulia Adila et.al. 2410.08828 null
2024-10-10 Full-Rank No More: Low-Rank Weight Training for Modern Speech Recognition Models Adriana Fernandez-Lopez et.al. 2410.07771 null
2024-10-18 Advocating Character Error Rate for Multilingual ASR Evaluation Thennal D K et.al. 2410.07400 null
2024-10-08 The USTC-NERCSLIP Systems for the CHiME-8 MMCSG Challenge Ya Jiang et.al. 2410.05986 null
2024-10-07 Incorporating Talker Identity Aids With Improving Speech Recognition in Adversarial Environments Sagarika Alavilli et.al. 2410.05423 null
2024-10-05 The OCON model: an old but gold solution for distributable supervised classification Stefano Giacomelli et.al. 2410.05320 link
2024-10-07 Enhancing Job Interview Preparation Through Immersive Experiences Using Photorealistic, AI-powered Metahuman Avatars Navid Ashrafi et.al. 2410.05131 null
2024-10-13 CR-CTC: Consistency regularization on CTC for improved speech recognition Zengwei Yao et.al. 2410.05101 link
2024-10-06 Punctuation Prediction for Polish Texts using Transformers Jakub Pokrywka et.al. 2410.04621 null
2024-10-06 Casablanca: Data and Models for Multidialectal Arabic Speech Recognition Bashar Talafha et.al. 2410.04527 null
2024-10-05 Efficient and Robust Long-Form Speech Recognition with Hybrid H3-Conformer Tomoki Honda et.al. 2410.04159 link
2024-10-05 The OCON model: an old but green solution for distributable supervised classification for acoustic monitoring in smart cities Stefano Giacomelli et.al. 2410.04098 null
2024-10-05 Enhancement of Dysarthric Speech Reconstruction by Contrastive Learning Keshvari Fatemeh et.al. 2410.04092 null
2024-10-04 Reverb: Open-Source ASR and Diarization from Rev Nishchal Bhandari et.al. 2410.03930 null
2024-10-13 Self-Powered LLM Modality Expansion for Large Speech-Text Models Tengfei Yu et.al. 2410.03798 link
2024-10-02 SeeSay: An Assistive Device for the Visually Impaired Using Retrieval Augmented Generation Melody Yu et.al. 2410.03771 null
2024-10-02 Efficient Streaming LLM for Speech Recognition Junteng Jia et.al. 2410.03752 null
2024-10-01 Recent Advances in Speech Language Models: A Survey Wenqian Cui et.al. 2410.03751 null
2024-10-04 Multi-Dialect Vietnamese: Task, Dataset, Baseline Models and Challenges Nguyen Van Dinh et.al. 2410.03458 link
2024-10-04 Team MTS @ AutoMin 2021: An Overview of Existing Summarization Approaches and Comparison to Unsupervised Summarization Techniques Olga Iakovenko et.al. 2410.03412 null
2024-10-03 Three-in-One: Fast and Accurate Transducer for Hybrid-Autoregressive ASR Hainan Xu et.al. 2410.02597 null
2024-10-04 Convolutional Variational Autoencoders for Spectrogram Compression in Automatic Speech Recognition Olga Iakovenko et.al. 2410.02560 null
2024-10-03 Algorithms For Automatic Accentuation And Transcription Of Russian Texts In Speech Recognition Systems Olga Iakovenko et.al. 2410.02538 null
2024-10-03 A Pilot Study of Applying Sequence-to-Sequence Voice Conversion to Evaluate the Intelligibility of L2 Speech Using a Native Speaker's Shadowings Haopeng Geng et.al. 2410.02239 null
2024-09-27 A GEN AI Framework for Medical Note Generation Hui Yi Leong et.al. 2410.01841 null
2024-10-02 Spoken Grammar Assessment Using LLM Sunil Kumar Kopparapu et.al. 2410.01579 null
2024-10-01 MOSEL: 950,000 Hours of Speech Data for Open-Source Speech Foundation Model Training on EU Languages Marco Gaido et.al. 2410.01036 link
2024-10-01 Automatic Speech Recognition for the Ika Language Uchenna Nzenwata et.al. 2410.00940 null
2024-10-04 VHASR: A Multimodal Speech Recognition System With Vision Hotwords Jiliang Hu et.al. 2410.00822 link
2024-10-01 End-to-End Speech Recognition with Pre-trained Masked Language Model Yosuke Higuchi et.al. 2410.00528 link
2024-09-30 Mamba for Streaming ASR Combined with Unimodal Aggregation Ying Fang et.al. 2410.00070 link
2024-10-02 Moshi: a speech-text foundation model for real-time dialogue Alexandre Défossez et.al. 2410.00037 link
2024-09-30 Boosting Hybrid Autoregressive Transducer-based ASR with Internal Acoustic Model Training and Dual Blank Thresholding Takafumi Moriya et.al. 2409.20313 null
2024-09-30 Alignment-Free Training for Transducer-based Multi-Talker ASR Takafumi Moriya et.al. 2409.20301 null
2024-09-30 AfriHuBERT: A self-supervised speech representation model for African languages Jesujoba O. Alabi et.al. 2409.20201 null
2024-09-30 Predictive Speech Recognition and End-of-Utterance Detection Towards Spoken Dialog Systems Oswald Zink et.al. 2409.19990 null
2024-09-30 HDMoLE: Mixture of LoRA Experts with Hierarchical Routing and Dynamic Thresholds for Fine-Tuning LLM-based ASR Models Bingshen Mu et.al. 2409.19878 null
2024-09-29 Fine-Tuning Automatic Speech Recognition for People with Parkinson's: An Effective Strategy for Enhancing Speech Technology Accessibility Xiuwen Zheng et.al. 2409.19818 null
2024-09-29 Efficient Long-Form Speech Recognition for General Speech In-Context Learning Hao Yen et.al. 2409.19757 null
2024-09-29 Quantitative Analysis of Audio-Visual Tasks: An Information-Theoretic Perspective Chen Chen et.al. 2409.19575 null
2024-09-29 CoT-ST: Enhancing LLM-based Speech Translation with Multimodal Chain-of-Thought Yexing Du et.al. 2409.19510 link
2024-09-28 Advanced Clustering Techniques for Speech Signal Enhancement: A Review and Metanalysis of Fuzzy C-Means, K-Means, and Kernel Fuzzy C-Means Methods Abdulhady Abas Abdullah et.al. 2409.19448 null
2024-09-27 Speech-Mamba: Long-Context Speech Recognition with Selective State Spaces Models Xiaoxue Gao et.al. 2409.18654 null
2024-09-30 ChildMandarin: A Comprehensive Mandarin Speech Dataset for Young Children Aged 3-5 Jiaming Zhou et.al. 2409.18584 null
2024-09-27 Improving Multilingual ASR in the Wild Using Simple N-best Re-ranking Brian Yan et.al. 2409.18428 link
2024-09-26 Unveiling the Role of Pretraining in Direct Speech Translation Belen Alastruey et.al. 2409.18044 null
2024-09-26 Are Transformers in Pre-trained LM A Good ASR Encoder? An Empirical Study Keyu An et.al. 2409.17750 null
2024-09-26 Paraformer-v2: An improved non-autoregressive transformer for noise-robust speech recognition Keyu An et.al. 2409.17746 null
2024-09-26 Deep CLAS: Deep Contextual Listen, Attend and Spell Shifu Xiong et.al. 2409.17603 null
2024-11-08 How to Connect Speech Foundation Models and Large Language Models? What Matters and What Does Not Francesco Verdini et.al. 2409.17044 null
2024-09-25 MT2KD: Towards A General-Purpose Encoder for Speech, Speaker, and Audio Events Xiaoyu Yang et.al. 2409.17010 null
2024-09-25 Weighted Cross-entropy for Low-Resource Languages in Multilingual Speech Recognition Andrés Piñeiro-Martín et.al. 2409.16954 link
2024-09-27 Semi-Supervised Cognitive State Classification from Speech with Multi-View Pseudo-Labeling Yuanchao Li et.al. 2409.16937 link
2024-09-25 Speech Recognition Rescoring with Large Speech-Text Foundation Models Prashanth Gurunath Shivakumar et.al. 2409.16654 null
2024-09-24 Spelling Correction through Rewriting of Non-Autoregressive ASR Lattices Leonid Velikovich et.al. 2409.16469 null
2024-09-24 Revisiting Acoustic Features for Robust ASR Muhammad A. Shah et.al. 2409.16399 null
2024-09-10 How Redundant Is the Transformer Stack in Speech Representation Models? Teresa Dorszewski et.al. 2409.16302 null
2024-09-24 Bridging Speech and Text: Enhancing ASR with Pinyin-to-Character Pre-training in LLMs Yang Yuhang et.al. 2409.16005 null
2024-10-31 Boosting Code-Switching ASR with Mixture of Experts Enhanced Speech-Conditioned LLM Fengrun Zhang et.al. 2409.15905 null
2024-09-24 WeSep: A Scalable and Flexible Toolkit Towards Generalizable Target Speaker Extraction Shuai Wang et.al. 2409.15799 link
2024-09-24 Hypothesis Clustering and Merging: Novel MultiTalker Speech Recognition with Speaker Tokens Yosuke Kashiwagi et.al. 2409.15732 null
2024-09-23 Revise, Reason, and Recognize: LLM-Based Emotion Recognition via Emotion-Specific Prompts and ASR Error Correction Yuanchao Li et.al. 2409.15551 link
2024-09-17 A Joint Spectro-Temporal Relational Thinking Based Acoustic Modeling Framework Zheng Nan et.al. 2409.15357 null
2024-09-11 Contextualization of ASR with LLM using phonetic retrieval-based augmentation Zhihong Lei et.al. 2409.15353 null
2024-09-10 A Large Dataset of Spontaneous Speech with the Accent Spoken in São Paulo for Automatic Speech Recognition Evaluation Rodrigo Lima et.al. 2409.15350 null
2024-09-13 CPT-Boosted Wav2vec2.0: Towards Noise Robust Speech Recognition for Classroom Environments Ahmed Adel Attia et.al. 2409.14494 null
2024-09-21 Strong Alone, Stronger Together: Synergizing Modality-Binding Foundation Models with Optimal Transport for Non-Verbal Emotion Recognition Orchid Chetia Phukan et.al. 2409.14221 null
2024-09-21 MultiMed: Multilingual Medical Speech Recognition via Attention Encoder Decoder Khai Le-Duc et.al. 2409.14074 link
2024-09-20 Time and Tokens: Benchmarking End-to-End Speech Dysfluency Detection Xuanru Zhou et.al. 2409.13582 null
2024-09-20 LM-assisted keyword biasing with Aho-Corasick algorithm for Transducer-based ASR Iuliia Thorbecke et.al. 2409.13514 null
2024-10-07 Fast Streaming Transducer ASR Prototyping via Knowledge Distillation with Whisper Iuliia Thorbecke et.al. 2409.13499 null
2024-09-20 A Multimodal Dense Retrieval Approach for Speech-Based Open-Domain Question Answering Georgios Sidiropoulos et.al. 2409.13483 null
2024-09-20 Large Language Model Should Understand Pinyin for Chinese ASR Error Correction Yuang Li et.al. 2409.13262 null
2024-09-19 Personalized Speech Recognition for Children with Test-Time Adaptation Zhonghao Shi et.al. 2409.13095 null
2024-09-19 Enhancing Synthetic Training Data for Speech Commands: From ASR-Based Filtering to Domain Adaptation in SSL Latent Space Sebastião Quintas et.al. 2409.12745 null
2024-09-19 Hidden in Plain Sound: Environmental Backdoor Poisoning Attacks on Whisper, and Mitigations Jonatan Bartolini et.al. 2409.12553 null
2024-09-19 Disentangling Speakers in Multi-Talker Speech Recognition with Speaker-Aware CTC Jiawen Kang et.al. 2409.12388 null
2024-09-19 Channel-Aware Domain-Adaptive Generative Adversarial Network for Robust Speech Recognition Chien-Chun Wang et.al. 2409.12386 link
2024-09-19 Robust Audiovisual Speech Recognition Models with Mixture-of-Experts Yihan Wu et.al. 2409.12370 null
2024-09-18 META-CAT: Speaker-Informed Speech Embeddings via Meta Information Concatenation for Multi-talker ASR Jinhan Wang et.al. 2409.12352 null
2024-09-18 Large Language Models Are Strong Audio-Visual Speech Recognition Learners Umberto Cappellazzo et.al. 2409.12319 null
2024-09-19 WeHelp: A Shared Autonomy System for Wheelchair Users Abulikemu Abuduweili et.al. 2409.12159 link
2024-09-18 ASR Benchmarking: Need for a More Representative Conversational Dataset Gaurav Maheshwari et.al. 2409.12042 link
2024-09-18 M2R-Whisper: Multi-stage and Multi-scale Retrieval Augmentation for Enhancing Whisper Jiaming Zhou et.al. 2409.11889 null
2024-09-19 Simulating Native Speaker Shadowing for Nonnative Speech Assessment with Latent Speech Representations Haopeng Geng et.al. 2409.11742 null
2024-09-17 Chain-of-Thought Prompting for Speech Translation Ke Hu et.al. 2409.11538 null
2024-09-17 M-BEST-RQ: A Multi-Channel Speech Foundation Model for Smart Glasses Yufeng Yang et.al. 2409.11494 null
2024-09-17 Bio-Inspired Mamba: Temporal Locality and Bioplausible Learning in Selective State Space Models Jiahao Qin et.al. 2409.11263 null
2024-09-17 WER We Stand: Benchmarking Urdu ASR Models Samee Arif et.al. 2409.11252 null
2024-09-17 Ideal-LLM: Integrating Dual Encoders and Language-Adapted LLM for Multilingual Speech-to-Text Hongfei Xue et.al. 2409.11214 null
2024-09-17 Zero Shot Text to Speech Augmentation for Automatic Speech Recognition on Low-Resource Accented Speech Corpora Francesco Nespoli et.al. 2409.11107 null
2024-09-17 Enhancing Low-Resource Language and Instruction Following Capabilities of Audio Language Models Potsawee Manakul et.al. 2409.10999 null
2024-09-17 Speech Recognition for Analysis of Police Radio Communication Tejes Srivastava et.al. 2409.10858 null
2024-09-16 An Efficient Self-Learning Framework For Interactive Spoken Dialog Systems Hitesh Tulsiani et.al. 2409.10515 null
2024-09-16 Meta-Whisper: Speech-Based Meta-ICL for ASR on Low-Resource Languages Ming-Hao Hsu et.al. 2409.10429 null
2024-09-16 Voice control interface for surgical robot assistants Ana Davila et.al. 2409.10225 null
2024-09-17 Augmenting Automatic Speech Recognition Models with Disfluency Detection Robin Amann et.al. 2409.10177 null
2024-09-16 Optimizing Dysarthria Wake-Up Word Spotting: An End-to-End Approach for SLT 2024 LRDWWS Challenge Shuiyun Liu et.al. 2409.10076 null
2024-09-16 A Study on Zero-shot Non-intrusive Speech Assessment using Large Language Models Ryandhimas E. Zezario et.al. 2409.09914 null
2024-09-17 Large Language Model Based Generative Error Correction: A Challenge and Baselines for Speech Recognition, Speaker Tagging, and Emotion Recognition Chao-Han Huck Yang et.al. 2409.09785 null
2024-09-14 ASR Error Correction using Large Language Models Rao Ma et.al. 2409.09554 null
2024-09-14 M $^{3}$ V: A multi-modal multi-view approach for Device-Directed Speech Detection Anna Wang et.al. 2409.09284 null
2024-09-13 Multi-modal Speech Transformer Decoders: When Do Multiple Modalities Improve Accuracy? Yiwen Guan et.al. 2409.09221 null
2024-09-13 Learnings from curating a trustworthy, well-annotated, and useful dataset of disordered English speech Pan-Pan Jiang et.al. 2409.09190 null
2024-09-13 Clean Label Attacks against SLU Systems Henry Li Xinyuan et.al. 2409.08985 null
2024-09-13 Exploring the Impact of Data Quantity on ASR in Extremely Low-resource Languages Yao-Fei Cheng et.al. 2409.08872 null
2024-09-13 Exploring SSL Discrete Tokens for Multilingual ASR Mingyu Cui et.al. 2409.08805 null
2024-09-13 NEST-RQ: Next Token Prediction for Speech Self-Supervised Pre-Training Minglun Han et.al. 2409.08680 null
2024-09-13 LA-RAG:Enhancing LLM-based ASR Accuracy with Retrieval-Augmented Generation Shaojun Li et.al. 2409.08597 null
2024-09-13 Large Language Model Can Transcribe Speech in Multi-Talker Scenarios with Versatile Instructions Lingwei Meng et.al. 2409.08596 null
2024-09-12 Faster Speech-LLaMA Inference with Multi-token Prediction Desh Raj et.al. 2409.08148 null
2024-09-12 WhisperNER: Unified Open Named Entity and Speech Recognition Gil Ayache et.al. 2409.08107 null
2024-10-06 The Faetar Benchmark: Speech Recognition in a Very Under-Resourced Language Michael Ong et.al. 2409.08103 null
2024-09-12 Auto-Landmark: Acoustic Landmark Dataset and Open-Source Toolkit for Landmark Extraction Xiangyu Zhang et.al. 2409.07969 null
2024-09-12 Detecting and Defending Against Adversarial Attacks on Automatic Speech Recognition via Diffusion Models Nikolai L. Kühne et.al. 2409.07936 link
2024-09-12 Full-text Error Correction for Chinese Speech Recognition with Large Language Model Zhiyuan Tang et.al. 2409.07790 null
2024-09-11 Rethinking Mamba in Speech Processing by Self-Supervised Models Xiangyu Zhang et.al. 2409.07273 null
2024-09-11 ManaTTS Persian: a recipe for creating TTS datasets for lower resource languages Mahta Fetrat Qharabagh et.al. 2409.07259 null
2024-09-11 Enhancing CTC-Based Visual Speech Recognition Hendrik Laux et.al. 2409.07210 null
2024-09-11 Linear Time Complexity Conformers with SummaryMixing for Streaming Speech Recognition Titouan Parcollet et.al. 2409.07165 link
2024-09-10 An Effective Context-Balanced Adaptation Approach for Long-Tailed Speech Recognition Yi-Cheng Wang et.al. 2409.06468 null
2024-09-10 Keyword-Aware ASR Error Augmentation for Robust Dialogue State Tracking Jihyun Lee et.al. 2409.06263 null
2024-09-10 Advancing Topic Segmentation of Broadcasted Speech with Multilingual Semantic Embeddings Sakshi Deo Shukla et.al. 2409.06222 link
2024-09-09 Retrieval Augmented Correction of Named Entity Speech Recognition Errors Ernest Pusateri et.al. 2409.06062 null
2024-09-09 Consensus-based Distributed Quantum Kernel Learning for Speech Recognition Kuan-Cheng Chen et.al. 2409.05770 null
2024-09-09 A Toolkit for Joint Speaker Diarization and Identification with Application to Speaker-Attributed ASR Giovanni Morrone et.al. 2409.05750 null
2024-09-11 Evaluation of real-time transcriptions using end-to-end ASR models Carlos Arriaga et.al. 2409.05674 null
2024-09-09 Longer is (Not Necessarily) Stronger: Punctuated Long-Sequence Training for Enhanced Speech Recognition and Translation Nithin Rao Koluguri et.al. 2409.05601 null
2024-09-09 An investigation of modularity for noise robustness in conformer-based ASR Louise Coppieters de Gibson et.al. 2409.05589 null
2025-08-27 Leveraging Content and Acoustic Representations for Speech Emotion Recognition Soumya Dutta et.al. 2409.05566 null
2024-09-09 NTT Multi-Speaker ASR System for the DASR Task of CHiME-8 Challenge Naoyuki Kamo et.al. 2409.05554 null
2024-09-09 Findings of the 2024 Mandarin Stuttering Event Detection and Automatic Speech Recognition Challenge Hongfei Xue et.al. 2409.05430 null
2024-09-08 Exploring WavLM Back-ends for Speech Spoofing and Deepfake Detection Theophile Stourbe et.al. 2409.05032 null
2024-09-04 Probing self-attention in self-supervised speech models for cross-linguistic differences Sai Gopinath et.al. 2409.03115 null
2024-09-04 Quantification of stylistic differences in human- and ASR-produced transcripts of African American English Annika Heuser et.al. 2409.03059 null
2024-09-04 Efficient Extraction of Noise-Robust Discrete Units from Self-Supervised Speech Models Jakob Poncelet et.al. 2409.02565 null
2024-09-04 Parameter estimation of hidden Markov models: comparison of EM and quasi-Newton methods with a new hybrid algorithm Sidonie Foulon et.al. 2409.02477 null
2024-09-04 What is lost in Normalization? Exploring Pitfalls in Multilingual ASR Model Evaluations Kavya Manohar et.al. 2409.02449 null
2024-09-05 Temporal Order Preserved Optimal Transport-based Cross-modal Knowledge Transfer Learning for ASR Xugang Lu et.al. 2409.02239 null
2024-08-19 Toward Large-scale Spiking Neural Networks: A Comprehensive Survey and Future Directions Yangfan Hu et.al. 2409.02111 null
2024-09-05 Enhancing Code-Switching Speech Recognition with LID-Based Collaborative Mixture of Experts Model Hukai Huang et.al. 2409.02050 null
2024-09-03 The USTC-NERCSLIP Systems for the CHiME-8 NOTSOFAR-1 Challenge Shutong Niu et.al. 2409.02041 null
2024-09-03 Reassessing Noise Augmentation Methods in the Context of Adversarial Speech Karla Pizzi et.al. 2409.01813 null
2024-09-24 VoxHakka: A Dialectally Diverse Multi-speaker Text-to-Speech System for Taiwanese Hakka Li-Wei Chen et.al. 2409.01548 null
2024-09-02 Resource-Efficient Adaptation of Speech Foundation Models for Multi-Speaker ASR Weiqing Wang et.al. 2409.01438 null
2024-09-23 Refined Statistical Bounds for Classification Error Mismatches with Constrained Bayes Error Zijian Yang et.al. 2409.01309 null
2024-09-02 A Framework for Synthetic Audio Conversations Generation using Large Language Models Kaung Myat Kyaw et.al. 2409.00946 null
2024-09-11 Serialized Speech Information Guidance with Overlapped Encoding Separation for Multi-Speaker Automatic Speech Recognition Hao Shi et.al. 2409.00815 null
2024-09-01 Comparing Discrete and Continuous Space LLMs for Speech Recognition Yaoxun Xu et.al. 2409.00800 null
2024-09-11 DCIM-AVSR : Efficient Audio-Visual Speech Recognition via Dual Conformer Interaction Module Xinyu Wang et.al. 2409.00481 null
2024-08-31 Progressive Residual Extraction based Pre-training for Speech Representation Learning Tianrui Wang et.al. 2409.00387 null
2024-09-08 ProGRes: Prompted Generative Rescoring on ASR n-Best Ada Defne Tur et.al. 2409.00217 link
2024-08-30 Developing an End-to-End Framework for Predicting the Social Communication Severity Scores of Children with Autism Spectrum Disorder Jihyun Mun et.al. 2409.00158 null
2024-08-30 Speaker Tagging Correction With Non-Autoregressive Language Models Grigor Kirakosyan et.al. 2409.00151 null
2024-08-30 Advancing Multi-talker ASR Performance with Large Language Models Mohan Shi et.al. 2408.17431 null
2024-08-30 Generative Modeling Perspective for Control and Reasoning in Robotics Takuma Yoneda et.al. 2408.17041 null
2024-08-29 CrisperWhisper: Accurate Timestamps on Verbatim Speech Transcriptions Laurin Wagner et.al. 2408.16589 link
2024-08-29 Human-Inspired Audio-Visual Speech Recognition: Spike Activity, Cueing Interaction and Causal Processing Qianhui Liu et.al. 2408.16564 null
2024-08-29 Measuring the Accuracy of Automatic Speech Recognition Solutions Korbinian Kuhn et.al. 2408.16287 link
2024-08-29 Revisit Micro-batch Clipping: Adaptive Data Pruning via Gradient Manipulation Lun Wang et.al. 2408.16204 null
2024-08-29 Benchmarking Japanese Speech Recognition on ASR-LLM Setups with Multi-Pass Augmented Generative Error Correction Yuka Ko et.al. 2408.16180 null
2024-08-28 Beyond Levenshtein: Leveraging Multiple Algorithms for Robust Word Error Rate Computations And Granular Error Classifications Korbinian Kuhn et.al. 2408.15616 link
2024-08-28 Whisper-PMFA: Partial Multi-Scale Feature Aggregation for Speaker Verification using Whisper Models Yiyang Zhao et.al. 2408.15585 null
2024-08-27 Speech Recognition Transformers: Topological-lingualism Perspective Shruti Singh et.al. 2408.14991 null
2024-08-27 Literary and Colloquial Dialect Identification for Tamil using Acoustic Features M. Nanmalar et.al. 2408.14887 null
2024-09-06 MEDSAGE: Enhancing Robustness of Medical Dialogue Summarization to ASR Errors with LLM-generated Synthetic Dialogues Kuluhan Binici et.al. 2408.14418 null
2024-08-26 Self-supervised Speech Representations Still Struggle with African American Vernacular English Kalvin Chang et.al. 2408.14262 link
2024-08-26 Automatic recognition and detection of aphasic natural speech Mara Barberis et.al. 2408.14082 null
2024-08-28 Research Advances and New Paradigms for Biology-inspired Spiking Neural Networks Tianyu Zheng et.al. 2408.13996 null
2024-08-25 Literary and Colloquial Tamil Dialect Identification M. Nanmalar et.al. 2408.13739 null
2024-08-24 Studying the Effect of Audio Filters in Pre-Trained Models for Environmental Sound Classification Aditya Dawn et.al. 2408.13644 null
2024-09-18 NEST: Self-supervised Fast Conformer as All-purpose Seasoning to Speech Processing Tasks He Huang et.al. 2408.13106 link
2024-08-23 Focused Discriminative Training For Streaming CTC-Trained Automatic Speech Recognition Models Adnan Haider et.al. 2408.13008 null
2024-08-22 Towards measuring fairness in speech recognition: Fair-Speech dataset Irina-Elena Veliche et.al. 2408.12734 null
2024-08-22 WhisperMask: A Noise Suppressive Mask-Type Microphone for Whisper Speech Hirotaka Hiraki et.al. 2408.12500 null
2024-08-22 Positional Description for Numerical Normalization Deepanshu Gupta et.al. 2408.12430 null
2024-08-22 Developing vocal system impaired patient-aimed voice quality assessment approach using ASR representation-included multiple features Shaoxiang Dang et.al. 2408.12279 null
2024-08-21 The State of Commercial Automatic French Legal Speech Recognition Systems and their Impact on Court Reporters et al Nicolad Garneau et.al. 2408.11940 null
2024-08-19 Parameter-Efficient Transfer Learning under Federated Learning for Automatic Speech Recognition Xuan Kan et.al. 2408.11873 null
2024-08-13 Style-Talker: Finetuning Audio Language Model and Style-Based Text-to-Speech Model for Fast Spoken Dialogue Generation Yinghao Aaron Li et.al. 2408.11849 null
2024-08-21 Approaching Deep Learning through the Spectral Dynamics of Weights David Yunis et.al. 2408.11804 link
2024-08-21 Improving Speech Recognition Error Prediction for Modern and Off-the-shelf Speech Recognizers Prashant Serai et.al. 2408.11258 null
2024-08-20 XCB: an effective contextual biasing approach to bias cross-lingual phrases in speech recognition Xucheng Wan et.al. 2408.10524 null
2024-08-19 Recording for Eyes, Not Echoing to Ears: Contextualized Spoken-to-Written Conversion of ASR Transcripts Jiaqing Liu et.al. 2408.09688 null
2024-08-18 A Transcription Prompt-based Efficient Audio Large Language Model for Robust Speech Recognition Yangze Li et.al. 2408.09491 null
2024-08-17 Generating Data with Text-to-Speech and Large-Language Models for Conversational Speech Recognition Samuele Cornell et.al. 2408.09215 link
2024-08-15 Enhancing Large Language Model-based Speech Recognition by Contextualization for Rare and Ambiguous Words Kento Nozawa et.al. 2408.08027 null
2024-08-14 SER Evals: In-domain and Out-of-domain Benchmarking for Speech Emotion Recognition Mohamed Osman et.al. 2408.07851 link
2024-08-14 DPSNN: Spiking Neural Network for Low-Latency Streaming Speech Enhancement Tao Sun et.al. 2408.07388 null
2024-08-16 MathBridge: A Large Corpus Dataset for Translating Spoken Mathematical Expressions into $LaTeX$ Formulas for Improved Readability Kyudan Jung et.al. 2408.07081 null
2024-08-12 Cross-Lingual Conversational Speech Summarization with Large Language Models Max Nelson et.al. 2408.06484 null
2024-08-12 Audio Enhancement for Computer Audition -- An Iterative Training Paradigm Using Sample Importance Manuel Milling et.al. 2408.06264 null
2024-08-12 Enhancing Dialogue Speech Recognition with Robust Contextual Awareness via Noise Representation Learning Wonjun Lee et.al. 2408.06043 null
2024-08-11 LI-TTA: Language Informed Test-Time Adaptation for Automatic Speech Recognition Eunseop Yoon et.al. 2408.05769 null
2024-08-11 VQ-CTAP: Cross-Modal Fine-Grained Sequence Representation Learning for Speech Processing Chunyu Qiang et.al. 2408.05758 null
2024-08-10 Improving Whisper's Recognition Performance for Under-Represented Language Kazakh Leveraging Unpaired Speech and Text Jinpeng Li et.al. 2408.05554 null
2024-08-09 MooER: LLM-based Speech Recognition and Translation Models from Moore Threads Junhao Xu et.al. 2408.05101 link
2024-08-08 HydraFormer: One Encoder For All Subsampling Rates Yaoxun Xu et.al. 2408.04325 link
2024-08-08 Preserving spoken content in voice anonymisation with character-level vocoder conditioning Michele Panariello et.al. 2408.04306 link
2024-08-08 wav2graph: A Framework for Supervised Learning Knowledge Graph from Speech Khai Le-Duc et.al. 2408.04174 link
2024-08-07 Speaker Adaptation for Quantised End-to-End ASR Models Qiuming Zhao et.al. 2408.03979 null
2024-08-06 ASR-enhanced Multimodal Representation Learning for Cross-Domain Product Retrieval Ruixiang Zhao et.al. 2408.02978 null
2024-08-06 Self-Supervised Learning for Multi-Channel Neural Transducer Atsushi Kojima et.al. 2408.02945 null
2024-08-05 Clustering and Mining Accented Speech for Inclusive and Fair Speech Recognition Jaeyoung Kim et.al. 2408.02582 null
2024-09-12 The NPU-ASLP System Description for Visual Speech Recognition in CNVSRC 2024 He Wang et.al. 2408.02369 link
2024-08-05 StreamVoice+: Evolving into End-to-end Streaming Zero-shot Voice Conversion Zhichao Wang et.al. 2408.02178 null
2024-08-03 ALIF: Low-Cost Adversarial Audio Attacks on Black-Box Speech Platforms using Linguistic Features Peng Cheng et.al. 2408.01808 link
2024-08-01 SynesLM: A Unified Approach for Audio-visual Speech Recognition and Translation via Language Model and Synthetic Data Yichen Lu et.al. 2408.00624 link
2024-08-01 Sentence-wise Speech Summarization: Task, Datasets, and End-to-End Modeling with LM Knowledge Distillation Kohei Matsuura et.al. 2408.00205 null
2024-07-18 Framework for Curating Speech Datasets and Evaluating ASR Systems: A Case Study for Polish Michał Junczyk et.al. 2408.00005 link
2024-07-18 Handling Numeric Expressions in Automatic Speech Recognition Christian Huber et.al. 2408.00004 null
2024-08-15 The Llama 3 Herd of Models Abhimanyu Dubey et.al. 2407.21783 null
2024-07-31 On the Problem of Text-To-Speech Model Selection for Synthetic Data Generation in Automatic Speech Recognition Nick Rossenbach et.al. 2407.21476 null
2024-07-31 Towards interfacing large language models with ASR systems using confidence measures and prompting Maryam Naderi et.al. 2407.21414 null
2024-07-30 Self-Supervised Models in Automatic Whispered Speech Recognition Aref Farhadipour et.al. 2407.21211 null
2024-07-28 ELP-Adapters: Parameter Efficient Adapter Tuning for Various Speech Processing Tasks Nakamasa Inoue et.al. 2407.21066 null
2024-07-26 Improving noisy student training for low-resource languages in End-to-End ASR using CycleGAN and inter-domain losses Chia-Yu Li et.al. 2407.21061 null
2024-07-10 Dynamic Encoder Size Based on Data-Driven Layer-wise Pruning for Speech Recognition Jingjing Xu et.al. 2407.18930 null
2024-08-07 Dynamic Language Group-Based MoE: Enhancing Code-Switching Speech Recognition with Hierarchical Routing Hukai Huang et.al. 2407.18581 link
2024-07-29 Speech Bandwidth Expansion Via High Fidelity Generative Adversarial Networks Mahmoud Salhab et.al. 2407.18571 null
2024-07-26 Enhancing Dysarthric Speech Recognition for Unseen Speakers via Prototype-Based Adaptation Shiyao Wang et.al. 2407.18461 link
2024-07-08 Analyzing Speech Unit Selection for Textless Speech-to-Speech Translation Jarod Duret et.al. 2407.18332 null
2024-07-25 On the Effect of Purely Synthetic Training Data for Different Automatic Speech Recognition Architectures Nick Rossenbach et.al. 2407.17997 null
2024-07-25 Improving Domain-Specific ASR with LLM-Generated Contextual Descriptions Jiwon Suh et.al. 2407.17874 null
2024-07-25 Scaling A Simple Approach to Zero-Shot Speech Recognition Jinming Zhao et.al. 2407.17852 link
2024-07-24 Coupling Speech Encoders with Downstream Text Models Ciprian Chelba et.al. 2407.17605 null
2024-07-30 Toward Automated Detection of Biased Social Signals from the Content of Clinical Conversations Feng Chen et.al. 2407.17477 null
2024-07-10 Explaining Spectrograms in Machine Learning: A Study on Neural Networks for Speech Classification Jesin James et.al. 2407.17416 null
2024-07-24 A Comparative Analysis of Bilingual and Trilingual Wav2Vec Models for Automatic Speech Recognition in Multilingual Oral History Archives Jan Lehečka et.al. 2407.17160 null
2024-07-23 Quantifying the Role of Textual Predictability in Automatic Speech Recognition Sean Robertson et.al. 2407.16537 null
2024-07-23 The CHiME-8 DASR Challenge for Generalizable and Array Agnostic Distant Automatic Speech Recognition and Diarization Samuele Cornell et.al. 2407.16447 null
2024-07-23 Evolutionary Prompt Design for LLM-Based Post-ASR Error Correction Rithik Sachdev et.al. 2407.16370 link
2024-07-22 dMel: Speech Tokenization made Simple He Bai et.al. 2407.15835 null
2024-07-22 Robustness of Speech Separation Models for Similar-pitch Speakers Bunlong Lay et.al. 2407.15749 null
2024-07-22 SELM: Enhancing Speech Emotion Recognition for Out-of-Domain Scenarios Hazim Bukhari et.al. 2407.15300 null
2024-08-24 Trading Devil Final: Backdoor attack via Stock market and Bayesian Optimization Orson Mengara et.al. 2407.14573 null
2024-07-07 Morse Code-Enabled Speech Recognition for Individuals with Visual and Hearing Impairments Ritabrata Roy Choudhury et.al. 2407.14525 null
2024-07-19 GE2E-AC: Generalized End-to-End Loss Training for Accent Classification Chihiro Watanabe et.al. 2407.14021 null
2024-07-19 Reexamining Racial Disparities in Automatic Speech Recognition Performance: The Role of Confounding by Provenance Changye Li et.al. 2407.13982 null
2024-07-22 Self-supervised ASR Models and Features For Dysarthric and Elderly Speech Recognition Shujie Hu et.al. 2407.13782 null
2024-07-18 Robust ASR Error Correction with Conservative Data Filtering Takuma Udagawa et.al. 2407.13300 null
2024-07-18 Low-Resourced Speech Recognition for Iu Mien Language via Weakly-Supervised Phoneme-based Multilingual Pre-training Lukuan Dong et.al. 2407.13292 null
2024-07-18 How Private is Low-Frequency Speech Audio in the Wild? An Analysis of Verbal Intelligibility by Humans and Machines Ailin Liu et.al. 2407.13266 null
2024-07-18 A light-weight and efficient punctuation and word casing prediction model for on-device streaming ASR Jian You et.al. 2407.13142 null
2024-06-29 Error Correction by Paying Attention to Both Acoustic and Confidence References for Automatic Speech Recognition Yuchun Shu et.al. 2407.12817 null
2024-07-17 Morphosyntactic Analysis for CHILDES Houjun Liu et.al. 2407.12389 null
2024-07-17 Adaptive Cascading Network for Continual Test-Time Adaptation Kien X. Nguyen et.al. 2407.12240 null
2024-07-16 Identifying Speakers in Dialogue Transcripts: A Text-based Approach Using Pretrained Language Models Minh Nguyen et.al. 2407.12094 link
2024-06-29 A Quality-Aware Voltage Overscaling Framework to Improve the Energy Efficiency and Lifetime of TPUs based on Statistical Error Modeling Alireza Senobari et.al. 2407.12029 null
2024-06-28 TreeSeg: Hierarchical Topic Segmentation of Large Transcripts Dimitrios C. Gklezakos et.al. 2407.12028 null
2024-05-31 Open the Data! Chuvash Datasets Nikolay Plotnikov et.al. 2407.11982 null
2024-07-17 Vibravox: A Dataset of French Speech Captured with Body-conduction Audio Sensors Julien Hauret et.al. 2407.11828 link
2024-07-16 Investigating the Effect of Label Topology and Training Criterion on ASR Performance and Alignment Quality Tina Raissi et.al. 2407.11641 null
2024-07-16 The VoicePrivacy 2022 Challenge: Progress and Perspectives in Voice Anonymisation Michele Panariello et.al. 2407.11516 null
2024-07-16 Beyond Binary: Multiclass Paraphasia Detection with Generative Pretrained Transformers and End-to-End Models Matthew Perez et.al. 2407.11345 null
2024-07-15 Leave No Knowledge Behind During Knowledge Distillation: Towards Practical and Effective Knowledge Distillation for Code-Switching ASR Using Realistic Data Liang-Hsuan Tseng et.al. 2407.10603 null
2024-07-14 Improving Neural Biasing for Contextual Speech Recognition by Early Context Injection and Text Perturbation Ruizhe Huang et.al. 2407.10303 null
2024-07-14 CUSIDE-T: Chunking, Simulating Future and Decoding for Transducer based Streaming ASR Wenbo Zhao et.al. 2407.10255 null
2024-07-14 Textless Dependency Parsing by Labeled Sequence Prediction Shunsuke Kando et.al. 2407.10118 link
2024-07-14 Whisper-SV: Adapting Whisper for Low-data-resource Speaker Verification Li Zhang et.al. 2407.10048 null
2024-07-13 Text-Based Detection of On-Hold Scripts in Contact Center Calls Dmitrii Galimzianov et.al. 2407.09849 link
2024-08-24 Empowering Whisper as a Joint Multi-Talker and Target-Talker Speech Recognition System Lingwei Meng et.al. 2407.09817 link
2024-07-13 A Streaming Multi-Channel End-to-End Speech Recognition System with Realistic Evaluations Xiangzhu Kong et.al. 2407.09807 link
2024-07-13 Speech Slytherin: Examining the Performance and Efficiency of Mamba for Speech Separation, Recognition, and Synthesis Xilin Jiang et.al. 2407.09732 link
2024-07-10 Evaluating Voice Command Pipelines for Drone Control: From STT and LLM to Direct Classification and Siamese Networks Lucca Emmanuel Pineli Simões et.al. 2407.08658 null
2024-08-12 Tamil Language Computing: the Present and the Future Kengatharaiyer Sarveswaran et.al. 2407.08618 null
2024-07-10 HebDB: a Weakly Supervised Dataset for Hebrew Speech Processing Arnon Turetzky et.al. 2407.07566 null
2024-07-09 Tailored Design of Audio-Visual Speech Recognition Models using Branchformers David Gimeno-Gómez et.al. 2407.06606 link
2024-07-08 Homogeneous Speaker Features for On-the-Fly Dysarthric and Elderly Speaker Adaptation Mengzhe Geng et.al. 2407.06310 null
2024-07-09 CosyVoice: A Scalable Multilingual Zero-shot Text-to-speech Synthesizer based on Supervised Semantic Tokens Zhihao Du et.al. 2407.05407 null
2024-07-10 Seed-ASR: Understanding Diverse Speech and Contexts with LLM-based Speech Recognition Ye Bai et.al. 2407.04675 null
2024-07-05 Multitaper mel-spectrograms for keyword spotting Douglas Baptista de Souza et.al. 2407.04662 null
2024-07-05 Pretraining End-to-End Keyword Search with Automatically Discovered Acoustic Units Bolaji Yusuf et.al. 2407.04652 link
2024-07-05 Speculative Speech Recognition by Audio-Prefixed Low-Rank Adaptation of Language Models Bolaji Yusuf et.al. 2407.04641 null
2024-07-05 Written Term Detection Improves Spoken Term Detection Bolaji Yusuf et.al. 2407.04601 link
2024-07-09 Performance Analysis of Speech Encoders for Low-Resource SLU and ASR in Tunisian Dialect Salima Mdhaffar et.al. 2407.04533 link
2024-07-05 Controlling Whisper: Universal Acoustic Adversarial Attacks to Control Speech Foundation Models Vyas Raina et.al. 2407.04482 null
2024-07-05 XLSR-Transducer: Streaming ASR for Self-Supervised Pretrained Models Shashi Kumar et.al. 2407.04439 null
2024-07-05 Romanization Encoding For Multilingual ASR Wen Ding et.al. 2407.04368 null
2024-07-05 LearnerVoice: A Dataset of Non-Native English Learners' Spontaneous Speech Haechan Kim et.al. 2407.04280 null
2024-07-05 Semi-supervised Learning for Code-Switching ASR with Large Language Model Filter Yu Xi et.al. 2407.04219 null
2024-07-11 FunAudioLLM: Voice Understanding and Generation Foundation Models for Natural Interaction Between Humans and LLMs Keyu An et.al. 2407.04051 link
2024-07-04 Improving Accented Speech Recognition using Data Augmentation based on Unsupervised Text-to-Speech Synthesis Cong-Thanh Do et.al. 2407.04047 null
2024-07-04 Serialized Output Training by Learned Dominance Ying Shi et.al. 2407.03966 null
2024-07-04 Finetuning End-to-End Models for Estonian Conversational Spoken Language Translation Tiia Sildam et.al. 2407.03809 null
2024-07-04 Improving Self-supervised Pre-training using Accent-Specific Codebooks Darshan Prabhu et.al. 2407.03734 link
2024-07-24 Multi-Convformer: Extending Conformer with Multiple Convolution Kernels Darshan Prabhu et.al. 2407.03718 link
2024-07-04 Learning Video Temporal Dynamics with Cross-Modal Attention for Robust Audio-Visual Speech Recognition Sungnyun Kim et.al. 2407.03563 null
2024-07-03 Codec-ASR: Training Performant Automatic Speech Recognition Systems with Discrete Speech Representations Kunal Dhawan et.al. 2407.03495 null
2024-07-03 Advanced Framework for Animal Sound Classification With Features Optimization Qiang Yang et.al. 2407.03440 null
2024-07-03 Qifusion-Net: Layer-adapted Stream/Non-stream Model for End-to-End Multi-Accent Speech Recognition Jinming Chen et.al. 2407.03026 null
2024-07-02 Towards the Next Frontier in Speech Representation Learning Using Disentanglement Varun Krishna et.al. 2407.02543 null
2024-07-02 The USTC-NERCSLIP Systems for The ICMC-ASR Challenge Minghui Wu et.al. 2407.02052 null
2024-07-02 Pinyin Regularization in Error Correction for Chinese Speech Recognition with Large Language Models Zhiyuan Tang et.al. 2407.01909 link
2024-06-30 Less Forgetting for Better Generalization: Exploring Continual-learning Fine-tuning Methods for Speech Self-supervised Representations Salah Zaiem et.al. 2407.00756 null
2024-06-29 When Robots Get Chatty: Grounding Multimodal Human-Robot Conversation and Collaboration Philipp Allgeuer et.al. 2407.00518 null
2024-07-18 Open-Source Conversational AI with SpeechBrain 1.0 Mirco Ravanelli et.al. 2407.00463 null
2024-06-28 SAML: Speaker Adaptive Mixture of LoRA Experts for End-to-End ASR Qiuming Zhao et.al. 2406.19706 null
2024-06-28 Less is More: Accurate Speech Recognition & Translation without Web-Scale Data Krishna C. Puvvada et.al. 2406.19674 null
2024-06-27 Voices Unheard: NLP Resources and Models for Yorùbá Regional Dialects Orevaoghene Ahia et.al. 2406.19564 link
2024-06-27 Tradition or Innovation: A Comparison of Modern ASR Methods for Forced Alignment Rotem Rousso et.al. 2406.19363 null
2024-06-27 Zero-Query Adversarial Attack on Black-box Automatic Speech Recognition Systems Zheng Fang et.al. 2406.19311 null
2024-06-27 Applying LLMs for Rescoring N-best ASR Hypotheses of Casual Conversations: Effects of Domain Adaptation and Context Carry-over Atsunori Ogawa et.al. 2406.18972 null
2024-06-27 Enhanced ASR Robustness to Packet Loss with a Front-End Adaptation Network Yehoshua Dissen et.al. 2406.18928 null
2024-06-27 Streaming Decoder-Only Automatic Speech Recognition with Discrete Speech Units: A Pilot Study Peikun Chen et.al. 2406.18862 link
2024-06-26 Dynamic Data Pruning for Automatic Speech Recognition Qiao Xiao et.al. 2406.18373 null
2024-06-26 MSR-86K: An Evolving, Multilingual Corpus with 86,300 Hours of Transcribed Audio for Speech Recognition Research Song Li et.al. 2406.18301 null
2024-06-26 Automatic Speech Recognition for Hindi Anish Saha et.al. 2406.18135 null
2024-07-12 ArzEn-LLM: Code-Switched Egyptian Arabic-English Translation and Speech Recognition Using LLMs Ahmed Heakl et.al. 2406.18120 link
2024-06-26 SC-MoE: Switch Conformer Mixture of Experts for Unified Streaming and Non-streaming Code-Switching ASR Shuaishuai Ye et.al. 2406.18021 null
2024-06-25 Sequential Editing for Lifelong Training of Speech Recognition Models Devang Kulshreshtha et.al. 2406.17935 null
2024-06-25 FASA: a Flexible and Automatic Speech Aligner for Extracting High-quality Aligned Children Speech Data Dancheng Liu et.al. 2406.17926 link
2024-06-25 Automatic speech recognition for the Nepali language using CNN, bidirectional LSTM and ResNet Manish Dhakal et.al. 2406.17825 link
2024-06-25 Towards Building an End-to-End Multilingual Automatic Lyrics Transcription Model Jiawen Huang et.al. 2406.17618 link
2024-06-25 MSRS: Training Multimodal Speech Recognition Models from Scratch with Sparse Mask Optimization Adriana Fernandez-Lopez et.al. 2406.17614 null
2024-06-25 A Comprehensive Solution to Connect Speech Encoder and Large Language Model for ASR Van Tung Pham et.al. 2406.17272 null
2024-06-24 Investigating Confidence Estimation Measures for Speaker Diarization Anurag Chowdhury et.al. 2406.17124 null
2024-06-24 Exploring the Capability of Mamba in Speech Applications Koichi Miyazaki et.al. 2406.16808 null
2024-06-24 Blending LLMs into Cascaded Speech Translation: KIT's Offline Speech Translation System for IWSLT 2024 Sai Koneru et.al. 2406.16777 null
2024-06-23 Contextualized End-to-end Automatic Speech Recognition with Intermediate Biasing Loss Muhammad Shakeel et.al. 2406.16120 null
2024-08-01 Decoder-only Architecture for Streaming End-to-end Speech Recognition Emiru Tsunoo et.al. 2406.16107 null
2024-06-22 Acoustic Feature Mixup for Balanced Multi-aspect Pronunciation Assessment Heejin Do et.al. 2406.15723 null
2024-06-21 PI-Whisper: An Adaptive and Incremental ASR Framework for Diverse and Evolving Speaker Characteristics Amir Nassereldine et.al. 2406.15668 null
2024-06-21 Perception of Phonological Assimilation by Neural Speech Recognition Models Charlotte Pouw et.al. 2406.15265 null
2024-06-21 InterBiasing: Boost Unseen Word Recognition through Biasing Intermediate Predictions Yu Nakagome et.al. 2406.14890 null
2024-06-20 An Adapter-Based Unified Model for Multiple Spoken Language Processing Tasks Varsha Suresh et.al. 2406.14747 null
2024-06-21 DASB - Discrete Audio and Speech Benchmark Pooneh Mousavi et.al. 2406.14294 null
2024-06-20 Intelligent Interface: Enhancing Lecture Engagement with Didactic Activity Summaries Anna Wróblewska et.al. 2406.14266 null
2024-06-19 Joint vs Sequential Speaker-Role Detection and Automatic Speech Recognition for Air-traffic Control Alexander Blatt et.al. 2406.13842 null
2024-06-19 ManWav: The First Manchu ASR Model Jean Seo et.al. 2406.13502 null
2024-06-24 Children's Speech Recognition through Discrete Token Enhancement Vrunda N. Sukhadia et.al. 2406.13431 null
2024-06-17 Self-Train Before You Transcribe Robert Flynn et.al. 2406.12937 link
2024-06-16 Automatic Speech Recognition for Biomedical Data in Bengali Language Shariar Kabir et.al. 2406.12931 null
2024-06-18 Bridging the Gap: Integrating Pre-trained Speech Enhancement and Recognition Models for Robust Speech Recognition Kuan-Chen Wang et.al. 2406.12699 null
2024-06-18 Transcribe, Align and Segment: Creating speech datasets for low-resource languages Taras Sereda et.al. 2406.12674 null
2024-06-18 Growing Trees on Sounds: Assessing Strategies for End-to-End Dependency Parsing of Speech Adrien Pupier et.al. 2406.12621 link
2024-06-18 Rapid Language Adaptation for Multilingual E2E Speech Recognition Using Encoder Prompting Yosuke Kashiwagi et.al. 2406.12611 null
2024-06-18 Unsupervised Online Continual Learning for Automatic Speech Recognition Steven Vander Eeckt et.al. 2406.12503 link
2024-06-18 Performant ASR Models for Medical Entities in Accented Speech Tejumade Afonja et.al. 2406.12387 null
2024-06-18 Finding Task-specific Subnetworks in Multi-task Spoken Language Understanding Model Hayato Futami et.al. 2406.12317 null
2024-06-18 SyncVSR: Data-Efficient Visual Speech Recognition with End-to-End Crossmodal Audio Token Synchronization Young Jin Ahn et.al. 2406.12233 link
2024-06-17 GigaSpeech 2: An Evolving, Large-Scale and Multi-domain ASR Corpus for Low-Resource Languages with Automated Crawling, Transcription and Refinement Yifan Yang et.al. 2406.11546 link
2024-06-16 Continual Test-time Adaptation for End-to-end Speech Recognition on Noisy Speech Guan-Ting Lin et.al. 2406.11064 null
2024-06-16 NAST: Noise Aware Speech Tokenization for Speech Language Models Shoval Messica et.al. 2406.11037 link
2024-06-16 Large Language Models for Dysfluency Detection in Stuttered Speech Dominik Wagner et.al. 2406.11025 null
2024-06-16 Outlier Reduction with Gated Attention for Improved Post-training Quantization in Large Sequence-to-sequence Speech Foundation Models Dominik Wagner et.al. 2406.11022 null
2024-06-16 Optimized Speculative Sampling for GPU Hardware Accelerators Dominik Wagner et.al. 2406.11016 null
2024-06-16 CoSTA: Code-Switched Speech Translation using Aligned Speech-Text Interleaving Bhavani Shankar et.al. 2406.10993 null
2024-06-16 Imperceptible Rhythm Backdoor Attacks: Exploring Rhythm Transformation for Embedding Undetectable Vulnerabilities on Speech Recognition Wenhan Yao et.al. 2406.10932 null
2024-06-15 Speech Emotion Recognition Using CNN and Its Use Case in Digital Healthcare Nishargo Nigar et.al. 2406.10741 null
2024-06-21 Trading Devil: Robust backdoor attack via Stochastic investment models and Bayesian approach Orson Mengara et.al. 2406.10719 null
2024-08-06 Double Multi-Head Attention Multimodal System for Odyssey 2024 Speech Emotion Recognition Challenge Federico Costa et.al. 2406.10598 null
2024-06-14 CNVSRC 2023: The First Chinese Continuous Visual Speech Recognition Challenge Chen Chen et.al. 2406.10313 null
2024-06-12 Improving child speech recognition with augmented child-like speech Yuanyuan Zhang et.al. 2406.10284 null
2024-06-14 Inclusive ASR for Disfluent Speech: Cascaded Large-Scale Self-Supervised Learning with Targeted Fine-Tuning and Data Augmentation Dena Mujtaba et.al. 2406.10177 null
2024-06-14 On the Evaluation of Speech Foundation Models for Spoken Language Understanding Siddhant Arora et.al. 2406.10083 null
2024-06-14 Whisper-Flamingo: Integrating Visual Features into Whisper for Audio-Visual Speech Recognition and Translation Andrew Rouditchenko et.al. 2406.10082 link
2024-06-14 Simul-Whisper: Attention-Guided Streaming Whisper with Truncation Detection Haoyu Wang et.al. 2406.10052 link
2024-06-14 ROAR: Reinforcing Original to Augmented Data Ratio Dynamics for Wav2Vec2.0 Based ASR Vishwanath Pratap Singh et.al. 2406.09999 null
2024-06-14 An efficient text augmentation approach for contextualized Mandarin speech recognition Naijun Zheng et.al. 2406.09950 null
2024-06-14 Perceiver-Prompt: Flexible Speaker Adaptation in Whisper for Chinese Disordered Speech Recognition Yicong Jiang et.al. 2406.09873 null
2024-06-14 MMM: Multi-Layer Multi-Residual Multi-Stream Discrete Speech Representation from Self-supervised Learning Model Jiatong Shi et.al. 2406.09869 null
2024-06-14 Optimizing Byte-level Representation for End-to-end ASR Roger Hsiao et.al. 2406.09676 null
2024-06-14 Learning Language Structures through Grounding Freda Shi et.al. 2406.09662 null
2024-06-13 Multi-Modal Retrieval For Large Language Model Based Speech Recognition Jari Kolehmainen et.al. 2406.09618 null
2024-06-13 Speech ReaLLM -- Real-time Streaming Speech Recognition with Multimodal LLMs by Teaching the Flow of Time Frank Seide et.al. 2406.09569 null
2024-06-13 The Second DISPLACE Challenge : DIarization of SPeaker and LAnguage in Conversational Environments Shareef Babu Kalluri et.al. 2406.09494 null
2024-06-12 Comparative Analysis of Personalized Voice Activity Detection Systems: Assessing Real-World Effectiveness Satyam Kumar et.al. 2406.09443 null
2024-04-13 SGPRS: Seamless GPU Partitioning Real-Time Scheduler for Periodic Deep Learning Workloads Amir Fakhim Babaei et.al. 2406.09425 null
2024-06-13 Language Complexity and Speech Recognition Accuracy: Orthographic Complexity Hurts, Phonological Complexity Doesn't Chihiro Taguchi et.al. 2406.09202 link
2024-06-13 LASER: Learning by Aligning Self-supervised Representations of Speech for Improving Content-related Tasks Amit Meghanani et.al. 2406.09153 link
2024-06-13 Transcription-Free Fine-Tuning of Speech Separation Models for Noisy and Reverberant Multi-Speaker Automatic Speech Recognition William Ravenscroft et.al. 2406.08914 null
2024-06-13 AdaPTwin: Low-Cost Adaptive Compression of Product Twins in Transformers Emil Biju et.al. 2406.08904 null
2024-06-12 ML-SUPERB 2.0: Benchmarking Multilingual Speech Models Across Modeling Constraints, Languages, and Datasets Jiatong Shi et.al. 2406.08641 null
2024-06-12 Neural Blind Source Separation and Diarization for Distant Speech Recognition Yoshiaki Bando et.al. 2406.08396 null
2025-01-10 Towards Unsupervised Speech Recognition Without Pronunciation Models Junrui Ni et.al. 2406.08380 null
2024-06-12 Speech Emotion Recognition with ASR Transcripts: A Comprehensive Study on Word Error Rate and Fusion Techniques Yuanchao Li et.al. 2406.08353 link
2024-06-13 Refining Self-Supervised Learnt Speech Representation using Brain Activations Hengyu Li et.al. 2406.08266 null
2024-06-12 Transformer-based Model for ASR N-Best Rescoring and Rewriting Iwen E. Kang et.al. 2406.08207 null
2024-06-12 Audio-conditioned phonemic and prosodic annotation for building text-to-speech models from unlabeled speech data Yuma Shirahata et.al. 2406.08111 null
2024-06-14 Can Large Language Models Understand Spatial Audio? Changli Tang et.al. 2406.07914 null
2024-06-12 Guiding Frame-Level CTC Alignments Using Self-knowledge Distillation Eungbeom Kim et.al. 2406.07909 null
2024-06-12 DualVC 3: Leveraging Language Model Generated Pseudo Context for End-to-end Low Latency Streaming Voice Conversion Ziqian Ning et.al. 2406.07846 null
2024-06-12 Dual-Pipeline with Low-Rank Adaptation for New Language Integration in Multilingual ASR Yerbolat Khassanov et.al. 2406.07842 null
2024-06-12 PRoDeliberation: Parallel Robust Deliberation for End-to-End Spoken Language Understanding Trang Le et.al. 2406.07823 null
2024-06-12 PolySpeech: Exploring Unified Multitask Speech Models for Competitiveness with Single-task Models Runyan Yang et.al. 2406.07801 null
2024-06-11 The Interspeech 2024 Challenge on Speech Processing Using Discrete Units Xuankai Chang et.al. 2406.07725 null
2024-06-11 Tag and correct: high precision post-editing approach to correction of speech recognition errors Tomasz Ziętkiewicz et.al. 2406.07589 null
2024-06-11 AS-70: A Mandarin stuttered speech dataset for automatic speech recognition and stuttering event detection Rong Gong et.al. 2406.07256 null
2024-06-11 Fast Context-Biasing for CTC and Transducer ASR models with CTC-based Word Spotter Andrei Andrusenko et.al. 2406.07096 null
2024-07-29 Spoken Language Corpora Augmentation with Domain-Specific Voice-Cloned Speech Mateusz Czyżnikiewicz et.al. 2406.07090 null
2024-06-11 Reading Miscue Detection in Primary School through Automatic Speech Recognition Lingyun Gao et.al. 2406.07060 null
2024-06-10 Synthetic Query Generation using Large Language Models for Virtual Assistants Sonal Sannigrahi et.al. 2406.06729 null
2024-06-13 ASTRA: Aligning Speech and Text Representations for Asr without Sampling Neeraj Gaur et.al. 2406.06664 null
2024-06-07 LoRA-Whisper: Parameter-Efficient and Extensible Multilingual ASR Zheshu Song et.al. 2406.06619 null
2024-06-25 Discrete Multimodal Transformers with a Pretrained Large Language Model for Mixed-Supervision Speech Processing Viet Anh Trinh et.al. 2406.06582 null
2024-06-10 A Parameter-efficient Language Extension Framework for Multilingual ASR Wei Liu et.al. 2406.06329 null
2024-06-10 Prompting Large Language Models with Audio for General-Purpose Speech Summarization Wonjune Kang et.al. 2406.05968 link
2024-07-18 Do Prompts Really Prompt? Exploring the Prompt Understanding Capability of Whisper Chih-Kai Yang et.al. 2406.05806 null
2024-07-20 Optimizing Multi-Stuttered Speech Classification: Leveraging Whisper's Encoder for Efficient Parameter Reduction in Automated Assessment Huma Ameer et.al. 2406.05784 null
2024-06-09 MS-HuBERT: Mitigating Pre-training and Inference Mismatch in Masked Language Modelling methods for learning Speech Representations Hemant Yadav et.al. 2406.05661 null
2024-06-07 LLM-based speaker diarization correction: A generalizable approach Georgios Efstathiadis et.al. 2406.04927 link
2024-07-02 Speaker-Smoothed kNN Speaker Adaptation for End-to-End ASR Shaojun Li et.al. 2406.04791 null
2024-06-07 Pitch-Aware RNN-T for Mandarin Chinese Mispronunciation Detection and Diagnosis Xintong Wang et.al. 2406.04595 null
2024-06-06 Flexible Multichannel Speech Enhancement for Noise-Robust Frontend Ante Jukić et.al. 2406.04552 null
2024-06-06 Label-Synchronous Neural Transducer for E2E Simultaneous Speech Translation Keqi Deng et.al. 2406.04541 link
2024-06-06 To Distill or Not to Distill? On the Robustness of Robust Knowledge Distillation Abdul Waheed et.al. 2406.04512 null
2024-06-06 LipGER: Visually-Conditioned Generative Error Correction for Robust Automatic Speech Recognition Sreyan Ghosh et.al. 2406.04432 link
2024-06-06 Beyond Performance Plateaus: A Comprehensive Study on Scalability in Speech Enhancement Wangyou Zhang et.al. 2406.04269 link
2024-07-02 Hypernetworks for Personalizing ASR to Atypical Speech Max Müller-Eberstein et.al. 2406.04240 null
2024-06-06 Helsinki Speech Challenge 2024 Martin Ludvigsen et.al. 2406.04123 null
2024-06-06 BLSP-Emo: Towards Empathetic Large Speech-Language Models Chen Wang et.al. 2406.03872 link
2024-06-14 Improving Zero-Shot Chinese-English Code-Switching ASR with kNN-CTC and Gated Monolingual Datastores Jiaming Zhou et.al. 2406.03814 null
2024-06-06 Speed of Light Exact Greedy Decoding for RNN-T Speech Recognition Models on GPU Daniel Galvez et.al. 2406.03791 null
2024-06-11 Enhancing CTC-based speech recognition with diverse modeling units Shiyi Han et.al. 2406.03274 null
2024-06-05 Error-preserving Automatic Speech Recognition of Young English Learners' Language Janick Michot et.al. 2406.03235 link
2024-06-05 StreamSpeech: Simultaneous Speech-to-Speech Translation with Multi-task Learning Shaolei Zhang et.al. 2406.03049 link
2024-06-05 4D ASR: Joint Beam Search Integrating CTC, Attention, Transducer, and Mask Predict Decoders Yui Sudo et.al. 2406.02950 null
2024-06-15 Task Arithmetic can Mitigate Synthetic-to-Real Gap in Automatic Speech Recognition Hsuan Su et.al. 2406.02925 null
2024-06-11 Text Injection for Neural Contextual Biasing Zhong Meng et.al. 2406.02921 null
2024-06-04 Keyword-Guided Adaptation of Automatic Speech Recognition Aviv Shamsian et.al. 2406.02649 null
2024-05-03 Combining X-Vectors and Bayesian Batch Active Learning: Two-Stage Active Learning Pipeline for Speech Recognition Ognjen Kundacina et.al. 2406.02566 null
2024-05-02 Sequence-to-sequence models in peer-to-peer learning: A practical application Robert Šajina et.al. 2406.02565 null
2024-04-29 A cost minimization approach to fix the vocabulary size in a tokenizer for an End-to-End ASR system Sunil Kumar Kopparapu et.al. 2406.02563 null
2024-04-24 Gated Low-rank Adaptation for personalized Code-Switching Automatic Speech Recognition on the low-spec devices Gwantae Kim et.al. 2406.02562 null
2024-04-23 Breaking Walls: Pioneering Automatic Speech Recognition for Central Kurdish: End-to-End Transformer Paradigm Abdulhady Abas Abdullah et.al. 2406.02561 null
2024-07-18 Less Peaky and More Accurate CTC Forced Alignment by Label Priors Ruizhe Huang et.al. 2406.02560 link
2024-03-27 PhoWhisper: Automatic Speech Recognition for Vietnamese Thanh-Thien Le et.al. 2406.02555 link
2024-06-04 Whistle: Data-Efficient Multilingual and Crosslingual Speech Recognition via Weakly Phonetic Supervision Saierdaer Yusuyin et.al. 2406.02166 link
2024-06-05 Efficiently Train ASR Models that Memorize Less and Perform Better with Per-core Clipping Lun Wang et.al. 2406.02004 null
2024-06-03 Enabling ASR for Low-Resource Languages: A Comprehensive Dataset Creation Approach Ara Yeroyan et.al. 2406.01446 null
2024-06-03 Compute-Efficient Medical Image Classification with Softmax-Free Transformers and Sequence Normalization Firas Khader et.al. 2406.01314 null
2024-06-02 YODAS: Youtube-Oriented Dataset for Audio and Speech Xinjian Li et.al. 2406.00899 null
2024-06-01 Wav2Prompt: End-to-End Speech Prompt Generation and Tuning For LLM in Zero and Few-shot Learning Keqi Deng et.al. 2406.00522 null
2024-05-27 ViSpeR: Multilingual Audio-Visual Speech Recognition Sanath Narayan et.al. 2406.00038 null
2024-05-14 Sonos Voice Control Bias Assessment Dataset: A Methodology for Demographic Bias Assessment in Voice Assistants Chloé Sekkat et.al. 2405.19342 null
2024-05-31 Zipper: A Multi-Tower Decoder Architecture for Fusing Modalities Vicky Zayats et.al. 2405.18669 null
2024-05-28 Augmented Conversation with Embedded Speech-Driven On-the-Fly Referencing in AR Shivesh Jadon et.al. 2405.18537 null
2024-05-28 Intelligent Clinical Documentation: Harnessing Generative AI for Patient-Centric Clinical Note Generation Anjanava Biswas et.al. 2405.18346 null
2024-05-28 NUTS, NARS, and Speech D. van der Sluis et.al. 2405.17874 null
2024-05-28 TransVIP: Speech to Speech Translation System with Voice and Isochrony Preservation Chenyang Le et.al. 2405.17809 null
2024-05-27 Federating Dynamic Models using Early-Exit Architectures for Automatic Speech Recognition on Heterogeneous Clients Mohamed Nabih Ali et.al. 2405.17376 null
2024-05-27 "Pass the butter": A study on desktop-classic multitasking robotic arm based on advanced YOLOv7 and BERT Haohua Que et.al. 2405.17250 null
2024-05-27 A Variance-Preserving Interpolation Approach for Diffusion Models with Applications to Single Channel Speech Enhancement and Recognition Zilu Guo et.al. 2405.16952 link
2024-05-24 Denoising LM: Pushing the Limits of Error Correction Models for Speech Recognition Zijin Gu et.al. 2405.15216 null
2024-05-23 Contrastive and Consistency Learning for Neural Noisy-Channel Model in Spoken Language Understanding Suyoung Kim et.al. 2405.15097 link
2024-06-02 Let's Fuse Step by Step: A Generative Fusion Decoding Algorithm with LLMs for Multi-modal Text Recognition Chan-Jan Hsu et.al. 2405.14259 link
2024-05-23 Self-Taught Recognizer: Toward Unsupervised Adaptation for Speech Foundation Models Yuchen Hu et.al. 2405.14161 link
2024-05-23 A Survey on Vision-Language-Action Models for Embodied AI Yueen Ma et.al. 2405.14093 null
2024-05-22 ST-Gait++: Leveraging spatio-temporal convolutions for gait-based emotion recognition on videos Maria Luísa Lima et.al. 2405.13903 null
2024-09-12 Joint Optimization of Streaming and Non-Streaming Automatic Speech Recognition with Multi-Decoder and Knowledge Distillation Muhammad Shakeel et.al. 2405.13514 null
2024-05-22 A Near-Real-Time Processing Ego Speech Filtering Pipeline Designed for Speech Interruption During Human-Robot Interaction Yue Li et.al. 2405.13477 null
2024-05-22 You don't understand me!: Comparing ASR results for L1 and L2 speakers of Swedish Ronald Cumbal et.al. 2405.13379 null
2024-05-22 Contextualized Automatic Speech Recognition with Dynamic Vocabulary Yui Sudo et.al. 2405.13344 null
2024-05-28 FairLENS: Assessing Fairness in Law Enforcement Speech Recognition Yicheng Wang et.al. 2405.13166 null
2024-05-21 Non-autoregressive real-time Accent Conversion model with voice cloning Vladimir Nechaev et.al. 2405.13162 null
2024-05-15 Continued Pretraining for Domain Adaptation of Wav2vec2.0 in Automatic Speech Recognition for Elementary Math Classroom Settings Ahmed Adel Attia et.al. 2405.13018 null
2024-05-12 Large Language Models for Education: A Survey Hanyi Xu et.al. 2405.13001 null
2024-03-14 Multilingual Audio-Visual Speech Recognition with Hybrid CTC/RNN-T Fast Conformer Maxime Burchi et.al. 2405.12983 null
2024-05-21 Could a Computer Architect Understand our Brain? Valentin Puente-Varona et.al. 2405.12815 null
2024-07-01 Mamba in Speech: Towards an Alternative to Self-Attention Xiangyu Zhang et.al. 2405.12609 null
2024-05-20 Continuous Sign Language Recognition with Adapted Conformer via Unsupervised Pretraining Neena Aloysius et.al. 2405.12018 null
2024-05-21 Acoustic modeling for Overlapping Speech Recognition: JHU Chime-5 Challenge System Vimal Manohar et.al. 2405.11078 null
2024-05-16 Listen Again and Choose the Right Answer: A New Paradigm for Automatic Speech Recognition with Large Language Models Yuchen Hu et.al. 2405.10025 null
2024-05-15 No More Mumbles: Enhancing Robot Intelligibility through Speech Adaptation Qiaoqiao Ren et.al. 2405.09708 link
2024-05-15 Towards Evaluating the Robustness of Automatic Speech Recognition Systems via Audio Style Transfer Weifei Jin et.al. 2405.09470 null
2024-05-14 Investigating the 'Autoencoder Behavior' in Speech Self-Supervised Models: a focus on HuBERT's Pretraining Valentin Vielzeuf et.al. 2405.08402 null
2024-05-31 SpeechVerse: A Large-scale Generalizable Audio Language Model Nilaksh Das et.al. 2405.08295 null
2024-06-07 Rene: A Pre-trained Multi-modal Architecture for Auscultation of Respiratory Diseases Pengfei Zhang et.al. 2405.07442 link
2024-05-12 SoccerNet-Echoes: A Soccer Game Audio Commentary Dataset Sushant Gautam et.al. 2405.07354 link
2024-07-22 DP-DyLoRA: Fine-Tuning Transformer-Based Models On-Device under Differentially Private Federated Learning using Dynamic Low-Rank Adaptation Jie Xu et.al. 2405.06368 null
2024-05-10 Lost in Transcription: Identifying and Quantifying the Accuracy Biases of Automatic Speech Recognition Systems Against Disfluent Speech Dena Mujtaba et.al. 2405.06150 null
2024-07-17 Muting Whisper: A Universal Acoustic Adversarial Attack on Speech Foundation Models Vyas Raina et.al. 2405.06134 link
2024-05-09 The RoyalFlush Automatic Speech Diarization and Recognition System for In-Car Multi-Channel Automatic Speech Recognition Challenge Jingguang Tian et.al. 2405.05498 null
2024-05-07 Open Implementation and Study of BEST-RQ for Speech Processing Ryan Whetten et.al. 2405.04296 link
2024-05-06 Whispy: Adapting STT Whisper Models to Real-Time Environments Antonio Bevilacqua et.al. 2405.03484 null
2024-05-06 MMGER: Multi-modal and Multi-granularity Generative Error Correction with LLM for Joint Accent and Speech Recognition Bingshen Mu et.al. 2405.03152 null
2024-05-11 Analysis about Theoretical Foundations for Method to Enhancing ASR Performance using OCR Word Frequency Differences Kyudan Jung et.al. 2405.02995 null
2024-05-04 Mixat: A Data Set of Bilingual Emirati-English Speech Maryam Al Ali et.al. 2405.02578 link
2024-05-06 Unveiling the Potential of LLM-Based ASR on Chinese Open-Source Datasets Xuelong Geng et.al. 2405.02132 null
2024-05-01 Efficient Sample-Specific Encoder Perturbations Yassir Fathullah et.al. 2405.01601 null
2024-05-02 Low-resource speech recognition and dialect identification of Irish in a multi-task framework Liam Lonergan et.al. 2405.01293 null
2024-05-02 Improving Membership Inference in ASR Model Auditing with Perturbed Loss Features Francisco Teixeira et.al. 2405.01207 null
2024-05-02 Deep Learning Models in Speech Recognition: Measuring GPU Energy Consumption, Impact of Noise and Model Quantization for Edge Deployment Aditya Chakravarty et.al. 2405.01004 link
2024-05-02 Efficient Compression of Multitask Multilingual Speech Models Thomas Palmeira Ferraz et.al. 2405.00966 null
2024-05-01 Active Learning with Task Adaptation Pre-training for Speech Emotion Recognition Dongyuan Li et.al. 2405.00307 null
2024-07-24 Confides: A Visual Analytics Solution for Automated Speech Recognition Analysis and Exploration Sunwoo Ha et.al. 2405.00223 null
2024-05-09 Does Whisper understand Swiss German? An automatic, qualitative, and human evaluation Eyal Liron Dolev et.al. 2404.19310 null
2024-04-30 EfficientASR: Speech Recognition Network Compression via Attention Redundancy and Chunk-Level FFN Optimization Jianzong Wang et.al. 2404.19214 null
2024-04-29 Towards Dog Bark Decoding: Leveraging Human Speech Processing for Automated Bark Classification Artem Abzaliev et.al. 2404.18739 null
2024-04-26 Child Speech Recognition in Human-Robot Interaction: Problem Solved? Ruben Janssens et.al. 2404.17394 null
2024-04-26 Automatic Speech Recognition System-Independent Word Error Rate Estimation Chanho Park et.al. 2404.16743 null
2024-04-26 Developing Acoustic Models for Automatic Speech Recognition in Swedish Giampiero Salvi et.al. 2404.16547 null
2024-04-25 U2++ MoE: Scaling 4.7x parameters with minimal impact on RTF Xingchen Song et.al. 2404.16407 null
2024-04-24 Mamba-360: Survey of State Space Models as Transformer Alternative for Long Sequence Modelling: Methods, Applications, and Challenges Badri Narayana Patro et.al. 2404.16112 link
2024-04-23 Killkan: The Automatic Speech Recognition Dataset for Kichwa with Morphosyntactic Information Chihiro Taguchi et.al. 2404.15501 link
2024-04-18 Artificial Neural Networks to Recognize Speakers Division from Continuous Bengali Speech Hasmot Ali et.al. 2404.15168 null
2024-04-23 Rethinking Processing Distortions: Disentangling the Impact of Speech Enhancement Errors on Speech Recognition Performance Tsubasa Ochiai et.al. 2404.14860 null
2024-04-22 Assessment of Sign Language-Based versus Touch-Based Input for Deaf Users Interacting with Intelligent Personal Assistants Nina Tran et.al. 2404.14605 null
2024-04-22 Exploring neural oscillations during speech perception via surrogate gradient spiking neural networks Alexandre Bittar et.al. 2404.14024 null
2024-04-20 Semantically Corrected Amharic Automatic Speech Recognition Samuael Adnew et.al. 2404.13362 link
2024-04-19 Learn2Talk: 3D Talking Face Learns from 2D Talking Face Yixiang Zhuang et.al. 2404.12888 null
2024-04-19 Efficient infusion of self-supervised representations in Automatic Speech Recognition Darshan Prabhu et.al. 2404.12628 null
2024-04-16 Teaching a Multilingual Large Language Model to Understand Multilingual Speech via Multi-Instructional Training Pavel Denisov et.al. 2404.10922 link
2024-04-16 Anatomy of Industrial Scale Multilingual ASR Francis McCann Ramirez et.al. 2404.09841 null
2024-04-15 Resilience of Large Language Models for Noisy Instructions Bin Wang et.al. 2404.09754 null
2024-04-12 Comparing Apples to Oranges: LLM-powered Multimodal Intention Prediction in an Object Categorization Task Hassan Ali et.al. 2404.08424 null
2024-07-26 Automatic Speech Recognition Advancements for Indigenous Languages of the Americas Monica Romero et.al. 2404.08368 null
2024-04-10 An inclusive review on deep learning techniques and their scope in handwriting recognition Sukhdeep Singh et.al. 2404.08011 null
2024-04-12 An Effective Automated Speaking Assessment Approach to Mitigating Data Scarcity and Imbalanced Distribution Tien-Hong Lo et.al. 2404.07575 null
2024-04-12 Conformer-1: Robust ASR via Large-Scale Semisupervised Bootstrapping Kevin Zhang et.al. 2404.07341 null
2024-03-31 Houston we have a Divergence: A Subgroup Performance Analysis of ASR Models Alkis Koudounas et.al. 2404.07226 null
2024-04-10 The X-LANCE Technical Report for Interspeech 2024 Speech Processing Using Discrete Speech Unit Challenge Yiwei Guo et.al. 2404.06079 null
2024-05-28 VietMed: A Dataset and Benchmark for Automatic Speech Recognition of Vietnamese in the Medical Domain Khai Le-Duc et.al. 2404.05659 link
2024-04-07 Safeguarding Voice Privacy: Harnessing Near-Ultrasonic Interference To Protect Against Unauthorized Audio Recording Forrest McKee et.al. 2404.04769 null
2024-04-04 Transducers with Pronunciation-aware Embeddings for Automatic Speech Recognition Hainan Xu et.al. 2404.04295 null
2024-04-03 Mai Ho'omāuna i ka 'Ai: Language Models Improve Automatic Speech Recognition in Hawaiian Kaavya Chaparala et.al. 2404.03073 null
2024-04-03 CMULAB: An Open-Source Framework for Training and Deployment of Natural Language Processing Models Zaid Sheikh et.al. 2404.02408 link
2024-04-02 BRAVEn: Improving Self-Supervised Pre-training for Visual and Auditory Speech Recognition Alexandros Haliassos et.al. 2404.02098 link
2024-04-02 Noise Masking Attacks and Defenses for Pretrained Speech Models Matthew Jagielski et.al. 2404.02052 null
2024-04-02 Kallaama: A Transcribed Speech Dataset about Agriculture in the Three Most Widely Spoken Languages in Senegal Elodie Gauthier et.al. 2404.01991 link
2024-04-02 Transfer Learning from Whisper for Microscopic Intelligibility Prediction Paul Best et.al. 2404.01737 null
2024-07-22 ELITR-Bench: A Meeting Assistant Benchmark for Long-Context Language Models Thibaut Thonet et.al. 2403.20262 link
2024-03-28 Multi-Stage Multi-Modal Pre-Training for Automatic Speech Recognition Yash Jain et.al. 2403.19822 null
2024-03-25 Hierarchical Recurrent Adapters for Efficient Multi-Task Adaptation of Large Speech Models Tsendsuren Munkhdalai et.al. 2403.19709 null
2024-03-29 Emotion Neural Transducer for Fine-Grained Speech Emotion Recognition Siyuan Shen et.al. 2403.19224 null
2024-03-28 LV-CTC: Non-autoregressive ASR with CTC and latent variable models Yuya Fujita et.al. 2403.19207 null
2024-03-04 JEP-KD: Joint-Embedding Predictive Architecture Based Knowledge Distillation for Visual Speech Recognition Chang Sun et.al. 2403.18843 null
2024-06-04 PhysicsAssistant: An LLM-Powered Interactive Learning Robot for Physics Lab Investigations Ehsan Latif et.al. 2403.18721 null
2024-03-27 ZAEBUC-Spoken: A Multilingual Multidialectal Arabic-English Speech Corpus Injy Hamed et.al. 2403.18182 null
2024-04-11 DANCER: Entity Description Augmented Named Entity Corrector for Automatic Speech Recognition Yi-Cheng Wang et.al. 2403.17645 null
2024-03-26 Extracting Biomedical Entities from Noisy Audio Transcripts Nima Ebadi et.al. 2403.17363 null
2024-03-25 Grammatical vs Spelling Error Correction: An Investigation into the Responsiveness of Transformer-based Language Models using BART and MarianMT Rohit Raju et.al. 2403.16655 null
2024-03-22 Privacy-Preserving End-to-End Spoken Language Understanding Yinggui Wang et.al. 2403.15510 null
2024-03-20 Isometric Neural Machine Translation using Phoneme Count Ratio Reward-based Reinforcement Learning Shivam Ratnakant Mhaskar et.al. 2403.15469 null
2024-07-21 Artificial Intelligence for Cochlear Implants: Review of Strategies, Challenges, and Perspectives Billel Essaid et.al. 2403.15442 null
2024-03-26 A Multimodal Approach to Device-Directed Speech Detection with Large Language Models Dominik Wagner et.al. 2403.14438 null
2024-03-21 XLAVS-R: Cross-Lingual Audio-Visual Speech Representation Learning for Noise-Robust Speech Perception HyoJung Han et.al. 2403.14402 null
2024-06-04 M $^3$ AV: A Multimodal, Multigenre, and Multipurpose Audio-Visual Academic Lecture Dataset Zhe Chen et.al. 2403.14168 null
2024-03-20 Open Access NAO (OAN): a ROS2-based software framework for HRI applications with the NAO robot Antonio Bono et.al. 2403.13960 null
2024-03-20 BanglaNum -- A Public Dataset for Bengali Digit Recognition from Speech Mir Sayeed Mohammad et.al. 2403.13465 null
2024-03-20 Advanced Long-Content Speech Recognition With Factorized Neural Transducer Xun Gong et.al. 2403.13423 null
2024-03-21 FlowerFormer: Empowering Neural Architecture Encoding using a Flow-aware Graph Transformer Dongyeong Hwang et.al. 2403.12821 link
2024-03-19 Real-time Speech Extraction Using Spatially Regularized Independent Low-rank Matrix Analysis and Rank-constrained Spatial Covariance Matrix Estimation Yuto Ishikawa et.al. 2403.12477 null
2024-03-18 Multimodal Human-Autonomous Agents Interaction Using Pre-Trained Language and Visual Foundation Models Linus Nwankwo et.al. 2403.12273 null
2024-03-18 AdaMER-CTC: Connectionist Temporal Classification with Adaptive Maximum Entropy Regularization for Automatic Speech Recognition SooHwan Eom et.al. 2403.11578 null
2024-03-16 Energy-Based Models with Applications to Speech and Language Processing Zhijian Ou et.al. 2403.10961 null
2024-03-16 Initial Decoding with Minimally Augmented Language Model for Improved Lattice Rescoring in Low Resource ASR Savitha Murthy et.al. 2403.10937 null
2024-03-15 Neural Networks Hear You Loud And Clear: Hearing Loss Compensation Using Deep Neural Networks Peter Leer et.al. 2403.10420 null
2024-03-14 SpokeN-100: A Cross-Lingual Benchmarking Dataset for The Classification of Spoken Numbers in Different Languages René Groh et.al. 2403.09753 link
2024-03-15 More than words: Advancements and challenges in speech recognition for singing Anna Kruspe et.al. 2403.09298 null
2024-05-21 Skipformer: A Skip-and-Recover Strategy for Efficient Speech Recognition Wenjing Zhu et.al. 2403.08258 null
2024-03-13 SpeechColab Leaderboard: An Open-Source Platform for Automatic Speech Recognition Evaluation Jiayu Du et.al. 2403.08196 link
2024-03-13 Automatic Speech Recognition (ASR) for the Diagnosis of pronunciation of Speech Sound Disorders in Korean children Taekyung Ahn et.al. 2403.08187 null
2024-03-12 Gujarati-English Code-Switching Speech Recognition using ensemble prediction of spoken language Yash Sharma et.al. 2403.08011 null
2024-03-11 The evaluation of a code-switched Sepedi-English automatic speech recognition system Amanda Phaladi et.al. 2403.07947 null
2024-03-08 Speech Robust Bench: A Robustness Benchmark For Speech Recognition Muhammad A. Shah et.al. 2403.07937 null
2024-03-12 Beyond the Labels: Unveiling Text-Dependency in Paralinguistic Speech Recognition Datasets Jan Pešán et.al. 2403.07767 null
2024-03-11 Real-Time Multimodal Cognitive Assistant for Emergency Medical Services Keshara Weerasinghe et.al. 2403.06734 link
2024-03-11 Towards Decoupling Frontend Enhancement and Backend Recognition in Monaural Robust ASR Yufeng Yang et.al. 2403.06387 null
2024-03-10 SCORE: Self-supervised Correspondence Fine-tuning for Improved Content Representations Amit Meghanani et.al. 2403.06260 link
2025-11-04 Aligning Speech to Languages to Enhance Code-switching Speech Recognition Hexin Liu et.al. 2403.05887 null
2024-03-02 A Cross-Modal Approach to Silent Speech with LLM-Enhanced Recognition Tyler Benster et.al. 2403.05583 link
2024-03-07 Classist Tools: Social Class Correlates with Performance in NLP Amanda Cercas Curry et.al. 2403.04445 null
2024-05-30 A New Benchmark for Evaluating Automatic Speech Recognition in the Arabic Call Domain Qusai Abo Obaidah et.al. 2403.04280 null
2024-03-07 A Study of Dropout-Induced Modality Bias on Robustness to Missing Video Frames for Audio-Visual Speech Recognition Yusheng Dai et.al. 2403.04245 link
2024-03-06 RADIA -- Radio Advertisement Detection with Intelligent Analytics Jorge Álvarez et.al. 2403.03538 null
2024-03-13 Non-verbal information in spontaneous speech -- towards a new framework of analysis Tirza Biron et.al. 2403.03522 null
2024-03-05 AIx Speed: Playback Speed Optimization Using Listening Comprehension of Speech Recognition Models Kazuki Kawamura et.al. 2403.02938 null
2024-03-04 PixIT: Joint Training of Speaker Diarization and Speech Separation from Real-world Multi-speaker Recordings Joonas Kalda et.al. 2403.02288 link
2024-03-04 What has LeBenchmark Learnt about French Syntax? Zdravko Dugonjić et.al. 2403.02173 null
2024-12-05 EMOVOME: A Dataset for Emotion Recognition in Spontaneous Real-Life Speech Lucía Gómez-Zaragozá et.al. 2403.02167 null
2024-03-04 SA-SOT: Speaker-Aware Serialized Output Training for Multi-Talker ASR Zhiyun Fan et.al. 2403.02010 null
2024-03-04 Language and Speech Technology for Central Kurdish Varieties Sina Ahmadi et.al. 2403.01983 link
2024-03-03 A Closer Look at Wav2Vec2 Embeddings for On-Device Single-Channel Speech Enhancement Ravi Shankar et.al. 2403.01369 null
2024-04-18 Automatic Speech Recognition using Advanced Deep Learning Approaches: A survey Hamza Kheddar et.al. 2403.01255 null
2024-03-01 Post-decoder Biasing for End-to-End Speech Recognition of Multi-turn Medical Interview Heyang Liu et.al. 2403.00370 null
2024-02-29 Probing the Information Encoded in Neural-based Acoustic Models of Automatic Speech Recognition Systems Quentin Raymondaud et.al. 2402.19443 null
2024-02-29 Inappropriate Pause Detection In Dysarthric Speech Using Large-Scale Speech Recognition Jeehyun Lee et.al. 2402.18923 null
2024-06-04 Exploration of Adapter for Noise Robust Automatic Speech Recognition Hao Shi et.al. 2402.18275 null
2024-06-19 Twists, Humps, and Pebbles: Multilingual Speech Recognition Models Exhibit Gender Performance Gaps Giuseppe Attanasio et.al. 2402.17954 link
2024-02-27 An Effective Mixture-Of-Experts Approach For Code-Switching Speech Recognition Leveraging Encoder Disentanglement Tzu-Ting Yang et.al. 2402.17189 null
2024-02-27 Extreme Encoder Output Frame Rate Reduction: Improving Computational Latencies of Large End-to-End Models Rohit Prabhavalkar et.al. 2402.17184 null
2024-04-01 ArEEG_Chars: Dataset for Envisioned Speech Recognition using EEG for Arabic Characters Hazem Darwish et.al. 2402.15733 null
2024-05-14 Where Visual Speech Meets Language: VSP-LLM Framework for Efficient and Context-Aware Visual Speech Processing Jeong Hun Yeo et.al. 2402.15151 link
2024-02-22 Efficient data selection employing Semantic Similarity-based Graph Structures for model training Roxana Petcu et.al. 2402.14888 null
2024-02-22 Wizard of Oz Experimentation for Language Technology Applications: Challenges and Tools Stephan Schlögl et.al. 2402.14563 null
2024-02-22 HINT: High-quality INPainting Transformer with Mask-Aware Encoding and Enhanced Attention Shuang Chen et.al. 2402.14185 link
2024-02-21 An Augmented Lagrangian Method for Training Recurrent Neural Networks Yue Wang et.al. 2402.13687 null
2024-02-22 Mel-FullSubNet: Mel-Spectrogram Enhancement for Improving Both Speech Quality and ASR Rui Zhou et.al. 2402.13511 null
2024-02-20 How do Hyenas deal with Human Speech? Speech Recognition and Translation with ConfHyena Marco Gaido et.al. 2402.13208 link
2024-02-20 Not All Weights Are Created Equal: Enhancing Energy Efficiency in On-Device Streaming Speech Recognition Yang Li et.al. 2402.13076 null
2024-02-20 Comparison of Conventional Hybrid and CTC/Attention Decoders for Continuous Visual Speech Recognition David Gimeno-Gómez et.al. 2402.13004 null
2024-06-16 OWSM-CTC: An Open Encoder-Only Speech Foundation Model for Speech Recognition, Translation, and Language Identification Yifan Peng et.al. 2402.12654 null
2024-02-19 Multimodal Emotion Recognition from Raw Audio with Sinc-convolution Xiaohui Zhang et.al. 2402.11954 null
2024-02-18 Ain't Misbehavin' -- Using LLMs to Generate Expressive Robot Behavior in Conversations with the Tabletop Robot Haru Zining Wang et.al. 2402.11571 null
2024-02-18 Cross-Attention Fusion of Visual and Geometric Features for Large Vocabulary Arabic Lipreading Samar Daou et.al. 2402.11520 null
2024-01-04 AntiDeepFake: AI for Deep Fake Speech Recognition Enkhtogtokh Togootogtokh et.al. 2402.10218 null
2024-02-15 A cross-talk robust multichannel VAD model for multiparty agent interactions trained using synthetic re-recordings Hyewon Han et.al. 2402.09797 null
2024-02-14 Listening to Multi-talker Conversations: Modular and End-to-end Perspectives Desh Raj et.al. 2402.08932 null
2024-02-14 UniEnc-CASSNAT: An Encoder-only Non-autoregressive ASR for Speech SSL Models Ruchao Fan et.al. 2402.08898 null
2024-02-13 An Embarrassingly Simple Approach for LLM with Strong ASR Capacity Ziyang Ma et.al. 2402.08846 link
2024-02-13 Syllable based DNN-HMM Cantonese Speech to Text System Timothy Wong et.al. 2402.08788 null
2024-05-03 Careless Whisper: Speech-to-Text Hallucination Harms Allison Koenecke et.al. 2402.08021 link
2024-07-26 AIR-Bench: Benchmarking Large Audio-Language Models via Generative Comprehension Qian Yang et.al. 2402.07729 link
2024-02-12 The Sound of Healthcare: Improving Medical Transcription ASR Accuracy with Large Language Models Ayo Adedeji et.al. 2402.07658 null
2024-02-12 The Balancing Act: Unmasking and Alleviating ASR Biases in Portuguese Ajinkya Kulkarni et.al. 2402.07513 null
2024-02-13 SALAD: Smart AI Language Assistant Daily Ragib Amin Nihal et.al. 2402.07431 null
2024-02-11 Does ChatGPT and Whisper Make Humanoid Robots More Relatable? Xiaohui Chen et.al. 2402.07095 null
2024-02-10 DeepCover: Advancing RNN Test Coverage and Online Error Prediction using State Machine Extraction Pouria Golshanrad et.al. 2402.06966 link
2024-02-13 CochCeps-Augment: A Novel Self-Supervised Contrastive Learning Using Cochlear Cepstrum-based Masking for Speech Emotion Recognition Ioannis Ziogas et.al. 2402.06923 null
2024-02-09 Self-consistent context aware conformer transducer for speech recognition Konstantin Kolokolov et.al. 2402.06592 null
2024-02-08 Unified Speech-Text Pretraining for Spoken Dialog Modeling Heeseung Kim et.al. 2402.05706 null
2024-02-08 It's Never Too Late: Fusing Acoustic Information into Large Language Models for Automatic Speech Recognition Chen Chen et.al. 2402.05457 null
2024-02-07 Progressive unsupervised domain adaptation for ASR using ensemble models and multi-stage training Rehan Ahmad et.al. 2402.04805 null
2024-05-28 REBORN: Reinforcement-Learned Boundary Segmentation with Iterative Training for Unsupervised ASR Liang-Hsuan Tseng et.al. 2402.03988 link
2024-02-05 Resolving Transcription Ambiguity in Spanish: A Hybrid Acoustic-Lexical System for Punctuation Restoration Xiliang Zhu et.al. 2402.03519 null
2024-02-05 A Comprehensive Study of the Current State-of-the-Art in Nepali Automatic Speech Recognition Systems Rupak Raj Ghimire et.al. 2402.03050 null
2024-02-03 Predicting positive transfer for improved low-resource speech recognition using acoustic pseudo-tokens Nay San et.al. 2402.02302 null
2024-02-02 Digits micro-model for accurate and secure transactions Chirag Chhablani et.al. 2402.01931 null
2024-02-02 Whispering in Norwegian: Navigating Orthographic and Dialectic Challenges Per E Kummervold et.al. 2402.01917 null
2024-02-01 Introduction to speech recognition Gabriel Dauphin et.al. 2402.01778 null
2024-02-02 Streaming Sequence Transduction through Dynamic Compression Weiting Tan et.al. 2402.01172 link
2024-02-05 AccentFold: A Journey through African Accents for Zero-Shot ASR Adaptation to Target Accents Abraham Toluwase Owodunni et.al. 2402.01152 null
2024-02-01 Prosody in Cascade and Direct Speech-to-Text Translation: a case study on Korean Wh-Phrases Giulio Zhou et.al. 2402.00632 null
2024-01-31 Exploring the limits of decoder-only models trained on public speech recognition corpora Ankit Gupta et.al. 2402.00235 null
2024-01-31 SpeechComposer: Unifying Multiple Speech Tasks with Prompt Composition Yihan Wu et.al. 2401.18045 null
2024-02-08 Computation and Parameter Efficient Multi-Modal Fusion Transformer for Cued Speech Recognition Lei Liu et.al. 2401.17604 null
2024-06-16 OWSM v3.1: Better and Faster Open Whisper-Style Speech Models based on E-Branchformer Yifan Peng et.al. 2401.16658 null
2024-01-28 Phoneme-Based Proactive Anti-Eavesdropping with Controlled Recording Privilege Peng Huang et.al. 2401.15704 null
2024-01-28 On Speaker Attribution with SURT Desh Raj et.al. 2401.15676 link
2024-01-28 Byte Pair Encoding Is All You Need For Automatic Bengali Speech Recognition Ahnaf Mozib Samin et.al. 2401.15532 null
2024-01-27 Towards Event Extraction from Speech with Contextual Clues Jingqi Kang et.al. 2401.15385 link
2024-01-26 Comparison of parameters of vowel sounds of russian and english languages V. I. Fedoseev et.al. 2401.14890 null
2024-01-26 Toward Practical Automatic Speech Recognition and Post-Processing: a Call for Explainable Error Benchmark Guideline Seonmin Koo et.al. 2401.14625 null
2024-01-25 TDFNet: An Efficient Audio-Visual Speech Separation Model with Top-down Fusion Samuel Pegg et.al. 2401.14185 link
2024-01-24 CNN architecture extraction on edge GPU Peter Horvath et.al. 2401.13575 null
2024-03-18 SpeechDPR: End-to-End Spoken Passage Retrieval for Open-Domain Spoken Question Answering Chyi-Jiunn Lin et.al. 2401.13463 null
2024-05-28 MF-AED-AEC: Speech Emotion Recognition by Leveraging Multimodal Fusion, Asr Error Detection, and Asr Error Correction Jiajun He et.al. 2401.13260 null
2024-01-23 Locality enhanced dynamic biasing and sampling strategies for contextual ASR Md Asif Jalal et.al. 2401.13146 null
2024-01-23 Multilingual and Fully Non-Autoregressive ASR with Large Language Model Fusion: A Comprehensive Study W. Ronny Huang et.al. 2401.12789 null
2024-01-22 Consistency Based Unsupervised Self-training For ASR Personalisation Jisi Zhang et.al. 2401.12085 null
2024-01-22 Lightweight Protection for Privacy in Offloaded Speech Understanding Dongqi Cai et.al. 2401.11983 null
2024-01-22 Keep Decoding Parallel with Effective Knowledge Distillation from Language Models to End-to-end Speech Recognisers Michael Hentschel et.al. 2401.11700 null
2024-06-06 Using Large Language Model for End-to-End Chinese ASR and NER Yuang Li et.al. 2401.11382 null
2024-02-02 Word-Level ASR Quality Estimation for Efficient Corpus Sampling and Post-Editing through Analyzing Attentions of a Reference-Free Metric Golara Javadi et.al. 2401.11268 link
2024-01-20 ConceptThread: Visualizing Threaded Concepts in MOOC Videos Zhiguang Zhou et.al. 2401.11132 null
2024-01-19 Contextualized Automatic Speech Recognition with Attention-Based Bias Phrase Boosted Beam Search Yui Sudo et.al. 2401.10449 null
2024-01-19 Investigating Training Strategies and Model Robustness of Low-Rank Adaptation for Language Modeling in Speech Recognition Yu Yu et.al. 2401.10447 null
2024-01-19 Large Language Models are Efficient Learners of Noise-Robust Speech Recognition Yuchen Hu et.al. 2401.10446 link
2024-01-18 AGADIR: Towards Array-Geometry Agnostic Directional Speech Recognition Ju Lin et.al. 2401.10411 null
2024-01-18 Communication-Efficient Personalized Federated Learning for Speech-to-Text Tasks Yichao Du et.al. 2401.10070 null
2024-07-18 Efficient Training for Multilingual Visual Speech Recognition: Pre-training with Discretized Visual Speech Representation Minsu Kim et.al. 2401.09802 null
2024-07-02 SlideAVSR: A Dataset of Paper Explanation Videos for Audio-Visual Speech Recognition Hao Wang et.al. 2401.09759 null
2024-01-12 Transcending Controlled Environments Assessing the Transferability of ASRRobust NLU Models to Real-World Applications Hania Khan et.al. 2401.09354 null
2024-01-17 On Speech Pre-emphasis as a Simple and Inexpensive Method to Boost Speech Enhancement Iván López-Espejo et.al. 2401.09315 null
2024-01-17 Two-pass Endpoint Detection for Speech Recognition Anirudh Raju et.al. 2401.08916 null
2024-01-16 NOTSOFAR-1 Challenge: New Datasets, Baseline, and Tasks for Distant Meeting Transcription Alon Vinnikov et.al. 2401.08887 null
2024-01-16 Improving ASR Contextual Biasing with Guided Attention Jiyang Tang et.al. 2401.08835 null
2024-01-16 Revisiting Self-supervised Learning of Speech Representation from a Mutual Information Perspective Alexander H. Liu et.al. 2401.08833 null
2024-03-01 Multi-Input Multi-Output Target-Speaker Voice Activity Detection For Unified, Flexible, and Robust Audio-Visual Speaker Diarization Ming Cheng et.al. 2401.08052 null
2024-01-15 Machine Perceptual Quality: Evaluating the Impact of Severe Lossy Compression on Audio and Image Models Dan Jacobellis et.al. 2401.07957 link
2024-07-24 Cascaded Cross-Modal Transformer for Audio-Textual Classification Nicolae-Catalin Ristea et.al. 2401.07575 link
2024-01-15 SeMaScore : a new evaluation metric for automatic speech recognition tasks Zitha Sasindran et.al. 2401.07506 null
2024-01-14 Promptformer: Prompted Conformer Transducer for ASR Sergio Duarte-Torres et.al. 2401.07360 null
2024-01-13 Joint Unsupervised and Supervised Training for Automatic Speech Recognition via Bilevel Optimization A F M Saif et.al. 2401.06980 link
2024-01-12 XLS-R Deep Learning Model for Multilingual ASR on Low- Resource Languages: Indonesian, Javanese, and Sundanese Panji Arisaputra et.al. 2401.06832 null
2024-02-29 The NPU-ASLP-LiAuto System Description for Visual Speech Recognition in CNVSRC 2023 He Wang et.al. 2401.06788 link
2024-01-15 Dynamic Behaviour of Connectionist Speech Recognition with Strong Latency Constraints Giampiero Salvi et.al. 2401.06588 null
2024-01-12 LCB-net: Long-Context Biasing for Audio-Visual Speech Recognition Fan Yu et.al. 2401.06390 link
2024-01-11 End to end Hindi to English speech conversion using Bark, mBART and a finetuned XLSR Wav2Vec2 Aniket Tathe et.al. 2401.06183 null
2024-01-11 UCorrect: An Unsupervised Framework for Automatic Speech Recognition Error Correction Jiaxin Guo et.al. 2401.05689 null
2024-01-10 Useful Blunders: Can Automated Speech Recognition Errors Improve Downstream Dementia Classification? Changye Li et.al. 2401.05551 null
2024-01-10 Towards Online Sign Language Recognition and Translation Ronglai Zuo et.al. 2401.05336 link
2024-07-17 Continuously Learning New Words in Automatic Speech Recognition Christian Huber et.al. 2401.04482 null
2024-01-08 High-precision Voice Search Query Correction via Retrievable Speech-text Embedings Christopher Li et.al. 2401.04235 null
2024-07-22 Cross-Speaker Encoding Network for Multi-Talker Speech Recognition Jiawen Kang et.al. 2401.04152 link
2024-01-08 Exploratory Evaluation of Speech Content Masking Jennifer Williams et.al. 2401.03936 null
2024-03-07 An audio-quality-based multi-strategy approach for target speaker extraction in the MISP 2023 Challenge Runduo Han et.al. 2401.03697 null
2024-06-10 LUPET: Incorporating Hierarchical Information Path into Multilingual ASR Wei Liu et.al. 2401.03689 null
2024-01-08 BS-PLCNet: Band-split Packet Loss Concealment Network with Multi-task Learning Framework and Multi-discriminators Zihan Zhang et.al. 2401.03687 null
2024-07-22 DiarizationLM: Speaker Diarization Post-Processing with Large Language Models Quan Wang et.al. 2401.03506 link
2024-02-21 ICMC-ASR: The ICASSP 2024 In-Car Multi-Channel Automatic Speech Recognition Challenge He Wang et.al. 2401.03473 null
2024-01-07 Multichannel AV-wav2vec2: A Framework for Learning Multichannel Multi-Modal Speech Representation Qiushi Zhu et.al. 2401.03468 link
2024-04-08 MLCA-AVSR: Multi-Layer Cross Attention Fusion based Audio-Visual Speech Recognition He Wang et.al. 2401.03424 null
2024-01-06 TeLeS: Temporal Lexeme Similarity Score to Estimate Confidence in End-to-End ASR Nagarathna Ravi et.al. 2401.03251 link
2024-01-06 Part-of-Speech Tagger for Bodo Language using Deep Learning approach Dhrubajyoti Pathak et.al. 2401.03175 null
2024-01-05 Towards ASR Robust Spoken Language Understanding Through In-Context Learning With Word Confusion Networks Kevin Everson et.al. 2401.02921 null
2024-01-05 Nonlinear functional regression by functional deep neural network with kernel embedding Zhongjie Shi et.al. 2401.02890 null
2024-01-05 A unified multichannel far-field speech recognition system: combining neural beamforming with attention based end-to-end model Dongdi Zhao et.al. 2401.02673 null
2024-01-04 Task Oriented Dialogue as a Catalyst for Self-Supervised Automatic Speech Recognition David M. Chan et.al. 2401.02417 link
2024-01-04 CTC Blank Triggered Dynamic Layer-Skipping for Efficient CTC-based Speech Recognition Junfeng Hou et.al. 2401.02046 null
2024-01-03 Hallucinations in Neural Automatic Speech Recognition: Identifying Errors and Hallucinatory Models Rita Frieske et.al. 2401.01572 null
2024-06-04 The Art of Deception: Robust Backdoor Attack using Dynamic Stacking of Triggers Orson Mengara et.al. 2401.01537 null
2024-01-01 Enhancing Pre-trained ASR System Fine-tuning for Dysarthric Speech Recognition using Adversarial Data Augmentation Huimeng Wang et.al. 2401.00662 null
2024-05-02 Stateful Conformer with Cache-based Inference for Streaming Automatic Speech Recognition Vahid Noroozi et.al. 2312.17279 null
2023-12-26 The NUS-HLT System for ICASSP2024 ICMC-ASR Grand Challenge Meng Ge et.al. 2312.16002 null
2023-12-26 Towards Probing Contact Center Large Language Models Varun Nathan et.al. 2312.15922 null
2023-12-24 Exploring data augmentation in bias mitigation against non-native-accented speech Yuanyuan Zhang et.al. 2312.15499 null
2023-12-22 BLSTM-Based Confidence Estimation for End-to-End Speech Recognition Atsunori Ogawa et.al. 2312.14609 null
2024-02-09 Multimodal Attention Merging for Improved Speech Recognition and Audio Event Classification Anirudh S. Sundar et.al. 2312.14378 null
2024-07-22 Multi-Sentence Grounding for Long-term Instructional Video Zeqian Li et.al. 2312.14055 null
2023-12-21 BANSpEmo: A Bangla Emotional Speech Recognition Dataset Md Gulzar Hussain et.al. 2312.14020 null
2023-12-21 Self-Supervised Adaptive AV Fusion Module for Pre-Trained ASR Models Christopher Simic et.al. 2312.13873 null
2024-02-03 kNN-CTC: Enhancing ASR via Retrieval of CTC Pseudo Labels Jiaming Zhou et.al. 2312.13560 link
2025-01-14 On the Effectiveness of ASR Representations in Real-world Noisy Speech Emotion Recognition Xiaohan Shi et.al. 2311.07093 null
2023-11-20 Decoupling and Interacting Multi-Task Learning Network for Joint Speech and Accent Recognition Qijie Shao et.al. 2311.07062 null
2023-11-02 An analysis of large speech models-based representations for speech emotion recognition Adrian Bogdan Stânea et.al. 2311.00394 null
2024-01-29 Generative Speech Recognition Error Correction with Large Language Models and Task-Activating Prompting Chao-Han Huck Yang et.al. 2309.15649 null
2023-08-09 Federated Representation Learning for Automatic Speech Recognition Guruprasad V Ramesh et.al. 2308.02013 null
2023-07-07 Audio-visual End-to-end Multi-channel Speech Separation, Dereverberation and Recognition Guinan Li et.al. 2307.02909 null
2023-05-30 HyperConformer: Multi-head HyperMixer for Efficient Speech Recognition Florian Mai et.al. 2305.18281 null
2023-04-24 A vector quantized masked autoencoder for speech emotion recognition Samir Sadok et.al. 2304.11117 null
2023-03-06 DWFormer: Dynamic Window transFormer for Speech Emotion Recognition Shuaiqi Chen et.al. 2303.01694 null
2024-11-08 Pre-Finetuning for Few-Shot Emotional Speech Recognition Maximillian Chen et.al. 2302.12921 null
2023-03-07 A Sidecar Separator Can Convert a Single-Talker Speech Recognition System to a Multi-Talker One Lingwei Meng et.al. 2302.09908 null
2022-11-16 Improving Children's Speech Recognition by Fine-tuning Self-supervised Adult Speech Representations Renee Lu et.al. 2211.07769 null
2022-10-27 Pretrained audio neural networks for Speech emotion recognition in Portuguese Marcelo Matheus Gauy et.al. 2210.14716 null
2022-04-07 What can predictive speech coders learn from speaker recognizers? Marcos Faundez-Zanuy et.al. 2204.02400 null
2022-03-18 Speaker Adaptation Using Spectro-Temporal Deep Features for Dysarthric and Elderly Speech Recognition Mengzhe Geng et.al. 2202.10290 null
2022-02-03 Visualizing Automatic Speech Recognition -- Means for a Better Understanding? Karla Markert et.al. 2202.00673 null
2022-01-31 Discovering Phonetic Inventories with Crosslingual Automatic Speech Recognition Piotr Żelasko et.al. 2201.11207 null
2021-12-22 Voice Quality and Pitch Features in Transformer-Based Speech Recognition Guillermo Cámbara et.al. 2112.11391 null
2022-05-03 Speech Pattern based Black-box Model Watermarking for Automatic Speech Recognition Haozhe Chen et.al. 2110.09814 null
2021-11-05 Towards efficient end-to-end speech recognition with biologically-inspired neural networks Thomas Bohnstingl et.al. 2110.02743 null
2025-02-06 Comparison of Self-Supervised Speech Pre-Training Methods on Flemish Dutch Jakob Poncelet et.al. 2109.14357 null
2021-07-27 Differentiable Allophone Graphs for Language-Universal Speech Recognition Brian Yan et.al. 2107.11628 null
2021-07-06 Arabic Code-Switching Speech Recognition using Monolingual Data Ahmed Ali et.al. 2107.01573 null
2021-07-05 Supervised Contrastive Learning for Accented Speech Recognition Tao Han et.al. 2107.00921 null
2021-07-05 Combining Frame-Synchronous and Label-Synchronous Systems for Speech Recognition Qiujia Li et.al. 2107.00764 null
2022-03-22 Unsupervised Automatic Speech Recognition: A Review Hanan Aldarmaki et.al. 2106.04897 null
2021-10-05 Non-autoregressive Mandarin-English Code-switching Speech Recognition Shun-Po Chuang et.al. 2104.02258 null
2021-02-16 Thank you for Attention: A survey on Attention-based Artificial Neural Networks for Automatic Speech Recognition Priyabrata Karmakar et.al. 2102.07259 null
2021-02-01 BCN2BRNO: ASR System Fusion for Albayzin 2020 Speech to Text Challenge Martin Kocour et.al. 2101.12729 null
2021-09-14 Multi-task Language Modeling for Improving Speech Recognition of Rare Words Chao-Han Huck Yang et.al. 2011.11715 null
2020-11-13 The CUHK-TUDELFT System for The SLT 2021 Children Speech Recognition Challenge Si-Ioi Ng et.al. 2011.06239 null
2020-11-10 Data Augmentation For Children's Speech Recognition -- The "Ethiopian" System For The SLT 2021 Children Speech Recognition Challenge Guoguo Chen et.al. 2011.04547 null
2020-11-10 Gated Recurrent Fusion with Joint Training Framework for Robust End-to-End Speech Recognition Cunhang Fan et.al. 2011.04249 null
2021-09-20 TutorNet: Towards Flexible Knowledge Distillation for End-to-End Speech Recognition Ji Won Yoon et.al. 2008.00671 null
2020-10-06 CTC-Segmentation of Large Corpora for German End-to-end Speech Recognition Ludwig Kürzinger et.al. 2007.09127 null
2020-06-04 The NTNU System at the Interspeech 2020 Non-Native Children's Speech ASR Challenge Tien-Hong Lo et.al. 2005.08433 null
2020-04-20 How to Teach DNNs to Pay Attention to the Visual Modality in Speech Recognition George Sterpu et.al. 2004.08250 null
2022-09-28 The Effect of Silence Feature in Dimensional Speech Emotion Recognition Bagus Tris Atmaja et.al. 2003.01277 null
2020-03-02 A Density Ratio Approach to Language Model Fusion in End-To-End Automatic Speech Recognition Erik McDermott et.al. 2002.11268 null
2020-01-08 Domain Adaptation via Teacher-Student Learning for End-to-End Speech Recognition Zhong Meng et.al. 2001.01798 null
2020-01-08 Character-Aware Attention-Based End-to-End Speech Recognition Zhong Meng et.al. 2001.01795 null
2023-05-23 Leveraging End-to-End Speech Recognition with Neural Architecture Search Ahmed Baruwa et.al. 1912.05946 null
2019-11-21 On using 2D sequence-to-sequence models for speech recognition Parnia Bahar et.al. 1911.08888 null
2019-11-13 Recurrent Neural Network Transducer for Audio-Visual Speech Recognition Takaki Makino et.al. 1911.04890 null
2019-10-15 VAIS ASR: Building a conversational speech recognition system using language model combination Quang Minh Nguyen et.al. 1910.05603 null
2020-05-08 Self-Training for End-to-End Speech Recognition Jacob Kahn et.al. 1909.09116 null
2020-03-17 Advancing Speech Recognition With No Speech Or With Noisy Speech Gautam Krishna et.al. 1906.08871 null
2019-04-26 Phonetically-Oriented Word Error Alignment for Speech Recognition Error Analysis in Speech Translation Nicholas Ruiz et.al. 1904.11024 null
2019-07-10 End-to-End Visual Speech Recognition for Small-Scale Datasets Stavros Petridis et.al. 1904.01954 null
2020-01-01 A Convolutional Neural Network model based on Neutrosophy for Noisy Speech Recognition Elyas Rashno et.al. 1901.10629 null
2018-11-20 Analysis of DNN Speech Signal Enhancement for Robust Speaker Recognition Ondrej Novotny et.al. 1811.07629 null
2018-11-13 Reinforcement Learning Based Speech Enhancement for Robust Speech Recognition Yih-Liang Shen et.al. 1811.04224 null
2023-05-15 End-to-end Audiovisual Speech Activity Detection with Bimodal Recurrent Neural Models Fei Tao et.al. 1809.04553 null
2018-09-13 Isolated and Ensemble Audio Preprocessing Methods for Detecting Adversarial Examples against Automatic Speech Recognition Krishan Rajaratnam et.al. 1809.04397 null
2018-07-04 Exploring End-to-End Techniques for Low-Resource Speech Recognition Vladimir Bataev et.al. 1807.00868 null
2018-05-29 Automatic context window composition for distant speech recognition Mirco Ravanelli et.al. 1805.10498 null
2022-03-17 Curriculum Learning for Speech Emotion Recognition from Crowdsourced Labels Reza Lotfian et.al. 1805.10339 null
2018-04-27 End-to-End Multimodal Speech Recognition Shruti Palaskar et.al. 1804.09713 null
2018-10-17 Deep Long Short-Term Memory Adaptive Beamforming Networks For Multichannel Robust Speech Recognition Zhong Meng et.al. 1711.08016 null
2019-05-01 Unsupervised Adaptation with Domain Separation Networks for Robust Speech Recognition Zhong Meng et.al. 1711.08010 null
2018-02-23 BridgeNets: Student-Teacher Transfer Learning Based on Recursive Neural Networks and its Application to Distant Speech Recognition Jaeyoung Kim et.al. 1710.10224 null
2018-06-29 Combining Multiple Views for Visual Speech Recognition Marina Zimmermann et.al. 1710.07168 null
2018-04-26 Visual speech recognition: aligning terminologies for better understanding Helen L Bear et.al. 1710.01292 null
2018-04-26 Resolution limits on visual speech recognition Helen L. Bear et.al. 1710.01073 null
2017-09-01 Leveraging Deep Neural Network Activation Entropy to cope with Unseen Data in Speech Recognition Vikramjit Mitra et.al. 1708.09516 null
2018-12-06 Single-Channel Multi-talker Speech Recognition with Permutation Invariant Training Yanmin Qian et.al. 1707.06527 null
2017-11-16 Rank-1 Constrained Multichannel Wiener Filter for Speech Recognition in Noisy Environments Ziteng Wang et.al. 1707.00201 null
2017-04-27 Towards Estimating the Upper Bound of Visual-Speech Recognition: The Visual Lip-Reading Feasibility Database Adriana Fernandez-Lopez et.al. 1704.08028 null
2016-12-07 Invariant Representations for Noisy Speech Recognition Dmitriy Serdyuk et.al. 1612.01928 null
2017-08-08 Robust coherence-based spectral enhancement for speech recognition in adverse real-world environments Hendrik Barfuss et.al. 1604.03393 null
2015-09-25 Noise-Robust ASR for the third 'CHiME' Challenge Exploiting Time-Frequency Masking based Multi-Channel Speech Enhancement and Recurrent Neural Network Zaihu Pang et.al. 1509.07211 null
2014-09-05 Visual Speech Recognition Ahmad B. A. Hassanat et.al. 1409.1411 null
2014-02-12 Modified SPLICE and its Extension to Non-Stereo Data for Noise Robust Speech Recognition D. S. Pavan Kumar et.al. 1307.4048 null

(back to top)

TTS

Publish Date Title Authors PDF Code
2026-03-05 Hierarchical Decoding for Discrete Speech Synthesis with Multi-Resolution Spoof Detection Junchuan Zhao et.al. 2603.05373 null
2026-03-05 Measuring the Redundancy of Decoder Layers in SpeechLLMs Adel Moumen et.al. 2603.05121 null
2026-03-04 ZeSTA: Zero-Shot TTS Augmentation with Domain-Conditioned Training for Data-Efficient Personalized Speech Synthesis Youngwon Choi et.al. 2603.04219 null
2026-03-04 VietNormalizer: An Open-Source, Dependency-Free Python Library for Vietnamese Text Normalization in TTS and NLP Applications Hung Vu Nguyen et.al. 2603.04145 null
2026-03-02 More Data, Fewer Diacritics: Scaling Arabic TTS Ahmed Musleh et.al. 2603.01622 null
2026-03-02 End-to-End Simultaneous Dysarthric Speech Reconstruction with Frame-Level Adaptor and Multiple Wait-k Knowledge Distillation Minghui Wu et.al. 2603.01382 null
2026-03-02 DARS: Dysarthria-Aware Rhythm-Style Synthesis for ASR Enhancement Minghui Wu et.al. 2603.01369 null
2026-03-01 S-VoCAL: A Dataset and Evaluation Framework for Inferring Speaking Voice Character Attributes in Literature Abigail Berthe-Pardo et.al. 2603.00958 null
2026-02-26 Discourse-Aware Dual-Track Streaming Response for Low-Latency Spoken Dialogue Systems Siyuan Liu et.al. 2602.23266 null
2026-02-26 TADA: A Generative Framework for Speech Modeling via Text-Acoustic Dual Alignment Trung Dang et.al. 2602.23068 null
2026-03-03 Scalable Multilingual Multimodal Machine Translation with Speech-Text Fusion Yexing Du et.al. 2602.21646 null
2026-02-25 The Design Space of Tri-Modal Masked Diffusion Models Louis Bethune et.al. 2602.21472 null
2026-02-23 Can You Tell It's AI? Human Perception of Synthetic Voices in Vishing Scenarios Zoha Hayat Bhatti et.al. 2602.20061 null
2026-02-23 CTC-TTS: LLM-based dual-streaming text-to-speech with CTC alignment Hanwen Liu et.al. 2602.19574 null
2026-02-19 CC-G2PnP: Streaming Grapheme-to-Phoneme and prosody with Conformer-CTC for unsegmented languages Yuma Shirahata et.al. 2602.17157 null
2026-02-13 Speech to Speech Synthesis for Voice Impersonation Bjorn Johnson et.al. 2602.16721 null
2026-02-18 How to Label Resynthesized Audio: The Dual Role of Neural Audio Codecs in Audio Deepfake Detection Yixuan Xiao et.al. 2602.16343 null
2026-02-17 LLM-to-Speech: A Synthetic Data Pipeline for Training Dialectal Text-to-Speech Models Ahmed Khaled Khamis et.al. 2602.15675 null
2026-03-03 UniTAF: A Modular Framework for Joint Text-to-Speech and Audio-to-Face Modeling Qiangong Zhou et.al. 2602.15651 null
2026-02-16 Disentangling Pitch and Creak for Speaker Identity Preservation in Speech Synthesis Frederik Rautenberg et.al. 2602.14686 null
2026-02-16 Probing Human Articulatory Constraints in End-to-End TTS with Reverse and Mismatched Speech-Text Directions Parth Khadse et.al. 2602.14664 null
2026-02-14 ELEAT-SAGA: Early & Late Integration with Evading Alternating Training for Spoof-Robust Speaker Verification Amro Asali et.al. 2602.13761 null
2026-02-13 PISHYAR: A Socially Intelligent Smart Cane for Indoor Social Navigation and Multimodal Human-Robot Interaction for Visually Impaired People Mahdi Haghighat Joo et.al. 2602.12597 null
2026-02-16 "Sorry, I Didn't Catch That": How Speech Models Miss What Matters Most Kaitlyn Zhou et.al. 2602.12249 null
2026-02-19 When Audio-LLMs Don't Listen: A Cross-Linguistic Study of Modality Arbitration Jayadev Billa et.al. 2602.11488 null
2026-02-12 SLD-L2S: Hierarchical Subspace Latent Diffusion for High-Fidelity Lip to Speech Synthesis Yifan Liang et.al. 2602.11477 null
2026-02-11 Calliope: A TTS-based Narrated E-book Creator Ensuring Exact Synchronization, Privacy, and Layout Fidelity Hugo L. Hammer et.al. 2602.10735 null
2026-02-10 Emotion-Coherent Speech Data Augmentation and Self-Supervised Contrastive Style Training for Enhancing Kids's Story Speech Synthesis Raymond Chung et.al. 2602.10164 null
2026-02-10 Covo-Audio Technical Report Wenfu Wang et.al. 2602.09823 null
2026-02-10 TVTSyn: Content-Synchronous Time-Varying Timbre for Streaming Voice Conversion and Anonymization Waris Quamer et.al. 2602.09389 null
2026-02-03 DSFlow: Dual Supervision and Step-Aware Architecture for One-Step Flow Matching Speech Synthesis Bin Lin et.al. 2602.09041 null
2026-02-19 Prototype-Based Disentanglement for Controllable Dysarthric Speech Synthesis Haoshen Wang et.al. 2602.08696 null
2026-02-08 SoulX-Singer: Towards High-Quality Zero-Shot Singing Voice Synthesis Jiale Qian et.al. 2602.07803 null
2026-01-14 PersonaPlex: Voice and Role Control for Full Duplex Conversational Speech Models Rajarshi Roy et.al. 2602.06053 null
2026-02-05 ARCHI-TTS: A flow-matching-based Text-to-Speech Model with Self-supervised Semantic Aligner and Accelerated Inference Chunyat Wu et.al. 2602.05207 null
2026-02-04 HoliAntiSpoof: Audio LLM for Holistic Speech Anti-Spoofing Xuenan Xu et.al. 2602.04535 null
2026-02-04 PFluxTTS: Hybrid Flow-Matching TTS with Robust Cross-Lingual Voice Cloning and Inference-Time Model Fusion Vikentii Pankov et.al. 2602.04160 null
2026-02-03 CoCoEmo: Composable and Controllable Human-Like Emotional TTS via Activation Steering Siyi Wang et.al. 2602.03420 null
2026-03-02 WAXAL: A Large-Scale Multilingual African Language Speech Corpus Abdoulaye Diack et.al. 2602.02734 null
2026-02-01 VividVoice: A Unified Framework for Scene-Aware Visually-Driven Speech Synthesis Chengyuan Ma et.al. 2602.02591 null
2026-02-02 LipSody: Lip-to-Speech Synthesis with Enhanced Prosody Consistency Jaejun Lee et.al. 2602.01908 null
2026-02-01 EmoAra: Emotion-Preserving English Speech Transcription and Cross-Lingual Translation with Arabic Text-to-Speech Besher Hassan et.al. 2602.01170 null
2026-02-01 Bias in the Ear of the Listener: Assessing Sensitivity in Audio Language Models Across Linguistic, Demographic, and Positional Variations Sheng-Lun Wei et.al. 2602.01030 null
2026-01-31 Edit Content, Preserve Acoustics: Imperceptible Text-Based Speech Editing via Self-Consistency Rewards Yong Ren et.al. 2602.00560 null
2026-01-30 Multi-Speaker Conversational Audio Deepfake: Taxonomy, Dataset and Pilot Study Alabi Ahmed et.al. 2602.00295 null
2026-01-30 Now You Hear Me: Audio Narrative Attacks Against Large Audio-Language Models Ye Yu et.al. 2601.23255 null
2026-01-30 EmoShift: Lightweight Activation Steering for Enhanced Emotion-Aware Speech Synthesis Li Zhou et.al. 2601.22873 null
2026-01-30 Evaluating and Rewarding LALMs for Expressive Role-Play TTS via Mean Continuation Log-Probability Yong Ren et.al. 2601.22661 null
2026-01-29 Speech Quality-Based Localization of Low-Quality Speech and Text-to-Speech Synthesis Artefacts Michael Kuhlmann et.al. 2601.21886 null
2026-01-28 Audio Deepfake Detection in the Age of Advanced Text-to-Speech models Robin Singh et.al. 2601.20510 null
2026-01-28 Erasing Your Voice Before It's Heard: Training-free Speaker Unlearning for Zero-shot Text-to-Speech Myungjin Lee et.al. 2601.20481 null
2026-01-29 Unit-Based Agent for Semi-Cascaded Full-Duplex Dialogue Systems Haoyuan Yu et.al. 2601.20230 null
2026-01-27 T-Mimi: A Transformer-based Mimi Decoder for Real-Time On-Phone TTS Haibin Wu et.al. 2601.20094 null
2026-01-27 Phonological Tokenizer: Prosody-Aware Phonetic Token via Multi-Objective Fine-Tuning with Differentiable K-Means Kentaro Onda et.al. 2601.19781 null
2026-01-26 Neural Multi-Speaker Voice Cloning for Nepali in Low-Resource Settings Aayush M. Shrestha et.al. 2601.18694 null
2026-01-26 UrgentMOS: Unified Multi-Metric and Preference Learning for Robust Speech Quality Assessment Wei Wang et.al. 2601.18438 null
2026-01-25 Quran-MD: A Fine-Grained Multilingual Multimodal Dataset of the Quran Muhammad Umar Salman et.al. 2601.17880 null
2026-01-23 SonoEdit: Null-Space Constrained Knowledge Editing for Pronunciation Correction in LLM-Based TTS Ayush Pratap Singh et.al. 2601.17086 null
2026-01-16 AI-based System for Transforming text and sound to Educational Videos M. E. ElAlami et.al. 2601.17022 null
2026-01-16 ES4R: Speech Encoding Based on Prepositive Affective Modeling for Empathetic Response Generation Zhuoyue Gao et.al. 2601.16225 null
2026-01-22 Timbre-Aware LLM-based Direct Speech-to-Speech Translation Extendable to Multiple Language Pairs Lalaram Arya et.al. 2601.16023 null
2026-01-22 Qwen3-TTS Technical Report Hangrui Hu et.al. 2601.15621 link
2026-01-22 DeepASMR: LLM-Based Zero-Shot ASMR Speech Generation for Anyone of Any Voice Leying Zhang et.al. 2601.15596 null
2026-01-20 Prosody-Guided Harmonic Attention for Phase-Coherent Neural Vocoding in the Complex Spectrum Mohammed Salah Al-Radhi et.al. 2601.14472 null
2026-01-28 Quantifying Speaker Embedding Phonological Rule Interactions in Accented Speech Synthesis Thanathai Lertpetchpun et.al. 2601.14417 null
2026-01-20 Habibi: Laying the Open-Source Foundation of Unified-Dialectal Arabic Speech Synthesis Yushen Chen et.al. 2601.13802 null
2026-01-19 Lombard Speech Synthesis for Any Voice with Controllable Style Embeddings Seymanur Akti et.al. 2601.12966 null
2026-01-18 A Unified Neural Codec Language Model for Selective Editable Text to Speech Generation Hanchen Pei et.al. 2601.12480 null
2026-01-18 ParaMETA: Towards Learning Disentangled Paralinguistic Speaking Styles Representations from Speech Haowei Lou et.al. 2601.12289 null
2026-01-18 Confidence-based Filtering for Speech Dataset Curation with Generative Speech Enhancement Using Discrete Tokens Kazuki Yamauchi et.al. 2601.12254 null
2026-01-16 WenetSpeech-Wu: Datasets, Benchmarks, and Models for a Unified Chinese Wu Dialect Speech Processing Ecosystem Chengyou Wang et.al. 2601.11027 null
2026-01-15 Unifying Speech Recognition, Synthesis and Conversion with Autoregressive Transformers Runyuan Cai et.al. 2601.10770 null
2026-01-20 VoiceSculptor: Your Voice, Designed By You Jingbin Hu et.al. 2601.10629 null
2026-01-15 STEAMROLLER: A Multi-Agent System for Inclusive Automatic Speech Recognition for People who Stutter Ziqi Xu et.al. 2601.10223 null
2026-01-13 Decoding Order Matters in Autoregressive Speech Synthesis Minghui Zhao et.al. 2601.08450 null
2026-01-13 Detecting Mental Manipulation in Speech via Synthetic Multi-Speaker Dialogue Run Chen et.al. 2601.08342 null
2026-03-02 FOCAL: A Novel Benchmarking Technique for Multi-modal Agents Anupam Purwar et.al. 2601.07367 null
2026-02-05 ESDD2: Environment-Aware Speech and Sound Deepfake Detection Challenge Evaluation Plan Xueping Zhang et.al. 2601.07303 null
2026-01-10 Lightweight Resolution-Aware Audio Deepfake Detection via Cross-Scale Attention and Consistency Learning K. A. Shahriar et.al. 2601.06560 null
2026-01-09 Pantagruel: Unified Self-Supervised Encoders for French Text and Speech Phuong-Hang Le et.al. 2601.05911 null
2026-01-14 Afri-MCQA: Multimodal Cultural Question Answering for African Languages Atnafu Lambebo Tonja et.al. 2601.05699 null
2026-01-09 SPAM: Style Prompt Adherence Metric for Prompt-based TTS Chanhee Cho et.al. 2601.05554 null
2026-01-08 CosyEdit: Unlocking End-to-End Speech Editing Capability from Zero-Shot Text-to-Speech Models Junyang Chen et.al. 2601.05329 null
2026-01-08 FlexiVoice: Enabling Flexible Style Control in Zero-Shot TTS with Natural Language Instructions Dekun Chen et.al. 2601.04656 null
2026-01-08 LLMs-Integrated Automatic Hate Speech Recognition Using Controllable Text Generation Models Ryutaro Oshima et.al. 2601.04654 null
2026-01-09 IndexTTS 2.5 Technical Report Yunpei Li et.al. 2601.03888 null
2026-01-14 Stuttering-Aware Automatic Speech Recognition for Indonesian Language Fadhil Muhammad et.al. 2601.03727 null
2026-01-07 Domain Adaptation of the Pyannote Diarization Pipeline for Conversational Indonesian Audio Muhammad Daffa'i Rafi Prasetyo et.al. 2601.03684 null
2026-01-07 ReStyle-TTS: Relative and Continuous Style Control for Zero-Shot Speech Synthesis Haitao Li et.al. 2601.03632 null
2026-01-06 Tigrinya Number Verbalization: Rules, Algorithm, and Implementation Fitsum Gaim et.al. 2601.03403 null
2026-01-06 Segment-Aware Conditioning for Training-Free Intra-Utterance Emotion and Duration Control in Text-to-Speech Qifan Liang et.al. 2601.03170 null
2026-01-24 XLSR-MamBo: Scaling the Hybrid Mamba-Attention Backbone for Audio Deepfake Detection Kwok-Ho Ng et.al. 2601.02944 null
2026-01-06 Vulnerabilities of Audio-Based Biometric Authentication Systems Against Deepfake Speech Synthesis Mengze Hong et.al. 2601.02914 null
2026-01-06 Vclip: Face-based Speaker Generation by Face-voice Association Learning Yao Shi et.al. 2601.02753 null
2026-01-05 VocalBridge: Latent Diffusion-Bridge Purification for Defeating Perturbation-Based Voiceprint Defenses Maryam Abbasihafshejani et.al. 2601.02444 null
2026-01-05 Towards Prosodically Informed Mizo TTS without Explicit Tone Markings Abhijit Mohanta et.al. 2601.02073 null
2026-01-08 MM-Sonate: Multimodal Controllable Audio-Video Generation with Zero-Shot Voice Cloning Chunyu Qiang et.al. 2601.01568 null
2026-01-04 OV-InstructTTS: Towards Open-Vocabulary Instruct Text-to-Speech Yong Ren et.al. 2601.01459 null
2026-01-02 Improving Code-Switching Speech Recognition with TTS Data Augmentation Yue Heng Yeo et.al. 2601.00935 null
2026-01-01 DepFlow: Disentangled Speech Generation to Mitigate Semantic Bias in Depression Detection Yuxin Li et.al. 2601.00303 null
2025-12-29 AI4Reading: Chinese Audiobook Interpretation System Based on Multi-Agent Collaboration Minjiang Huang et.al. 2512.23300 null
2025-12-27 ManchuTTS: Towards High-Quality Manchu Speech Synthesis via Flow Matching and Hierarchical Text Representation Suhua Wang et.al. 2512.22491 null
2025-12-25 Zero-Shot to Zero-Lies: Detecting Bengali Deepfake Audio through Transfer Learning Most. Sharmin Sultana Samu et.al. 2512.21702 null
2026-01-20 Fun-Audio-Chat Technical Report Tongyi Fun Team et.al. 2512.20156 link
2025-12-21 Smark: A Watermark for Text-to-Speech Diffusion Models via Discrete Wavelet Transform Yichuan Zhang et.al. 2512.18791 null
2025-12-21 Task Vector in TTS: Toward Emotionally Expressive Dialectal Speech Synthesis Pengchao Feng et.al. 2512.18699 null
2025-12-19 Training Text-to-Speech Model with Purely Synthetic Data: Feasibility, Sensitivity, and Generalization Capability Tingxiao Zhou et.al. 2512.17356 null
2025-12-19 Robust TTS Training via Self-Purifying Flow Matching for the WildSpoof 2026 TTS Track June Young Yi et.al. 2512.17293 null
2025-12-24 Hearing to Translate: The Effectiveness of Speech Modality Integration into LLMs Sara Papi et.al. 2512.16378 link
2025-12-16 Adapting Speech Language Model to Singing Voice Synthesis Yiwen Zhao et.al. 2512.14657 null
2025-12-16 GLM-TTS Technical Report Jiayan Cui et.al. 2512.14291 link
2025-12-18 A stylometric analysis of speaker attribution from speech transcripts Cristina Aggazzotti et.al. 2512.13667 null
2025-12-15 Reproducing and Dissecting Denoising Language Models for Speech Recognition Dorian Koch et.al. 2512.13576 null
2026-01-04 DisCo-Speech: Controllable Zero-Shot Speech Generation with A Disentangled Speech Codec Tao Li et.al. 2512.13251 null
2025-12-11 CompanionCast: A Multi-Agent Conversational AI Framework with Spatial Audio for Social Co-Viewing Experiences Yiyang Wang et.al. 2512.10918 null
2025-12-10 DMP-TTS: Disentangled multi-modal Prompting for Controllable Text-to-Speech with Chained Guidance Kang Yin et.al. 2512.09504 null
2025-12-09 LG Uplus System with Multi-Speaker IDs and Discriminator-based Sub-Judges for the WildSpoof Challenge Jinyoung Park et.al. 2512.09000 null
2025-12-08 Beyond Unified Models: A Service-Oriented Approach to Low Latency, Context Aware Phonemization for Real Time TTS Mahta Fetrat et.al. 2512.08006 link
2025-12-06 Sanvaad: A Multimodal Accessibility Framework for ISL Recognition and Voice-Based Interaction Kush Revankar et.al. 2512.06485 null
2025-12-05 SEA-SafeguardBench: Evaluating AI Safety in SEA Languages and Cultures Panuthep Tasawong et.al. 2512.05501 null
2025-11-23 SyncVoice: Towards Video Dubbing with Vision-Augmented Pretrained TTS Model Kaidi Wang et.al. 2512.05126 null
2025-12-04 HiPPO: Exploring A Novel Hierarchical Pronunciation Assessment Approach for Spoken Languages Bi-Cheng Yan et.al. 2512.04964 null
2025-12-04 M3-TTS: Multi-modal DiT Alignment & Mel-latent for Zero-shot High-fidelity Speech Synthesis Xiaopeng Wang et.al. 2512.04720 null
2026-01-26 RRPO: Robust Reward Policy Optimization for LLM-based Emotional TTS Cong Wang et.al. 2512.04552 null
2025-12-02 How to DP-fy Your Data: A Practical Guide to Generating Synthetic Data With Differential Privacy Natalia Ponomareva et.al. 2512.03238 null
2025-12-02 BOOM: Beyond Only One Modality KIT's Multimodal Multilingual Lecture Companion Sai Koneru et.al. 2512.02817 null
2025-12-02 Hear What Matters! Text-conditioned Selective Video-to-Audio Generation Junwon Lee et.al. 2512.02650 null
2025-12-02 Spoken Conversational Agents with Large Language Models Chao-Han Huck Yang et.al. 2512.02593 null
2025-12-01 MCAT: Scaling Many-to-Many Speech-to-Text Translation with MLLMs to 70 Languages Yexing Du et.al. 2512.01512 null
2025-12-01 fMRI2GES: Co-speech Gesture Reconstruction from fMRI Signal with Dual Brain Decoding Alignment Chunzheng Zhu et.al. 2512.01189 null
2025-11-30 Arabic TTS with FastPitch: Reproducible Baselines, Adversarial Training, and Oversmoothing Analysis Lars Nippert et.al. 2512.00937 null
2025-12-03 STCTS: Generative Semantic Compression for Ultra-Low Bitrate Speech via Explicit Text-Prosody-Timbre Decomposition Siyu Wang et.al. 2512.00451 null
2025-11-28 OmniFusion: Simultaneous Multilingual Multimodal Translations via Modular Fusion Sai Koneru et.al. 2512.00234 link
2025-11-28 CoordSpeaker: Exploiting Gesture Captioning for Coordinated Caption-Empowered Co-Speech Gesture Generation Fengyi Fang et.al. 2511.22863 null
2025-11-27 PURE Codec: Progressive Unfolding of Residual Entropy for Speech Codec Learning Jiatong Shi et.al. 2511.22687 null
2025-11-27 Joint Speech and Text Training for LLM-Based End-to-End Spoken Dialogue State Tracking Katia Vendrame et.al. 2511.22503 null
2025-11-27 GLA-Grad++: An Improved Griffin-Lim Guided Diffusion Model for Speech Synthesis Teysir Baoueb et.al. 2511.22293 null
2025-11-27 VSpeechLM: A Visual Speech Language Model for Visual Text-to-Speech Task Yuyue Wang et.al. 2511.22229 null
2025-11-27 Layover or Direct Flight: Rethinking Audio-Guided Image Segmentation Joel Alberto Santos et.al. 2511.22025 null
2025-11-26 Advancing Marine Bioacoustics with Deep Generative Models: A Hybrid Augmentation Strategy for Southern Resident Killer Whale Detection Bruno Padovese et.al. 2511.21872 null
2025-12-05 Decoding inner speech with an end-to-end brain-to-text neural interface Yizi Zhang et.al. 2511.21740 null
2025-11-26 Voice, Bias, and Coreference: An Interpretability Study of Gender in Speech Translation Lina Conti et.al. 2511.21517 null
2025-11-26 Multi-Reward GRPO for Stable and Prosodic Single-Codebook TTS LLMs at Scale Yicheng Zhong et.al. 2511.21270 null
2025-11-26 RosettaSpeech: Zero-Shot Speech-to-Speech Translation from Monolingual Data Zhisheng Zheng et.al. 2511.20974 null
2025-12-24 SingingSDS: A Singing-Capable Spoken Dialogue System for Conversational Roleplay Applications Jionghao Han et.al. 2511.20972 link
2025-11-25 Continual Audio Deepfake Detection via Universal Adversarial Perturbation Wangjie Li et.al. 2511.19974 null
2025-11-25 It Hears, It Sees too: Multi-Modal LLM for Depression Detection By Integrating Visual Understanding into Audio Language Models Xiangyu Zhao et.al. 2511.19877 null
2025-11-24 Dynamic Multi-Species Bird Soundscape Generation with Acoustic Patterning and 3D Spatialization Ellie L. Zhang et.al. 2511.19275 null
2025-11-24 Context-Aware Whisper for Arabic ASR Under Linguistic Varieties Bashar Talafha et.al. 2511.18774 null
2025-12-03 First Deep Learning Approach to Hammering Acoustics for Stem Stability Assessment in Total Hip Arthroplasty Dongqi Zhu et.al. 2511.18725 null
2025-11-24 AIRHILT: A Human-in-the-Loop Testbed for Multimodal Conflict Detection in Aviation Omar Garib et.al. 2511.18718 null
2025-11-23 InstructAudio: Unified speech and music generation with natural language instruction Chunyu Qiang et.al. 2511.18487 null
2025-11-23 A Multimodal Conversational Agent for Tabular Data Analysis Mohammad Nour Al Awad et.al. 2511.18405 null
2025-11-22 A superpersuasive autonomous policy debating system Allen Roush et.al. 2511.17854 null
2025-11-12 Speech Recognition Model Improves Text-to-Speech Synthesis using Fine-Grained Reward Guansu Wang et.al. 2511.17555 null
2025-11-21 AI in Music and Sound: Pedagogical Reflections, Post-Structuralist Approaches and Creative Outcomes in Seminar Practice Guilherme Coelho et.al. 2511.17425 null
2025-11-21 Robot Confirmation Generation and Action Planning Using Long-context Q-Former Integrated with Multimodal LLM Chiori Hori et.al. 2511.17335 null
2025-11-20 Revisiting Audio-language Pretraining for Learning General-purpose Audio Representation Wei-Cheng Tseng et.al. 2511.16757 null
2025-11-20 Codec2Vec: Self-Supervised Speech Representation Learning Using Neural Speech Codecs Wei-Cheng Tseng et.al. 2511.16639 null
2025-11-20 SceneGuard: Training-Time Voice Protection with Scene-Consistent Audible Background Noise Rui Sang et.al. 2511.16114 null
2025-11-26 Step-Audio-R1 Technical Report Fei Tian et.al. 2511.15848 link
2025-11-24 PresentCoach: Dual-Agent Presentation Coaching through Exemplars and Interactive Feedback Sirui Chen et.al. 2511.15253 null
2025-11-18 Quality-Controlled Multimodal Emotion Recognition in Conversations with Identity-Based Transfer Learning and MAMBA Fusion Zanxu Wang et.al. 2511.14969 null
2025-11-18 Voiced-Aware Style Extraction and Style Direction Adjustment for Expressive Text-to-Speech Nam-Gyu Kim et.al. 2511.14824 null
2025-11-06 The Impact of Prosodic Segmentation on Speech Synthesis of Spontaneous Speech Julio Cesar Galdino et.al. 2511.14779 null
2025-11-18 Ground Truth Generation for Multilingual Historical NLP using LLMs Clovis Gladstone et.al. 2511.14688 null
2025-11-18 TTA: Transcribe, Translate and Alignment for Cross-lingual Speech Representation Wei Liu et.al. 2511.14410 null
2025-11-19 StreamingTalker: Audio-driven 3D Facial Animation with Autoregressive Diffusion Model Yifan Yang et.al. 2511.14223 null
2025-11-20 FxSearcher: gradient-free text-driven audio transformation Hojoon Ki et.al. 2511.14138 null
2025-11-17 Human-centric Maintenance Process Through Integration of AI, Speech, and AR Parul Khanna et.al. 2511.13918 null
2025-11-26 Passive Dementia Screening via Facial Temporal Micro-Dynamics Analysis of In-the-Wild Talking-Head Video Filippo Cenacchi et.al. 2511.13802 null
2025-11-17 Computational Measurement of Political Positions: A Review of Text-Based Ideal Point Estimation Algorithms Patrick Parschan et.al. 2511.13238 null
2025-11-24 FoleyBench: A Benchmark For Video-to-Audio Models Satvik Dixit et.al. 2511.13219 null
2025-11-17 Distinguishing Repetition Disfluency from Morphological Reduplication in Bangla ASR Transcripts: A Novel Corpus and Benchmarking Analysis Zaara Zabeen Arpa et.al. 2511.13159 null
2025-11-17 A Smart-Glasses for Emergency Medical Services via Multimodal Multitask Learning Liuyi Jin et.al. 2511.13078 null
2025-11-16 Improving Direct Persian-English Speech-to-Speech Translation with Discrete Units and Synthetic Parallel Data Sina Rashidi et.al. 2511.12690 null
2025-11-16 Hi-Reco: High-Fidelity Real-Time Conversational Digital Humans Hongbin Huang et.al. 2511.12662 null
2025-11-23 Uni-MoE-2.0-Omni: Scaling Language-Centric Omnimodal Large Model with Advanced MoE, Training and Data Yunxin Li et.al. 2511.12609 link
2025-11-16 DenseAnnotate: Enabling Scalable Dense Caption Collection for Images and 3D Scenes via Spoken Descriptions Xiaoyu Lin et.al. 2511.12452 null
2025-11-15 VoiceCraft-X: Unifying Multilingual, Voice-Cloning Speech Synthesis and Speech Editing Zhisheng Zheng et.al. 2511.12347 null
2025-11-15 Fusionista2.0: Efficiency Retrieval System for Large-Scale Datasets Huy M. Le et.al. 2511.12255 null
2025-10-27 TimeStampEval: A Simple LLM Eval and a Little Fuzzy Matching Trick to Improve Search Accuracy James McCammon et.al. 2511.11594 null
2025-11-14 Language-Aided State Estimation Yuki Miyoshi et.al. 2511.11285 null
2025-11-14 CLARITY: Contextual Linguistic Adaptation and Accent Retrieval for Dual-Bias Mitigation in Text-to-Speech Generation Crystal Min Hui Poon et.al. 2511.11104 null
2025-11-14 Synthetic Voices, Real Threats: Evaluating Large Text-to-Speech Models in Generating Harmful Audio Guangke Chen et.al. 2511.10913 null
2025-11-13 Curved Worlds, Clear Boundaries: Generalizing Speech Deepfake Detection using Hyperbolic and Spherical Geometry Spaces Farhan Sheth et.al. 2511.10793 null
2025-11-12 Do AI Voices Learn Social Nuances? A Case of Politeness and Speech Rate Eyal Rabin et.al. 2511.10693 null
2025-11-12 StyleBreak: Revealing Alignment Vulnerabilities in Large Audio-Language Models via Style-Aware Audio Jailbreak Hongyi Li et.al. 2511.10692 null
2025-11-09 Towards Fine-Grained Code-Switch Speech Translation with Semantic Space Alignment Yan Gao et.al. 2511.10670 null
2025-11-13 VocalNet-M2: Advancing Low-Latency Spoken Language Modeling via Integrated Multi-Codebook Tokenization and Multi-Token Prediction Yuhao Wang et.al. 2511.10232 null
2025-11-14 Speech-Audio Compositional Attacks on Multimodal LLMs and Their Mitigation with SALMONN-Guard Yudong Yang et.al. 2511.10222 null
2025-11-13 FabasedVC: Enhancing Voice Conversion with Text Modality Fusion and Phoneme-Level SSL Features Wenyu Wang et.al. 2511.10112 null
2025-11-13 Time-Layer Adaptive Alignment for Speaker Similarity in Flow-Matching Based Zero-Shot TTS Haoyu Li et.al. 2511.09995 null
2025-11-12 End-to-end Contrastive Language-Speech Pretraining Model For Long-form Spoken Question Answering Jiliang Hu et.al. 2511.09282 null
2025-11-12 POTSA: A Cross-Lingual Speech Alignment Framework for Low Resource Speech-to-Text Translation Xuanchen Li et.al. 2511.09232 null
2025-11-01 Retrieval-Augmented Generation of Pediatric Speech-Language Pathology vignettes: A Proof-of-Concept Study Yilan Liu et.al. 2511.08600 null
2025-11-11 ParliaBench: An Evaluation and Benchmarking Framework for LLM-Generated Parliamentary Speech Marios Koniaris et.al. 2511.08247 null
2025-11-11 State of the Art in Text Classification for South Slavic Languages: Fine-Tuning or Prompting? Taja Kuzman Pungeršek et.al. 2511.07989 null
2025-11-30 SpeechJudge: Towards Human-Level Judgment for Speech Naturalness Xueyao Zhang et.al. 2511.07931 null
2025-11-24 SynTTS-Commands: A Public Dataset for On-Device KWS via TTS-Synthesized Multilingual Speech Lu Gan et.al. 2511.07821 link
2025-11-10 Conditional Diffusion as Latent Constraints for Controllable Symbolic Music Generation Matteo Pettenó et.al. 2511.07156 null
2025-11-10 Generating Novel and Realistic Speakers for Voice Conversion Meiying Melissa Chen et.al. 2511.07135 null
2025-11-10 On the Joint Minimization of Regularization Loss Functions in Deep Variational Bayesian Methods for Attribute-Controlled Symbolic Music Generation Matteo Pettenó et.al. 2511.07118 null
2025-11-10 E2E-VGuard: Adversarial Prevention for Production LLM-based End-To-End Speech Synthesis Zhisheng Zhang et.al. 2511.07099 null
2025-11-10 MedVoiceBias: A Controlled Study of Audio LLM Behavior in Clinical Decision-Making Zhi Rui Tam et.al. 2511.06592 null
2025-11-09 SAR-LM: Symbolic Audio Reasoning with Large Language Models Termeh Taheri et.al. 2511.06483 null
2025-11-18 TalkSketch: Multimodal Generative AI for Real-time Sketch Ideation with Speech Weiyan Shi et.al. 2511.05817 null
2025-11-07 Shared Latent Representation for Joint Text-to-Audio-Visual Synthesis Dogucan Yaman et.al. 2511.05432 null
2025-11-07 Synthesizing speech with selected perceptual voice qualities - A case study with creaky voice Frederik Rautenberg et.al. 2511.05143 null
2025-11-06 PromptSep: Generative Audio Separation via Multimodal Prompting Yutong Wen et.al. 2511.04623 null
2025-11-19 Step-Audio-EditX Technical Report Chao Yan et.al. 2511.03601 link
2025-11-05 Seeing What You Say: Expressive Image Generation from Speech Jiyoung Lee et.al. 2511.03423 null
2025-11-05 TASU: Text-Only Alignment for Speech Understanding Jing Peng et.al. 2511.03310 null
2025-11-11 How to Evaluate Speech Translation with Source-Aware Neural MT Metrics Mauro Cettolo et.al. 2511.03295 null
2025-11-05 PolyNorm: Few-Shot LLM-Based Text Normalization for Text-to-Speech Michel Wong et.al. 2511.03080 null
2025-11-04 Augmenting Open-Vocabulary Dysarthric Speech Assessment with Human Perceptual Supervision Kaimeng Jia et.al. 2511.02270 null
2025-11-03 Toward Objective and Interpretable Prosody Evaluation in Text-to-Speech: A Linguistically Motivated Approach Cedric Chan et.al. 2511.02104 null
2025-11-03 SeaLLMs-Audio: Large Audio-Language Models for Southeast Asia Chaoqun Liu et.al. 2511.01670 null
2025-11-03 Speech-DRAME: A Framework for Human-Aligned Benchmarks in Speech Role-Play Jiatong Shi et.al. 2511.01261 null
2025-11-28 LongCat-Flash-Omni Technical Report Meituan LongCat Team et.al. 2511.00279 null
2025-10-31 Reconstructing Unseen Sentences from Speech-related Biosignals for Open-vocabulary Neural Communication Deok-Seon Kim et.al. 2510.27247 null
2025-10-30 UniTok-Audio: A Unified Audio Generation Framework via Generative Modeling on Discrete Codec Tokens Chengwei Liu et.al. 2510.26372 null
2025-10-30 SP-MCQA: Evaluating Intelligibility of TTS Beyond the Word Level Hitomi Jin Ling Tee et.al. 2510.26190 null
2025-10-30 ALMGuard: Safety Shortcuts and Where to Find Them as Guardrails for Audio-Language Models Weifei Jin et.al. 2510.26096 null
2025-10-27 SFMS-ALR: Script-First Multilingual Speech Synthesis with Adaptive Locale Resolution Dharma Teja Donepudi et.al. 2510.25178 null
2025-10-29 Explainable Disentanglement on Discrete Speech Representations for Noise-Robust ASR Shreyas Gopal et.al. 2510.25150 null
2025-10-30 Evaluating Emotion Recognition in Spoken Language Models on Emotionally Incongruent Speech Pedro Corrêa et.al. 2510.25054 null
2025-10-28 POWSM: A Phonetic Open Whisper-Style Speech Foundation Model Chin-Jou Li et.al. 2510.24992 null
2025-11-25 Ming-Flash-Omni: A Sparse, Unified Architecture for Multimodal Perception and Generation Inclusion AI et.al. 2510.24821 null
2025-11-28 STAR-Bench: Probing Deep Spatio-Temporal Reasoning as Audio 4D Intelligence Zihan Liu et.al. 2510.24693 link
2025-10-28 Levée d'ambiguïtés par grammaires locales Eric G. C. Laporte et.al. 2510.24530 null
2025-10-28 Bayesian Speech synthesizers Can Learn from Multiple Teachers Ziyang Zhang et.al. 2510.24372 null
2025-10-28 Abjad AI at NADI 2025: CATT-Whisper: Multimodal Diacritic Restoration Using Text and Speech Representations Ahmad Ghannam et.al. 2510.24247 null
2025-10-28 V-SAT: Video Subtitle Annotation Tool Arpita Kundu et.al. 2510.24180 null
2025-10-30 TeleEgo: Benchmarking Egocentric AI Assistants in the Wild Jiaqi Yan et.al. 2510.23981 null
2025-10-28 emg2speech: synthesizing speech from electromyography using self-supervised speech models Harshavardhana T. Gowda et.al. 2510.23969 null
2025-10-27 AfriMTEB and AfriE5: Benchmarking and Adapting Text Embedding Models for African Languages Kosei Uemura et.al. 2510.23896 null
2025-11-01 RoboOmni: Proactive Robot Manipulation in Omni-modal Context Siyin Wang et.al. 2510.23763 link
2025-10-28 SoulX-Podcast: Towards Realistic Long-form Podcasts with Dialectal and Paralinguistic Diversity Hanke Xie et.al. 2510.23541 null
2025-10-29 Are ASR foundation models generalized enough to capture features of regional dialects for low-resource languages? Tawsif Tashwar Dipto et.al. 2510.23252 null
2025-10-27 Flexing in 73 Languages: A Single Small Model for Multilingual Inflection Tomáš Sourada et.al. 2510.23114 null
2025-10-27 Adapting Speech Foundation Models with Large Language Models for Unified Speech Recognition Jing-Xuan Zhang et.al. 2510.22961 null
2025-10-30 DiffRhythm 2: Efficient and High Fidelity Song Generation via Block Flow Matching Yuepeng Jiang et.al. 2510.22950 null
2025-10-26 UltraVoice: Scaling Fine-Grained Style-Controlled Speech Conversations for Spoken Dialogue Models Wenming Tu et.al. 2510.22588 link
2025-10-25 M-CIF: Multi-Scale Alignment For CIF-Based Non-Autoregressive ASR Ruixiang Mao et.al. 2510.22172 null
2025-10-23 GuitarFlow: Realistic Electric Guitar Synthesis From Tablatures via Flow Matching and Style Transfer Jackson Loth et.al. 2510.21872 null
2025-10-24 Foley Control: Aligning a Frozen Latent Text-to-Audio Model to Video Ciara Rowles et.al. 2510.21581 null
2025-10-24 SindBERT, the Sailor: Charting the Seas of Turkish NLP Raphael Scheible-Schmitt et.al. 2510.21364 null
2025-10-30 Elementary, My Dear Watson: Non-Invasive Neural Keyword Spotting in the LibriBrain Dataset Gereon Elvers et.al. 2510.21038 null
2025-10-27 ReFESS-QI: Reference-Free Evaluation For Speech Separation With Joint Quality And Intelligibility Scoring Ari Frummer et.al. 2510.21014 null
2025-11-13 Can Current Detectors Catch Face-to-Voice Deepfake Attacks? Nguyen Linh Bao Nguyen et.al. 2510.21004 null
2025-10-22 Data-Centric Lessons To Improve Speech-Language Pretraining Vishaal Udandarao et.al. 2510.20860 null
2025-10-23 \textsc{CantoNLU}: A benchmark for Cantonese natural language understanding Junghyun Min et.al. 2510.20670 null
2025-10-23 Speaking Clearly: A Simplified Whisper-Based Codec for Low-Bitrate Speech Coding Xin Zhang et.al. 2510.20504 null
2025-10-23 Vox-Evaluator: Enhancing Stability and Fidelity for Zero-shot TTS with A Multi-Level Evaluator Hualei Wang et.al. 2510.20210 null
2025-10-23 Are Stereotypes Leading LLMs' Zero-Shot Stance Detection ? Anthony Dubreuil et.al. 2510.20154 null
2025-10-23 SpeechAgent: An End-to-End Mobile Infrastructure for Speech Impairment Assistance Haowei Lou et.al. 2510.20113 null
2025-10-22 OmniMotion-X: Versatile Multimodal Whole-Body Motion Generation Guowei Xu et.al. 2510.19789 null
2025-10-23 Adapting Multilingual Models to Code-Mixed Tasks via Model Merging Prashant Kodali et.al. 2510.19782 null
2025-10-22 Style Attack Disguise: When Fonts Become a Camouflage for Adversarial Intent Yangshijie Zhang et.al. 2510.19641 null
2025-10-22 Re-evaluating Minimum Bayes Risk Decoding for Automatic Speech Recognition Yuu Jinnai et.al. 2510.19471 null
2025-10-22 EchoFake: A Replay-Aware Dataset for Practical Speech Deepfake Detection Tong Zhang et.al. 2510.19414 null
2025-10-22 SONAR-SLT: Multilingual Sign Language Translation via Language-Agnostic Sentence Embedding Supervision Yasser Hamidullah et.al. 2510.19398 null
2025-10-22 M3-SLU: Evaluating Speaker-Attributed Reasoning in Multimodal Large Language Models Yejin Kwon et.al. 2510.19358 null
2025-10-22 Modeling Turn-Taking with Semantically Informed Gestures Varsha Suresh et.al. 2510.19350 null
2025-10-22 Slot Filling as a Reasoning Task for SpeechLLMs Kadri Hacioglu et.al. 2510.19326 null
2025-10-21 Steering Autoregressive Music Generation with Recursive Feature Machines Daniel Zhao et.al. 2510.19127 null
2025-11-07 Re:Member: Emotional Question Generation from Personal Memories Zackary Rackauckas et.al. 2510.19030 null
2025-11-05 StutterZero and StutterFormer: End-to-End Speech Conversion for Stuttering Transcription and Correction Qianheng Xu et.al. 2510.18938 null
2025-10-21 ProLAP: Probabilistic Language-Audio Pre-Training Toranosuke Manabe et.al. 2510.18423 null
2025-10-21 KrishokBondhu: A Retrieval-Augmented Voice-Based Agricultural Advisory Call Center for Bengali Farmers Mohd Ruhul Ameen et.al. 2510.18355 null
2025-10-21 ParaStyleTTS: Toward Efficient and Robust Paralinguistic Style Control for Expressive Text-to-Speech Generation Haowei Lou et.al. 2510.18308 link
2025-10-20 SARSteer: Safeguarding Large Audio Language Models via Safe-Ablated Refusal Steering Weilin Lin et.al. 2510.17633 null
2025-10-20 ImaGGen: Zero-Shot Generation of Co-Speech Semantic Gestures Grounded in Language and Image Input Hendric Voss et.al. 2510.17617 null
2025-10-20 Addressing Antisocial Behavior in Multi-Party Dialogs Through Multimodal Representation Learning Hajar Bakarou et.al. 2510.17289 null
2025-10-19 Investigating Safety Vulnerabilities of Large Audio-Language Models Under Speaker Emotional Variations Bo-Han Feng et.al. 2510.16893 link
2025-12-14 SAC: Neural Speech Codec with Semantic-Acoustic Dual-Stream Quantization Wenxi Chen et.al. 2510.16841 link
2025-10-19 End-to-end Listen, Look, Speak and Act Siyin Wang et.al. 2510.16756 null
2025-10-19 U-Codec: Ultra Low Frame-rate Neural Speech Codec for Fast High-fidelity Speech Generation Xusheng Yang et.al. 2510.16718 null
2025-10-19 Zero- and One-Shot Data Augmentation for Sentence-Level Dysarthric Speech Recognition in Constrained Scenarios Shiyao Wang et.al. 2510.16700 null
2025-10-18 Edge-Based Speech Transcription and Synthesis for Kinyarwanda and Swahili Languages Pacome Simon Mbonimpa et.al. 2510.16497 null
2025-10-18 Probing the Hidden Talent of ASR Foundation Models for L2 English Oral Assessment Fu-An Chao et.al. 2510.16387 null
2025-10-17 AsyncVoice Agent: Real-Time Explanation for LLM Planning and Reasoning Yueqian Lin et.al. 2510.16156 null
2025-10-17 Leveraging LLMs for Context-Aware Implicit Textual and Multimodal Hate Speech Detection Joshua Wolfe Brook et.al. 2510.15685 null
2025-10-17 SpikeVox: Towards Energy-Efficient Speech Therapy Framework with Spike-driven Generative Language Models Rachmad Vidya Wicaksana Putra et.al. 2510.15566 null
2025-10-17 Extending Audio Context for Long-Form Understanding in Large Audio-Language Models Yuatyong Chaichana et.al. 2510.15231 null
2025-10-17 LongCat-Audio-Codec: An Audio Tokenizer and Detokenizer Solution Designed for Speech Large Language Models Xiaohan Zhao et.al. 2510.15227 null
2025-10-16 OmniMotion: Multimodal Motion Generation with Continuous Masked Autoregression Zhe Li et.al. 2510.14954 null
2025-10-16 TASLA: Text-Aligned Speech Tokens with Multiple Layer-Aggregation Ming-Hao Hsu et.al. 2510.14934 null
2025-10-16 TRI-DEP: A Trimodal Comparative Study for Depression Detection Using Speech, Text, and EEG Annisaa Fitri Nurfidausi et.al. 2510.14922 null
2025-10-16 RLAIF-SPA: Optimizing LLM-based Emotional Speech Synthesis via RLAIF Qing Yang et.al. 2510.14628 null
2025-10-15 Closing the Gap Between Text and Speech Understanding in LLMs Santiago Cuervo et.al. 2510.13632 null
2025-10-15 Mismatch Aware Guidance for Robust Emotion Control in Auto-Regressive TTS Models Yizhou Peng et.al. 2510.13293 null
2025-10-23 Continuous-Token Diffusion for Speaker-Referenced TTS in Multimodal LLMs Xinlu He et.al. 2510.12995 null
2025-10-15 DiSTAR: Diffusion over a Scalable Token Autoregressive Representation for Speech Generation Yakun Song et.al. 2510.12210 null
2025-10-14 Understanding the Modality Gap: An Empirical Study on the Speech-Text Alignment Mechanism of Large Speech Language Models Bajian Xiang et.al. 2510.12116 null
2025-10-13 BridgeCode: A Dual Speech Representation Paradigm for Autoregressive Zero-Shot Text-to-Speech Synthesis Jingyuan Xing et.al. 2510.11646 null
2025-10-13 Perturbation Self-Supervised Representations for Cross-Lingual Emotion TTS: Stage-Wise Modeling of Emotion and Speaker Cheng Gong et.al. 2510.11124 null
2025-10-14 ParsVoice: A Large-Scale Multi-Speaker Persian Speech Corpus for Text-to-Speech Synthesis Mohammad Javad Ranjbar Kalahroodi et.al. 2510.10774 null
2025-10-12 End-to-end Speech Recognition with similar length speech and text Peng Fan et.al. 2510.10453 null
2025-10-17 MRSAudio: A Large-Scale Multimodal Recorded Spatial Audio Dataset with Refined Annotations Wenxiang Guo et.al. 2510.10396 link
2025-10-10 O_O-VC: Synthetic Data-Driven One-to-One Alignment for Any-to-Any Voice Conversion Huu Tuong Tu et.al. 2510.09061 link
2025-10-09 DialoSpeech: Dual-Speaker Dialogue Generation with LLM and Flow Matching Hanke Xie et.al. 2510.08373 null
2025-10-09 IntMeanFlow: Few-step Speech Generation with Integral Velocity Distillation Wei Wang et.al. 2510.07979 null
2025-10-09 Standard-to-Dialect Transfer Trends Differ across Text and Speech: A Case Study on Intent and Topic Classification in German Dialects Verena Blaschke et.al. 2510.07890 null
2025-10-08 Making Machines Sound Sarcastic: LLM-Enhanced and Retrieval-Guided Sarcastic Speech Synthesis Zhu Li et.al. 2510.07096 null
2025-10-08 Towards Responsible Evaluation for Text-to-Speech Yifan Yang et.al. 2510.06927 null
2025-10-08 XLSR-Kanformer: A KAN-Intergrated model for Synthetic Speech Detection Phuong Tuan Dat et.al. 2510.06706 null
2025-10-07 EverydayMMQA: A Multilingual and Multimodal Framework for Culturally Grounded Spoken Visual QA Firoj Alam et.al. 2510.06371 null
2025-10-08 TokenChain: A Discrete Speech Chain via Semantic Token Modeling Mingxuan Wang et.al. 2510.06201 null
2025-10-07 Latent Speech-Text Transformer Yen-Ju Lu et.al. 2510.06195 null
2025-10-07 ECTSpeech: Enhancing Efficient Speech Synthesis via Easy Consistency Tuning Tao Zhu et.al. 2510.05984 null
2025-10-07 Data-efficient Targeted Token-level Preference Optimization for LLM-based Text-to-Speech Rikuto Kotoge et.al. 2510.05799 null
2025-10-07 Sparse deepfake detection promotes better disentanglement Antoine Teissier et.al. 2510.05696 null
2025-10-09 Paper2Video: Automatic Video Generation from Scientific Papers Zeyu Zhu et.al. 2510.05096 link
2025-10-06 Speak, Edit, Repeat: High-Fidelity Voice Editing and Zero-Shot TTS with Cross-Attentive Mamba Baher Mohammad et.al. 2510.04738 null
2025-11-20 UniVoice: Unifying Autoregressive ASR and Flow-Matching based TTS with Large Language Models Wenhao Guan et.al. 2510.04593 link
2025-10-07 Synthetic Audio Forensics Evaluation (SAFE) Challenge Kirill Trapeznikov et.al. 2510.03387 null
2025-10-03 Flamed-TTS: Flow Matching Attention-Free Models for Efficient Generating and Dynamic Pacing Zero-shot Text-to-Speech Hieu-Nghia Huynh-Nguyen et.al. 2510.02848 null
2025-09-26 KAME: Tandem Architecture for Enhancing Knowledge in Real-Time Speech-to-Speech Conversational AI So Kuroki et.al. 2510.02327 null
2025-09-24 SpeechCT-CLIP: Distilling Text-Image Knowledge to Speech for Voice-Native Multimodal CT Analysis Lukas Buess et.al. 2510.02322 null
2025-10-02 Emotional Text-To-Speech Based on Mutual-Information-Guided Emotion-Timbre Disentanglement Jianing Yang et.al. 2510.01722 null
2025-09-30 BatonVoice: An Operationalist Framework for Enhancing Controllable Speech Synthesis with Linguistic Intelligence from LLMs Yue Wang et.al. 2509.26514 link
2025-09-30 Optimizing Speech Language Models for Acoustic Consistency Morteza Rohanian et.al. 2509.26276 null
2025-09-30 HiStyle: Hierarchical Style Embedding Predictor for Text-Prompt-Guided Controllable Speech Synthesis Ziyu Zhang et.al. 2509.25842 null
2025-09-30 LTA-L2S: Lexical Tone-Aware Lip-to-Speech Synthesis for Mandarin with Cross-Lingual Transfer Learning Kang Yang et.al. 2509.25670 null
2025-09-29 Emotion-Aligned Generation in Diffusion Text to Speech Models via Preference-Guided Optimization Jiacheng Shi et.al. 2509.25416 null
2025-09-29 MGM-Omni: Scaling Omni LLMs to Personalized Long-Horizon Speech Chengyao Wang et.al. 2509.25131 link
2025-09-30 VSSFlow: Unifying Video-conditioned Sound and Speech Generation via Joint Learning Xin Cheng et.al. 2509.24773 null
2025-09-29 VoxCPM: Tokenizer-Free TTS for Context-Aware Speech Generation and True-to-Life Voice Cloning Yixuan Zhou et.al. 2509.24650 null
2025-09-29 Word-Level Emotional Expression Control in Zero-Shot Text-to-Speech Synthesis Tianrui Wang et.al. 2509.24629 null
2025-09-29 ISSE: An Instruction-Guided Speech Style Editing Dataset And Benchmark Yun Chen et.al. 2509.24570 null
2025-09-29 UniFlow-Audio: Unified Flow Matching for Audio Generation from Omni-Modalities Xuenan Xu et.al. 2509.24391 link
2025-09-28 Generalizable Speech Deepfake Detection via Information Bottleneck Enhanced Adversarial Alignment Pu Huang et.al. 2509.23618 null
2025-09-27 BFA: Real-time Multilingual Text-to-speech Forced Alignment Abdul Rehman et.al. 2509.23147 null
2025-09-26 ArFake: A Multi-Dialect Benchmark and Baselines for Arabic Spoof-Speech Detection Mohamed Maged et.al. 2509.22808 null
2025-09-25 DiaMoE-TTS: A Unified IPA-Based Dialect TTS Framework with Mixture-of-Experts and Parameter-Efficient Zero-Shot Adaptation Ziqi Chen et.al. 2509.22727 null
2025-09-26 Semantic-VAE: Semantic-Alignment Latent Representation for Better Speech Synthesis Zhikang Niu et.al. 2509.22167 null
2025-09-26 Speaker Anonymisation for Speech-based Suicide Risk Detection Ziyun Cui et.al. 2509.22148 null
2025-09-26 Comprehend and Talk: Text to Speech Synthesis via Dual Language Modeling Junjie Cao et.al. 2509.22062 null
2025-09-26 Align2Speak: Improving TTS for Low Resource Languages via ASR-Guided Online Preference Optimization Shehzeen Hussain et.al. 2509.21718 null
2025-09-25 UniSS: Unified Expressive Speech-to-Speech Translation with Your Voice Sitong Cheng et.al. 2509.21144 null
2025-09-27 i-LAVA: Insights on Low Latency Voice-2-Voice Architecture for Agents Anupam Purwar et.al. 2509.20971 null
2025-09-26 SPADE: Structured Pruning and Adaptive Distillation for Efficient LLM-TTS Tan Dat Nguyen et.al. 2509.20802 null
2025-09-24 Objective Evaluation of Prosody and Intelligibility in Speech Synthesis via Conditional Prediction of Discrete Tokens Ismail Rasim Ulgen et.al. 2509.20485 null
2025-09-20 Beyond Global Emotion: Fine-Grained Emotional Speech Synthesis with Dynamic Word-Level Modulation Sirui Wang et.al. 2509.20378 null
2025-09-24 OLaPh: Optimal Language Phonemizer Johannes Wirth et.al. 2509.20086 null
2025-09-25 Measuring Prosody Diversity in Zero-Shot TTS: A New Metric, Benchmark, and Exploration Yifan Yang et.al. 2509.19928 null
2025-09-24 CoMelSinger: Discrete Token-Based Zero-Shot Singing Synthesis With Structured Melody Control and Guidance Junchuan Zhao et.al. 2509.19883 null
2025-09-24 Eliminating stability hallucinations in llm-based tts models via attention guidance ShiMing Wang et.al. 2509.19852 null
2025-09-24 Efficient Speech Watermarking for Speech Synthesis via Progressive Knowledge Distillation Yang Cui et.al. 2509.19812 null
2025-09-24 PART: Progressive Alignment Representation Training for Multilingual Speech-To-Text with LLMs Pei Zhang et.al. 2509.19745 null
2025-09-24 Selective Classifier-free Guidance for Zero-shot Text-to-speech John Zheng et.al. 2509.19668 null
2025-09-23 HD-PPT: Hierarchical Decoding of Content- and Prompt-Preference Tokens for Instruction-based TTS Sihang Nie et.al. 2509.19001 null
2025-09-23 Direct Preference Optimization for Speech Autoregressive Diffusion Models Zhijun Liu et.al. 2509.18928 null
2025-09-23 Group Relative Policy Optimization for Text-to-Speech with Large Language Models Chang Liu et.al. 2509.18798 null
2025-09-23 Explore the Reinforcement Learning for the LLM based ASR and TTS system Changfeng Gao et.al. 2509.18569 null
2025-09-23 No Verifiable Reward for Prosody: Toward Preference-Guided Prosody Learning in TTS Seungyoun Shin et.al. 2509.18531 null
2025-10-13 Discrete-Time Diffusion-Like Models for Speech Synthesis Xiaozhou Tan et.al. 2509.18470 null
2025-09-22 TMD-TTS: A Unified Tibetan Multi-Dialect Text-to-Speech Synthesis for Ü-Tsang, Amdo and Kham Speech Dataset Generation Yutong Liu et.al. 2509.18060 null
2025-09-22 Nord-Parl-TTS: Finnish and Swedish TTS Dataset from Parliament Speech Zirui Li et.al. 2509.17988 null
2025-09-22 Audiobook-CC: Controllable Long-context Speech Generation for Multicast Audiobook Min Liu et.al. 2509.17516 null
2025-09-29 Sidon: Fast and Robust Open-Source Multilingual Speech Restoration for Large-scale Dataset Cleansing Wataru Nakata et.al. 2509.17052 link
2025-09-21 Bridging the gap between training and inference in LM-based TTS models Ruonan Zhang et.al. 2509.17021 null
2025-09-21 MBCodec:Thorough disentangle for high-fidelity audio compression Ruonan Zhang et.al. 2509.17006 null
2025-09-19 Fed-PISA: Federated Voice Cloning via Personalized Identity-Style Adaptation Qi Wang et.al. 2509.16010 null
2025-09-19 VoXtream: Full-Stream Text-to-Speech with Extremely Low Latency Nikita Torgashov et.al. 2509.15969 link
2025-09-19 Deep Dubbing: End-to-End Auto-Audiobook System with Text-to-Timbre and Context-Aware Instruct-TTS Ziqi Dai et.al. 2509.15845 null
2025-09-19 LibriTTS-VI: A Public Corpus and Novel Methods for Efficient Voice Impression Control Junki Ohmura et.al. 2509.15626 null
2025-09-19 Beyond Video-to-SFX: Video to Audio Synthesis with Environmentally Aware Speech Xinlei Niu et.al. 2509.15492 null
2025-09-18 A Novel Semantic Compression Approach for Ultra-low Bandwidth Voice Communication Ryan Collette et.al. 2509.15462 null
2025-09-23 Frustratingly Easy Data Augmentation for Low-Resource ASR Katsumi Ibaraki et.al. 2509.15373 null
2025-09-18 Emotion-Aware Speech Generation with Character-Specific Voices for Comics Zhiwen Qian et.al. 2509.15253 null
2025-09-18 Real-Time Streaming Mel Vocoding with Generative Flow Matching Simon Welker et.al. 2509.15085 null
2025-09-18 MELA-TTS: Joint transformer-diffusion model with representation alignment for speech synthesis Keyu An et.al. 2509.14784 null
2025-09-19 DAIEN-TTS: Disentangled Audio Infilling for Environment-Aware Text-to-Speech Synthesis Ye-Xin Lu et.al. 2509.14684 null
2025-09-18 Stochastic Clock Attention for Aligning Continuous and Ordered Sequences Hyungjoon Soh et.al. 2509.14678 null
2025-09-20 Cross-Lingual F5-TTS: Towards Language-Agnostic Voice Cloning and Speech Synthesis Qingyu Liu et.al. 2509.14579 null
2025-09-17 SpeechOp: Inference-Time Task Composition for Generative Speech Processing Justin Lovelace et.al. 2509.14298 null
2025-10-01 SpeechWeave: Diverse Multilingual Synthetic Text & Audio Data Generation Pipeline for Training Text to Speech Models Karan Dua et.al. 2509.14270 null
2025-09-17 CS-FLEURS: A Massively Multilingual and Code-Switched Speech Dataset Brian Yan et.al. 2509.14161 null
2025-09-22 Do You Hear What I Mean? Quantifying the Instruction-Perception Gap in Instruction-Guided Expressive Text-To-Speech Systems Yi-Cheng Lin et.al. 2509.13989 null
2025-10-15 MSR-Codec: A Low-Bitrate Multi-Stream Residual Codec for High-Fidelity Speech Generation with Information Disentanglement Jingyu Li et.al. 2509.13068 null
2025-09-16 A Lightweight Pipeline for Noisy Speech Voice Cloning and Accurate Lip Sync Synthesis Javeria Amir et.al. 2509.12831 null
2025-10-16 Preservation of Language Understanding Capabilities in Speech-aware Large Language Models Marek Kubis et.al. 2509.12171 null
2025-09-29 FuseCodec: Semantic-Contextual Fusion and Supervision for Neural Codecs Md Mubtasim Ahasan et.al. 2509.11425 null
2025-09-14 Length-Aware Rotary Position Embedding for Text-Speech Alignment Hyeongju Kim et.al. 2509.11084 null
2025-09-12 WhisTLE: Deeply Supervised, Text-Only Domain Adaptation for Pretrained Speech Recognition Transformers Akshat Pandey et.al. 2509.10452 null
2025-09-12 Towards Data Drift Monitoring for Speech Deepfake Detection in the context of MLOps Xin Wang et.al. 2509.10086 null
2025-09-11 DiTReducio: A Training-Free Acceleration for DiT-Based TTS via Progressive Calibration Yanru Huo et.al. 2509.09748 null
2025-09-12 DiFlow-TTS: Discrete Flow Matching with Factorized Speech Tokens for Low-Latency Zero-Shot Text-To-Speech Ngoc-Son Nguyen et.al. 2509.09631 null
2025-09-11 HISPASpoof: A New Dataset For Spanish Speech Forensics Maria Risques et.al. 2509.09155 null
2025-09-29 Streaming Sequence-to-Sequence Learning with Delayed Streams Modeling Neil Zeghidour et.al. 2509.08753 null
2025-09-09 ParCzech4Speech: A New Speech Corpus Derived from Czech Parliamentary Data Vladislav Stankov et.al. 2509.06675 null
2025-08-19 Integrating Feedback Loss from Bi-modal Sarcasm Detector for Sarcastic Speech Synthesis Zhu Li et.al. 2508.13028 null
2025-10-07 EmoSSLSphere: Multilingual Emotional Speech Synthesis with Spherical Vectors and Discrete Speech Tokens Joonyong Park et.al. 2508.11273 null
2025-08-08 UniTalker: Conversational Speech-Visual Synthesis Yifan Hu et.al. 2508.04585 null
2025-08-29 Parallel GPT: Harmonizing the Independence and Interdependence of Acoustic and Semantic Information for Zero-Shot Text-to-Speech Jingyuan Xing et.al. 2508.04141 null
2025-07-23 AI Telephone Surveying: Automating Quantitative Data Collection with an AI Interviewer Danny D. Leybzon et.al. 2507.17718 null
2025-07-23 BoSS: Beyond-Semantic Speech Qing Wang et.al. 2507.17563 null
2025-07-22 SplitMeanFlow: Interval Splitting Consistency in Few-Step Generative Modeling Yi Guo et.al. 2507.16884 null
2025-07-15 Evaluating Speech-to-Text x LLM x Text-to-Speech Combinations for AI Interview Systems Nima Yazdani et.al. 2507.16835 null
2025-07-21 A2TTS: TTS for Low Resource Indian Languages Ayush Singh Bhadoriya et.al. 2507.15272 null
2025-07-21 EchoVoices: Preserving Generational Voices and Memories for Seniors and Children Haiying Xu et.al. 2507.15221 null
2025-07-22 Hear Your Code Fail, Voice-Assisted Debugging for Python Sayed Mahbub Hasan Amiri et.al. 2507.15007 null
2025-07-20 DMOSpeech 2: Reinforcement Learning for Duration Prediction in Metric-Optimized Speech Synthesis Yinghao Aaron Li et.al. 2507.14988 null
2025-07-17 A Data-Centric Framework for Addressing Phonetic and Prosodic Challenges in Russian Speech Generative Models Kirill Borodin et.al. 2507.13563 null
2025-07-17 NonverbalTTS: A Public English Corpus of Text-Aligned Nonverbal Vocalizations with Emotion Annotations for Text-to-Speech Maksim Borisov et.al. 2507.13155 null
2025-07-17 Intelligent Virtual Sonographer (IVS): Enhancing Physician-Robot-Patient Communication Tianyu Song et.al. 2507.13052 null
2025-07-17 Enkidu: Universal Frequential Perturbation for Real-Time Audio Privacy Protection against Voice Deepfakes Zhou Feng et.al. 2507.12932 null
2025-07-16 Quantize More, Lose Less: Autoregressive Generation from Residually Quantized Speech Representations Yichen Han et.al. 2507.12197 null
2025-07-16 EME-TTS: Unlocking the Emphasis and Emotion Link in Speech Synthesis Haoxun Li et.al. 2507.12015 null
2025-07-15 Towards Scalable AASIST: Refining Graph Attention for Speech Deepfake Detection Ivan Viakhirev et.al. 2507.11777 null
2025-07-15 P.808 Multilingual Speech Enhancement Testing: Approach and Results of URGENT 2025 Challenge Marvin Sach et.al. 2507.11306 null
2025-07-20 Supporting SENCOTEN Language Documentation Efforts with Automatic Speech Recognition Mengzhe Geng et.al. 2507.10827 null
2025-07-14 An Empirical Evaluation of AI-Powered Non-Player Characters' Perceived Realism and Performance in Virtual Reality Environments Mikko Korkiakoski et.al. 2507.10469 null
2025-07-12 ZipVoice-Dialog: Non-Autoregressive Spoken Dialogue Generation with Flow Matching Han Zhu et.al. 2507.09318 null
2025-07-12 Voice Conversion for Lombard Speaking Style with Implicit and Explicit Acoustic Feature Conditioning Dominika Woszczyk et.al. 2507.09310 null
2025-07-12 ClaritySpeech: Dementia Obfuscation in Speech Dominika Woszczyk et.al. 2507.09282 null
2025-07-11 SemAlignVC: Enhancing zero-shot timbre conversion using semantic alignment Shivam Mehta et.al. 2507.09070 null
2025-07-11 Exploiting Leaderboards for Large-Scale Distribution of Malicious Models Anshuman Suri et.al. 2507.08983 null
2025-07-06 A Hybrid Machine Learning Framework for Optimizing Crop Selection via Agronomic and Economic Forecasting Niranjan Mallikarjun Sindhur et.al. 2507.08832 null
2025-07-11 Unlocking Speech Instruction Data Potential with Query Rewriting Yonghua Hei et.al. 2507.08603 null
2025-07-11 MIDI-VALLE: Improving Expressive Piano Performance Synthesis Through Neural Codec Language Modelling Jingjing Tang et.al. 2507.08530 null
2025-07-11 Active Learning for Text-to-Speech Synthesis with Informative Sample Collection Kentaro Seki et.al. 2507.08319 null
2025-07-05 RepeaTTS: Towards Feature Discovery through Repeated Fine-Tuning Atli Sigurgeirsson et.al. 2507.08012 null
2025-07-10 SecureSpeech: Prompt-based Speaker and Content Protection Belinda Soh Hui Hui et.al. 2507.07799 null
2025-07-09 Learning Japanese with Jouzu: Interaction Outcomes with Stylized Dialogue Fictional Agents Zackary Rackauckas et.al. 2507.06483 null
2025-07-08 Speech Quality Assessment Model Based on Mixture of Experts: System-Level Performance Enhancement and Utterance-Level Challenge Analysis Xintong Hu et.al. 2507.06116 null
2025-07-08 Differentiable Reward Optimization for LLM based TTS system Changfeng Gao et.al. 2507.05911 null
2025-07-08 OpenS2S: Advancing Fully Open-Source End-to-End Empathetic Large Speech Language Model Chen Wang et.al. 2507.05177 null
2025-07-07 Multi-Step Prediction and Control of Hierarchical Emotion Distribution in Text-to-Speech Synthesis Sho Inoue et.al. 2507.04598 null
2025-07-06 TTS-CtrlNet: Time varying emotion aligned text-to-speech generation with ControlNet Jaeseok Jeong et.al. 2507.04349 null
2025-07-05 PresentAgent: Multimodal Agent for Presentation Video Generation Jingwei Shi et.al. 2507.04036 null
2025-07-08 Prosody Labeling with Phoneme-BERT and Speech Foundation Models Tomoki Koriyama et.al. 2507.03912 null
2025-07-05 Traceable TTS: Toward Watermark-Free TTS with Strong Traceability Yuxiang Zhao et.al. 2507.03887 null
2025-07-14 DeepGesture: A conversational gesture synthesis system based on emotions and semantics Thanh Hoang-Minh et.al. 2507.03147 null
2025-07-03 Open-Source System for Multilingual Translation and Cloned Speech Synthesis Mateo Cámara et.al. 2507.02530 null
2025-07-03 JoyTTS: LLM-based Spoken Chatbot With Voice Cloning Fangru Zhou et.al. 2507.02380 null
2025-07-02 Analyzing and Improving Speaker Similarity Assessment for Speech Synthesis Marc-André Carbonneau et.al. 2507.02176 null
2025-07-08 Pronunciation Editing for Finnish Speech using Phonetic Posteriorgrams Zirui Li et.al. 2507.02115 null
2025-07-02 A Dataset for Automatic Assessment of TTS Quality in Spanish Alejandro Sosa Welford et.al. 2507.01805 null
2025-07-02 Voice Conversion for Likability Control via Automated Rating of Speech Synthesis Corpora Hitoshi Suda et.al. 2507.01356 null
2025-07-08 SpeechAccentLLM: A Unified Framework for Foreign Accent Conversion and Text to Speech Zhuangfei Cheng et.al. 2507.01348 null
2025-07-02 Multi-interaction TTS toward professional recording reproduction Hiroki Kanagawa et.al. 2507.00808 null
2025-07-18 MuteSwap: Visual-informed Silent Video Identity Conversion Yifan Liu et.al. 2507.00498 null
2025-06-30 Collecting, Curating, and Annotating Good Quality Speech deepfake dataset for Famous Figures: Process and Challenges Hashim Ali et.al. 2507.00324 null
2025-06-30 Investigating Stochastic Methods for Prosody Modeling in Speech Synthesis Paul Mayer et.al. 2507.00227 null
2025-06-30 JAM-Flow: Joint Audio-Motion Synthesis with Flow Matching Mingi Kwon et.al. 2506.23552 null
2025-06-29 You Sound a Little Tense: L2 Tailored Clear TTS Using Durational Vowel Properties Paige Tuttösí et.al. 2506.23367 null
2025-06-27 Evaluating Pointing Gestures for Target Selection in Human-Robot Collaboration Noora Sassali et.al. 2506.22116 null
2025-06-27 Robust and Efficient Autoregressive Speech Synthesis with Dynamic Chunk-wise Prediction Policy Bohan Li et.al. 2506.22023 null
2025-06-23 IndexTTS2: A Breakthrough in Emotionally Expressive and Duration-Controlled Auto-Regressive Zero-Shot Text-to-Speech Siyi Zhou et.al. 2506.21619 null
2025-06-27 A Multi-Stage Framework for Multimodal Controllable Speech Synthesis Rui Niu et.al. 2506.20945 null
2025-06-25 An Exploration of ECAPA-TDNN and x-vector Speaker Representations in Zero-shot Multi-speaker TTS Marie Kunešová et.al. 2506.20190 null
2025-06-24 TTSDS2: Resources and Benchmark for Evaluating Human-Quality Text to Speech Systems Christoph Minixhofer et.al. 2506.19441 null
2025-06-23 Selecting N-lowest scores for training MOS prediction models Yuto Kondo et.al. 2506.18326 null
2025-06-23 Rethinking Mean Opinion Scores in Speech Quality Assessment: Aggregation through Quantized Distribution Fitting Yuto Kondo et.al. 2506.18307 null
2025-07-15 JIS: A Speech Corpus of Japanese Idol Speakers with Various Speaking Styles Yuto Kondo et.al. 2506.18296 null
2025-06-21 OpusLM: A Family of Open Unified Speech Language Models Jinchuan Tian et.al. 2506.17611 null
2025-06-20 RapFlow-TTS: Rapid and High-Fidelity Text-to-Speech with Improved Consistency Flow Matching Hyun Joon Park et.al. 2506.16741 null
2025-06-20 LM-SPT: LM-Aligned Semantic Distillation for Speech Tokenization Daejin Jo et.al. 2506.16738 null
2025-06-20 V-CASS: Vision-context-aware Expressive Speech Synthesis for Enhancing User Understanding of Videos Qixin Wang et.al. 2506.16716 null
2025-06-19 Streaming Non-Autoregressive Model for Accent Conversion and Pronunciation Improvement Tuan-Nam Nguyen et.al. 2506.16580 null
2025-06-19 InstructTTSEval: Benchmarking Complex Natural-Language Instruction Following in Text-to-Speech Systems Kexin Huang et.al. 2506.16381 link
2025-06-19 Optimizing Multilingual Text-To-Speech with Accents & Emotions Pranav Pawar et.al. 2506.16310 null
2025-06-18 TTSOps: A Closed-Loop Corpus Optimization Framework for Training Multi-Speaker TTS Models from Dark Data Kentaro Seki et.al. 2506.15614 null
2025-06-18 PredGen: Accelerated Inference of Large Language Models through Input-Time Speculation for Real-Time Speech Interaction Shufan Li et.al. 2506.15556 null
2025-06-18 EmojiVoice: Towards long-term controllable expressivity in robot speech Paige Tuttösí et.al. 2506.15085 null
2025-06-18 An accurate and revised version of optical character recognition-based speech synthesis using LabVIEW Prateek Mehta et.al. 2506.15029 null
2025-06-17 Investigation of Zero-shot Text-to-Speech Models for Enhancing Short-Utterance Speaker Verification Yiyang Zhao et.al. 2506.14226 null
2025-06-17 Pushing the Performance of Synthetic Speech Detection with Kolmogorov-Arnold Networks and Self-Supervised Learning Models Tuan Dat Phuong et.al. 2506.14153 link
2025-06-16 EmoNews: A Spoken Dialogue System for Expressive News Conversations Ryuki Matsuura et.al. 2506.13894 link
2025-07-08 Multimodal Integration Challenges in Emotionally Expressive Child Avatars for Training Applications Pegah Salehi et.al. 2506.13477 null
2025-06-20 ZipVoice: Fast and High-Quality Zero-Shot Text-to-Speech with Flow Matching Han Zhu et.al. 2506.13053 link
2025-06-14 StreamMel: Real-Time Zero-shot Text-to-Speech via Interleaved Continuous Autoregressive Modeling Hui Wang et.al. 2506.12570 null
2025-06-14 Phonikud: Hebrew Grapheme-to-Phoneme Conversion for Real-Time Text-to-Speech Yakov Kolani et.al. 2506.12311 null
2025-07-08 S2ST-Omni: An Efficient Multilingual Speech-to-Speech Translation Framework via Seamless Speech-Text Alignment and Progressive Fine-tuning Yu Pan et.al. 2506.11160 null
2025-06-16 A Self-Refining Framework for Enhancing ASR Using TTS-Synthesized Data Cheng-Kang Chou et.al. 2506.11130 null
2025-06-10 GUIRoboTron-Speech: Towards Automated GUI Agents Based on Speech Instructions Wenkang Han et.al. 2506.11127 null
2025-06-10 ASRJam: Human-Friendly AI Speech Jamming to Prevent Automated Phone Scams Freddie Grabovski et.al. 2506.11125 null
2025-06-05 Intelligibility of Text-to-Speech Systems for Mathematical Expressions Sujoy Roychowdhury et.al. 2506.11086 null
2025-06-12 Scheduled Interleaved Speech-Text Training for Speech-to-Speech Translation with LLMs Hayato Futami et.al. 2506.10299 null
2025-07-10 UmbraTTS: Adapting Text-to-Speech to Environmental Contexts with Flow Matching Neta Glazer et.al. 2506.09874 null
2025-06-15 EmoNet-Voice: A Fine-Grained, Expert-Verified Benchmark for Speech Emotion Detection Christoph Schuhmann et.al. 2506.09827 null
2025-06-11 OmniDRCA: Parallel Speech-Text Foundation Model via Dual-Resolution Speech Representations and Contrastive Alignment Chao-Hong Tan et.al. 2506.09349 link
2025-06-11 Ming-Omni: A Unified Multimodal Model for Perception and Generation Inclusion AI et.al. 2506.09344 link
2025-06-13 Step-Audio-AQAA: a Fully End-to-End Expressive Large Audio Language Model Ailin Huang et.al. 2506.08967 null
2025-06-10 A Review on Score-based Generative Models for Audio Applications Ge Zhu et.al. 2506.08457 null
2025-06-09 Seeing Voices: Generating A-Roll Video from Audio with Mirage Aditi Sundararaman et.al. 2506.08279 null
2025-06-09 Transcript-Prompted Whisper with Dictionary-Enhanced Decoding for Japanese Speech Annotation Rui Hu et.al. 2506.07646 null
2025-06-07 SynHate: Detecting Hate Speech in Synthetic Deepfake Audio Rishabh Ranjan et.al. 2506.06772 null
2025-06-09 Voice Impression Control in Zero-Shot TTS Keinichi Fujita et.al. 2506.05688 null
2025-05-28 Speaking images. A novel framework for the automated self-description of artworks Valentine Bernasconi et.al. 2506.05368 null
2025-06-05 Grapheme-Coherent Phonemic and Prosodic Annotation of Speech by Implicit and Explicit Grapheme Conditioning Hien Ohnaka et.al. 2506.04527 null
2025-06-04 Can we reconstruct a dysarthric voice with the large speech model Parler TTS? Ariadna Sanchez et.al. 2506.04397 null
2025-06-04 HiFiTTS-2: A Large-Scale High Bandwidth Speech Dataset Ryan Langman et.al. 2506.04152 null
2025-07-23 UniCUE: Unified Recognition and Generation Framework for Chinese Cued Speech Video-to-Speech Generation Jinting Wang et.al. 2506.04134 null
2025-06-04 A Novel Data Augmentation Approach for Automatic Speaking Assessment on Opinion Expressions Chung-Chun Wang et.al. 2506.04077 null
2025-06-04 Kinship in Speech: Leveraging Linguistic Relatedness for Zero-Shot TTS in Indian Languages Utkarsh Pathak et.al. 2506.03884 null
2025-06-04 Mark My Words: A Robust Multilingual Model for Punctuation in Text and Speech Transcripts Sidharth Pulipaka et.al. 2506.03793 null
2025-06-04 Comparative Analysis of Fast and High-Fidelity Neural Vocoders for Low-Latency Streaming Synthesis in Resource-Constrained Environments Reo Yoneyama et.al. 2506.03554 null
2025-06-04 BitTTS: Highly Compact Text-to-Speech Using 1.58-bit Quantization and Weight Indexing Masaya Kawamura et.al. 2506.03515 null
2025-06-03 Controllable Text-to-Speech Synthesis with Masked-Autoencoded Style-Rich Representation Yongqi Wang et.al. 2506.02997 null
2025-06-03 Towards a Japanese Full-duplex Spoken Dialogue System Atsumoto Ohashi et.al. 2506.02979 null
2025-06-03 CapSpeech: Enabling Downstream Applications in Style-Captioned Text-to-Speech Helin Wang et.al. 2506.02863 null
2025-06-03 Prompt-Unseen-Emotion: Zero-shot Expressive Speech Synthesis with Prompt-LLM Contextual Knowledge for Mixed Emotions Xiaoxue Gao et.al. 2506.02742 null
2025-06-03 Trusted Fake Audio Detection Based on Dirichlet Distribution Chi Ding et.al. 2506.02401 null
2025-06-02 SALF-MOS: Speaker Agnostic Latent Features Downsampled for MOS Prediction Saurabh Agrawal et.al. 2506.02082 null
2025-06-02 Speech-to-Speech Translation Pipelines for Conversations in Low-Resource Languages Andrei Popescu-Belis et.al. 2506.01406 null
2025-06-02 Zero-Shot Text-to-Speech for Vietnamese Thi Vu et.al. 2506.01322 null
2025-06-02 CleanS2S: Single-file Framework for Proactive Speech-to-Speech Interaction Yudong Lu et.al. 2506.01268 null
2025-06-02 WCTC-Biasing: Retraining-free Contextual Biasing ASR with Wildcard CTC-based Keyword Spotting and Inter-layer Biasing Yu Nakagome et.al. 2506.01263 null
2025-06-01 DS-TTS: Zero-Shot Speaker Style Adaptation from Voice Clips via Dynamic Dual-Style Feature Modulation Ming Meng et.al. 2506.01020 null
2025-06-01 Counterfactual Activation Editing for Post-hoc Prosody and Mispronunciation Correction in TTS Models Kyowoon Lee et.al. 2506.00832 null
2025-05-31 Chain-of-Thought Training for Open E2E Spoken Dialogue Systems Siddhant Arora et.al. 2506.00722 null
2025-05-30 Werewolf: A Straightforward Game Framework with TTS for Improved User Engagement Qihui Fan et.al. 2506.00160 null
2025-05-30 SwitchLingua: The First Large-Scale Multilingual and Multi-Ethnic Code-Switching Dataset Peng Xie et.al. 2506.00087 null
2025-05-30 Speech Token Prediction via Compressed-to-fine Language Modeling for Speech Generation Wenrui Liu et.al. 2505.24496 null
2025-05-30 DS-Codec: Dual-Stage Training with Mirror-to-NonMirror Architecture Switching for Speech Codec Peijie Chen et.al. 2505.24314 null
2025-05-29 Can Emotion Fool Anti-spoofing? Aurosweta Mahapatra et.al. 2505.23962 null
2025-05-29 Few-Shot Speech Deepfake Detection Adaptation with Gaussian Processes Neta Glazer et.al. 2505.23619 link
2025-05-29 EmergentTTS-Eval: Evaluating TTS Models on Complex Prosodic, Expressiveness, and Linguistic Challenges Using Model-as-a-Judge Ruskin Raj Manku et.al. 2505.23009 link
2025-05-29 LLM-Synth4KWS: Scalable Automatic Generation and Synthesis of Confusable Data for Custom Keyword Spotting Pai Zhu et.al. 2505.22995 null
2025-05-28 BinauralFlow: A Causal and Streamable Approach for High-Quality Binaural Speech Synthesis with Flow Matching Models Susan Liang et.al. 2505.22865 null
2025-05-28 Tell me Habibi, is it Real or Fake? Kartik Kuckreja et.al. 2505.22581 null
2025-05-28 A Linguistically Motivated Analysis of Intonational Phrasing in Text-to-Speech Systems: Revealing Gaps in Syntactic Sensitivity Charlotte Pouw et.al. 2505.22236 null
2025-06-29 Spotlight-TTS: Spotlighting the Style via Voiced-Aware Style Extraction and Style Direction Adjustment for Expressive Text-to-Speech Nam-Gyu Kim et.al. 2505.20868 null
2025-05-26 ArVoice: A Multi-Speaker Dataset for Arabic Speech Synthesis Hawau Olamide Toyin et.al. 2505.20506 null
2025-06-04 Accelerating Flow-Matching-Based Text-to-Speech via Empirically Pruned Step Sampling Qixi Zheng et.al. 2505.19931 null
2025-05-26 DiEmo-TTS: Disentangled Emotion Representations via Self-Supervised Distillation for Cross-Speaker Emotion Transfer in Text-to-Speech Deok-Hyeon Cho et.al. 2505.19687 null
2025-05-26 KIT's Low-resource Speech Translation Systems for IWSLT2025: System Enhancement with Synthetic Data and Model Regularization Zhaolin Li et.al. 2505.19679 null
2025-06-02 Zero-Shot Streaming Text to Speech Synthesis with Transducer and Auto-Regressive Modeling Haiyang Sun et.al. 2505.19669 null
2025-05-30 Accelerating Diffusion-based Text-to-Speech Model Training with Dual Modality Alignment Jeongsoo Choi et.al. 2505.19595 link
2025-05-26 GSA-TTS : Toward Zero-Shot Speech Synthesis based on Gradual Style Adaptor Seokgi Lee et.al. 2505.19384 null
2025-05-25 SpeakStream: Streaming Text-to-Speech with Interleaved Data Richard He Bai et.al. 2505.19206 null
2025-05-25 CloneShield: A Framework for Universal Perturbation Against Zero-Shot Voice Cloning Renyuan Li et.al. 2505.19119 null
2025-05-27 Revival with Voice: Multi-modal Controllable Text-to-Speech Synthesis Minsu Kim et.al. 2505.18972 null
2025-05-27 RASMALAI: Resources for Adaptive Speech Modeling in Indian Languages with Accents and Intonations Ashwin Sankar et.al. 2505.18609 null
2025-05-24 MPE-TTS: Customized Emotion Zero-Shot Text-To-Speech Using Multi-Modal Prompt Zhichao Wu et.al. 2505.18453 null
2025-05-27 CosyVoice 3: Towards In-the-wild Speech Generation via Scaling-up and Post-training Zhihao Du et.al. 2505.17589 null
2025-05-23 What You Read Isn't What You Hear: Linguistic Sensitivity in Deepfake Speech Detection Binh Nguyen et.al. 2505.17513 null
2025-05-23 UniTTS: An end-to-end TTS system without decoupling of acoustic and semantic information Rui Wang et.al. 2505.17426 link
2025-05-23 Speechless: Speech Instruction Training Without Speech for Low Resource Languages Alan Dao et.al. 2505.17417 link
2025-05-22 Benchmarking Expressive Japanese Character Text-to-Speech with VITS and Style-BERT-VITS2 Zackary Rackauckas et.al. 2505.17320 null
2025-05-21 Voicing Personas: Rewriting Persona Descriptions into Style Prompts for Controllable Text-to-Speech Yejin Lee et.al. 2505.17093 null
2025-06-13 Impact of Frame Rates on Speech Tokenizer: A Case Study on Mandarin and English Haoyang Zhang et.al. 2505.17076 null
2025-05-22 From Tens of Hours to Tens of Thousands: Scaling Back-Translation for Speech Recognition Tianduo Wang et.al. 2505.16972 link
2025-05-21 MIKU-PAL: An Automated and Standardized Multi-Modal Method for Speech Paralinguistic and Affect Labeling Yifan Cheng et.al. 2505.15772 null
2025-05-21 Segmentation-Variant Codebooks for Preservation of Paralinguistic and Prosodic Information Nicholas Sanders et.al. 2505.15667 null
2025-05-21 Audio Jailbreak: An Open Comprehensive Benchmark for Jailbreaking Large Audio-Language Models Zirui Song et.al. 2505.15406 link
2025-05-21 Prosody-Adaptable Audio Codecs for Zero-Shot Voice Conversion via In-Context Learning Junchuan Zhao et.al. 2505.15402 null
2025-06-03 Accelerating Autoregressive Speech Synthesis Inference With Speech Speculative Decoding Zijian Lin et.al. 2505.15380 null
2025-05-20 Pairwise Evaluation of Accent Similarity in Speech Synthesis Jinzuomu Zhong et.al. 2505.14410 null
2025-05-20 FMSD-TTS: Few-shot Multi-Speaker Multi-Dialect Text-to-Speech Synthesis for Ü-Tsang, Amdo and Kham Speech Dataset Generation Yutong Liu et.al. 2505.14351 null
2025-05-21 AudioJailbreak: Jailbreak Attacks against End-to-End Large Audio-Language Models Guangke Chen et.al. 2505.14103 null
2025-05-20 SeamlessEdit: Background Noise Aware Zero-Shot Speech Editing with in-Context Enhancement Kuan-Yu Chen et.al. 2505.14066 null
2025-05-22 Improving Noise Robustness of LLM-based Zero-shot TTS via Discrete Acoustic Token Denoising Ye-Xin Lu et.al. 2505.13830 null
2025-05-29 Articulatory Feature Prediction from Surface EMG during Speech Production Jihwan Lee et.al. 2505.13814 null
2025-05-19 Efficient Speech Language Modeling via Energy Distance in Continuous Latent Space Zhengrui Ma et.al. 2505.13181 link
2025-05-19 OZSpeech: One-step Zero-shot Speech Synthesis with Learned-Prior-Conditioned Flow Matching Hieu-Nghia Huynh-Nguyen et.al. 2505.12800 null
2025-05-19 RoVo: Robust Voice Protection Against Unauthorized Speech Synthesis with Embedding-Level Perturbations Seungmin Kim et.al. 2505.12686 null
2025-05-19 Chain-Talker: Chain Understanding and Rendering for Empathetic Conversational Speech Synthesis Yifan Hu et.al. 2505.12597 link
2025-05-18 Shallow Flow Matching for Coarse-to-Fine Text-to-Speech Synthesis Dong Yang et.al. 2505.12226 null
2025-05-16 Audio Turing Test: Benchmarking the Human-likeness of Large Language Model-based Text-to-Speech Systems in Chinese Xihuai Wang et.al. 2505.11200 null
2025-05-16 BanglaFake: Constructing and Evaluating a Specialized Bengali Deepfake Audio Dataset Istiaq Ahmed Fahad et.al. 2505.10885 link
2025-05-15 UDDETTS: Unifying Discrete and Dimensional Emotions for Controllable Emotional Text-to-Speech Jiaxuan Liu et.al. 2505.10599 null
2025-05-14 DPN-GAN: Inducing Periodic Activations in Generative Adversarial Networks for High-Fidelity Audio Synthesis Zeeshan Ahmad et.al. 2505.09091 null
2025-05-13 Investigating self-supervised features for expressive, multilingual voice conversion Álvaro Martín-Cortinas et.al. 2505.08278 null
2025-05-12 MiniMax-Speech: Intrinsic Zero-Shot Text-to-Speech with a Learnable Speaker Encoder Bowen Zhang et.al. 2505.07916 null
2025-05-13 Lightweight End-to-end Text-to-speech Synthesis for low resource on-device applications Biel Tura Vecino et.al. 2505.07701 null
2025-05-10 VTutor: An Animated Pedagogical Agent SDK that Provide Real Time Multi-Model Feedback Eason Chen et.al. 2505.06676 null
2025-05-10 Bridging the Gap: An Intermediate Language for Enhanced and Cost-Effective Grapheme-to-Phoneme Conversion with Homographs with Multiple Pronunciations Disambiguation Abbas Bertina et.al. 2505.06599 null
2025-05-15 FlexSpeech: Towards Stable, Controllable and Expressive Text-to-Speech Linhan Ma et.al. 2505.05159 null
2025-05-08 Teochew-Wild: The First In-the-wild Teochew Dataset with Orthographic Annotations Linrong Pan et.al. 2505.05056 null
2025-05-08 A Multi-Agent AI Framework for Immersive Audiobook Production through Spatial Audio and Neural Narration Shaja Arul Selvamani et.al. 2505.04885 null
2025-06-06 Advancing Zero-shot Text-to-Speech Intelligibility across Diverse Domains via Preference Alignment Xueyao Zhang et.al. 2505.04113 null
2025-05-06 VITA-Audio: Fast Interleaved Cross-Modal Token Generation for Efficient Large Speech-Language Model Zuwei Long et.al. 2505.03739 link
2025-05-13 SonicRAG : High Fidelity Sound Effects Synthesis Based on Retrival Augmented Generation Yu-Ren Guo et.al. 2505.03244 null
2025-05-05 Generating Narrated Lecture Videos from Slides with Synchronized Highlights Alexander Holmberg et.al. 2505.02966 null
2025-05-05 Voila: Voice-Language Foundation Models for Real-Time Autonomous Interaction and Voice Role-Play Yemin Shi et.al. 2505.02707 link
2025-05-05 LLaMA-Omni2: LLM-based Real-time Spoken Chatbot with Autoregressive Streaming Speech Synthesis Qingkai Fang et.al. 2505.02625 link
2025-04-30 Sadeed: Advancing Arabic Diacritization Through Small Language Model Zeina Aldallal et.al. 2504.21635 null
2025-04-29 AlignDiT: Multimodal Aligned Diffusion Transformer for Synchronized Speech Generation Jeongsoo Choi et.al. 2504.20629 null
2025-05-28 ClonEval: An Open Voice Cloning Benchmark Iwona Christop et.al. 2504.20581 link
2025-05-02 Towards Flow-Matching-based TTS without Classifier-Free Guidance Yuzhe Liang et.al. 2504.20334 null
2025-04-27 Generative Adversarial Network based Voice Conversion: Techniques, Challenges, and Recent Advancements Sandipan Dhar et.al. 2504.19197 null
2025-04-27 Muyan-TTS: A Trainable Text-to-Speech Model Optimized for Podcast Scenarios with a $50K Budget Xin Li et.al. 2504.19146 link
2025-04-22 FADEL: Uncertainty-aware Fake Audio Detection with Evidential Deep Learning Ju Yeon Kang et.al. 2504.15663 null
2025-04-22 A Multi-Agent Framework for Automated Qinqiang Opera Script Generation Using Large Language Models Gengxian Cao et.al. 2504.15552 null
2025-04-21 SOLIDO: A Robust Watermarking Method for Speech Synthesis via Low-Rank Adaptation Yue Li et.al. 2504.15035 null
2025-04-20 DialogueAgents: A Hybrid Agent-Based Speech Synthesis Framework for Multi-Party Dialogue Xiang Li et.al. 2504.14482 link
2025-04-18 ChatNekoHacker: Real-Time Fan Engagement with Conversational Agents Takuya Sera et.al. 2504.13793 null
2025-04-18 Collective Learning Mechanism based Optimal Transport Generative Adversarial Network for Non-parallel Voice Conversion Sandipan Dhar et.al. 2504.13791 null
2025-04-22 EmoVoice: LLM-based Emotional Text-To-Speech Model with Freestyle Text Prompting Guanrou Yang et.al. 2504.12867 null
2025-05-28 GOAT-TTS: Expressive and Realistic Speech Generation via A Dual-Branch LLM Yaodong Song et.al. 2504.12339 null
2025-04-15 Dopamine Audiobook: A Training-free MLLM Agent for Emotional and Human-like Audiobook Generation Yan Rong et.al. 2504.11002 null
2025-04-15 Generalized Audio Deepfake Detection Using Frame-level Latent Information Entropy Botao Zhao et.al. 2504.10819 null
2025-04-14 Pseudo-Autoregressive Neural Codec Language Models for Efficient Zero-Shot Text-to-Speech Synthesis Yifan Yang et.al. 2504.10352 null
2025-04-14 AutoStyle-TTS: Retrieval-Augmented Generation based Automatic Style Matching Text-to-Speech Synthesis Dan Luo et.al. 2504.10309 null
2025-04-14 SafeSpeech: Robust and Universal Voice Protection Against Malicious Speech Synthesis Zhisheng Zhang et.al. 2504.09839 link
2025-04-12 AMNet: An Acoustic Model Network for Enhanced Mandarin Speech Synthesis Yubing Cao et.al. 2504.09225 null
2025-04-11 Generalized Multilingual Text-to-Speech Generation with Language-Aware Style Adaptation Haowei Lou et.al. 2504.08274 null
2025-04-10 Empowering Global Voices: A Data-Efficient, Phoneme-Tone Adaptive Approach to High-Fidelity Speech Synthesis Yizhong Geng et.al. 2504.07858 null
2025-05-16 SlimSpeech: Lightweight and Efficient Text-to-Speech with Slim Rectified Flow Kaidi Wang et.al. 2504.07776 null
2025-04-08 AVENet: Disentangling Features by Approximating Average Features for Voice Conversion Wenyu Wang et.al. 2504.05833 null
2025-04-07 SpeakEasy: Enhancing Text-to-Speech Interactions for Expressive Content Creation Stephen Brade et.al. 2504.05106 null
2025-04-04 RWKVTTS: Yet another TTS based on RWKV-7 Lin yueyu et.al. 2504.03289 link
2025-04-22 F5R-TTS: Improving Flow-Matching based Text-to-Speech with Group Relative Policy Optimization Xiaohui Sun et.al. 2504.02407 null
2025-04-03 VoiceCraft-Dub: Automated Video Dubbing with Neural Codec Language Models Kim Sung-Bin et.al. 2504.02386 null
2025-04-02 TeleAntiFraud-28k: An Audio-Text Slow-Thinking Dataset for Telecom Fraud Detection Zhiming Ma et.al. 2503.24115 link
2025-03-31 SpeechDialogueFactory: Generating High-Quality Speech Dialogue Data to Accelerate Your Speech-LLM Development Minghan Wang et.al. 2503.23848 link
2025-03-30 Speculative End-Turn Detector for Efficient Speech Chatbot Assistant Hyunjong Ok et.al. 2503.23439 null
2025-05-16 SupertonicTTS: Towards Highly Scalable and Efficient Text-to-Speech System Hyeongju Kim et.al. 2503.23108 null
2025-03-26 Dual Audio-Centric Modality Coupling for Talking Head Generation Ao Fu et.al. 2503.22728 null
2025-03-28 DeepAudio-V1:Towards Multi-Modal Multi-Stage End-to-End Video to Speech and Audio Generation Haomin Zhang et.al. 2503.22265 null
2025-03-26 Text-Driven Voice Conversion via Latent State-Space Modeling Wen Li et.al. 2503.20999 null
2025-05-26 FireRedTTS-1S: An Upgraded Streamable Foundation Text-to-Speech System Hao-Han Guo et.al. 2503.20499 null
2025-03-21 Your voice is your voice: Supporting Self-expression through Speech Generation and LLMs in Augmented and Alternative Communication Yiwen Xu et.al. 2503.17479 null
2025-03-21 From Faces to Voices: Learning Hierarchical Representations for High-quality Video-to-Speech Ji-Hoon Kim et.al. 2503.16956 null
2025-03-20 WaveFM: A High-Fidelity and Efficient Vocoder Based on Flow Matching Tianze Luo et.al. 2503.16689 link
2025-03-10 VocalEyes: Enhancing Environmental Perception for the Visually Impaired through Vision-Language Models and Distance-Aware Object Detection Kunal Chavan et.al. 2503.16488 null
2025-01-22 Development of an Inclusive Educational Platform Using Open Technologies and Machine Learning: A Case Study on Accessibility Enhancement Jimi Togni et.al. 2503.15501 null
2025-01-14 AI-Powered Assistive Technologies for Visual Impairment Prudhvi Naayini et.al. 2503.15494 null
2025-03-19 MoonCast: High-Quality Zero-Shot Podcast Generation Zeqian Ju et.al. 2503.14345 link
2025-03-26 InnerSelf: Designing Self-Deepfaked Voice for Emotional Well-being Guang Dai et.al. 2503.14257 null
2025-03-14 MAVFlow: Preserving Paralinguistic Elements with Conditional Flow Matching for Zero-Shot AV2AV Multilingual Translation Sungwoo Cho et.al. 2503.11026 null
2025-03-11 An Exhaustive Evaluation of TTS- and VC-based Data Augmentation for ASR Sewade Ogun et.al. 2503.08954 null
2025-03-07 DiVISe: Direct Visual-Input Speech Synthesis Preserving Speaker Characteristics And Intelligibility Yifan Liu et.al. 2503.05223 link
2025-03-03 Direct Speech to Speech Translation: A Review Mohammad Sarim et.al. 2503.04799 null
2025-03-06 LLMVoX: Autoregressive Streaming Text-to-Speech Model for Any LLM Sambal Shikhar et.al. 2503.04724 null
2025-03-06 Scaling Rich Style-Prompted Text-to-Speech Datasets Anuj Diwan et.al. 2503.04713 link
2025-03-05 Good practices for evaluation of synthesized speech Erica Cooper et.al. 2503.03250 null
2025-03-04 InSerter: Speech Instruction Following with Unsupervised Interleaved Pre-training Dingdong Wang et.al. 2503.02769 null
2025-03-03 Spark-TTS: An Efficient LLM-Based Text-to-Speech Model with Single-Stream Decoupled Speech Tokens Xinsheng Wang et.al. 2503.01710 link
2025-03-03 Voice Cloning for Dysarthric Speech Synthesis: Addressing Data Scarcity in Speech-Language Pathology Birger Moell et.al. 2503.01266 null
2025-03-02 UniWav: Towards Unified Pre-training for Speech Representation Learning and Generation Alexander H. Liu et.al. 2503.00733 null
2025-03-01 PodAgent: A Comprehensive Framework for Podcast Generation Yujia Xiao et.al. 2503.00455 link
2025-03-12 Telephone Surveys Meet Conversational AI: Evaluating a LLM-Based Telephone Survey System at Scale Max M. Lang et.al. 2502.20140 null
2025-02-27 DiffCSS: Diverse and Expressive Conversational Speech Synthesis with Diffusion Models Weihao wu et.al. 2502.19924 null
2025-03-28 MegaTTS 3: Sparse Alignment Enhanced Latent Diffusion Transformer for Zero-Shot Speech Synthesis Ziyue Jiang et.al. 2502.18924 null
2025-03-08 Clip-TTS: Contrastive Text-content and Mel-spectrogram, A High-Quality Text-to-Speech Method based on Contextual Semantic Understanding Tianyun Liu et.al. 2502.18889 null
2025-02-24 Balancing Speech Understanding and Generation Using Continual Pre-training for Codec-based Speech LLM Jiatong Shi et.al. 2502.16897 null
2025-02-18 AV-Flow: Transforming Text to Audio-Visual Human-like Interactions Aggelina Chatziagapi et.al. 2502.13133 null
2025-02-18 High-Fidelity Music Vocoder using Neural Audio Codecs Luca A. Lanzendörfer et.al. 2502.12759 null
2025-02-18 A Survey on Bridging EEG Signals and Generative AI: From Image and Text to Beyond Shreya Shukla et.al. 2502.12048 null
2025-02-17 NaturalL2S: End-to-End High-quality Multispeaker Lip-to-Speech Synthesis with Differential Digital Signal Processing Yifan Liang et.al. 2502.12002 null
2025-02-16 FELLE: Autoregressive Speech Synthesis with Token-Wise Coarse-to-Fine Flow Matching Hui Wang et.al. 2502.11128 null
2025-02-16 SyncSpeech: Low-Latency and Efficient Dual-Stream Text-to-Speech based on Temporal Masked Transformer Zhengyan Sheng et.al. 2502.11094 null
2025-02-14 VocalCrypt: Novel Active Defense Against Deepfake Voice Based on Masking Effect Qingyuan Fei et.al. 2502.10329 null
2025-02-13 TokenSynth: A Token-based Neural Synthesizer for Instrument Cloning and Text-to-Instrument Kyungsu Kim et.al. 2502.08939 link
2025-04-24 ASVspoof 5: Design, Collection and Validation of Resources for Spoofing, Deepfake, and Adversarial Attack Detection Using Crowdsourced Speech Xin Wang et.al. 2502.08857 null
2025-02-11 LoRP-TTS: Low-Rank Personalized Text-To-Speech Łukasz Bondaruk et.al. 2502.07562 null
2025-02-11 Advanced Zero-Shot Text-to-Speech for Background Removal and Preservation with Controllable Masked Speech Prediction Leying Zhang et.al. 2502.07345 null
2025-02-11 Vevo: Controllable Zero-Shot Voice Imitation with Self-Supervised Disentanglement Xueyao Zhang et.al. 2502.07243 null
2025-02-10 Synthetic Audio Helps for Cognitive State Tasks Adil Soubki et.al. 2502.06922 link
2025-02-19 Speech to Speech Translation with Translatotron: A State of the Art Review Jules R. Kala et.al. 2502.05980 null
2025-02-09 Non-invasive electromyographic speech neuroprosthesis: a geometric perspective Harshavardhana T. Gowda et.al. 2502.05762 null
2025-02-09 BnTTS: Few-Shot Speaker Adaptation in Low-Resource Setting Mohammad Jahid Ibna Basher et.al. 2502.05729 null
2025-02-08 Gender Bias in Instruction-Guided Speech Synthesis Models Chun-Yi Kuan et.al. 2502.05649 null
2025-02-08 IndexTTS: An Industrial-Level Controllable and Efficient Zero-Shot Text-To-Speech System Wei Deng et.al. 2502.05512 link
2025-02-22 Llasa: Scaling Train-Time and Inference-Time Compute for Llama-based Speech Synthesis Zhen Ye et.al. 2502.04128 link
2025-02-05 Metis: A Foundation Speech Generation Model with Masked Generative Pre-training Yuancheng Wang et.al. 2502.03128 link
2025-02-05 Fine-grained Preference Optimization Improves Zero-shot Text-to-Speech Jixun Yao et.al. 2502.02950 null
2025-02-04 Developing multilingual speech synthesis system for Ojibwe, Mi'kmaq, and Maliseet Shenran Wang et.al. 2502.02703 link
2025-02-04 Streaming Speaker Change Detection and Gender Classification for Transducer-Based Multi-Talker Speech Translation Peidong Wang et.al. 2502.02683 null
2025-02-13 Continuous Autoregressive Modeling with Stochastic Monotonic Alignment for Speech Synthesis Weiwei Lin et.al. 2502.01084 null
2025-02-02 EmoTalkingGaussian: Continuous Emotion-conditioned Talking Head Synthesis Junuk Cha et.al. 2502.00654 null
2025-01-31 VisualSpeech: Enhance Prosody with Visual Context in TTS Shumin Que et.al. 2501.19258 null
2025-01-29 BreezyVoice: Adapting TTS for Taiwanese Mandarin with Enhanced Polyphone Disambiguation -- Challenges and Insights Chan-Jan Hsu et.al. 2501.17790 null
2025-01-28 Compact Neural TTS Voices for Accessibility Kunal Jain et.al. 2501.17332 null
2025-02-11 Overview of the Amphion Toolkit (v0.2) Jiaqi Li et.al. 2501.15442 link
2025-01-24 Characteristic-Specific Partial Fine-Tuning for Efficient Emotion and Speaker Adaptation in Codec Language Text-to-Speech Models Tianrui Wang et.al. 2501.14273 null
2025-01-24 Generalizable Audio Deepfake Detection via Latent Space Refinement and Augmentation Wen Huang et.al. 2501.14240 null
2025-01-24 LoCoML: A Framework for Real-World ML Inference Pipelines Kritin Maddireddy et.al. 2501.14165 null
2025-01-23 Generative Data Augmentation Challenge: Zero-Shot Speech Synthesis for Personalized Speech Enhancement Jae-Sung Bae et.al. 2501.13372 null
2025-01-21 A Domain Adaptation Framework for Speech Recognition Systems with Only Synthetic data Minh Tran et.al. 2501.12501 null
2025-01-20 A Non-autoregressive Model for Joint STT and TTS Vishal Sunder et.al. 2501.09104 null
2025-01-15 Speech Synthesis along Perceptual Voice Quality Dimensions Frederik Rautenberg et.al. 2501.08791 null
2025-01-15 Adaptive Data Augmentation with NaturalSpeech3 for Far-field Speaker Verification Li Zhang et.al. 2501.08691 null
2025-01-15 Towards Lightweight and Stable Zero-shot TTS with Self-distilled Representation Disentanglement Qianniu Chen et.al. 2501.08566 null
2025-03-17 CodecFake+: A Large-Scale Neural Audio Codec-Based Deepfake Speech Dataset Xuanjun Chen et.al. 2501.08238 null
2025-01-13 Exploring the encoding of linguistic representations in the Fully-Connected Layer of generative CNNs for Speech Bruno Ferenc Šegedin et.al. 2501.07726 null
2025-01-19 MathReader : Text-to-Speech for Mathematical Documents Sieun Hyeon et.al. 2501.07088 link
2025-01-11 Retrieval-Augmented Dialogue Knowledge Aggregation for Expressive Conversational Speech Synthesis Rui Liu et.al. 2501.06467 link
2025-01-10 TTS-Transducer: End-to-End Speech Synthesis with Neural Transducer Vladimir Bataev et.al. 2501.06320 null
2025-01-10 MinMo: A Multimodal Large Language Model for Seamless Voice Interaction Qian Chen et.al. 2501.06282 null
2025-01-10 PROEMO: Prompt-Driven Text-to-Speech Synthesis Based on Emotion and Intensity Control Shaozuo Zhang et.al. 2501.06276 null
2025-06-03 Low-Resource Text-to-Speech Synthesis Using Noise-Augmented Training of ForwardTacotron Kishor Kayyar Lakshminarayana et.al. 2501.05976 null
2025-01-10 MARS6: A Small and Robust Hierarchical-Codec Text-to-Speech Model Matthew Baas et.al. 2501.05787 null
2025-01-09 Probing Speaker-specific Features in Speaker Representations Aemon Yat Fei Chiu et.al. 2501.05310 null
2025-01-09 JELLY: Joint Emotion Recognition and Context Reasoning with LLMs for Conversational Speech Synthesis Jun-Hyeok Cha et.al. 2501.04904 null
2025-01-08 Cued Speech Generation Leveraging a Pre-trained Audiovisual Text-to-Speech Model Sanjana Sankar et.al. 2501.04799 null
2025-01-08 FleSpeech: Flexibly Controllable Speech Generation with Various Prompts Hanzhao Li et.al. 2501.04644 null
2025-02-23 OpenOmni: Advancing Open-Source Omnimodal Large Language Models with Progressive Multimodal Alignment and Real-Time Self-Aware Emotional Speech Synthesis Run Luo et.al. 2501.04561 link
2025-01-08 DrawSpeech: Expressive Speech Synthesis Using Prosodic Sketches as Control Conditions Weidong Chen et.al. 2501.04256 null
2025-01-07 NeuroIncept Decoder for High-Fidelity Speech Reconstruction from Neural Activity Owais Mujtaba Khanday et.al. 2501.03757 link
2025-01-02 FaceSpeak: Expressive and High-Quality Speech Synthesis from Human Portraits of Different Styles Tian-Hao Zhang et.al. 2501.03181 null
2025-01-02 RingFormer: A Neural Vocoder with Ring Attention and Convolution-Augmented Transformer Seongho Hong et.al. 2501.01182 link
2025-01-02 Disambiguation of Chinese Polyphones in an End-to-End Framework with Semantic Features Extracted by Pre-trained BERT Dongyang Dai et.al. 2501.01102 null
2025-01-06 Takeaways from Applying LLM Capabilities to Multiple Conversational Avatars in a VR Pilot Study Mykola Maslych et.al. 2501.00168 null
2024-12-16 SECodec: Structural Entropy-based Compressive Speech Representation Codec for Speech Language Models Linqin Wang et.al. 2501.00018 null
2024-12-28 Stable-TTS: Stable Speaker-Adaptive Text-to-Speech Synthesis via Prosody Prompting Wooseok Han et.al. 2412.20155 null
2024-12-28 CrossSpeech++: Cross-lingual Speech Synthesis with Decoupled Language and Speaker Generation Ji-Hoon Kim et.al. 2412.20048 null
2024-12-26 VoiceDiT: Dual-Condition Diffusion Transformer for Environment-Aware Speech Synthesis Jaemin Jung et.al. 2412.19259 null
2024-12-26 "I've Heard of You!": Generate Spoken Named Entity Recognition Data for Unseen Entities Jiawei Yu et.al. 2412.19102 null
2024-12-26 Indonesian-English Code-Switching Speech Synthesizer Utilizing Multilingual STEN-TTS and Bert LID Ahmad Alfani Handoyo et.al. 2412.19043 null
2025-01-23 Advancing NAM-to-Speech Conversion with Novel Methods and the MultiNAM Dataset Neil Shah et.al. 2412.18839 null
2025-01-17 MRI2Speech: Speech Synthesis from Articulatory Movements Recorded by Real-time MRI Neil Shah et.al. 2412.18836 null
2024-12-25 Intra- and Inter-modal Context Interaction Modeling for Conversational Speech Synthesis Zhenqi Jia et.al. 2412.18733 null
2024-12-24 GenPod: Constructive News Framing in AI-Generated Podcasts More Effectively Reduces Negative Emotions Than Non-Constructive Framing Wen Ku et.al. 2412.18300 null
2025-03-27 VERSA: A Versatile Evaluation Toolkit for Speech, Audio, and Music Jiatong Shi et.al. 2412.17667 link
2024-12-22 Why Do Speech Language Models Fail to Generate Semantically Coherent Outputs? A Modality Evolving Perspective Hankun Wang et.al. 2412.17048 null
2024-12-22 Incremental Disentanglement for Environment-Aware Zero-Shot Text-to-Speech Synthesis Ye-Xin Lu et.al. 2412.16977 null
2025-09-18 KALL-E:Autoregressive Speech Synthesis with Next-Distribution Prediction Kangxiang Xia et.al. 2412.16846 null
2024-12-23 Interleaved Speech-Text Language Models are Simple Streaming Text to Speech Synthesizers Yifan Yang et.al. 2412.16102 null
2024-12-19 Scale This, Not That: Investigating Key Dataset Attributes for Efficient Speech Enhancement Scaling Leying Zhang et.al. 2412.14890 null
2024-12-17 Deep Speech Synthesis from Multimodal Articulatory Representations Peter Wu et.al. 2412.13387 null
2024-12-17 Synthetic Speech Classification: IEEE Signal Processing Cup 2022 challenge Mahieyin Rahmun et.al. 2412.13279 link
2024-12-17 Enhancing Naturalness in LLM-Generated Utterances through Disfluency Insertion Syed Zohaib Hassan et.al. 2412.12710 null
2024-12-17 Phoneme-Level Feature Discrepancies: A Key to Detecting Sophisticated Speech Deepfakes Kuiyuan Zhang et.al. 2412.12619 null
2025-01-10 Hierarchical Control of Emotion Rendering in Speech Synthesis Sho Inoue et.al. 2412.12498 link
2024-12-19 ProsodyFM: Unsupervised Phrasing and Intonation Control for Intelligible Speech Synthesis Xiangheng He et.al. 2412.11795 null
2024-12-16 Region-Based Optimization in Continual Learning for Audio Deepfake Detection Yujie Chen et.al. 2412.11551 link
2025-01-15 Multi-modal and Multi-scale Spatial Environment Understanding for Immersive Visual Text-to-Speech Rui Liu et.al. 2412.11409 link
2024-12-16 Efficient Generative Modeling with Residual Vector Quantization-Based Tokens Jaehyeon Kim et.al. 2412.10208 null
2024-12-25 CosyVoice 2: Scalable Streaming Speech Synthesis with Large Language Models Zhihao Du et.al. 2412.10117 link
2024-12-13 AMuSeD: An Attentive Deep Neural Network for Multimodal Sarcasm Detection Incorporating Bi-modal Data Augmentation Xiyuan Gao et.al. 2412.10103 null
2024-12-13 CSSinger: End-to-End Chunkwise Streaming Singing Voice Synthesis System Based on Conditional Variational Autoencoder Jianwei Cui et.al. 2412.08918 null
2024-12-11 Multimodal Latent Language Modeling with Next-Token Diffusion Yutao Sun et.al. 2412.08635 link
2024-12-11 Zero-Shot Mono-to-Binaural Speech Synthesis Alon Levkovitch et.al. 2412.08356 null
2024-12-11 A Unified Model For Voice and Accent Conversion In Speech and Singing using Self-Supervised Learning and Feature Extraction Sowmya Cheripally et.al. 2412.08312 null
2024-12-11 A Preliminary Analysis of Automatic Word and Syllable Prominence Detection in Non-Native Speech With Text-to-Speech Prosody Embeddings Anindita Mondal et.al. 2412.08283 null
2024-12-11 LatentSpeech: Latent Diffusion for Text-To-Speech Generation Haowei Lou et.al. 2412.08117 null
2024-12-11 Aligner-Guided Training Paradigm: Advancing Text-to-Speech Models with Aligner Guided Duration Haowei Lou et.al. 2412.08112 null
2024-12-09 Towards Controllable Speech Synthesis in the Era of Large Language Models: A Survey Tianxin Xie et.al. 2412.06602 link
2024-12-12 EmoSpeech: A Corpus of Emotionally Rich and Contextually Detailed Speech Annotations Weizhen Bian et.al. 2412.06581 null
2024-12-01 Text Is Not All You Need: Multimodal Prompting Helps LLMs Understand Humor Ashwin Baluja et.al. 2412.05315 null
2024-12-04 DiffStyleTTS: Diffusion-based Hierarchical Prosody Modeling for Text-to-Speech with Diverse and Controllable Styles Jiaxuan Liu et.al. 2412.03388 null
2024-12-05 Analytic Study of Text-Free Speech Synthesis for Raw Audio using a Self-Supervised Learning Model Joonyong Park et.al. 2412.03074 null
2024-12-03 GLM-4-Voice: Towards Intelligent and Human-Like End-to-End Spoken Chatbot Aohan Zeng et.al. 2412.02612 link
2024-11-19 A Context-Based Numerical Format Prediction for a Text-To-Speech System Yaser Darwesh et.al. 2412.00028 null
2024-11-27 Continual Learning in Machine Speech Chain Using Gradient Episodic Memory Geoffrey Tyndall et.al. 2411.18320 null
2024-11-27 SALMONN-omni: A Codec-free LLM for Full-duplex Speech Understanding and Generation Wenyi Yu et.al. 2411.18138 null
2024-11-26 Visatronic: A Multimodal Decoder-Only Model for Speech Synthesis Akshita Gupta et.al. 2411.17690 null
2024-11-22 VQalAttent: a Transparent Speech Generation Pipeline based on Transformer-learned VQ-VAE Latent Space Armani Rodriguez et.al. 2411.14642 null
2024-11-26 WavChat: A Survey of Spoken Dialogue Models Shengpeng Ji et.al. 2411.13577 link
2024-12-02 I2TTS: Image-indicated Immersive Text-to-speech Synthesis with Spatial Perception Jiawei Zhang et.al. 2411.13314 null
2024-11-20 Hard-Synth: Synthesizing Diverse Hard Samples for ASR using Zero-Shot TTS and LLM Jiawei Yu et.al. 2411.13159 null
2024-12-15 Rethinking MUSHRA: Addressing Modern Challenges in Text-to-Speech Evaluation Praveen Srinivasa Varadhan et.al. 2411.12719 null
2024-11-19 Leveraging Virtual Reality and AI Tutoring for Language Learning: A Case Study of a Virtual Campus Environment with OpenAI GPT Integration with Unity 3D Adithya TG et.al. 2411.12619 null
2024-11-18 ESTVocoder: An Excitation-Spectral-Transformed Neural Vocoder Conditioned on Mel Spectrogram Xiao-Hang Jiang et.al. 2411.11258 null
2024-11-18 SAMOS: A Neural MOS Prediction Model Leveraging Semantic Representations and Acoustic Features Yu-Fei Shi et.al. 2411.11232 null
2024-11-15 SmoothCache: A Universal Inference Acceleration Technique for Diffusion Transformers Joseph Liu et.al. 2411.10510 link
2024-11-14 Robust AI-Synthesized Speech Detection Using Feature Decomposition Learning and Synthesizer Feature Augmentation Kuiyuan Zhang et.al. 2411.09167 null
2024-11-14 Evaluating Synthetic Command Attacks on Smart Voice Assistants Zhengxian He et.al. 2411.08316 null
2024-11-12 Improving Grapheme-to-Phoneme Conversion through In-Context Knowledge Retrieval with Large Language Models Dongrui Han et.al. 2411.07563 null
2024-11-11 Enhancing Accessibility in Special Libraries: A Study on AI-Powered Assistive Technologies for Patrons with Disabilities Snehasish Paul Shivali Chauhan et.al. 2411.06970 null
2024-12-04 Debatts: Zero-Shot Debating Text-to-Speech Synthesis Yiqiao Huang et.al. 2411.06540 null
2024-11-07 CUIfy the XR: An Open-Source Package to Embed LLM-powered Conversational Agents in XR Kadir Burak Buldu et.al. 2411.04671 null
2024-11-04 EmoSphere++: Emotion-Controllable Zero-Shot Text-to-Speech via Emotion-Adaptive Spherical Vector Deok-Hyeon Cho et.al. 2411.02625 link
2024-11-04 Complete reconstruction of the tongue contour through acoustic to articulatory inversion using real-time MRI data Sofiane Azzouz et.al. 2411.02037 null
2024-11-09 Fish-Speech: Leveraging Large Language Models for Advanced Multilingual Text-to-Speech Synthesis Shijia Liao et.al. 2411.01156 link
2024-10-31 Speech is More Than Words: Do Speech-to-Text Translation Systems Leverage Prosody? Ioannis Tsiamas et.al. 2410.24019 null
2024-10-30 Lina-Speech: Gated Linear Attention is a Fast and Parameter-Efficient Learner for text-to-speech synthesis Théodor Lemerle et.al. 2410.23320 link
2024-10-30 Augmenting Polish Automatic Speech Recognition System With Synthetic Data Łukasz Bondaruk et.al. 2410.22903 null
2024-10-29 Very Attentive Tacotron: Robust and Unbounded Length Generalization in Autoregressive Transformer-Based Text-to-Speech Eric Battenberg et.al. 2410.22179 link
2024-10-29 Fast and High-Quality Auto-Regressive Speech Synthesis via Speculative Decoding Bohan Li et.al. 2410.21951 null
2024-10-29 RDSinger: Reference-based Diffusion Network for Singing Voice Synthesis Kehan Sui et.al. 2410.21641 null
2024-10-28 Asynchronous Tool Usage for Real-Time Agents Antonio A. Ginart et.al. 2410.21620 null
2024-10-28 Enhancing TTS Stability in Hebrew using Discrete Semantic Units Ella Zeldes et.al. 2410.21502 null
2024-10-28 Mitigating Unauthorized Speech Synthesis for Voice Protection Zhisheng Zhang et.al. 2410.20742 link
2024-10-27 Get Large Language Models Ready to Speak: A Late-fusion Approach for Speech Generation Maohao Shen et.al. 2410.20336 null
2024-10-24 Making Social Platforms Accessible: Emotion-Aware Speech Generation with Integrated Text Analysis Suparna De et.al. 2410.19199 null
2024-10-24 STTATTS: Unified Speech-To-Text And Text-To-Speech Model Hawau Olamide Toyin et.al. 2410.18607 link
2024-10-24 Evaluating and Improving Automatic Speech Recognition Systems for Korean Meteorological Experts ChaeHun Park et.al. 2410.18444 null
2024-10-23 ELAICHI: Enhancing Low-resource TTS by Addressing Infrequent and Low-frequency Character Bigrams Srija Anand et.al. 2410.17901 null
2024-10-22 Continuous Speech Tokenizer in Text To Speech Yixing Li et.al. 2410.17081 null
2024-10-22 Enhancing Low-Resource ASR through Versatile TTS: Bridging the Data Gap Guanrou Yang et.al. 2410.16726 null
2024-10-21 Continuous Speech Synthesis using per-token Latent Diffusion Arnon Turetzky et.al. 2410.16048 null
2024-10-18 A Unified Framework for Collecting Text-to-Speech Synthesis Datasets for 22 Indian Languages Sujitha Sathiyamoorthy et.al. 2410.14197 null
2024-12-23 Multi-Source Spatial Knowledge Understanding for Immersive Visual Text-to-Speech Shuwei He et.al. 2410.14101 link
2024-10-17 Accelerating Codec-based Speech Synthesis with Multi-Token Prediction and Speculative Decoding Tan Dat Nguyen et.al. 2410.13839 null
2024-10-17 Enhancing Crowdsourced Audio for Text-to-Speech Models José Giraldo et.al. 2410.13357 null
2024-10-17 DART: Disentanglement of Accent and Speaker Representation in Multispeaker Text-to-Speech Jan Melechovsky et.al. 2410.13342 null
2024-10-17 DurIAN-E 2: Duration Informed Attention Network with Adaptive Variational Autoencoder and Adversarial Learning for Expressive Text-to-Speech Synthesis Yu Gu et.al. 2410.13288 null
2024-10-17 Failing Forward: Improving Generative Error Correction for ASR with Synthetic Data and Retrieval Augmentation Sreyan Ghosh et.al. 2410.13198 null
2024-10-16 ERVQ: Enhanced Residual Vector Quantization with Intra-and-Inter-Codebook Optimization for Neural Audio Codecs Rui-Chen Zheng et.al. 2410.12359 null
2024-10-16 Beyond Oversmoothing: Evaluating DDPM and MSE for Scalable Speech Synthesis in ASR Christoph Minixhofer et.al. 2410.12279 null
2024-10-14 IsoChronoMeter: A simple and effective isochronic translation evaluation metric Nikolai Rozanov et.al. 2410.11127 null
2024-10-14 DMDSpeech: Distilled Diffusion Model Surpassing The Teacher in Zero-shot Speech Synthesis via Direct Metric Optimization Yingahao Aaron Li et.al. 2410.11097 null
2024-10-14 Everyday Speech in the Indian Subcontinent Utkarsh Pathak et.al. 2410.10508 null
2024-10-12 Emphasis Rendering for Conversational Text-to-Speech with Multi-modal Multi-scale Context Modeling Rui Liu et.al. 2410.09524 null
2024-10-10 Unsupervised Data Validation Methods for Efficient Model Training Yurii Paniv et.al. 2410.07880 null
2024-10-15 F5-TTS: A Fairytaler that Fakes Fluent and Faithful Speech with Flow Matching Yushen Chen et.al. 2410.06885 link
2024-10-09 Efficient training strategies for natural sounding speech synthesis and speaker adaptation based on FastPitch Teodora Răgman et.al. 2410.06787 null
2024-10-09 Bahasa Harmony: A Comprehensive Dataset for Bahasa Text-to-Speech Synthesis with Discrete Codec Modeling of EnGen-TTS Onkar Kishor Susladkar et.al. 2410.06608 null
2024-10-09 Can DeepFake Speech be Reliably Detected? Hongbin Liu et.al. 2410.06572 null
2024-10-07 SegINR: Segment-wise Implicit Neural Representation for Sequence Alignment in Neural Text-to-Speech Minchan Kim et.al. 2410.04690 null
2024-10-06 HALL-E: Hierarchical Neural Codec Language Model for Minute-Long Zero-Shot Text-to-Speech Synthesis Yuto Nishimura et.al. 2410.04380 null
2024-10-10 SONAR: A Synthetic AI-Audio Detection Framework and Benchmark Xiang Li et.al. 2410.04324 link
2024-10-05 Adversarial Attacks and Robust Defenses in Speaker Embedding based Zero-Shot Text-to-Speech System Ze Li et.al. 2410.04017 null
2024-10-01 Recent Advances in Speech Language Models: A Survey Wenqian Cui et.al. 2410.03751 null
2024-09-30 Accent conversion using discrete units with parallel data synthesized from controllable accented TTS Tuan Nam Nguyen et.al. 2410.03734 null
2024-09-28 FluentEditor+: Text-based Speech Editing by Modeling Local Hierarchical Acoustic Smoothness and Global Prosody Consistency Rui Liu et.al. 2410.03719 null
2024-10-04 Generative Semantic Communication for Text-to-Speech Synthesis Jiahao Zheng et.al. 2410.03459 null
2024-10-04 Textless Streaming Speech-to-Speech Translation using Semantic Speech Tokens Jinzheng Zhao et.al. 2410.03298 null
2024-10-04 Narrative Player: Reviving Data Narratives with Visuals Zekai Shao et.al. 2410.03268 null
2024-10-04 MultiVerse: Efficient and Expressive Zero-Shot Multi-Task Text-to-Speech Taejun Bak et.al. 2410.03192 null
2024-10-07 Algorithms For Automatic Accentuation And Transcription Of Russian Texts In Speech Recognition Systems Olga Iakovenko et.al. 2410.02538 null
2024-10-01 Augmentation through Laundering Attacks for Audio Spoof Detection Hashim Ali et.al. 2410.01108 null
2024-10-01 Zero-Shot Text-to-Speech from Continuous Text Streams Trung Dang et.al. 2410.00767 null
2024-10-01 EmoKnob: Enhance Voice Cloning with Fine-Grained Emotion Control Haozhe Chen et.al. 2410.00316 link
2024-10-02 Moshi: a speech-text foundation model for real-time dialogue Alexandre Défossez et.al. 2410.00037 link
2024-09-30 Word-wise intonation model for cross-language TTS systems Tomilov A. A. et.al. 2409.20374 null
2024-09-29 Quantitative Analysis of Audio-Visual Tasks: An Information-Theoretic Perspective Chen Chen et.al. 2409.19575 null
2024-09-27 Audio-Based Linguistic Feature Extraction for Enhancing Multi-lingual and Low-Resource Text-to-Speech Youngjae Kim et.al. 2409.18622 null
2024-09-27 EmoPro: A Prompt Selection Strategy for Emotional Expression in LM-based Speech Synthesis Haoyu Wang et.al. 2409.18512 null
2024-09-26 Description-based Controllable Text-to-Speech with Cross-Lingual Voice Control Ryuichi Yamamoto et.al. 2409.17452 null
2024-09-25 Exploring synthetic data for cross-speaker style transfer in style representation based TTS Lucas H. Ueda et.al. 2409.17364 null
2024-09-18 SpoofCeleb: Speech Deepfake Detection and SASV In The Wild Jee-weon Jung et.al. 2409.17285 null
2024-09-25 Emotional Dimension Control in Language Model-Based Text-to-Speech: Spanning a Broad Spectrum of Human Emotions Kun Zhou et.al. 2409.16681 null
2024-09-25 Enabling Auditory Large Language Models for Automatic Speech Quality Evaluation Siyin Wang et.al. 2409.16644 null
2024-09-24 FastTalker: Jointly Generating Speech and Conversational Gestures from Text Zixin Guo et.al. 2409.16404 null
2024-09-24 Beyond Text-to-Text: An Overview of Multimodal and Generative Artificial Intelligence for Education Using Topic Modeling Ville Heilala et.al. 2409.16376 null
2024-09-24 Facial Expression-Enhanced TTS: Combining Face Representation and Emotion Intensity for Adaptive Speech Yunji Chu et.al. 2409.16203 null
2024-09-24 NanoVoice: Efficient Speaker-Adaptive Text-to-Speech for Multiple Speakers Nohil Park et.al. 2409.15760 null
2024-09-24 VoiceGuider: Enhancing Out-of-Domain Performance in Parameter-Efficient Speaker-Adaptive Text-to-Speech via Autoguidance Jiheum Yeom et.al. 2409.15759 null
2024-09-24 StyleFusion TTS: Multimodal Style-control and Enhanced Feature Fusion for Zero-shot Text-to-speech Synthesis Zhiyong Chen et.al. 2409.15741 null
2024-09-04 Real-time Robotics Situation Awareness for Accident Prevention in Industry Juan M. Deniz et.al. 2409.15305 null
2024-11-28 A Comprehensive Survey with Critical Analysis for Deepfake Speech Detection Lam Pham et.al. 2409.15180 null
2024-09-23 HiFi-Glot: Neural Formant Synthesis with Differentiable Resonant Filters Lauri Juvela et.al. 2409.14823 null
2024-09-23 LlamaPartialSpoof: An LLM-Driven Fake Speech Dataset Simulating Disinformation Generation Hieu-Thi Luong et.al. 2409.14743 null
2024-09-20 Zero-shot Cross-lingual Voice Transfer for TTS Fadi Biadsy et.al. 2409.13910 null
2024-09-20 On the Feasibility of Fully AI-automated Vishing Attacks João Figueiredo et.al. 2409.13793 null
2024-09-24 Enhancing Kurdish Text-to-Speech with Native Corpus Training: A High-Quality WaveGlow Vocoder Approach Abdulhady Abas Abdullah et.al. 2409.13734 null
2024-09-20 Audio Codec Augmentation for Robust Collaborative Watermarking of Speech Synthesis Lauri Juvela et.al. 2409.13382 link
2024-09-19 Enhancing Synthetic Training Data for Speech Commands: From ASR-Based Filtering to Domain Adaptation in SSL Latent Space Sebastião Quintas et.al. 2409.12745 null
2024-09-19 NDVQ: Robust Neural Audio Codec with Normal Distribution-Based Vector Quantization Zhikang Niu et.al. 2409.12717 null
2024-09-19 Preference Alignment Improves Language Model-Based TTS Jinchuan Tian et.al. 2409.12403 null
2024-09-10 Prosodic Parameter Manipulation in TTS generated speech for Controlled Speech Generation Podakanti Satyajith Chary et.al. 2409.12176 null
2024-09-18 Low Frame-rate Speech Codec: a Codec Designed for Fast High-quality Speech LLM Training and Inference Edresson Casanova et.al. 2409.12117 null
2024-09-18 Exploring an Inter-Pausal Unit (IPU) based Approach for Indic End-to-End TTS Systems Anusha Prakash et.al. 2409.11915 null
2024-09-18 Mixture of Experts Fusion for Fake Audio Detection Using Frozen wav2vec 2.0 Zhiyong Wang et.al. 2409.11909 null
2024-09-18 DPI-TTS: Directional Patch Interaction for Fast-Converging and Style Temporal Modeling in Text-to-Speech Xin Qi et.al. 2409.11835 null
2024-09-18 Speaking from Coarse to Fine: Improving Neural Codec Language Model via Multi-Scale Speech Coding and Generation Haohan Guo et.al. 2409.11630 null
2024-09-17 SpMis: An Investigation of Synthetic Spoken Misinformation Detection Peizhuo Liu et.al. 2409.11308 null
2024-09-19 The Art of Storytelling: Multi-Agent Generative AI for Dynamic Multimodal Narratives Samee Arif et.al. 2409.11261 link
2024-09-17 Zero Shot Text to Speech Augmentation for Automatic Speech Recognition on Low-Resource Accented Speech Corpora Francesco Nespoli et.al. 2409.11107 null
2024-09-17 Single-stage TTS with Masked Audio Token Modeling and Semantic Knowledge Distillation Gerard I. Gállego et.al. 2409.11003 null
2024-09-17 Enhancing Multilingual Speech Generation and Recognition Abilities in LLMs with Constructed Code-switched Data Jing Xu et.al. 2409.10969 null
2024-09-16 Emo-DPO: Controllable Emotional Speech Synthesis through Direct Preference Optimization Xiaoxue Gao et.al. 2409.10157 null
2024-09-16 StyleTTS-ZS: Efficient High-Quality Zero-Shot Text-to-Speech Synthesis with Distilled Time-Varying Style Diffusion Yinghao Aaron Li et.al. 2409.10058 null
2024-09-15 Acquiring Pronunciation Knowledge from Transcribed Speech Audio via Multi-task Learning Siqi Sun et.al. 2409.09891 null
2025-01-13 MacST: Multi-Accent Speech Synthesis via Text Transliteration for Accent Conversion Sho Inoue et.al. 2409.09352 null
2024-09-14 E1 TTS: Simple and Fast Non-Autoregressive TTS Zhijun Liu et.al. 2409.09351 null
2024-09-14 Improving Robustness of Diffusion-Based Zero-Shot Speech Synthesis via Stable Formant Generation Changjin Han et.al. 2409.09311 null
2024-09-14 SafeEar: Content Privacy-Preserving Audio Deepfake Detection Xinfeng Li et.al. 2409.09272 link
2024-09-13 AccentBox: Towards High-Fidelity Zero-Shot Accent Generation Jinzuomu Zhong et.al. 2409.09098 null
2024-09-17 HLTCOE JHU Submission to the Voice Privacy Challenge 2024 Henry Li Xinyuan et.al. 2409.08913 null
2024-09-13 Text-To-Speech Synthesis In The Wild Jee-weon Jung et.al. 2409.08711 null
2024-09-13 LLM-Powered Grapheme-to-Phoneme Conversion: Benchmark and Case Study Mahta Fetrat Qharabagh et.al. 2409.08554 null
2024-09-14 Exploring Accessibility Trends and Challenges in Mobile App Development: A Study of Stack Overflow Questions Amila Indika et.al. 2409.07945 null
2024-09-12 Full-text Error Correction for Chinese Speech Recognition with Large Language Model Zhiyuan Tang et.al. 2409.07790 null
2025-01-03 SSR-Speech: Towards Stable, Safe and Robust Zero-shot Text-based Speech Editing and Synthesis Helin Wang et.al. 2409.07556 link
2024-09-11 D-CAPTCHA++: A Study of Resilience of Deepfake CAPTCHA under Transferable Imperceptible Adversarial Attack Hong-Hanh Nguyen-Le et.al. 2409.07390 null
2024-09-11 Cross-Dialect Text-To-Speech in Pitch-Accent Language Incorporating Multi-Dialect Phoneme-Level BERT Kazuki Yamauchi et.al. 2409.07265 null
2024-09-11 Zero-Shot Text-to-Speech as Golden Speech Generator: A Systematic Framework and its Applicability in Automatic Pronunciation Assessment Tien-Hong Lo et.al. 2409.07151 null
2024-09-11 The VoiceMOS Challenge 2024: Beyond Speech Quality Prediction Wen-Chin Huang et.al. 2409.07001 null
2024-09-10 Enhancing Emotional Text-to-Speech Controllability with Natural Language Guidance through Contrastive Learning and Diffusion Models Xin Jing et.al. 2409.06451 null
2024-09-26 What happens to diffusion model likelihood when your model is conditional? Mattias Cross et.al. 2409.06364 null
2024-09-10 VoiceWukong: Benchmarking Deepfake Voice Detection Ziwei Yan et.al. 2409.06348 null
2024-09-10 AS-Speech: Adaptive Style For Speech Synthesis Zhipeng Li et.al. 2409.05730 null
2024-10-07 IndicVoices-R: Unlocking a Massive Multilingual Multi-speaker Speech Corpus for Scaling Indian TTS Ashwin Sankar et.al. 2409.05356 link
2024-09-10 Disentangling the Prosody and Semantic Information with Pre-trained Model for In-Context Learning based Zero-Shot Voice Conversion Zhengyang Chen et.al. 2409.05004 null
2024-09-01 Sample-Efficient Diffusion for Text-To-Speech Synthesis Justin Lovelace et.al. 2409.03717 link
2024-09-10 LAST: Language Model Aware Speech Tokenization Arnon Turetzky et.al. 2409.03701 null
2024-09-05 FireRedTTS: A Foundation Text-To-Speech Framework for Industry-Level Generative Speech Applications Hao-Han Guo et.al. 2409.03283 null
2024-09-04 Training Universal Vocoders with Feature Smoothing-Based Augmentation Methods for High-Quality TTS Systems Jeongmin Liu et.al. 2409.02517 null
2024-09-04 Fast, High-Quality and Parameter-Efficient Articulatory Synthesis using Differentiable DSP Yisi Liu et.al. 2409.02451 null
2024-09-11 vec2wav 2.0: Advancing Voice Conversion via Discrete Token Vocoders Yiwei Guo et.al. 2409.01995 null
2024-10-02 VoxHakka: A Dialectally Diverse Multi-speaker Text-to-Speech System for Taiwanese Hakka Li-Wei Chen et.al. 2409.01548 null
2024-09-02 A multilingual training strategy for low resource Text to Speech Asma Amalas et.al. 2409.01217 null
2024-09-02 A Framework for Synthetic Audio Conversations Generation using Large Language Models Kaung Myat Kyaw et.al. 2409.00946 null
2024-09-02 SoCodec: A Semantic-Ordered Multi-Stream Speech Codec for Efficient Language Model Based Text-to-Speech Synthesis Haohan Guo et.al. 2409.00933 link
2024-10-11 MaskGCT: Zero-Shot Text-to-Speech with Masked Generative Codec Transformer Yuancheng Wang et.al. 2409.00750 null
2024-08-30 SelectTTS: Synthesizing Anyone's Voice via Discrete Unit-Based Frame Selection Ismail Rasim Ulgen et.al. 2408.17432 null
2024-08-30 AASIST3: KAN-Enhanced AASIST Speech Deepfake Detection using SSL Features and Additional Regularization for the ASVspoof 2024 Challenge Kirill Borodin et.al. 2408.17352 null
2024-09-19 Codec Does Matter: Exploring the Semantic Shortcoming of Codec for Audio Language Model Zhen Ye et.al. 2408.17175 link
2024-08-30 Utilizing Speaker Profiles for Impersonation Audio Detection Hao Gu et.al. 2408.17009 null
2024-08-30 Mini-Omni: Language Models Can Hear, Talk While Thinking in Streaming Zhifei Xie et.al. 2408.16725 link
2024-08-29 RAVE for Speech: Efficient Voice Conversion at High Sampling Rates Anders R. Bargum et.al. 2408.16546 null
2024-08-29 Enabling Beam Search for Language Model-Based Text-to-Speech Synthesis Zehai Tu et.al. 2408.16373 null
2024-08-28 Multi-modal Adversarial Training for Zero-Shot Voice Cloning John Janiczek et.al. 2408.15916 null
2024-08-29 Easy, Interpretable, Effective: openSMILE for voice deepfake detection Octavian Pascu et.al. 2408.15775 null
2024-08-28 VoxInstruct: Expressive Human Instruction-to-Speech Generation with Unified Multilingual Codec Language Modelling Yixuan Zhou et.al. 2408.15676 link
2024-08-27 Literary and Colloquial Dialect Identification for Tamil using Acoustic Features M. Nanmalar et.al. 2408.14887 null
2024-08-28 VoiceTailor: Lightweight Plug-In Adapter for Diffusion-Based Personalized Text-to-Speech Heeseung Kim et.al. 2408.14739 null
2024-08-27 StyleSpeech: Parameter-efficient Fine Tuning for Pre-trained Controllable Text-to-Speech Haowei Lo

About

Update ASR paper everyday

Resources

License

Code of conduct

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages