Usage instructions: here
This page is modified from here
| Publish Date | Title | Authors | Code | |
|---|---|---|---|---|
| 2026-03-05 | PersianPunc: A Large-Scale Dataset and BERT-Based Approach for Persian Punctuation Restoration | Mohammad Javad Ranjbar Kalahroodi et.al. | 2603.05314 | null |
| 2026-03-05 | Visual-Informed Speech Enhancement Using Attention-Based Beamforming | Chihyun Liu et.al. | 2603.05270 | null |
| 2026-03-05 | Beyond Word Error Rate: Auditing the Diversity Tax in Speech Recognition through Dataset Cartography | Ting-Hui Cheng et.al. | 2603.05267 | null |
| 2026-03-05 | Boosting ASR Robustness via Test-Time Reinforcement Learning with Audio-Text Semantic Rewards | Linghan Fang et.al. | 2603.05231 | null |
| 2026-03-05 | Federated Heterogeneous Language Model Optimization for Hybrid Automatic Speech Recognition | Mengze Hong et.al. | 2603.04945 | null |
| 2026-03-05 | Spectral dynamics reservoir computing for high-speed hardware-efficient neuromorphic processing | Jiaxuan Chen et.al. | 2603.04901 | null |
| 2026-03-05 | WhisperAlign: Word-Boundary-Aware ASR and WhisperX-Anchored Pyannote Diarization for Long-Form Bengali Speech | Aurchi Chowdhury et.al. | 2603.04809 | null |
| 2026-03-05 | When Denoising Hinders: Revisiting Zero-Shot ASR with SAM-Audio and Whisper | Akif Islam et.al. | 2603.04710 | null |
| 2026-02-16 | Generating Realistic, Protocol-Compliant Maritime Radio Dialogues using Self-Instruct and Low-Rank Adaptation | Gürsel Akdeniz et.al. | 2603.04423 | null |
| 2026-03-04 | Robust LLM-based Audio-Visual Speech Recognition with Sparse Modality Alignment and Visual Unit-Guided Refinement | Fei Su et.al. | 2603.03811 | null |
| 2026-02-28 | ACES: Accent Subspaces for Coupling, Explanations, and Stress-Testing in Automatic Speech Recognition | Swapnil Parekh et.al. | 2603.03359 | null |
| 2026-03-03 | An Investigation Into Various Approaches For Bengali Long-Form Speech Transcription and Bengali Speaker Diarization | Epshita Jahan et.al. | 2603.03158 | null |
| 2026-03-03 | Speech recognition assisted by large language models to command software orally -- Application to an augmented and virtual reality web app for immersive molecular graphics | Fabio Cortes Rodriguez et.al. | 2603.02901 | null |
| 2026-03-04 | SilentWear: an Ultra-Low Power Wearable System for EMG-based Silent Speech Recognition | Giusy Spacone et.al. | 2603.02847 | null |
| 2026-03-05 | Benchmarking Speech Systems for Frontline Health Conversations: The DISPLACE-M Challenge | Dhanya E et.al. | 2603.02813 | null |
| 2026-03-02 | GLoRIA: Gated Low-Rank Interpretable Adaptation for Dialectal ASR | Pouya Mehralian et.al. | 2603.02464 | null |
| 2026-03-02 | RO-N3WS: Enhancing Generalization in Low-Resource ASR with Diverse Romanian Speech Benchmarks | Alexandra Diaconu et.al. | 2603.02368 | null |
| 2026-03-02 | Sequence-Level Unsupervised Training in Speech Recognition: A Theoretical Study | Zijian Yang et.al. | 2603.02285 | null |
| 2026-02-27 | Whisper-RIR-Mega: A Paired Clean-Reverberant Speech Benchmark for ASR Robustness to Room Acoustics | Mandip Goswami et.al. | 2603.02252 | link |
| 2026-02-25 | Quality of Automatic Speech Recognition -- Polish Language case study -- from Wav2Vec to Scribe ElevenLabs | Marcin Pietroń et.al. | 2603.02246 | null |
| 2026-03-02 | VietSuperSpeech: A Large-Scale Vietnamese Conversational Speech Dataset for ASR Fine-Tuning in Chatbot, Customer Support, and Call Center Applications | Loan Do et.al. | 2603.01894 | null |
| 2026-03-02 | More Data, Fewer Diacritics: Scaling Arabic TTS | Ahmed Musleh et.al. | 2603.01622 | null |
| 2026-03-02 | The USTC-NERCSLIP Systems for the CHiME-9 MCoRec Challenge | Ya Jiang et.al. | 2603.01415 | null |
| 2026-03-02 | End-to-End Simultaneous Dysarthric Speech Reconstruction with Frame-Level Adaptor and Multiple Wait-k Knowledge Distillation | Minghui Wu et.al. | 2603.01382 | null |
| 2026-03-02 | DARS: Dysarthria-Aware Rhythm-Style Synthesis for ASR Enhancement | Minghui Wu et.al. | 2603.01369 | null |
| 2026-03-03 | Using Songs to Improve Kazakh Automatic Speech Recognition | Rustem Yeshpanov et.al. | 2603.00961 | null |
| 2026-03-01 | Towards Orthographically-Informed Evaluation of Speech Recognition Systems for Indian Languages | Kaushal Santosh Bhogale et.al. | 2603.00941 | null |
| 2026-02-28 | Polynomial Mixing for Efficient Self-supervised Speech Encoders | Eva Feillet et.al. | 2603.00683 | null |
| 2026-02-28 | Whisper-MLA: Reducing GPU Memory Consumption of ASR Models based on MHA2MLA Conversion | Sen Zhang et.al. | 2603.00563 | null |
| 2026-02-16 | Iterative LLM-based improvement for French Clinical Interview Transcription and Speaker Diarization | Ambre Marie et.al. | 2603.00086 | null |
| 2026-02-27 | Chunk-wise Attention Transducers for Fast and Accurate Streaming Speech-to-Text | Hainan Xu et.al. | 2602.24245 | null |
| 2026-02-27 | Dialect and Gender Bias in YouTube's Spanish Captioning System | Iris Dania Jimenez et.al. | 2602.24002 | null |
| 2026-02-26 | Challenges in Automatic Speech Recognition for Adults with Cognitive Impairment | Michelle Cohn et.al. | 2602.23436 | null |
| 2026-02-16 | Hello-Chat: Towards Realistic Social Audio Interactions | Yueran Hou et.al. | 2602.23387 | null |
| 2026-02-26 | Make It Hard to Hear, Easy to Learn: Long-Form Bengali ASR and Speaker Diarization via Extreme Augmentation and Perfect Alignment | Sanjid Hasan et.al. | 2602.23070 | null |
| 2026-02-26 | A Holistic Framework for Robust Bangla ASR and Speaker Diarization with Optimized VAD and CTC Alignment | Zarif Ishmam et.al. | 2602.22935 | null |
| 2026-02-26 | Efficient Dialect-Aware Modeling and Conditioning for Low-Resource Taiwanese Hakka Speech Processing | An-Ci Peng et.al. | 2602.22522 | null |
| 2026-02-25 | TG-ASR: Translation-Guided Learning with Parallel Gated Cross Attention for Low-Resource Automatic Speech Recognition | Cheng-Yeh Yang et.al. | 2602.22039 | null |
| 2026-02-25 | Robust Long-Form Bangla Speech Processing: Automatic Speech Recognition and Speaker Diarization | MD. Sagor Chowdhury et.al. | 2602.21741 | null |
| 2026-03-02 | Mitigating Structural Noise in Low-Resource S2TT: An Optimized Cascaded Nepali-English Pipeline with Punctuation Restoration | Tangsang Chongbang et.al. | 2602.21647 | null |
| 2026-02-24 | 823-OLT @ BUET DL Sprint 4.0: Context-Aware Windowing for ASR and Fine-Tuned Speaker Diarization in Bengali Long Form Audio | Ratnajit Dhar et.al. | 2602.21183 | null |
| 2026-02-24 | Training-Free Intelligibility-Guided Observation Addition for Noisy ASR | Haoyang Li et.al. | 2602.20967 | null |
| 2026-02-23 | An Approach to Combining Video and Speech with Large Language Models in Human-Robot Interaction | Guanting Shen et.al. | 2602.20219 | null |
| 2026-02-22 | Pay Attention to CTC: Fast and Robust Pseudo-Labelling for Unified Speech Recognition | Alexandros Haliassos et.al. | 2602.19316 | null |
| 2026-02-21 | Whisper: Courtside Edition Enhancing ASR Performance Through LLM-Driven Context Generation | Yonathan Ron et.al. | 2602.18966 | null |
| 2026-02-21 | ReHear: Iterative Pseudo-Label Refinement for Semi-Supervised Speech Recognition via Audio Large Language Models | Zefang Liu et.al. | 2602.18721 | null |
| 2026-02-18 | Fine-Pruning: A Biologically Inspired Algorithm for Personalization of Machine Learning Models | Joseph Bingham et.al. | 2602.18507 | null |
| 2026-02-19 | Voice-Driven Semantic Perception for UAV-Assisted Emergency Networks | Nuno Saavedra et.al. | 2602.17394 | null |
| 2026-02-13 | Speech to Speech Synthesis for Voice Impersonation | Bjorn Johnson et.al. | 2602.16721 | null |
| 2026-02-24 | Enroll-on-Wakeup: A First Comparative Study of Target Speech Extraction for Seamless Interaction in Real Noisy Human-Machine Dialogue Scenarios | Yiming Yang et.al. | 2602.15519 | null |
| 2026-02-17 | Joint Enhancement and Classification using Coupled Diffusion Models of Signals and Logits | Gilad Nurko et.al. | 2602.15405 | null |
| 2026-02-16 | CLAP-Based Automatic Word Naming Recognition in Post-Stroke Aphasia | Yacouba Kaloga et.al. | 2602.14584 | null |
| 2026-02-15 | From Scarcity to Scale: A Release-Level Analysis of the Pashto Common Voice Dataset | Jandad Jahani et.al. | 2602.14062 | null |
| 2026-02-15 | Eureka-Audio: Triggering Audio Intelligence in Compact Language Models | Dan Zhang et.al. | 2602.13954 | null |
| 2026-02-14 | voice2mode: Phonation Mode Classification in Singing using Self-Supervised Speech Models | Aju Ani Justus et.al. | 2602.13928 | null |
| 2026-02-03 | Multimodal Consistency-Guided Reference-Free Data Selection for ASR Accent Adaptation | Ligong Lei et.al. | 2602.13263 | null |
| 2026-02-13 | ViMedCSS: A Vietnamese Medical Code-Switching Speech Dataset & Benchmark | Tung X. Nguyen et.al. | 2602.12911 | null |
| 2026-02-13 | Lamer-SSL: Layer-aware Mixture of LoRA Experts for Continual Multilingual Expansion of Self-supervised Models without Forgetting | Jing Xu et.al. | 2602.12746 | null |
| 2026-02-13 | PISHYAR: A Socially Intelligent Smart Cane for Indoor Social Navigation and Multimodal Human-Robot Interaction for Visually Impaired People | Mahdi Haghighat Joo et.al. | 2602.12597 | null |
| 2026-02-13 | Decoder-only Conformer with Modality-aware Sparse Mixtures of Experts for ASR | Jaeyoung Lee et.al. | 2602.12546 | null |
| 2026-01-21 | Retrieval-Augmented Self-Taught Reasoning Model with Adaptive Chain-of-Thought for ASR Named Entity Correction | Junjie An et.al. | 2602.12287 | null |
| 2026-02-16 | "Sorry, I Didn't Catch That": How Speech Models Miss What Matters Most | Kaitlyn Zhou et.al. | 2602.12249 | null |
| 2026-02-12 | Moonshine v2: Ergodic Streaming Encoder ASR for Latency-Critical Speech Applications | Manjunath Kudlur et.al. | 2602.12241 | null |
| 2026-02-12 | On the Sensitivity of Firing Rate-Based Federated Spiking Neural Networks to Differential Privacy | Luiz Pereira et.al. | 2602.12009 | null |
| 2026-02-28 | TC-BiMamba: Trans-Chunk bidirectionally within BiMamba for unified streaming and non-streaming ASR | Qingshun She et.al. | 2602.11546 | null |
| 2026-02-21 | Voxtral Realtime | Alexander H. Liu et.al. | 2602.11298 | null |
| 2026-02-11 | Self-Supervised Learning for Speaker Recognition: A study and review | Theo Lepage et.al. | 2602.10829 | null |
| 2026-02-10 | ViSpeechFormer: A Phonemic Approach for Vietnamese Automatic Speech Recognition | Khoa Anh Nguyen et.al. | 2602.10003 | null |
| 2026-02-10 | Where Are We At with Automatic Speech Recognition for the Bambara Language? | Seydou Diallo et.al. | 2602.09785 | null |
| 2026-02-04 | Beyond the Utterance: An Empirical Study of Very Long Context Speech Recognition | Robert Flynn et.al. | 2602.09044 | null |
| 2026-02-04 | Windowed SummaryMixing: An Efficient Fine-Tuning of Self-Supervised Learning Models for Low-resource Speech Recognition | Aditya Srinivas Menon et.al. | 2602.09043 | null |
| 2026-02-19 | Prototype-Based Disentanglement for Controllable Dysarthric Speech Synthesis | Haoshen Wang et.al. | 2602.08696 | null |
| 2026-02-09 | Cross-Modal Bottleneck Fusion For Noise Robust Audio-Visual Speech Recognition | Seaone Ok et.al. | 2602.08293 | null |
| 2026-02-08 | D-ORCA: Dialogue-Centric Optimization for Robust Audio-Visual Captioning | Changli Tang et.al. | 2602.07960 | null |
| 2026-02-06 | Equipping LLM with Directional Multi-Talker Speech Understanding Capabilities | Ju Lin et.al. | 2602.07211 | null |
| 2026-02-05 | From Hallucination to Articulation: Language Model-Driven Losses for Ultra Low-Bitrate Neural Speech Coding | Jayeon Yi et.al. | 2602.06213 | null |
| 2026-02-05 | Enabling Automatic Disordered Speech Recognition: An Impaired Speech Dataset in the Akan Language | Isaac Wiafe et.al. | 2602.05406 | null |
| 2026-02-11 | Evaluating Kubernetes Performance for GenAI Inference: From Automatic Speech Recognition to LLM Summarization | Sai Sindhur Malleni et.al. | 2602.04900 | null |
| 2026-02-04 | Speaker-Aware Simulation Improves Conversational Speech Recognition | Máté Gedeon et.al. | 2602.04776 | null |
| 2026-03-01 | Universal Robust Speech Adaptation for Cross-Domain Speech Recognition and Enhancement | Chien-Chun Wang et.al. | 2602.04307 | null |
| 2026-02-04 | Frontend Token Enhancement for Token-Based Speech Recognition | Takanori Ashihara et.al. | 2602.04217 | null |
| 2026-02-06 | Benchmarking Automatic Speech Recognition for Indian Languages in Agricultural Contexts | Chandrashekar M S et.al. | 2602.03868 | null |
| 2026-02-03 | Mići Princ -- A Little Boy Teaching Speech Technologies the Chakavian Dialect | Nikola Ljubešić et.al. | 2602.03245 | null |
| 2026-03-02 | WAXAL: A Large-Scale Multilingual African Language Speech Corpus | Abdoulaye Diack et.al. | 2602.02734 | null |
| 2026-02-02 | Mixture-of-Experts with Intermediate CTC Supervision for Accented Speech Recognition | Wonjun Lee et.al. | 2602.01967 | null |
| 2026-02-02 | BBPE16: UTF-16-based byte-level byte-pair encoding for improved multilingual speech recognition | Hyunsik Kim et.al. | 2602.01717 | null |
| 2026-02-01 | EmoAra: Emotion-Preserving English Speech Transcription and Cross-Lingual Translation with Arabic Text-to-Speech | Besher Hassan et.al. | 2602.01170 | null |
| 2026-02-01 | Adapting Where It Matters: Depth-Aware Adaptation for Efficient Multilingual Speech Recognition in Low-Resource Languages | Yang Xiao et.al. | 2602.01008 | null |
| 2026-02-01 | MedSpeak: A Knowledge Graph-Aided ASR Error Correction Framework for Spoken Medical QA | Yutong Song et.al. | 2602.00981 | null |
| 2026-01-30 | CALM: Joint Contextual Acoustic-Linguistic Modeling for Personalization of Multi-Speaker ASR | Muhammad Shakeel et.al. | 2601.22792 | null |
| 2026-01-30 | Streaming Speech Recognition with Decoder-Only Large Language Models and Latency Optimization | Genshun Wan et.al. | 2601.22779 | null |
| 2026-01-29 | Towards Robust Dysarthric Speech Recognition: LLM-Agent Post-ASR Correction Beyond WER | Xiuwen Zheng et.al. | 2601.21347 | null |
| 2026-01-30 | Qwen3-ASR Technical Report | Xian Shi et.al. | 2601.21337 | link |
| 2026-01-28 | asr_eval: Algorithms and tools for multi-reference and streaming speech recognition evaluation | Oleg Sedukhin et.al. | 2601.20992 | null |
| 2026-01-30 | Text-only adaptation in LLM-based ASR through text denoising | Sergio Burdisso et.al. | 2601.20900 | null |
| 2026-01-28 | Reducing Prompt Sensitivity in LLM-based Speech Recognition Through Learnable Projection | Sergio Burdisso et.al. | 2601.20898 | null |
| 2026-01-28 | A Study of Data Selection Strategies for Pre-training Self-Supervised Speech Models | Ryan Whetten et.al. | 2601.20896 | null |
| 2026-01-28 | SW-ASR: A Context-Aware Hybrid ASR Pipeline for Robust Single Word Speech Recognition | Manali Sharma et.al. | 2601.20890 | null |
| 2026-01-27 | MA-LipNet: Multi-Dimensional Attention Networks for Robust Lipreading | Matteo Rossi et.al. | 2601.20881 | null |
| 2026-01-28 | ASR for Affective Speech: Investigating Impact of Emotion and Speech Generative Strategy | Ya-Tse Wu et.al. | 2601.20319 | null |
| 2026-01-28 | Mind the Shift: Using Delta SSL Embeddings to Enhance Child ASR | Zilai Wang et.al. | 2601.20142 | null |
| 2026-01-27 | Do we really need Self-Attention for Streaming Automatic Speech Recognition? | Youness Dkhissi et.al. | 2601.19960 | null |
| 2026-01-23 | Benchmarking von ASR-Modellen im deutschen medizinischen Kontext: Eine Leistungsanalyse anhand von Anamnesegesprächen | Thomas Schuster et.al. | 2601.19945 | null |
| 2026-01-08 | FastWhisper: Adaptive Self-knowledge Distillation for Real-time Automatic Speech Recognition | Junseok Lee et.al. | 2601.19919 | null |
| 2026-01-27 | SLM-SS: Speech Language Model for Generative Speech Separation | Tianhua Li et.al. | 2601.19533 | null |
| 2026-01-27 | Dynamic Multi-Expert Projectors with Stabilized Routing for Multilingual Speech Recognition | Isha Pandey et.al. | 2601.19451 | null |
| 2026-01-27 | SE-DiCoW: Self-Enrolled Diarization-Conditioned Whisper | Alexander Polok et.al. | 2601.19194 | null |
| 2026-02-02 | Language Family Matters: Evaluating LLM-Based ASR Across Linguistic Boundaries | Yuchen Zhang et.al. | 2601.18899 | null |
| 2026-01-29 | Unheard in the Digital Age: Rethinking AI Bias and Speech Diversity | Onyedikachi Hope Amaechi-Okorie et.al. | 2601.18641 | null |
| 2026-01-26 | Pisets: A Robust Speech Recognition System for Lectures and Interviews | Ivan Bondarenko et.al. | 2601.18415 | link |
| 2026-01-26 | Noise-Robust AV-ASR Using Visual Features Both in the Whisper Encoder and Decoder | Zhengyang Li et.al. | 2601.18396 | null |
| 2026-01-26 | OCR-Enhanced Multimodal ASR Can Read While Listening | Junli Chen et.al. | 2601.18393 | null |
| 2026-01-26 | Efficient Rehearsal for Continual Learning in ASR via Singular Value Tuning | Steven Vander Eeckt et.al. | 2601.18266 | null |
| 2026-01-26 | VIBEVOICE-ASR Technical Report | Zhiliang Peng et.al. | 2601.18184 | null |
| 2026-01-25 | SpatialEmb: Extract and Encode Spatial Information for 1-Stage Multi-channel Multi-speaker ASR on Arbitrary Microphone Arrays | Yiwen Shao et.al. | 2601.18037 | null |
| 2026-01-25 | dLLM-ASR: A Faster Diffusion LLM-based Framework for Speech Recognition | Wenjie Tian et.al. | 2601.17902 | null |
| 2026-02-28 | Speech Emotion Recognition with ASR Integration | Yuanchao Li et.al. | 2601.17901 | null |
| 2026-01-25 | Quran-MD: A Fine-Grained Multilingual Multimodal Dataset of the Quran | Muhammad Umar Salman et.al. | 2601.17880 | null |
| 2026-01-25 | BanglaRobustNet: A Hybrid Denoising-Attention Architecture for Robust Bangla Speech Recognition | Md Sazzadul Islam Ridoy et.al. | 2601.17679 | null |
| 2026-01-25 | End-to-End Joint ASR and Speaker Role Diarization with Child-Adult Interactions | Anfeng Xu et.al. | 2601.17640 | link |
| 2026-01-24 | Window Size Versus Accuracy Experiments in Voice Activity Detectors | Max McKinnon et.al. | 2601.17270 | null |
| 2026-01-22 | Sink or SWIM: Tackling Real-Time ASR at Scale | Federico Bruzzone et.al. | 2601.17097 | null |
| 2026-01-16 | AI-based System for Transforming text and sound to Educational Videos | M. E. ElAlami et.al. | 2601.17022 | null |
| 2026-01-21 | Test-Time Adaptation for Speech Emotion Recognition | Jiaheng Dong et.al. | 2601.16240 | null |
| 2026-01-20 | SoundBreak: A Systematic Study of Audio-Only Adversarial Attacks on Trimodal Models | Aafiya Hussain et.al. | 2601.16231 | null |
| 2026-01-22 | Quantum Dimension Reduction of Hidden Markov Models | Rishi Sundar et.al. | 2601.16126 | null |
| 2026-01-27 | Distillation-based Layer Dropping (DLD): Effective End-to-end Framework for Dynamic Speech Networks | Abdul Hannan et.al. | 2601.16117 | null |
| 2026-01-20 | Lost in Transcription: How Speech-to-Text Errors Derail Code Understanding | Jayant Havare et.al. | 2601.15339 | null |
| 2026-01-22 | Deaf and Hard of Hearing Access to Intelligent Personal Assistants: Comparison of Voice-Based Options with an LLM-Powered Touch Interface | Paige S. DeVries et.al. | 2601.15209 | null |
| 2026-01-21 | Inverse-Hessian Regularization for Continual Learning in ASR | Steven Vander Eeckt et.al. | 2601.14751 | null |
| 2026-01-19 | Typhoon ASR Real-time: FastConformer-Transducer for Thai Automatic Speech Recognition | Warit Sirichotedumrong et.al. | 2601.13044 | link |
| 2026-01-19 | DUAP: Dual-task Universal Adversarial Perturbations Against Voice Control Systems | Suyang Sun et.al. | 2601.12786 | null |
| 2026-01-18 | SSVD-O: Parameter-Efficient Fine-Tuning with Structured SVD for Speech Recognition | Pu Wang et.al. | 2601.12600 | null |
| 2026-01-18 | Purification Before Fusion: Toward Mask-Free Speech Enhancement for Robust Audio-Visual Speech Recognition | Linzhi Wu et.al. | 2601.12436 | null |
| 2026-01-18 | CTC-DID: CTC-Based Arabic dialect identification for streaming applications | Muhammad Umar Farooq et.al. | 2601.12199 | null |
| 2026-01-16 | WenetSpeech-Wu: Datasets, Benchmarks, and Models for a Unified Chinese Wu Dialect Speech Processing Ecosystem | Chengyou Wang et.al. | 2601.11027 | null |
| 2026-01-15 | Unifying Speech Recognition, Synthesis and Conversion with Autoregressive Transformers | Runyuan Cai et.al. | 2601.10770 | null |
| 2026-01-15 | STEAMROLLER: A Multi-Agent System for Inclusive Automatic Speech Recognition for People who Stutter | Ziqi Xu et.al. | 2601.10223 | null |
| 2025-12-23 | Multi-Level Embedding Conformer Framework for Bengali Automatic Speech Recognition | Md. Nazmus Sakib et.al. | 2601.09710 | null |
| 2026-01-14 | Linear Complexity Self-Supervised Learning for Music Understanding with Random Quantizer | Petros Vavaroutsos et.al. | 2601.09603 | null |
| 2026-01-14 | Speech-Hands: A Self-Reflection Voice Agentic Approach to Speech Recognition and Audio Reasoning with Omni Perception | Zhen Wan et.al. | 2601.09413 | null |
| 2026-01-14 | SLAM-LLM: A Modular, Open-Source Multimodal Large Language Model Framework and Best Practice for Speech, Language, Audio and Music Processing | Ziyang Ma et.al. | 2601.09385 | null |
| 2026-01-17 | MCGA: A Multi-task Classical Chinese Literary Genre Audio Corpus | Yexing Du et.al. | 2601.09270 | link |
| 2026-01-13 | Robust CAPTCHA Using Audio Illusions in the Era of Large Language Models: from Evaluation to Advances | Ziqi Ding et.al. | 2601.08516 | null |
| 2026-01-12 | Towards Comprehensive Semantic Speech Embeddings for Chinese Dialects | Kalvin Chang et.al. | 2601.07274 | link |
| 2026-01-11 | Task Arithmetic with Support Languages for Low-Resource ASR | Emma Rafkin et.al. | 2601.07038 | null |
| 2026-01-11 | Categorize Early, Integrate Late: Divergent Processing Strategies in Automatic Speech Recognition | Nathan Roll et.al. | 2601.06972 | null |
| 2026-01-11 | Variational decomposition autoencoding improves disentanglement of latent representations | Ioannis Ziogas et.al. | 2601.06844 | null |
| 2026-01-11 | Doing More with Less: Data Augmentation for Sudanese Dialect Automatic Speech Recognition | Ayman Mansour et.al. | 2601.06802 | null |
| 2026-01-10 | QMAVIS: Long Video-Audio Understanding using Fusion of Large Multimodal Models | Zixing Lin et.al. | 2601.06573 | null |
| 2026-01-09 | An Intelligent AI glasses System with Multi-Agent Architecture for Real-Time Voice Processing and Task Execution | Sheng-Kai Chen et.al. | 2601.06235 | null |
| 2026-01-13 | GenAITEd Ghana: A First-of-Its-Kind Context-Aware and Curriculum-Aligned Conversational AI Agent for Teacher Education | Matthew Nyaaba et.al. | 2601.06093 | null |
| 2026-01-09 | Multimodal In-context Learning for ASR of Low-resource Languages | Zhaolin Li et.al. | 2601.05707 | null |
| 2026-01-08 | LLMs-Integrated Automatic Hate Speech Recognition Using Controllable Text Generation Models | Ryutaro Oshima et.al. | 2601.04654 | null |
| 2026-01-08 | WESR: Scaling and Evaluating Word-level Event-Speech Recognition | Chenchen Yang et.al. | 2601.04508 | null |
| 2026-01-08 | Latent-Level Enhancement with Flow Matching for Robust Automatic Speech Recognition | Da-Hee Yang et.al. | 2601.04459 | null |
| 2026-01-14 | Stuttering-Aware Automatic Speech Recognition for Indonesian Language | Fadhil Muhammad et.al. | 2601.03727 | null |
| 2026-01-08 | TellWhisper: Tell Whisper Who Speaks When | Yifan Hu et.al. | 2601.03712 | null |
| 2026-01-06 | Linear Script Representations in Speech Foundation Models Enable Zero-Shot Transliteration | Ryan Soh-Eun Shim et.al. | 2601.02906 | null |
| 2026-01-06 | Multi-channel multi-speaker transformer for speech recognition | Guo Yifan et.al. | 2601.02688 | null |
| 2026-01-05 | Dynamic Quantization Error Propagation in Encoder-Decoder ASR Quantization | Xinyu Wang et.al. | 2601.02455 | null |
| 2026-01-05 | VocalBridge: Latent Diffusion-Bridge Purification for Defeating Perturbation-Based Voiceprint Defenses | Maryam Abbasihafshejani et.al. | 2601.02444 | null |
| 2026-01-14 | MORE: Multi-Objective Adversarial Attacks on Speech Recognition | Xiaoxue Gao et.al. | 2601.01852 | null |
| 2026-01-03 | IO-RAE: Information-Obfuscation Reversible Adversarial Example for Audio Privacy Protection | Jiajie Zhu et.al. | 2601.01239 | null |
| 2026-01-02 | Improving Code-Switching Speech Recognition with TTS Data Augmentation | Yue Heng Yeo et.al. | 2601.00935 | null |
| 2025-12-31 | Index-ASR Technical Report | Zheshu Song et.al. | 2601.00890 | null |
| 2026-01-02 | Three factor delay learning rules for spiking neural networks | Luke Vassallo et.al. | 2601.00668 | null |
| 2026-01-01 | IKFST: IOO and KOO Algorithms for Accelerated and Precise WFST-based End-to-End Automatic Speech Recognition | Zhuoran Zhuang et.al. | 2601.00160 | null |
| 2025-12-31 | Learning Speech Representations with Variational Predictive Coding | Sung-Lin Yeh et.al. | 2601.00100 | null |
| 2025-12-31 | SLM-TTA: A Framework for Test-Time Adaptation of Generative Spoken Language Models | Yuan-Kuei Wu et.al. | 2512.24739 | null |
| 2025-12-29 | PROFASR-BENCH: A Benchmark for Context-Conditioned ASR in High-Stakes Professional Speech | Deepak Babu Piskala et.al. | 2512.23686 | link |
| 2025-12-17 | Marco-ASR: A Principled and Metric-Driven Framework for Fine-Tuning Large-Scale ASR Models for Domain Adaptation | Xuanfan Ni et.al. | 2512.22165 | null |
| 2025-12-14 | EEG-to-Voice Decoding of Spoken and Imagined speech Using Non-Invasive EEG | Hanbeot Park et.al. | 2512.22146 | null |
| 2025-12-26 | Contextual Biasing for LLM-Based ASR with Hotword Retrieval and Reinforcement Learning | YuXiang Kong et.al. | 2512.21828 | null |
| 2025-12-25 | Broadband tunable microwave photonic radar for simultaneous detection of human respiration, heartbeat, and speech with deep learning-based speech recognition | Lei Gao et.al. | 2512.21566 | null |
| 2025-12-29 | VALLR-Pin: Uncertainty-Factorized Visual Speech Recognition for Mandarin with Pinyin Guidance | Chang Sun et.al. | 2512.20032 | null |
| 2025-12-22 | From Speech to Subtitles: Evaluating ASR Models in Subtitling Italian Television Programs | Alessandro Lucca et.al. | 2512.19161 | null |
| 2025-12-22 | Enhancing Fully Formatted End-to-End Speech Recognition with Knowledge Distillation via Multi-Codebook Vector Quantization | Jian You et.al. | 2512.18967 | null |
| 2025-12-20 | Phoneme-based speech recognition driven by large language models and sampling marginalization | Te Ma et.al. | 2512.18371 | null |
| 2025-12-20 | TICL+: A Case Study On Speech In-Context Learning for Children's Speech Recognition | Haolong Zheng et.al. | 2512.18263 | null |
| 2025-11-27 | Supplementary Resources and Analysis for Automatic Speech Recognition Systems Trained on the Loquacious Dataset | Nick Rossenbach et.al. | 2512.17915 | null |
| 2025-12-19 | Peeking Into The Future For Contextual Biasing | Ramaneswaran Selvakumar et.al. | 2512.17657 | null |
| 2025-12-19 | When De-noising Hurts: A Systematic Study of Speech Enhancement Effects on Modern Medical ASR Systems | Sujal Chondhekar et.al. | 2512.17562 | null |
| 2025-12-19 | Zero-Shot Recognition of Dysarthric Speech Using Commercial Automatic Speech Recognition and Multimodal Large Language Models | Ali Alsayegh et.al. | 2512.17474 | null |
| 2025-12-19 | Incorporating Error Level Noise Embedding for Improving LLM-Assisted Robustness in Persian Speech Recognition | Zahra Rahmani et.al. | 2512.17247 | null |
| 2025-11-04 | V-Agent: An Interactive Video Search System Using Vision-Language Models | SunYoung Park et.al. | 2512.16925 | null |
| 2026-01-14 | Navigating the Reality Gap: Privacy-Preserving On-Device Continual Adaptation of ASR for Clinical Telephony | Darshil Chauhan et.al. | 2512.16401 | null |
| 2026-01-15 | TinyMyo: a Tiny Foundation Model for Flexible EMG Signal Processing at the Edge | Matteo Fasulo et.al. | 2512.15729 | link |
| 2025-12-16 | ComMark: Covert and Robust Black-Box Model Watermarking with Compressed Samples | Yunfei Yang et.al. | 2512.15641 | null |
| 2025-12-16 | Adapting Speech Language Model to Singing Voice Synthesis | Yiwen Zhao et.al. | 2512.14657 | null |
| 2025-12-16 | Scalable Frameworks for Real-World Audio-Visual Speech Recognition | Sungnyun Kim et.al. | 2512.14083 | null |
| 2025-12-15 | Reproducing and Dissecting Denoising Language Models for Speech Recognition | Dorian Koch et.al. | 2512.13576 | null |
| 2025-12-18 | Adaptive Edge-Cloud Inference for Speech-to-Action Systems Using ASR and Large Language Models | Mohammad Jalili Torkamani et.al. | 2512.12769 | null |
| 2025-12-13 | System X: A Mobile Voice-Based AI System for EMR Generation and Clinical Decision Support in Low-Resource Maternal Healthcare | Maryam Mustafa et.al. | 2512.12240 | null |
| 2025-12-12 | All-in-One ASR: Unifying Encoder-Decoder Models of CTC, Attention, and Transducer in Dual-Mode ASR | Takafumi Moriya et.al. | 2512.11543 | null |
| 2025-12-12 | The Affective Bridge: Unifying Feature Representations for Speech Deepfake Detection | Yupei Li et.al. | 2512.11241 | null |
| 2025-12-11 | The TCG CREST -- RKMVERI Submission for the NCIIPC Startup India AI Grand Challenge | Nikhil Raghav et.al. | 2512.11009 | null |
| 2025-11-30 | Benchmarking Automatic Speech Recognition Models for African Languages | Alvin Nahabwe et.al. | 2512.10968 | null |
| 2025-11-30 | ASR Under the Stethoscope: Evaluating Biases in Clinical Speech Recognition across Indian Languages | Subham Kumar et.al. | 2512.10967 | null |
| 2025-12-11 | TRIDENT: A Redundant Architecture for Caribbean-Accented Emergency Speech Triage | Elroy Galbraith et.al. | 2512.10741 | null |
| 2025-12-10 | Robust Speech Activity Detection in the Presence of Singing Voice | Philipp Grundhuber et.al. | 2512.09713 | null |
| 2025-12-02 | Enhancing Automatic Speech Recognition Through Integrated Noise Detection Architecture | Karamvir Singh et.al. | 2512.08973 | null |
| 2025-12-08 | A Simple Method to Enhance Pre-trained Language Models with Speech Tokens for Classification | Nicolas Calbucura et.al. | 2512.07571 | null |
| 2025-12-08 | Efficient ASR for Low-Resource Languages: Leveraging Cross-Lingual Unlabeled Data | Srihari Bandarupalli et.al. | 2512.07277 | null |
| 2025-12-06 | Sanvaad: A Multimodal Accessibility Framework for ISL Recognition and Voice-Based Interaction | Kush Revankar et.al. | 2512.06485 | null |
| 2025-12-01 | KidSpeak: A General Multi-purpose LLM for Kids' Speech Recognition and Screening | Rohan Sharma et.al. | 2512.05994 | null |
| 2025-11-23 | SyncVoice: Towards Video Dubbing with Vision-Augmented Pretrained TTS Model | Kaidi Wang et.al. | 2512.05126 | null |
| 2025-12-04 | Measuring the Unspoken: A Disentanglement Model and Benchmark for Psychological Analysis in the Wild | Yigui Feng et.al. | 2512.04728 | null |
| 2025-12-04 | Multi-Loss Learning for Speech Emotion Recognition with Energy-Adaptive Mixup and Frame-Level Attention | Cong Wang et.al. | 2512.04551 | null |
| 2025-12-02 | Comparing Unsupervised and Supervised Semantic Speech Tokens: A Case Study of Child ASR | Mohan Shi et.al. | 2512.03301 | null |
| 2025-12-02 | MAViD: A Multimodal Framework for Audio-Visual Dialogue Understanding and Generation | Youxin Pang et.al. | 2512.03034 | null |
| 2025-12-02 | Bangla Hate Speech Classification with Fine-tuned Transformer Models | Yalda Keivan Jafari et.al. | 2512.02845 | null |
| 2025-12-02 | Reasoning-Aware Multimodal Fusion for Hateful Video Detection | Shuonan Yang et.al. | 2512.02743 | null |
| 2025-12-02 | Hear What Matters! Text-conditioned Selective Video-to-Audio Generation | Junwon Lee et.al. | 2512.02650 | null |
| 2025-12-01 | See, Hear, and Understand: Benchmarking Audiovisual Human Speech Understanding in Multimodal Large Language Models | Le Thien Phuc Nguyen et.al. | 2512.02231 | null |
| 2026-01-19 | Swivuriso: The South African Next Voices Multilingual Speech Dataset | Vukosi Marivate et.al. | 2512.02201 | null |
| 2025-11-18 | On the Difficulty of Token-Level Modeling of Dysfluency and Fluency Shaping Artifacts | Kashaf Gulzar et.al. | 2512.02027 | null |
| 2025-12-01 | MAC-SLU: Multi-Intent Automotive Cabin Spoken Language Understanding Benchmark | Yuezhang Peng et.al. | 2512.01603 | link |
| 2025-12-01 | ZO-ASR: Zeroth-Order Fine-Tuning of Speech Foundation Models without Back-Propagation | Yuezhang Peng et.al. | 2512.01267 | null |
| 2025-11-28 | OmniFusion: Simultaneous Multilingual Multimodal Translations via Modular Fusion | Sai Koneru et.al. | 2512.00234 | link |
| 2025-11-28 | Scaling HuBERT for African Languages: From Base to Large and XL | Antoine Caubrière et.al. | 2511.23370 | null |
| 2025-11-28 | HPSU: A Benchmark for Human-Level Perception in Real-World Spoken Speech Understanding | Chen Li et.al. | 2511.23178 | null |
| 2025-11-28 | Group-Aware Partial Model Merging for Children's Automatic Speech Recognition | Thomas Rolland et.al. | 2511.23098 | null |
| 2025-11-27 | Modeling Romanized Hindi and Bengali: Dataset Creation and Multilingual LLM Integration | Kanchon Gharami et.al. | 2511.22769 | null |
| 2025-11-27 | Do You See What I Say? Generalizable Deepfake Detection based on Visual Speech Recognition | Maheswar Bora et.al. | 2511.22443 | null |
| 2025-11-27 | Layover or Direct Flight: Rethinking Audio-Guided Image Segmentation | Joel Alberto Santos et.al. | 2511.22025 | null |
| 2025-11-16 | On the Cross-lingual Transferability of Pre-trained wav2vec2-based Models | Jonatas Grosman et.al. | 2511.21704 | null |
| 2025-11-26 | Multi-Reward GRPO for Stable and Prosodic Single-Codebook TTS LLMs at Scale | Yicheng Zhong et.al. | 2511.21270 | null |
| 2025-11-26 | ASR Error Correction in Low-Resource Burmese with Alignment-Enhanced Transformers using Phonetic Features | Ye Bhone Lin et.al. | 2511.21088 | null |
| 2025-11-26 | Towards Audio Token Compression in Large Audio Language Models | Saurabhchand Bhati et.al. | 2511.20973 | null |
| 2025-12-24 | SingingSDS: A Singing-Capable Spoken Dialogue System for Conversational Roleplay Applications | Jionghao Han et.al. | 2511.20972 | link |
| 2025-11-25 | Bridging the Language Gap: Synthetic Voice Diversity via Latent Mixup for Equitable Speech Recognition | Wesley Bian et.al. | 2511.20534 | null |
| 2025-11-25 | Mispronunciation Detection and Diagnosis Without Model Training: A Retrieval-Based Approach | Huu Tuong Tu et.al. | 2511.20107 | null |
| 2025-11-25 | EM2LDL: A Multilingual Speech Corpus for Mixed Emotion Recognition through Label Distribution Learning | Xingfeng Li et.al. | 2511.20106 | null |
| 2025-11-25 | It Hears, It Sees too: Multi-Modal LLM for Depression Detection By Integrating Visual Understanding into Audio Language Models | Xiangyu Zhao et.al. | 2511.19877 | null |
| 2025-11-24 | Neural Architecture Search for Quantum Autoencoders | Hibah Agha et.al. | 2511.19246 | null |
| 2025-11-24 | AuViRe: Audio-visual Speech Representation Reconstruction for Deepfake Temporal Localization | Christos Koutlis et.al. | 2511.18993 | null |
| 2025-11-27 | PrismAudio: Decomposed Chain-of-Thoughts and Multi-dimensional Rewards for Video-to-Audio Generation | Huadai Liu et.al. | 2511.18833 | null |
| 2025-11-24 | Context-Aware Whisper for Arabic ASR Under Linguistic Varieties | Bashar Talafha et.al. | 2511.18774 | null |
| 2025-11-24 | AIRHILT: A Human-in-the-Loop Testbed for Multimodal Conflict Detection in Aviation | Omar Garib et.al. | 2511.18718 | null |
| 2025-11-23 | A Multimodal Conversational Agent for Tabular Data Analysis | Mohammad Nour Al Awad et.al. | 2511.18405 | null |
| 2025-11-21 | Point of Order: Action-Aware LLM Persona Modeling for Realistic Civic Simulation | Scott Merrill et.al. | 2511.17813 | null |
| 2025-11-12 | Speech Recognition Model Improves Text-to-Speech Synthesis using Fine-Grained Reward | Guansu Wang et.al. | 2511.17555 | null |
| 2025-11-21 | Enhancing Quranic Learning: A Multimodal Deep Learning Approach for Arabic Phoneme Recognition | Ayhan Kucukmanisa et.al. | 2511.17477 | null |
| 2025-11-21 | Robot Confirmation Generation and Action Planning Using Long-context Q-Former Integrated with Multimodal LLM | Chiori Hori et.al. | 2511.17335 | null |
| 2025-11-21 | Investigating self-supervised representations for audio-visual deepfake detection | Dragos-Alexandru Boldisor et.al. | 2511.17181 | null |
| 2026-01-19 | WER is Unaware: Assessing How ASR Errors Distort Clinical Understanding in Patient Facing Dialogue | Zachary Ellis et.al. | 2511.16544 | null |
| 2025-12-03 | NLP Datasets for Idiom and Figurative Language Tasks | Blake Matheny et.al. | 2511.16345 | null |
| 2025-11-20 | Train Short, Infer Long: Speech-LLM Enables Zero-Shot Streamable Joint ASR and Diarization on Long Audio | Mohan Shi et.al. | 2511.16046 | null |
| 2025-11-19 | Scriboora: Rethinking Human Pose Forecasting | Daniel Bermuth et.al. | 2511.15565 | null |
| 2025-11-18 | Quality-Controlled Multimodal Emotion Recognition in Conversations with Identity-Based Transfer Learning and MAMBA Fusion | Zanxu Wang et.al. | 2511.14969 | null |
| 2025-11-18 | Ground Truth Generation for Multilingual Historical NLP using LLMs | Clovis Gladstone et.al. | 2511.14688 | null |
| 2025-12-01 | IMSE: Efficient U-Net-based Speech Enhancement using Inception Depthwise Convolution and Amplitude-Aware Linear Attention | Xinxin Tang et.al. | 2511.14515 | null |
| 2025-11-18 | TTA: Transcribe, Translate and Alignment for Cross-lingual Speech Representation | Wei Liu et.al. | 2511.14410 | null |
| 2025-11-18 | AfriSpeech-MultiBench: A Verticalized Multidomain Multicountry Benchmark Suite for African Accented English ASR | Gabrial Zencha Ashungafac et.al. | 2511.14255 | null |
| 2025-11-19 | StreamingTalker: Audio-driven 3D Facial Animation with Autoregressive Diffusion Model | Yifan Yang et.al. | 2511.14223 | null |
| 2025-11-18 | Listen Like a Teacher: Mitigating Whisper Hallucinations using Adaptive Layer Attention and Knowledge Distillation | Kumud Tripathi et.al. | 2511.14219 | null |
| 2025-11-17 | Human-centric Maintenance Process Through Integration of AI, Speech, and AR | Parul Khanna et.al. | 2511.13918 | null |
| 2025-11-19 | Segmenting Collision Sound Sources in Egocentric Videos | Kranti Kumar Parida et.al. | 2511.13863 | null |
| 2025-11-26 | Passive Dementia Screening via Facial Temporal Micro-Dynamics Analysis of In-the-Wild Talking-Head Video | Filippo Cenacchi et.al. | 2511.13802 | null |
| 2025-11-05 | Emotion Recognition in Multi-Speaker Conversations through Speaker Identification, Knowledge Distillation, and Hierarchical Fusion | Xiao Li et.al. | 2511.13731 | null |
| 2026-01-14 | Toward Conversational Hungarian Speech Recognition: Introducing the BEA-Large and BEA-Dialogue Datasets | Máté Gedeon et.al. | 2511.13529 | null |
| 2025-11-17 | Spatial Blind Spot: Auditory Motion Perception Deficits in Audio LLMs | Zhe Sun et.al. | 2511.13273 | null |
| 2025-11-17 | Distinguishing Repetition Disfluency from Morphological Reduplication in Bangla ASR Transcripts: A Novel Corpus and Benchmarking Analysis | Zaara Zabeen Arpa et.al. | 2511.13159 | null |
| 2025-11-16 | Hi-Reco: High-Fidelity Real-Time Conversational Digital Humans | Hongbin Huang et.al. | 2511.12662 | null |
| 2025-11-23 | Uni-MoE-2.0-Omni: Scaling Language-Centric Omnimodal Large Model with Advanced MoE, Training and Data | Yunxin Li et.al. | 2511.12609 | link |
| 2025-11-15 | How Far Do SSL Speech Models Listen for Tone? Temporal Focus of Tone Representation under Low-resource Transfer | Minu Kim et.al. | 2511.12285 | null |
| 2025-11-15 | Fusionista2.0: Efficiency Retrieval System for Large-Scale Datasets | Huy M. Le et.al. | 2511.12255 | null |
| 2025-11-12 | Tighter Truncated Rectangular Prism Approximation for RNN Robustness Verification | Xingqi Lin et.al. | 2511.11699 | null |
| 2025-11-12 | Beyond saliency: enhancing explanation of speech emotion recognition with expert-referenced acoustic cues | Seham Nasr et.al. | 2511.11691 | null |
| 2025-11-14 | Speech-Aware Long Context Pruning and Integration for Contextualized Automatic Speech Recognition | Yiming Rong et.al. | 2511.11139 | null |
| 2025-11-13 | TEDxTN: A Three-way Speech Translation Corpus for Code-Switched Tunisian Arabic - English | Fethi Bougares et.al. | 2511.10780 | null |
| 2025-11-09 | Towards Fine-Grained Code-Switch Speech Translation with Semantic Space Alignment | Yan Gao et.al. | 2511.10670 | null |
| 2025-11-13 | Music Flamingo: Scaling Music Understanding in Audio Language Models | Sreyan Ghosh et.al. | 2511.10289 | null |
| 2025-11-12 | Omnilingual ASR: Open-Source Multilingual Speech Recognition for 1600+ Languages | Omnilingual ASR team et.al. | 2511.09690 | link |
| 2025-11-12 | End-to-end Contrastive Language-Speech Pretraining Model For Long-form Spoken Question Answering | Jiliang Hu et.al. | 2511.09282 | null |
| 2025-11-12 | Context-Aware Dynamic Chunking for Streaming Tibetan Speech Recognition | Chao Wang et.al. | 2511.09085 | null |
| 2025-11-12 | Towards Effective and Efficient Non-autoregressive decoders for Conformer and LLM-based ASR using Block-based Attention Mask | Tianzi Wang et.al. | 2511.09084 | null |
| 2025-11-11 | Quantizing Whisper-small: How design choices affect ASR performance | Arthur Söhler et.al. | 2511.08093 | null |
| 2025-11-11 | Speech Emotion Recognition with Phonation Excitation Information and Articulatory Kinematics | Ziqian Zhang et.al. | 2511.07955 | null |
| 2025-11-13 | SpikCommander: A High-performance Spiking Transformer with Multi-view Learning for Efficient Speech Command Recognition | Jiaqi Wang et.al. | 2511.07883 | null |
| 2025-11-24 | SynTTS-Commands: A Public Dataset for On-Device KWS via TTS-Synthesized Multilingual Speech | Lu Gan et.al. | 2511.07821 | null |
| 2025-11-10 | LiveNeRF: Efficient Face Replacement Through Neural Radiance Fields Integration | Tung Vu et.al. | 2511.07552 | null |
| 2025-11-10 | Enabling Automatic Self-Talk Detection via Earables | Euihyeok Lee et.al. | 2511.07493 | null |
| 2025-11-11 | Surgical Agent Orchestration Platform for Voice-directed Patient Data Interaction | Hyeryun Park et.al. | 2511.07392 | null |
| 2025-11-10 | Omni-AVSR: Towards Unified Multimodal Speech Recognition with Large Language Models | Umberto Cappellazzo et.al. | 2511.07253 | link |
| 2025-11-10 | Improving Remote Patient Monitoring Systems Using a Fog-based IoT Platform with Speech Recognition | Marc Jayson Baucas et.al. | 2511.07189 | null |
| 2025-11-10 | E2E-VGuard: Adversarial Prevention for Production LLM-based End-To-End Speech Synthesis | Zhisheng Zhang et.al. | 2511.07099 | null |
| 2025-11-10 | CLiFT-ASR: A Cross-Lingual Fine-Tuning Framework for Low-Resource Taiwanese Hokkien Speech Recognition | Hung-Yang Sung et.al. | 2511.06860 | null |
| 2025-11-10 | MedVoiceBias: A Controlled Study of Audio LLM Behavior in Clinical Decision-Making | Zhi Rui Tam et.al. | 2511.06592 | null |
| 2025-11-07 | Shared Latent Representation for Joint Text-to-Audio-Visual Synthesis | Dogucan Yaman et.al. | 2511.05432 | null |
| 2025-11-12 | MERaLiON-SER: Robust Speech Emotion Recognition Model for English and SEA Languages | Hardik B. Sailor et.al. | 2511.04914 | null |
| 2025-11-06 | CantoASR: Prosody-Aware ASR-LALM Collaboration for Low-Resource Cantonese | Dazhong Chen et.al. | 2511.04139 | null |
| 2025-11-06 | WST: Weakly Supervised Transducer for Automatic Speech Recognition | Dongji Gao et.al. | 2511.04035 | null |
| 2025-11-06 | Accelerating scientific discovery with the common task framework | J. Nathan Kutz et.al. | 2511.04001 | null |
| 2025-11-05 | Seeing What You Say: Expressive Image Generation from Speech | Jiyoung Lee et.al. | 2511.03423 | null |
| 2025-11-05 | Open Source State-Of-the-Art Solution for Romanian Speech Recognition | Gabriel Pirlogeanu et.al. | 2511.03361 | null |
| 2025-11-05 | TASU: Text-Only Alignment for Speech Understanding | Jing Peng et.al. | 2511.03310 | null |
| 2025-11-11 | How to Evaluate Speech Translation with Source-Aware Neural MT Metrics | Mauro Cettolo et.al. | 2511.03295 | null |
| 2025-11-04 | An unscented Kalman filter method for real time input-parameter-state estimation | Marios Impraimakis et.al. | 2511.02717 | null |
| 2025-11-04 | Energy-Efficient Hardware Acceleration of Whisper ASR on a CGLA | Takuto Ando et.al. | 2511.02269 | null |
| 2025-11-03 | SeaLLMs-Audio: Large Audio-Language Models for Southeast Asia | Chaoqun Liu et.al. | 2511.01670 | null |
| 2025-11-02 | MULTI-Bench: A Multi-Turn Interactive Benchmark for Assessing Emotional Intelligence ability of Spoken Dialogue Models | Yayue Deng et.al. | 2511.00850 | null |
| 2025-11-01 | Emotion Detection in Speech Using Lightweight and Transformer-Based Models: A Comparative and Ablation Study | Lucky Onyekwelu-Udoka et.al. | 2511.00402 | null |
| 2025-10-31 | Reference Microphone Selection for Guided Source Separation based on the Normalized L-p Norm | Anselm Lohmann et.al. | 2510.27198 | null |
| 2025-10-30 | Overview of the MEDIQA-OE 2025 Shared Task on Medical Order Extraction from Doctor-Patient Consultations | Jean-Philippe Corbeil et.al. | 2510.26974 | null |
| 2025-10-29 | Multi-Representation Attention Framework for Underwater Bioacoustic Denoising and Recognition | Amine Razig et.al. | 2510.26838 | null |
| 2025-10-29 | Audio-Visual Speech Enhancement In Complex Scenarios With Separation And Dereverberation Joint Modeling | Jiarong Du et.al. | 2510.26825 | null |
| 2025-10-28 | Cross-Corpus Validation of Speech Emotion Recognition in Urdu using Domain-Knowledge Acoustic Features | Unzela Talpur et.al. | 2510.26823 | null |
| 2025-10-28 | See the Speaker: Crafting High-Resolution Talking Faces from Speech with Prior Guidance and Region Refinement | Jinting Wang et.al. | 2510.26819 | null |
| 2025-10-30 | HMM for short independent sequences: Multiple sequence Baum-Welch application | Margarita Cabrera-Bean et.al. | 2510.26532 | null |
| 2025-10-29 | Lost in Phonation: Voice Quality Variation as an Evaluation Dimension for Speech Foundation Models | Harm Lameris et.al. | 2510.25577 | null |
| 2025-10-29 | Learning Disentangled Speech- and Expression-Driven Blendshapes for 3D Talking Face Animation | Yuxiang Mao et.al. | 2510.25234 | null |
| 2025-10-30 | Evaluating Emotion Recognition in Spoken Language Models on Emotionally Incongruent Speech | Pedro Corrêa et.al. | 2510.25054 | null |
| 2025-10-28 | POWSM: A Phonetic Open Whisper-Style Speech Foundation Model | Chin-Jou Li et.al. | 2510.24992 | null |
| 2025-11-25 | Ming-Flash-Omni: A Sparse, Unified Architecture for Multimodal Perception and Generation | Inclusion AI et.al. | 2510.24821 | null |
| 2025-10-28 | BEST-RQ-Based Self-Supervised Learning for Whisper Domain Adaptation | Raphaël Bagat et.al. | 2510.24570 | null |
| 2025-10-28 | Levée d'ambiguïtés par grammaires locales | Eric G. C. Laporte et.al. | 2510.24530 | null |
| 2025-10-30 | Audio Signal Processing Using Time Domain Mel-Frequency Wavelet Coefficient | Rinku Sebastian et.al. | 2510.24519 | null |
| 2025-10-28 | Sound Source Localization for Spatial Mapping of Surgical Actions in Dynamic Scenes | Jonas Hein et.al. | 2510.24332 | null |
| 2025-10-28 | V-SAT: Video Subtitle Annotation Tool | Arpita Kundu et.al. | 2510.24180 | null |
| 2025-10-28 | RegSpeech12: A Regional Corpus of Bengali Spontaneous Speech Across Dialects | Md. Rezuwan Hassan et.al. | 2510.24096 | null |
| 2025-10-28 | Listening without Looking: Modality Bias in Audio-Visual Captioning | Yuchi Ishikawa et.al. | 2510.24024 | null |
| 2025-10-30 | TeleEgo: Benchmarking Egocentric AI Assistants in the Wild | Jiaqi Yan et.al. | 2510.23981 | null |
| 2025-10-27 | A Neural Model for Contextual Biasing Score Learning and Filtering | Wanting Huang et.al. | 2510.23849 | null |
| 2025-11-01 | RoboOmni: Proactive Robot Manipulation in Omni-modal Context | Siyin Wang et.al. | 2510.23763 | link |
| 2025-10-27 | LibriConvo: Simulating Conversations from Read Literature for ASR and Diarization | Máté Gedeon et.al. | 2510.23320 | null |
| 2025-10-27 | Arabic Little STT: Arabic Children Speech Recognition Dataset | Mouhand Alkadri et.al. | 2510.23319 | null |
| 2025-10-27 | A Cocktail-Party Benchmark: Multi-Modal dataset and Comparative Evaluation Results | Thai-Binh Nguyen et.al. | 2510.23276 | null |
| 2025-10-29 | Are ASR foundation models generalized enough to capture features of regional dialects for low-resource languages? | Tawsif Tashwar Dipto et.al. | 2510.23252 | null |
| 2025-10-27 | Treble10: A high-quality dataset for far-field speech recognition, dereverberation, and enhancement | Sarabeth S. Mullins et.al. | 2510.23141 | null |
| 2025-10-27 | Adapting Speech Foundation Models with Large Language Models for Unified Speech Recognition | Jing-Xuan Zhang et.al. | 2510.22961 | null |
| 2025-10-26 | EchoMind: An Interrelated Multi-level Benchmark for Evaluating Empathetic Speech Language Models | Li Zhou et.al. | 2510.22758 | null |
| 2025-10-26 | LRW-Persian: Lip-reading in the Wild Dataset for Persian Language | Zahra Taghizadeh et.al. | 2510.22716 | null |
| 2025-10-28 | Look and Tell: A Dataset for Multimodal Grounding Across Egocentric and Exocentric Views | Anna Deichler et.al. | 2510.22672 | null |
| 2025-11-02 | Mitigating Attention Sinks and Massive Activations in Audio-Visual Speech Recognition with LLMs | Anand et.al. | 2510.22603 | link |
| 2025-10-26 | A Sociophonetic Analysis of Racial Bias in Commercial ASR Systems Using the Pacific Northwest English Corpus | Michael Scott et.al. | 2510.22495 | null |
| 2025-10-26 | The Tonogenesis Continuum in Tibetan: A Computational Investigation | Siyu Liang et.al. | 2510.22485 | null |
| 2025-10-25 | M-CIF: Multi-Scale Alignment For CIF-Based Non-Autoregressive ASR | Ruixiang Mao et.al. | 2510.22172 | null |
| 2025-10-23 | LSF-Animation: Label-Free Speech-Driven Facial Animation via Implicit Feature Representation | Xin Lu et.al. | 2510.21864 | null |
| 2025-10-24 | Compressing Quaternion Convolutional Neural Networks for Audio Classification | Arshdeep Singh et.al. | 2510.21388 | null |
| 2025-10-24 | SindBERT, the Sailor: Charting the Seas of Turkish NLP | Raphael Scheible-Schmitt et.al. | 2510.21364 | null |
| 2025-10-27 | ReFESS-QI: Reference-Free Evaluation For Speech Separation With Joint Quality And Intelligibility Scoring | Ari Frummer et.al. | 2510.21014 | null |
| 2025-10-22 | Beyond Hearing: Learning Task-agnostic ExG Representations from Earphones via Physiology-informed Tokenization | Hyungjun Yoon et.al. | 2510.20853 | null |
| 2025-10-21 | Can large audio language models understand child stuttering speech? speech summarization, and source separation | Chibuzor Okocha et.al. | 2510.20850 | null |
| 2025-10-23 | Decoding the Ear: A Framework for Objectifying Expressiveness from Human Preference Through Efficient Alignment | Zhiyu Lin et.al. | 2510.20513 | null |
| 2025-10-23 | Speaking Clearly: A Simplified Whisper-Based Codec for Low-Bitrate Speech Coding | Xin Zhang et.al. | 2510.20504 | link |
| 2025-10-23 | SpeechAgent: An End-to-End Mobile Infrastructure for Speech Impairment Assistance | Haowei Lou et.al. | 2510.20113 | null |
| 2025-10-22 | Re-evaluating Minimum Bayes Risk Decoding for Automatic Speech Recognition | Yuu Jinnai et.al. | 2510.19471 | null |
| 2025-10-22 | Time delay embeddings to characterize the timbre of musical instruments using Topological Data Analysis: a study on synthetic and real data | Gakusei Sato et.al. | 2510.19435 | null |
| 2025-10-23 | FLASH Viterbi: Fast and Adaptive Viterbi Decoding for Modern Data Systems | Ziheng Deng et.al. | 2510.19301 | null |
| 2025-10-22 | Tibetan Language and AI: A Comprehensive Survey of Resources, Methods and Challenges | Cheng Huang et.al. | 2510.19144 | null |
| 2025-11-05 | StutterZero and StutterFormer: End-to-End Speech Conversion for Stuttering Transcription and Correction | Qianheng Xu et.al. | 2510.18938 | null |
| 2025-10-28 | RIR-Mega: a large-scale simulated room impulse response dataset for machine learning and room acoustics modeling | Mandip Goswami et.al. | 2510.18917 | link |
| 2025-10-21 | Adapting Language Balance in Code-Switching Speech | Enes Yavuz Ugan et.al. | 2510.18724 | null |
| 2025-10-23 | MLMA: Towards Multilingual ASR With Mamba-based Architectures | Mohamed Nabih Ali et.al. | 2510.18684 | null |
| 2025-10-21 | KrishokBondhu: A Retrieval-Augmented Voice-Based Agricultural Advisory Call Center for Bengali Farmers | Mohd Ruhul Ameen et.al. | 2510.18355 | null |
| 2025-10-20 | Transformer Redesign for Late Fusion of Audio-Text Features on Ultra-Low-Power Edge Hardware | Stavros Mitsis et.al. | 2510.18036 | null |
| 2025-10-20 | ImaGGen: Zero-Shot Generation of Co-Speech Semantic Gestures Grounded in Language and Image Input | Hendric Voss et.al. | 2510.17617 | null |
| 2025-10-20 | Conveying Meaning through Gestures: An Investigation into Semantic Co-Speech Gesture Generation | Hendric Voss et.al. | 2510.17599 | null |
| 2025-10-19 | End-to-end Listen, Look, Speak and Act | Siyin Wang et.al. | 2510.16756 | null |
| 2025-10-19 | Zero- and One-Shot Data Augmentation for Sentence-Level Dysarthric Speech Recognition in Constrained Scenarios | Shiyao Wang et.al. | 2510.16700 | null |
| 2025-10-18 | Hallucination Benchmark for Speech Foundation Models | Alkis Koudounas et.al. | 2510.16567 | null |
| 2025-10-18 | Interpreting the Dimensions of Speaker Embedding Space | Mark Huckvale et.al. | 2510.16489 | null |
| 2025-10-18 | Probing the Hidden Talent of ASR Foundation Models for L2 English Oral Assessment | Fu-An Chao et.al. | 2510.16387 | null |
| 2025-10-18 | MuseTok: Symbolic Music Tokenization for Generation and Semantic Understanding | Jingyue Huang et.al. | 2510.16273 | null |
| 2025-10-17 | SpeechLLMs for Large-scale Contextualized Zero-shot Slot Filling | Kadri Hacioglu et.al. | 2510.15851 | null |
| 2025-10-17 | Magnitude and Phase-based Feature Fusion Using Co-attention Mechanism for Speaker recognition | Rongfeng Su et.al. | 2510.15659 | null |
| 2025-10-17 | SpikeVox: Towards Energy-Efficient Speech Therapy Framework with Spike-driven Generative Language Models | Rachmad Vidya Wicaksana Putra et.al. | 2510.15566 | null |
| 2025-10-17 | VocalBench-DF: A Benchmark for Evaluating Speech LLM Robustness to Disfluency | Hongcheng Liu et.al. | 2510.15406 | null |
| 2025-10-16 | OmniMotion: Multimodal Motion Generation with Continuous Masked Autoregression | Zhe Li et.al. | 2510.14954 | null |
| 2025-10-16 | RLAIF-SPA: Optimizing LLM-based Emotional Speech Synthesis via RLAIF | Qing Yang et.al. | 2510.14628 | null |
| 2025-10-15 | Do Slides Help? Multi-modal Context for Automatic Transcription of Conference Talks | Supriti Sinhamahapatra et.al. | 2510.13979 | null |
| 2025-10-15 | Two Heads Are Better Than One: Audio-Visual Speech Error Correction with Dual Hypotheses | Sungnyun Kim et.al. | 2510.13281 | null |
| 2025-11-13 | A Critical Review of the Need for Knowledge-Centric Evaluation of Quranic Recitation | Mohammed Hilal Al-Kharusi et.al. | 2510.12858 | null |
| 2025-10-14 | Adaptive vector steering: A training-free, layer-wise intervention for hallucination mitigation in large audio and multimodal models | Tsung-En Lin et.al. | 2510.12851 | null |
| 2025-10-11 | Automatic Speech Recognition in the Modern Era: Architectures, Training, and Evaluation | Md. Nayeem et.al. | 2510.12827 | null |
| 2025-10-14 | Structured Sparsity and Weight-adaptive Pruning for Memory and Compute efficient Whisper models | Prasenjit K Mudi et.al. | 2510.12666 | null |
| 2025-10-12 | End-to-end Speech Recognition with similar length speech and text | Peng Fan et.al. | 2510.10453 | null |
| 2025-10-11 | End-to-end Automatic Speech Recognition and Speech Translation: Integration of Speech Foundational Models and LLMs | Nam Luu et.al. | 2510.10329 | null |
| 2025-10-11 | SyncLipMAE: Contrastive Masked Pretraining for Audio-Visual Talking-Face Representation | Zeyu Ling et.al. | 2510.10069 | null |
| 2025-10-10 | Accent-Invariant Automatic Speech Recognition via Saliency-Driven Spectrogram Masking | Mohammad Hossein Sameti et.al. | 2510.09528 | null |
| 2025-10-10 | WildElder: A Chinese Elderly Speech Dataset from the Wild with Fine-Grained Manual Annotations | Hui Wang et.al. | 2510.09344 | null |
| 2025-10-10 | Effects of automotive microphone frequency response characteristics and noise conditions on speech and ASR quality -- an experimental evaluation | Michele Buccoli et.al. | 2510.09236 | null |
| 2025-10-10 | FLToP CTC: Frame-Level Token Pruning via Relative Threshold for Efficient and Memory-Saving Decoding on Diverse Platforms | Atul Shree et.al. | 2510.09085 | null |
| 2025-10-08 | Look before Transcription: End-to-End SlideASR with Visually-Anchored Policy Optimization | Rui Hu et.al. | 2510.08618 | null |
| 2025-10-01 | Articulation-Informed ASR: Integrating Articulatory Features into ASR via Auxiliary Speech Inversion and Cross-Attention Fusion | Ahmed Adel Attia et.al. | 2510.08585 | null |
| 2025-10-09 | Pseudo2Real: Task Arithmetic for Pseudo-Label Correction in Automatic Speech Recognition | Yi-Cheng Lin et.al. | 2510.08047 | null |
| 2025-10-09 | Bloodroot: When Watermarking Turns Poisonous For Stealthy Backdoor | Kuan-Yu Chen et.al. | 2510.07909 | null |
| 2025-10-08 | How much speech data is necessary for ASR in African languages? An evaluation of data scaling in Kinyarwanda and Kikuyu | Benjamin Akera et.al. | 2510.07221 | null |
| 2025-10-09 | Open ASR Leaderboard: Towards Reproducible and Transparent Multilingual and Long-Form Speech Recognition Evaluation | Vaibhav Srivastav et.al. | 2510.06961 | null |
| 2025-10-07 | Linguistically Informed Tokenization Improves ASR for Underresourced Languages | Massimo Daul et.al. | 2510.06461 | null |
| 2025-10-06 | How I Built ASR for Endangered Languages with a Spoken Dictionary | Christopher Bartley et.al. | 2510.04832 | null |
| 2025-10-06 | UniVoice: Unifying Autoregressive ASR and Flow-Matching based TTS with Large Language Models | Wenhao Guan et.al. | 2510.04593 | null |
| 2025-10-06 | Evaluating Self-Supervised Speech Models via Text-Based LLMS | Takashi Maekaku et.al. | 2510.04463 | null |
| 2025-10-05 | Probing Whisper for Dysarthric Speech in Detection and Assessment | Zhengjun Yue et.al. | 2510.04219 | null |
| 2025-10-05 | Drax: Speech Recognition with Discrete Flow Matching | Aviv Navon et.al. | 2510.04162 | link |
| 2025-10-05 | MoME: Mixture of Matryoshka Experts for Audio-Visual Speech Recognition | Umberto Cappellazzo et.al. | 2510.04136 | null |
| 2025-10-04 | Adapting Diarization-Conditioned Whisper for End-to-End Multi-Talker Speech Recognition | Martin Kocour et.al. | 2510.03723 | null |
| 2025-10-04 | Towards Unsupervised Speech Recognition at the Syllable-Level | Liming Wang et.al. | 2510.03639 | null |
| 2025-10-04 | Scaling Multi-Talker ASR with Speaker-Agnostic Activity Streams | Xiluo He et.al. | 2510.03630 | null |
| 2025-10-03 | Listening or Reading? Evaluating Speech Awareness in Chain-of-Thought Speech-to-Text Translation | Jacobo Romero-Díaz et.al. | 2510.03115 | null |
| 2025-10-03 | Revisiting Direct Speech-to-Text Translation with Speech LLMs: Better Scaling than CoT Prompting? | Oriol Pareras et.al. | 2510.03093 | null |
| 2025-10-16 | Transcribe, Translate, or Transliterate: An Investigation of Intermediate Representations in Spoken Language Models | Tolúlopé Ògúnrèmí et.al. | 2510.02569 | null |
| 2025-09-26 | KAME: Tandem Architecture for Enhancing Knowledge in Real-Time Speech-to-Speech Conversational AI | So Kuroki et.al. | 2510.02327 | null |
| 2025-10-02 | EvolveCaptions: Empowering DHH Users Through Real-Time Collaborative Captioning | Liang-Yuan Wu et.al. | 2510.02181 | null |
| 2025-10-01 | Backdoor Attacks Against Speech Language Models | Alexandrine Fortier et.al. | 2510.01157 | null |
| 2025-10-01 | Automatic Speech Recognition (ASR) for African Low-Resource Languages: A Systematic Literature Review | Sukairaj Hafiz Imam et.al. | 2510.01145 | null |
| 2025-10-01 | Spiralformer: Low Latency Encoder for Streaming Speech Recognition with Circular Layer Skipping and Early Exiting | Emiru Tsunoo et.al. | 2510.00982 | null |
| 2025-09-30 | IR-UWB Radar-Based Contactless Silent Speech Recognition with Attention-Enhanced Temporal Convolutional Networks | Sunghwa Lee et.al. | 2509.26409 | null |
| 2025-09-30 | ASR Under Noise: Exploring Robustness for Sundanese and Javanese | Salsabila Zahirah Pranida et.al. | 2509.25878 | null |
| 2025-09-29 | Beyond WER: Probing Whisper's Sub-token Decoder Across Diverse Language Resource Levels | Siyu Liang et.al. | 2509.25516 | null |
| 2025-09-29 | Confidence-Guided Error Correction for Disordered Speech Recognition | Abner Hernandez et.al. | 2509.25048 | null |
| 2025-10-05 | HiKE: Hierarchical Evaluation Framework for Korean-English Code-Switching Speech Recognition | Gio Paik et.al. | 2509.24613 | link |
| 2025-09-29 | A Text-To-Text Alignment Algorithm for Better Evaluation of Modern Speech Recognition Systems | Lasse Borgholt et.al. | 2509.24478 | null |
| 2025-09-29 | Code-switching Speech Recognition Under the Lens: Model- and Data-Centric Perspectives | Hexin Liu et.al. | 2509.24310 | null |
| 2025-09-28 | AISHELL6-whisper: A Chinese Mandarin Audio-visual Whisper Speech Dataset with Speech Recognition Baselines | Cancan Li et.al. | 2509.23833 | null |
| 2025-09-28 | Automatic Speech Recognition for Greek Medical Dictation | Vardis Georgilas et.al. | 2509.23550 | null |
| 2025-09-30 | MeanFlowSE: One-Step Generative Speech Enhancement via MeanFlow | Yike Zhu et.al. | 2509.23299 | null |
| 2025-09-26 | ArFake: A Multi-Dialect Benchmark and Baselines for Arabic Spoof-Speech Detection | Mohamed Maged et.al. | 2509.22808 | null |
| 2025-09-26 | Index-MSR: A high-efficiency multimodal fusion framework for speech recognition | Jinming Chen et.al. | 2509.22744 | null |
| 2025-10-10 | From Coarse to Fine: Recursive Audio-Visual Semantic Enhancement for Speech Separation | Ke Xue et.al. | 2509.22425 | null |
| 2025-09-26 | Decoding Deception: Understanding Automatic Speech Recognition Vulnerabilities in Evasion and Poisoning Attacks | Aravindhan G et.al. | 2509.22060 | null |
| 2025-09-26 | A Parallel Ultra-Low Power Silent Speech Interface based on a Wearable, Fully-dry EMG Neckband | Fiona Meier et.al. | 2509.21964 | null |
| 2025-09-26 | Lightweight Front-end Enhancement for Robust ASR via Frame Resampling and Sub-Band Pruning | Siyi Zhao et.al. | 2509.21833 | null |
| 2025-09-26 | Align2Speak: Improving TTS for Low Resource Languages via ASR-Guided Online Preference Optimization | Shehzeen Hussain et.al. | 2509.21718 | null |
| 2025-09-27 | i-LAVA: Insights on Low Latency Voice-2-Voice Architecture for Agents | Anupam Purwar et.al. | 2509.20971 | null |
| 2025-09-25 | Real-Time System for Audio-Visual Target Speech Enhancement | T. Aleksandra Ma et.al. | 2509.20741 | null |
| 2025-09-25 | Visual Authority and the Rhetoric of Health Misinformation: A Multimodal Analysis of Social Media Videos | Mohammad Reza Zarei et.al. | 2509.20724 | null |
| 2025-09-23 | Variational Low-Rank Adaptation for Personalized Impaired Speech Recognition | Niclas Pokel et.al. | 2509.20397 | null |
| 2025-09-23 | Data-Efficient ASR Personalization for Non-Normative Speech Using an Uncertainty-Based Phoneme Difficulty Score for Guided Sampling | Niclas Pokel et.al. | 2509.20396 | null |
| 2025-09-26 | MMedFD: A Real-world Healthcare Benchmark for Multi-turn Full-Duplex Automatic Speech Recognition | Hongzhao Chen et.al. | 2509.19817 | null |
| 2025-09-23 | Retrieval Augmented Generation based context discovery for ASR | Dimitrios Siskos et.al. | 2509.19567 | null |
| 2025-09-23 | SloPalSpeech: A 2,8000-Hour Slovak Speech Corpus from Parliamentary Data | Erik Božík et.al. | 2509.19270 | null |
| 2025-09-23 | HD-PPT: Hierarchical Decoding of Content- and Prompt-Preference Tokens for Instruction-based TTS | Sihang Nie et.al. | 2509.19001 | null |
| 2025-09-23 | Group Relative Policy Optimization for Text-to-Speech with Large Language Models | Chang Liu et.al. | 2509.18798 | null |
| 2025-09-24 | M4SER: Multimodal, Multirepresentation, Multitask, and Multistrategy Learning for Speech Emotion Recognition | Jiajun He et.al. | 2509.18706 | null |
| 2025-09-23 | HarmoniFuse: A Component-Selective and Prompt-Adaptive Framework for Multi-Task Speech Language Modeling | Yuke Si et.al. | 2509.18570 | null |
| 2025-09-23 | Explore the Reinforcement Learning for the LLM based ASR and TTS system | Changfeng Gao et.al. | 2509.18569 | null |
| 2025-09-24 | MNV-17: A High-Quality Performative Mandarin Dataset for Nonverbal Vocalization Recognition in Speech | Jialong Mai et.al. | 2509.18196 | null |
| 2025-09-22 | Transformer-Encoder Trees for Efficient Multilingual Machine Translation and Speech Translation | Yiwen Guan et.al. | 2509.17930 | null |
| 2025-09-22 | Leveraging Audio-Visual Data to Reduce the Multilingual Gap in Self-Supervised Speech Models | María Andrea Cruz Blandón et.al. | 2509.17523 | null |
| 2025-09-29 | Sidon: Fast and Robust Open-Source Multilingual Speech Restoration for Large-scale Dataset Cleansing | Wataru Nakata et.al. | 2509.17052 | link |
| 2025-09-20 | Idiosyncratic Versus Normative Modeling of Atypical Speech Recognition: Dysarthric Case Studies | Vishnu Raja et.al. | 2509.16718 | null |
| 2025-10-09 | Audio-Conditioned Diffusion LLMs for ASR and Deliberation Processing | Mengqi Wang et.al. | 2509.16622 | null |
| 2025-09-26 | GLip: A Global-Local Integrated Progressive Framework for Robust Visual Speech Recognition | Tianyue Wang et.al. | 2509.16031 | null |
| 2025-09-22 | Interpreting the Role of Visemes in Audio-Visual Speech Recognition | Aristeidis Papadopoulos et.al. | 2509.16023 | null |
| 2025-09-19 | VOX-KRIKRI: Unifying Speech and Language through Continuous Fusion | Dimitrios Damianos et.al. | 2509.15667 | null |
| 2025-09-19 | Layer-wise Minimal Pair Probing Reveals Contextual Grammatical-Conceptual Hierarchy in Speech Representations | Linyang He et.al. | 2509.15655 | null |
| 2025-09-19 | Thinking in cocktail party: Chain-of-Thought and reinforcement learning for target speaker automatic speech recognition | Yiru Zhang et.al. | 2509.15612 | null |
| 2025-09-19 | Chunk Based Speech Pre-training with High Resolution Finite Scalar Quantization | Yun Tang et.al. | 2509.15579 | null |
| 2025-09-19 | State-of-the-Art Dysarthric Speech Recognition with MetaICL for on-the-fly Personalization | Dhruuv Agarwal et.al. | 2509.15516 | null |
| 2025-09-18 | Impact of Phonetics on Speaker Identity in Adversarial Voice Attack | Daniyal Kabir Dar et.al. | 2509.15437 | null |
| 2025-09-18 | BiRQ: Bi-Level Self-Labeling Random Quantization for Self-Supervised Speech Recognition | Liuyuan Jiang et.al. | 2509.15430 | null |
| 2025-09-23 | Frustratingly Easy Data Augmentation for Low-Resource ASR | Katsumi Ibaraki et.al. | 2509.15373 | null |
| 2025-09-25 | Speech Language Models for Under-Represented Languages: Insights from Wolof | Yaya Sy et.al. | 2509.15362 | null |
| 2025-09-20 | Listening, Imagining & Refining: A Heuristic Optimized ASR Correction Framework with LLMs | Yutong Liu et.al. | 2509.15095 | null |
| 2025-09-18 | From Who Said What to Who They Are: Modular Training-free Identity-Aware LLM Refinement of Speaker Diarization | Yu-Wen Chen et.al. | 2509.15082 | null |
| 2025-09-19 | From Hype to Insight: Rethinking Large Language Model Integration in Visual Speech Recognition | Rishabh Jain et.al. | 2509.14880 | null |
| 2025-09-18 | UMA-Split: unimodal aggregation for both English and Mandarin non-autoregressive speech recognition | Ying Fang et.al. | 2509.14653 | null |
| 2025-09-17 | Multi-Channel Differential ASR for Robust Wearer Speech Recognition on Smart Glasses | Yufeng Yang et.al. | 2509.14430 | null |
| 2025-09-17 | CS-FLEURS: A Massively Multilingual and Code-Switched Speech Dataset | Brian Yan et.al. | 2509.14161 | null |
| 2025-09-25 | Canary-1B-v2 & Parakeet-TDT-0.6B-v3: Efficient and High-Performance Models for Multilingual ASR and AST | Monica Sekoyan et.al. | 2509.14128 | null |
| 2025-09-17 | Language Conditioning Improves Accuracy of Aircraft Goal Prediction in Untowered Airspace | Sundhar Vinodh Sangeetha et.al. | 2509.14063 | null |
| 2025-09-17 | Conducting Mission-Critical Voice Experiments with Automated Speech Recognition and Crowdsourcing | Jan Janak et.al. | 2509.13724 | null |
| 2025-09-09 | On the Contribution of Lexical Features to Speech Emotion Recognition | David Combei et.al. | 2509.05634 | null |
| 2025-07-23 | AI Telephone Surveying: Automating Quantitative Data Collection with an AI Interviewer | Danny D. Leybzon et.al. | 2507.17718 | null |
| 2025-07-23 | Synthetic Voice Data for Automatic Speech Recognition in African Languages | Brian DeRenzi et.al. | 2507.17578 | null |
| 2025-07-23 | BoSS: Beyond-Semantic Speech | Qing Wang et.al. | 2507.17563 | null |
| 2025-07-23 | Application of Whisper in Clinical Practice: the Post-Stroke Speech Assessment during a Naming Task | Milena Davudova et.al. | 2507.17326 | null |
| 2025-07-23 | Triple X: A LLM-Based Multilingual Speech Recognition System for the INTERSPEECH2025 MLC-SLM Challenge | Miaomiao Gao et.al. | 2507.17288 | null |
| 2025-07-20 | Weak Supervision Techniques towards Enhanced ASR Models in Industry-level CRM Systems | Zhongsheng Wang et.al. | 2507.16843 | null |
| 2025-07-15 | Towards Robust Speech Recognition for Jamaican Patois Music Transcription | Jordan Madden et.al. | 2507.16834 | null |
| 2025-07-22 | Step-Audio 2 Technical Report | Boyong Wu et.al. | 2507.16632 | null |
| 2025-07-22 | An approach to measuring the performance of Automatic Speech Recognition (ASR) models in the context of Large Language Model (LLM) powered applications | Sujith Pulikodan et.al. | 2507.16456 | null |
| 2025-07-21 | Beyond Rate Coding: Surrogate Gradients Enable Spike Timing Learning in Spiking Neural Networks | Ziqiao Yu et.al. | 2507.16043 | null |
| 2025-07-21 | Mixture to Beamformed Mixture: Leveraging Beamformed Mixture as Weak-Supervision for Speech Enhancement and Noise-Robust ASR | Zhong-Qiu Wang et.al. | 2507.15229 | null |
| 2025-07-21 | EchoVoices: Preserving Generational Voices and Memories for Seniors and Children | Haiying Xu et.al. | 2507.15221 | null |
| 2025-07-19 | Adapting Whisper for Lightweight and Efficient Automatic Speech Recognition of Children for On-device Edge Applications | Satwik Dutta et.al. | 2507.14451 | null |
| 2025-07-18 | Open Automatic Speech Recognition Models for Classical and Modern Standard Arabic | Lilit Grigoryan et.al. | 2507.13977 | null |
| 2025-07-18 | Optimizing ASR for Catalan-Spanish Code-Switching: A Comparative Analysis of Methodologies | Carlos Mena et.al. | 2507.13875 | null |
| 2025-07-17 | Reading Between the Lines: Combining Pause Dynamics and Semantic Coherence for Automated Assessment of Thought Disorder | Feng Chen et.al. | 2507.13551 | null |
| 2025-07-18 | Automatically assessing oral narratives of Afrikaans and isiXhosa children | Retief Louw et.al. | 2507.13205 | null |
| 2025-07-17 | NonverbalTTS: A Public English Corpus of Text-Aligned Nonverbal Vocalizations with Emotion Annotations for Text-to-Speech | Maksim Borisov et.al. | 2507.13155 | null |
| 2025-07-17 | UniSLU: Unified Spoken Language Understanding from Heterogeneous Cross-Task Datasets | Zhichao Sheng et.al. | 2507.12951 | null |
| 2025-07-17 | Task-Specific Audio Coding for Machines: Machine-Learned Latent Features Are Codes for That Machine | Anastasia Kuznetsova et.al. | 2507.12701 | null |
| 2025-07-16 | Improving Contextual ASR via Multi-grained Fusion with Large Language Models | Shilin Zhou et.al. | 2507.12252 | null |
| 2025-07-14 | WhisperKit: On-device Real-time ASR with Billion-Scale Transformers | Atila Orhon et.al. | 2507.10860 | null |
| 2025-07-20 | Supporting SENCOTEN Language Documentation Efforts with Automatic Speech Recognition | Mengzhe Geng et.al. | 2507.10827 | null |
| 2025-07-14 | DQLoRA: A Lightweight Domain-Aware Denoising ASR via Adapter-guided Distillation | Yiru Yang et.al. | 2507.10313 | null |
| 2025-07-13 | The DKU System for Multi-Speaker Automatic Speech Recognition in MLC-SLM Challenge | Yuke Lin et.al. | 2507.09499 | null |
| 2025-07-12 | Can We Really Repurpose Multi-Speaker ASR Corpus for Speaker Diarization? | Shota Horiguchi et.al. | 2507.09226 | null |
| 2025-07-22 | Mixture of LoRA Experts with Multi-Modal and Multi-Granularity LLM Generative Error Correction for Accented Speech Recognition | Bingshen Mu et.al. | 2507.09116 | null |
| 2025-07-06 | A Hybrid Machine Learning Framework for Optimizing Crop Selection via Agronomic and Economic Forecasting | Niranjan Mallikarjun Sindhur et.al. | 2507.08832 | null |
| 2025-07-11 | The Impact of Automatic Speech Transcription on Speaker Attribution | Cristina Aggazzotti et.al. | 2507.08660 | null |
| 2025-07-11 | ILT-Iterative LoRA Training through Focus-Feedback-Fix for Multilingual Speech Recognition | Qingliang Meng et.al. | 2507.08477 | null |
| 2025-07-10 | DARAS: Dynamic Audio-Room Acoustic Synthesis for Blind Room Impulse Response Estimation | Chunxi Wang et.al. | 2507.08135 | null |
| 2025-07-10 | Modèle physique variationnel pour l'estimation de réponses impulsionnelles de salles | Louis Lalay et.al. | 2507.08051 | null |
| 2025-07-10 | Edge-ASR: Towards Low-Bit Quantization of Automatic Speech Recognition Models | Chen Feng et.al. | 2507.07877 | null |
| 2025-07-10 | Code-Switching in End-to-End Automatic Speech Recognition: A Systematic Literature Review | Maha Tufail Agro et.al. | 2507.07741 | null |
| 2025-07-08 | Deep Feed-Forward Neural Network for Bangla Isolated Speech Recognition | Dipayan Bhadra et.al. | 2507.07068 | null |
| 2025-07-04 | Pronunciation-Lexicon Free Training for Phoneme-based Crosslingual ASR via Joint Stochastic Approximation | Saierdaer Yusuyin et.al. | 2507.06249 | null |
| 2025-07-21 | VisualSpeaker: Visually-Guided 3D Avatar Lip Synthesis | Alexandre Symeonidis-Herzig et.al. | 2507.06060 | null |
| 2025-07-08 | How to Evaluate Automatic Speech Recognition: Comparing Different Performance and Bias Measures | Tanvina Patel et.al. | 2507.05885 | null |
| 2025-07-08 | ContextASR-Bench: A Massive Contextual Speech Recognition Benchmark | He Wang et.al. | 2507.05727 | null |
| 2025-11-06 | Omni-Router: Sharing Routing Decisions in Sparse Mixture-of-Experts for Speech Recognition | Zijin Gu et.al. | 2507.05724 | null |
| 2025-07-07 | Adaptive Slimming for Scalable and Efficient Speech Enhancement | Riccardo Miccini et.al. | 2507.04879 | null |
| 2025-07-08 | SHNU Multilingual Conversational Speech Recognition System for INTERSPEECH 2025 MLC-SLM Challenge | Yuxiang Mei et.al. | 2507.03343 | null |
| 2025-06-26 | A Unified Speech LLM for Diarization and Speech Recognition in Multilingual Conversations | Phurich Saengthong et.al. | 2507.02927 | null |
| 2025-07-03 | Open-Source System for Multilingual Translation and Cloned Speech Synthesis | Mateo Cámara et.al. | 2507.02530 | null |
| 2025-07-03 | A Cookbook for Community-driven Data Collection of Impaired Speech in LowResource Languages | Sumaya Ahmed Salihs et.al. | 2507.02428 | null |
| 2025-07-03 | Benchmarking Akan ASR Models Across Domain-Specific Datasets: A Comparative Evaluation of Performance, Scalability, and Adaptability | Mark Atta Mensah et.al. | 2507.02407 | null |
| 2025-07-02 | Adaptability of ASR Models on Low-Resource Language: A Comparative Study of Whisper and Wav2Vec-BERT on Bangla | Md Sazzadul Islam Ridoy et.al. | 2507.01931 | null |
| 2025-07-02 | First Steps Towards Voice Anonymization for Code-Switching Speech | Sarina Meyer et.al. | 2507.01765 | null |
| 2025-07-02 | PERTINENCE: Input-based Opportunistic Neural Network Dynamic Execution | Omkar Shende et.al. | 2507.01695 | null |
| 2025-07-02 | Learning from Random Subspace Exploration: Generalized Test-Time Augmentation with Self-supervised Distillation | Andrei Jelea et.al. | 2507.01347 | null |
| 2025-07-02 | AI Meets Maritime Training: Precision Analytics for Enhanced Safety and Performance | Vishakha Lall et.al. | 2507.01274 | null |
| 2025-06-16 | Hello Afrika: Speech Commands in Kinyarwanda | George Igwegbe et.al. | 2507.01024 | null |
| 2025-07-01 | MambAttention: Mamba with Multi-Head Attention for Generalizable Single-Channel Speech Enhancement | Nikolai Lund Kühne et.al. | 2507.00966 | null |
| 2025-07-01 | Rectifying Magnitude Neglect in Linear Attention | Qihang Fan et.al. | 2507.00698 | null |
| 2025-07-01 | Audio-3DVG: Unified Audio - Point Cloud Fusion for 3D Visual Grounding | Duc Cao-Dinh et.al. | 2507.00669 | null |
| 2025-06-29 | Research on Comprehensive Classroom Evaluation System Based on Multiple AI Models | Cong Xie et.al. | 2506.23079 | null |
| 2025-06-28 | Mind the Gap: Entity-Preserved Context-Aware ASR Structured Transcriptions | Duygu Altinok et.al. | 2506.22858 | null |
| 2025-06-28 | Boosting CTC-Based ASR Using LLM-Based Intermediate Loss Regularization | Duygu Altinok et.al. | 2506.22846 | null |
| 2025-06-28 | A Self-Training Approach for Whisper to Enhance Long Dysarthric Speech Recognition | Shiyao Wang et.al. | 2506.22810 | null |
| 2025-06-27 | Speaker Targeting via Self-Speaker Adaptation for Multi-talker ASR | Weiqing Wang et.al. | 2506.22646 | null |
| 2025-06-27 | Cross-lingual Data Selection Using Clip-level Acoustic Similarity for Enhancing Low-resource Automatic Speech Recognition | Shunsuke Mitsumori et.al. | 2506.22194 | null |
| 2025-06-27 | SAGE: Spliced-Audio Generated Data for Enhancing Foundational Models in Low-Resource Arabic-English Code-Switched Speech Recognition | Muhammad Umar Farooq et.al. | 2506.22143 | null |
| 2025-06-27 | Analyzing and Fine-Tuning Whisper Models for Multilingual Pilot Speech Transcription in the Cockpit | Kartheek Kumar Reddy Nareddy et.al. | 2506.21990 | null |
| 2025-06-23 | Adapting Foundation Speech Recognition Models to Impaired Speech: A Semantic Re-chaining Approach for Personalization of German Speech | Niclas Pokel et.al. | 2506.21622 | null |
| 2025-06-16 | Language-Aware Prompt Tuning for Parameter-Efficient Seamless Language Expansion in Multilingual ASR | Hongli Yang et.al. | 2506.21577 | null |
| 2025-06-16 | Adapting Whisper for Parameter-efficient Code-Switching Speech Recognition via Soft Prompt Tuning | Hongli Yang et.al. | 2506.21576 | null |
| 2025-06-12 | FormosanBench: Benchmarking Low-Resource Austronesian Languages in the Era of Large Language Models | Kaiying Kevin Lin et.al. | 2506.21563 | null |
| 2025-06-11 | Efficient Multilingual ASR Finetuning via LoRA Language Experts | Jiahong Li et.al. | 2506.21555 | null |
| 2025-06-25 | Multimodal Representation Learning and Fusion | Qihang Jin et.al. | 2506.20494 | null |
| 2025-06-25 | Lightweight Target-Speaker-Based Overlap Transcription for Practical Streaming ASR | Aleš Pražák et.al. | 2506.20288 | null |
| 2025-06-24 | Accurate, fast, cheap: Choose three. Replacing Multi-Head-Attention with Bidirectional Recurrent Attention for Long-Form ASR | Martin Ratajczak et.al. | 2506.19761 | null |
| 2025-06-23 | Context Biasing for Pronunciations-Orthography Mismatch in Automatic Speech Recognition | Christian Huber et.al. | 2506.18703 | null |
| 2025-06-23 | Evaluating Multichannel Speech Enhancement Algorithms at the Phoneme Scale Across Genders | Nasser-Eddine Monir et.al. | 2506.18691 | null |
| 2025-06-23 | End-to-End Spoken Grammatical Error Correction | Mengjie Qian et.al. | 2506.18532 | null |
| 2025-06-28 | AI-Generated Song Detection via Lyrics Transcripts | Markus Frohmann et.al. | 2506.18488 | null |
| 2025-06-22 | Splitformer: An improved early-exit architecture for automatic speech recognition on edge devices | Maxence Lasbordes et.al. | 2506.18035 | null |
| 2025-06-21 | OpusLM: A Family of Open Unified Speech Language Models | Jinchuan Tian et.al. | 2506.17611 | null |
| 2025-06-27 | Data Quality Issues in Multilingual Speech Datasets: The Need for Sociolinguistic Awareness and Proactive Language Planning | Mingfei Lau et.al. | 2506.17525 | null |
| 2025-06-20 | Breaking the Transcription Bottleneck: Fine-tuning ASR Models for Extremely Low-Resource Fieldwork Languages | Siyu Liang et.al. | 2506.17459 | null |
| 2025-06-20 | Simultaneous Translation with Offline Speech and LLM Models in CUNI Submission to IWSLT 2025 | Dominik Macháček et.al. | 2506.17077 | link |
| 2025-06-20 | Instituto de Telecomunicações at IWSLT 2025: Aligning Small-Scale Speech and Language Models for Speech-to-Text Learning | Giuseppe Attanasio et.al. | 2506.17019 | link |
| 2025-06-27 | State-Space Models in Efficient Whispered and Multi-dialect Speech Recognition | Aref Farhadipour et.al. | 2506.16969 | null |
| 2025-06-20 | LM-SPT: LM-Aligned Semantic Distillation for Speech Tokenization | Daejin Jo et.al. | 2506.16738 | null |
| 2025-06-19 | Weight Factorization and Centralization for Continual Learning in Speech Recognition | Enes Yavuz Ugan et.al. | 2506.16574 | null |
| 2025-06-19 | Automatic Speech Recognition Biases in Newcastle English: an Error Analysis | Dana Serditova et.al. | 2506.16558 | null |
| 2025-06-18 | Exploiting Music Source Separation for Automatic Lyrics Transcription with Whisper | Jaza Syed et.al. | 2506.15514 | null |
| 2025-06-18 | Foundation of Affective Computing and Interaction | Changzeng Fu et.al. | 2506.15497 | null |
| 2025-06-17 | Thinking in Directivity: Speech Large Language Model for Multi-Talker Directional Speech Recognition | Jiamin Xie et.al. | 2506.14973 | null |
| 2025-06-17 | Unifying Streaming and Non-streaming Zipformer-based ASR | Bidisha Sharma et.al. | 2506.14434 | null |
| 2025-06-17 | Improving Practical Aspects of End-to-End Multi-Talker Speech Recognition for Online and Offline Scenarios | Aswin Shanmugam Subramanian et.al. | 2506.14204 | null |
| 2025-06-17 | AsyncSwitch: Asynchronous Text-Speech Adaptation for Code-Switched ASR | Tuan Nguyen et.al. | 2506.14190 | null |
| 2025-06-16 | A Silent Speech Decoding System from EEG and EMG with Heterogenous Electrode Configurations | Masakazu Inoue et.al. | 2506.13835 | null |
| 2025-07-07 | Qwen vs. Gemma Integration with Whisper: A Comparative Study in Multilingual SpeechLLM Systems | Tuan Nguyen et.al. | 2506.13596 | null |
| 2025-06-16 | BUT System for the MLC-SLM Challenge | Alexander Polok et.al. | 2506.13414 | null |
| 2025-07-04 | Bi-directional Context-Enhanced Speech Large Language Models for Multilingual Conversational ASR | Yizhou Peng et.al. | 2506.13396 | null |
| 2025-07-04 | NTU Speechlab LLM-Based Multilingual ASR System for Interspeech MLC-SLM Challenge 2025 | Yizhou Peng et.al. | 2506.13339 | null |
| 2025-06-18 | Seewo's Submission to MLC-SLM: Lessons learned from Speech Reasoning Language Models | Bo Li et.al. | 2506.13300 | null |
| 2025-06-15 | SC-SOT: Conditioning the Decoder on Diarized Speaker Information for End-to-End Overlapped Speech Recognition | Yuta Hirano et.al. | 2506.12672 | null |
| 2025-06-13 | Adapting Whisper for Streaming Speech Recognition via Two-Pass Decoding | Haoran Zhou et.al. | 2506.12154 | null |
| 2025-05-31 | CMT-LLM: Contextual Multi-Talker ASR Utilizing Large Language Models | Jiajun He et.al. | 2506.12059 | null |
| 2025-06-13 | Enabling automatic transcription of child-centered audio recordings from real-world environments | Daniil Kocharov et.al. | 2506.11747 | null |
| 2025-06-13 | Lightweight and Robust Multi-Channel End-to-End Speech Recognition with Spherical Harmonic Transform | Xiangzhu Kong et.al. | 2506.11630 | null |
| 2025-06-13 | (SimPhon Speech Test): A Data-Driven Method for In Silico Design and Validation of a Phonetically Balanced Speech Test | Stefan Bleeck et.al. | 2506.11620 | null |
| 2025-06-13 | Machine Unlearning for Robust DNNs: Attribution-Guided Partitioning and Neuron Pruning in Noisy Environments | Deliang Jin et.al. | 2506.11615 | null |
| 2025-06-12 | Advances in Small-Footprint Keyword Spotting: A Comprehensive Review of Efficient Models and Algorithms | Soumen Garai et.al. | 2506.11169 | link |
| 2025-06-10 | ASRJam: Human-Friendly AI Speech Jamming to Prevent Automated Phone Scams | Freddie Grabovski et.al. | 2506.11125 | null |
| 2025-06-09 | Benchmarking Foundation Speech and Language Models for Alzheimer's Disease and Related Dementia Detection from Spontaneous Speech | Jingyu Li et.al. | 2506.11119 | null |
| 2025-06-05 | Customizing Speech Recognition Model with Large Language Model Feedback | Shaoshi Ling et.al. | 2506.11091 | null |
| 2025-06-05 | Better Pseudo-labeling with Multi-ASR Fusion and Error Correction by SpeechLLM | Jeena Prakash et.al. | 2506.11089 | null |
| 2025-06-04 | Improving Child Speech Recognition and Reading Mistake Detection by Using Prompts | Lingyun Gao et.al. | 2506.11079 | null |
| 2025-06-02 | Regularized Federated Learning for Privacy-Preserving Dysarthric and Elderly Speech Recognition | Tao Zhong et.al. | 2506.11069 | null |
| 2025-05-31 | PMF-CEC: Phoneme-augmented Multimodal Fusion for Context-aware ASR Error Correction with Error-specific Selective Decoding | Jiajun He et.al. | 2506.11064 | null |
| 2025-06-12 | Improving Named Entity Transcription with Contextual LLM-based Revision | Viet Anh Trinh et.al. | 2506.10779 | null |
| 2025-06-12 | FairASR: Fair Audio Contrastive Learning for Automatic Speech Recognition | Jongsuk Kim et.al. | 2506.10747 | null |
| 2025-06-12 | Joint ASR and Speaker Role Tagging with Serialized Output Training | Anfeng Xu et.al. | 2506.10349 | null |
| 2025-06-11 | Regularizing Learnable Feature Extraction for Automatic Speech Recognition | Peter Vieting et.al. | 2506.09804 | null |
| 2025-06-11 | OWSM-Biasing: Contextualizing Open Whisper-Style Speech Models for Automatic Speech Recognition with Dynamic Vocabulary | Yui Sudo et.al. | 2506.09448 | null |
| 2025-06-10 | SimClass: A Classroom Speech Dataset Generated via Game Engine Simulation For Automatic Speech Recognition Research | Ahmed Adel Attia et.al. | 2506.09206 | null |
| 2025-07-11 | Addressing Pitfalls in Auditing Practices of Automatic Speech Recognition Technologies: A Case Study of People with Aphasia | Katelyn Xiaoying Mei et.al. | 2506.08846 | link |
| 2025-06-09 | Uncovering the Functional Roles of Nonlinearity in Memory | Manuel Brenner et.al. | 2506.07919 | null |
| 2025-06-09 | Unified Semi-Supervised Pipeline for Automatic Speech Recognition | Nune Tadevosyan et.al. | 2506.07659 | null |
| 2025-06-09 | Transcript-Prompted Whisper with Dictionary-Enhanced Decoding for Japanese Speech Annotation | Rui Hu et.al. | 2506.07646 | null |
| 2025-06-09 | Speaker-Distinguishable CTC: Learning Speaker Distinction Using CTC for Multi-Talker Speech Recognition | Asahi Sakuma et.al. | 2506.07515 | null |
| 2025-06-09 | DeRAGEC: Denoising Named Entity Candidates with Synthetic Rationale for ASR Error Correction | Solee Im et.al. | 2506.07510 | null |
| 2025-06-11 | Towards Energy-Efficient and Low-Latency Voice-Controlled Smart Homes: A Proposal for Offline Speech Recognition and IoT Integration | Peng Huang et.al. | 2506.07494 | null |
| 2025-06-08 | Speech Recognition on TV Series with Video-guided Post-Correction | Haoyuan Yang et.al. | 2506.07323 | null |
| 2025-06-08 | Technical Report: A Practical Guide to Kaldi ASR Optimization | Mengze Hong et.al. | 2506.07149 | null |
| 2025-06-07 | Automatic Speech Recognition of African American English: Lexical and Contextual Effects | Hamid Mojarad et.al. | 2506.06888 | null |
| 2025-06-07 | Beyond Classification: Towards Speech Emotion Reasoning with Multitask AudioLLMs | Wenyu Zhang et.al. | 2506.06820 | null |
| 2025-06-07 | A Survey of Retentive Network | Haiqi Yang et.al. | 2506.06708 | null |
| 2025-06-06 | AS-ASR: A Lightweight Framework for Aphasia-Specific Automatic Speech Recognition | Chen Bao et.al. | 2506.06566 | null |
| 2025-06-13 | Structured State Space Model Dynamics and Parametrization for Spiking Neural Networks | Maxime Fabre et.al. | 2506.06374 | link |
| 2025-06-06 | Lightweight Prompt Biasing for Contextualized End-to-End ASR Systems | Bo Ren et.al. | 2506.06252 | null |
| 2025-06-06 | Phonetically-Augmented Discriminative Rescoring for Voice Search Error Correction | Christophe Van Gysel et.al. | 2506.06117 | null |
| 2025-06-06 | Diarization-Aware Multi-Speaker Automatic Speech Recognition via Large Language Models | Yuke Lin et.al. | 2506.05796 | null |
| 2025-06-06 | Bridging the Modality Gap: Softly Discretizing Audio Representation for LLM-based Automatic Speech Recognition | Mu Yang et.al. | 2506.05706 | null |
| 2025-06-06 | Low-Resource Domain Adaptation for Speech LLMs via Text-Only Fine-Tuning | Yangui Fang et.al. | 2506.05671 | null |
| 2025-06-03 | Auto Review: Second Stage Error Detection for Highly Accurate Information Extraction from Phone Conversations | Ayesha Qamar et.al. | 2506.05400 | null |
| 2025-06-05 | LLM-based phoneme-to-grapheme for phoneme-based speech recognition | Te Ma et.al. | 2506.04711 | null |
| 2025-06-05 | ViCocktail: Automated Multi-Modal Data Collection for Vietnamese Audio-Visual Speech Recognition | Thai-Binh Nguyen et.al. | 2506.04635 | null |
| 2025-06-05 | LESS: Large Language Model Enhanced Semi-Supervised Learning for Speech Foundational Models | Wen Ding et.al. | 2506.04586 | null |
| 2025-06-04 | Effects of Speaker Count, Duration, and Accent Diversity on Zero-Shot Accent Robustness in Low-Resource ASR | Zheng-Xin Yong et.al. | 2506.04364 | null |
| 2025-06-04 | MFLA: Monotonic Finite Look-ahead Attention for Streaming Speech Recognition | Yinfeng Xia et.al. | 2506.03722 | null |
| 2025-06-03 | A Multi-Dialectal Dataset for German Dialect ASR and Dialect-to-Standard Speech Translation | Verena Blaschke et.al. | 2506.02894 | null |
| 2025-06-03 | Overcoming Data Scarcity in Multi-Dialectal Arabic ASR via Whisper Fine-Tuning | Ömer Tarik Özyilmaz et.al. | 2506.02627 | null |
| 2025-06-03 | On the Language and Gender Biases in PSTN, VoIP and Neural Audio Codecs | Kemal Altwlkany et.al. | 2506.02545 | null |
| 2025-06-03 | SOVA-Bench: Benchmarking the Speech Conversation Ability for LLM-based Voice Assistant | Yixuan Hou et.al. | 2506.02457 | null |
| 2025-06-03 | Enhancing Lyrics Transcription on Music Mixtures with Consistency Loss | Jiawen Huang et.al. | 2506.02339 | null |
| 2025-06-02 | Cocktail-Party Audio-Visual Speech Recognition | Thai-Binh Nguyen et.al. | 2506.02178 | null |
| 2025-06-02 | HENT-SRT: Hierarchical Efficient Neural Transducer with Self-Distillation for Joint Speech Recognition and Translation | Amir Hussein et.al. | 2506.02157 | null |
| 2025-06-01 | Enhancing Speech Instruction Understanding and Disambiguation in Robotics via Speech Prosody | David Sasu et.al. | 2506.02057 | null |
| 2025-05-31 | No Audiogram: Leveraging Existing Scores for Personalized Speech Intelligibility Prediction | Haoshuai Zhou et.al. | 2506.02039 | null |
| 2025-05-27 | Leveraging Large Language Models in Visual Speech Recognition: Model Scaling, Context-Aware Decoding, and Iterative Polishing | Zehua Liu et.al. | 2506.02012 | null |
| 2025-05-27 | CNVSRC 2024: The Second Chinese Continuous Visual Speech Recognition Challenge | Zehua Liu et.al. | 2506.02010 | null |
| 2025-06-02 | DNCASR: End-to-End Training for Speaker-Attributed ASR | Xianrui Zheng et.al. | 2506.01916 | null |
| 2025-06-02 | Reasoning-Based Approach with Chain-of-Thought for Alzheimer's Detection Using Speech and Large Language Models | Chanwoo Park et.al. | 2506.01683 | null |
| 2025-06-02 | Self-Supervised Speech Quality Assessment (S3QA): Leveraging Speech Foundation Models for a Scalable Speech Quality Metric | Mattson Ogg et.al. | 2506.01655 | null |
| 2025-06-02 | Riemannian Time Warping: Multiple Sequence Alignment in Curved Spaces | Julian Richter et.al. | 2506.01635 | null |
| 2025-06-02 | Unsupervised Rhythm and Voice Conversion to Improve ASR on Dysarthric Speech | Karl El Hajal et.al. | 2506.01618 | null |
| 2025-06-02 | Analyzing the Importance of Blank for CTC-Based Knowledge Distillation | Benedikt Hilmes et.al. | 2506.01503 | null |
| 2025-06-02 | TalTech Systems for the Interspeech 2025 ML-SUPERB 2.0 Challenge | Tanel Alumäe et.al. | 2506.01458 | null |
| 2025-06-02 | Whale: Large-Scale multilingual ASR model with w2v-BERT and E-Branchformer with large speech data | Yosuke Kashiwagi et.al. | 2506.01439 | null |
| 2025-06-02 | Speech-to-Speech Translation Pipelines for Conversations in Low-Resource Languages | Andrei Popescu-Belis et.al. | 2506.01406 | null |
| 2025-06-02 | CleanS2S: Single-file Framework for Proactive Speech-to-Speech Interaction | Yudong Lu et.al. | 2506.01268 | null |
| 2025-06-02 | WCTC-Biasing: Retraining-free Contextual Biasing ASR with Wildcard CTC-based Keyword Spotting and Inter-layer Biasing | Yu Nakagome et.al. | 2506.01263 | null |
| 2025-06-01 | GigaAM: Efficient Self-Supervised Learner for Speech Recognition | Aleksandr Kutsakov et.al. | 2506.01192 | link |
| 2025-06-01 | What do self-supervised speech models know about Dutch? Analyzing advantages of language-specific pre-training | Marianne de Heer Kloots et.al. | 2506.00981 | link |
| 2025-06-01 | Fine-Tuning ASR for Stuttered Speech: Personalized vs. Generalized Approaches | Dena Mujtaba et.al. | 2506.00853 | null |
| 2025-05-31 | Chain-of-Thought Training for Open E2E Spoken Dialogue Systems | Siddhant Arora et.al. | 2506.00722 | null |
| 2025-05-31 | Towards Temporally Explainable Dysarthric Speech Clarity Assessment | Seohyun Park et.al. | 2506.00454 | link |
| 2025-05-31 | DYNAC: Dynamic Vocabulary based Non-Autoregressive Contextualization for Speech Recognition | Yui Sudo et.al. | 2506.00422 | null |
| 2025-05-31 | Causal Structure Discovery for Error Diagnostics of Children's ASR | Vishwanath Pratap Singh et.al. | 2506.00402 | null |
| 2025-05-30 | Can LLMs Understand Unvoiced Speech? Exploring EMG-to-Text Conversion with LLMs | Payal Mohapatra et.al. | 2506.00304 | null |
| 2025-05-30 | Vedavani: A Benchmark Corpus for ASR on Vedic Sanskrit Poetry | Sujeet Kumar et.al. | 2506.00145 | null |
| 2025-05-30 | SwitchLingua: The First Large-Scale Multilingual and Multi-Ethnic Code-Switching Dataset | Peng Xie et.al. | 2506.00087 | null |
| 2025-05-30 | Running Conventional Automatic Speech Recognition on Memristor Hardware: A Simulated Approach | Nick Rossenbach et.al. | 2505.24721 | null |
| 2025-06-02 | MSDA: Combining Pseudo-labeling and Self-Supervision for Unsupervised Domain Adaptation in ASR | Dimitrios Damianos et.al. | 2505.24656 | null |
| 2025-05-30 | SuPseudo: A Pseudo-supervised Learning Method for Neural Speech Enhancement in Far-field Speech Recognition | Longjie Luo et.al. | 2505.24450 | null |
| 2025-05-30 | Pseudo Labels-based Neural Speech Enhancement for the AVSR Task in the MISP-Meeting Challenge | Longjie Luo et.al. | 2505.24446 | null |
| 2025-06-05 | Fewer Hallucinations, More Verification: A Three-Stage LLM-Based Framework for ASR Error Correction | Yangui Fang et.al. | 2505.24347 | null |
| 2025-05-30 | Dynamic Context-Aware Streaming Pretrained Language Model For Inverse Text Normalization | Luong Ho et.al. | 2505.24229 | null |
| 2025-05-30 | MOPSA: Mixture of Prompt-Experts Based Speaker Adaptation for Elderly Speech Recognition | Chengxi Deng et.al. | 2505.24224 | null |
| 2025-06-03 | Improving Multilingual Speech Models on ML-SUPERB 2.0: Fine-tuning with Data Augmentation and LID-Aware CTC | Qingzheng Wang et.al. | 2505.24200 | null |
| 2025-05-29 | BeaverTalk: Oregon State University's IWSLT 2025 Simultaneous Speech Translation System | Matthew Raffel et.al. | 2505.24016 | link |
| 2025-05-29 | Prompting Whisper for Improved Verbatim Transcription and End-to-end Miscue Detection | Griffin Dietz Smith et.al. | 2505.23627 | null |
| 2025-05-29 | Contextualized Automatic Speech Recognition with Dynamic Vocabulary Prediction and Activation | Zhennan Lin et.al. | 2505.23077 | null |
| 2025-05-29 | AISHELL-5: The First Open-Source In-Car Multi-Channel Multi-Speaker Speech Dataset for Automatic Speech Diarization and Recognition | Yuhang Dai et.al. | 2505.23036 | link |
| 2025-05-28 | NGPU-LM: GPU-Accelerated N-Gram Language Model for Context-Biasing in Greedy ASR Decoding | Vladimir Bataev et.al. | 2505.22857 | null |
| 2025-06-05 | Evaluation of LLMs in Speech is Often Flawed: Test Set Contamination in Large Language Models for Speech Recognition | Yuan Tseng et.al. | 2505.22251 | null |
| 2025-05-28 | Advancing Hearing Assessment: An ASR-Based Frequency-Specific Speech Test for Diagnosing Presbycusis | Stefan Bleeck et.al. | 2505.22231 | null |
| 2025-05-28 | On-the-fly Routing for Zero-shot MoE Speaker Adaptation of Speech Foundation Models for Dysarthric Speech Recognition | Shujie HU et.al. | 2505.22072 | null |
| 2025-05-28 | Weakly Supervised Data Refinement and Flexible Sequence Compression for Efficient Thai LLM-based ASR | Mingchen Shao et.al. | 2505.22063 | null |
| 2025-05-28 | Overlap-Adaptive Hybrid Speaker Diarization and ASR-Aware Observation Addition for MISP 2025 Challenge | Shangkun Huang et.al. | 2505.22013 | null |
| 2025-05-28 | Leveraging LLM for Stuttering Speech: A Unified Architecture Bridging Recognition and Event Detection | Shangkun Huang et.al. | 2505.22005 | null |
| 2025-05-27 | GMU Systems for the IWSLT 2025 Low-Resource Speech Translation Shared Task | Chutong Meng et.al. | 2505.21781 | null |
| 2025-05-27 | Loquacious Set: 25,000 Hours of Transcribed and Diverse English Speech Recognition Data for Research and Commercial Use | Titouan Parcollet et.al. | 2505.21578 | null |
| 2025-05-25 | WhisperD: Dementia Speech Recognition and Filler Word Detection with Whisper | Emmanuel Akinrintoyo et.al. | 2505.21551 | null |
| 2025-05-29 | VietASR: Achieving Industry-level Vietnamese ASR with 50-hour labeled data and Large-Scale Speech Pretraining | Jianheng Zhuo et.al. | 2505.21527 | null |
| 2025-05-27 | Towards One-bit ASR: Extremely Low-bit Conformer Quantization Using Co-training and Stochastic Precision | Zhaoqing Li et.al. | 2505.21245 | null |
| 2025-05-27 | PSRB: A Comprehensive Benchmark for Evaluating Persian ASR Systems | Nima Sedghiyeh et.al. | 2505.21230 | null |
| 2025-05-27 | Topological Deep Learning for Speech Data | Zhiwang Yu et.al. | 2505.21173 | null |
| 2025-05-27 | Leveraging LLM and Self-Supervised Training Models for Speech Recognition in Chinese Dialects: A Comparative Analysis | Tianyi Xu et.al. | 2505.21138 | null |
| 2025-05-27 | Towards Pretraining Robust ASR Foundation Model with Acoustic-Aware Data Augmentation | Dancheng Liu et.al. | 2505.20606 | null |
| 2025-05-30 | The NaijaVoices Dataset: Cultivating Large-Scale, High-Quality, Culturally-Rich Speech Data for African Languages | Chris Emezue et.al. | 2505.20564 | null |
| 2025-05-26 | Robust fine-tuning of speech recognition models via model merging: application to disordered speech | Alexandre Ducorroy et.al. | 2505.20477 | null |
| 2025-06-05 | In-context Language Learning for Endangered Languages in Speech Recognition | Zhaolin Li et.al. | 2505.20445 | null |
| 2025-05-26 | Continuous Learning for Children's ASR: Overcoming Catastrophic Forgetting with Elastic Weight Consolidation and Synaptic Intelligence | Edem Ahadzi et.al. | 2505.20216 | null |
| 2025-05-26 | Exploring Generative Error Correction for Dysarthric Speech Recognition | Moreno La Quatra et.al. | 2505.20163 | link |
| 2025-05-26 | Mixture of LoRA Experts for Low-Resourced Multi-Accent Automatic Speech Recognition | Raphaël Bagat et.al. | 2505.20006 | null |
| 2025-05-26 | Novel Loss-Enhanced Universal Adversarial Patches for Sustainable Speaker Privacy | Elvir Karimov et.al. | 2505.19951 | null |
| 2025-05-26 | KIT's Low-resource Speech Translation Systems for IWSLT2025: System Enhancement with Synthetic Data and Model Regularization | Zhaolin Li et.al. | 2505.19679 | null |
| 2025-05-26 | Languages in Multilingual Speech Foundation Models Align Both Phonetically and Semantically | Ryan Soh-Eun Shim et.al. | 2505.19606 | null |
| 2025-05-26 | Beyond Manual Transcripts: The Potential of Automated Speech Recognition Errors in Improving Alzheimer's Disease Detection | Yin-Long Liu et.al. | 2505.19448 | null |
| 2025-05-25 | BR-ASR: Efficient and Scalable Bias Retrieval Framework for Contextual Biasing ASR in Speech LLM | Xun Gong et.al. | 2505.19179 | null |
| 2025-05-24 | Building a Functional Machine Translation Corpus for Kpelle | Kweku Andoh Yamoah et.al. | 2505.18905 | null |
| 2025-05-24 | StandUp4AI: A New Multilingual Dataset for Humor Detection in Stand-up Comedy Videos | Valentin Barriere et.al. | 2505.18903 | null |
| 2025-05-24 | CHSER: A Dataset and Case Study on Generative Speech Error Correction for Child ASR | Natarajan Balaji Shankar et.al. | 2505.18463 | link |
| 2025-05-23 | Daily-Omni: Towards Audio-Visual Reasoning with Temporal Alignment across Modalities | Ziwei Zhou et.al. | 2505.17862 | link |
| 2025-05-27 | CosyVoice 3: Towards In-the-wild Speech Generation via Scaling-up and Post-training | Zhihao Du et.al. | 2505.17589 | null |
| 2025-05-23 | Swedish Whispers; Leveraging a Massive Speech Corpus for Swedish Speech Recognition | Leonora Vesterbacka et.al. | 2505.17538 | null |
| 2025-05-23 | Speechless: Speech Instruction Training Without Speech for Low Resource Languages | Alan Dao et.al. | 2505.17417 | link |
| 2025-05-23 | LLM-based Generative Error Correction for Rare Words with Synthetic Data and Phonetic Context | Natsuo Yamashita et.al. | 2505.17410 | link |
| 2025-06-02 | An End-to-End Approach for Child Reading Assessment in the Xhosa Language | Sergio Chevtchenko et.al. | 2505.17371 | null |
| 2025-05-20 | From Weak Labels to Strong Results: Utilizing 5,000 Hours of Noisy Classroom Transcripts with Minimal Accurate Data | Ahmed Adel Attia et.al. | 2505.17088 | null |
| 2025-05-30 | Impact of Frame Rates on Speech Tokenizer: A Case Study on Mandarin and English | Haoyang Zhang et.al. | 2505.17076 | null |
| 2025-05-28 | An Effective Training Framework for Light-Weight Automatic Speech Recognition Models | Abdul Hannan et.al. | 2505.16991 | null |
| 2025-05-22 | From Tens of Hours to Tens of Thousands: Scaling Back-Translation for Speech Recognition | Tianduo Wang et.al. | 2505.16972 | link |
| 2025-05-22 | SoccerChat: Integrating Multimodal Data for Enhanced Soccer Game Understanding | Sushant Gautam et.al. | 2505.16630 | null |
| 2025-05-27 | X-ARES: A Comprehensive Framework for Assessing Audio Encoder Performance | Junbo Zhang et.al. | 2505.16369 | link |
| 2025-05-24 | Large Language Models based ASR Error Correction for Child Conversations | Anfeng Xu et.al. | 2505.16212 | null |
| 2025-05-22 | Differentiable K-means for Fully-optimized Discrete Token-based ASR | Kentaro Onda et.al. | 2505.16207 | null |
| 2025-05-22 | Prosodically Enhanced Foreign Accent Simulation by Discrete Token-based Resynthesis Only with Native Speech Corpora | Kentaro Onda et.al. | 2505.16191 | null |
| 2025-05-22 | Selective Invocation for Multilingual ASR: A Cost-effective Approach Adapting to Speech Recognition Difficulty | Hongfei Xue et.al. | 2505.16168 | null |
| 2025-05-21 | Word Level Timestamp Generation for Automatic Speech Recognition and Translation | Ke Hu et.al. | 2505.15646 | link |
| 2025-05-20 | In-Context Learning Boosts Speech Recognition via Human-like Adaptation to Speakers and Language Varieties | Nathan Roll et.al. | 2505.14887 | null |
| 2025-05-30 | Towards Inclusive ASR: Investigating Voice Conversion for Dysarthric Speech Recognition in Low-Resource Languages | Chin-Jou Li et.al. | 2505.14874 | link |
| 2025-05-20 | Vox-Profile: A Speech Foundation Model Benchmark for Characterizing Diverse Speaker and Speech Traits | Tiantian Feng et.al. | 2505.14648 | link |
| 2025-05-20 | Dual Precision Quantization for Efficient and Accurate Deep Neural Networks Inference | Tomer Gafni et.al. | 2505.14638 | link |
| 2025-05-20 | PersonaTAB: Predicting Personality Traits using Textual, Acoustic, and Behavioral Cues in Fully-Duplex Speech Dialogs | Sho Inoue et.al. | 2505.14356 | link |
| 2025-05-21 | Scaling and Enhancing LLM-based AVSR: A Sparse Mixture of Projectors Approach | Umberto Cappellazzo et.al. | 2505.14336 | null |
| 2025-05-23 | HausaNLP: Current Status, Challenges and Future Directions for Hausa Natural Language Processing | Shamsuddeen Hassan Muhammad et.al. | 2505.14311 | null |
| 2025-05-27 | The Multimodal Information Based Speech Processing (MISP) 2025 Challenge: Audio-Visual Diarization and Recognition | Ming Gao et.al. | 2505.13971 | null |
| 2025-08-12 | Transfer Learning from Visual Speech Recognition to Mouthing Recognition in German Sign Language | Dinh Nam Pham et.al. | 2505.13784 | null |
| 2025-05-21 | Multi-head Temporal Latent Attention | Keqi Deng et.al. | 2505.13544 | link |
| 2025-05-21 | Granary: Speech Recognition and Translation Dataset in 25 European Languages | Nithin Rao Koluguri et.al. | 2505.13404 | null |
| 2025-05-19 | Cross-modal Knowledge Transfer Learning as Graph Matching Based on Optimal Transport for ASR | Xugang Lu et.al. | 2505.13079 | null |
| 2025-05-19 | KIT's Offline Speech Translation and Instruction Following Submission for IWSLT 2025 | Sai Koneru et.al. | 2505.13036 | null |
| 2025-05-19 | Personalized Fine-Tuning with Controllable Synthetic Speech from LLM-Generated Transcripts for Dysarthric Speech Recognition | Dominik Wagner et.al. | 2505.12991 | null |
| 2025-05-19 | Calm-Whisper: Reduce Whisper Hallucination On Non-Speech By Calming Crazy Heads Down | Yingzhi Wang et.al. | 2505.12969 | null |
| 2025-05-16 | Automatic Speech Recognition for African Low-Resource Languages: Challenges and Future Directions | Sukairaj Hafiz Imam et.al. | 2505.11690 | null |
| 2025-05-16 | ASR-FAIRBENCH: Measuring and Benchmarking Equity Across Speech Recognition Systems | Anand Rai et.al. | 2505.11572 | null |
| 2025-05-26 | LipDiffuser: Lip-to-Speech Generation with Conditional Diffusion Models | Danilo de Oliveira et.al. | 2505.11391 | null |
| 2025-05-16 | LegoSLM: Connecting LLM with Speech Encoder using CTC Posteriors | Rao Ma et.al. | 2505.11352 | null |
| 2025-05-16 | Survey of End-to-End Multi-Speaker Automatic Speech Recognition for Monaural Audio | Xinlu He et.al. | 2505.10975 | null |
| 2025-05-27 | Multi-Stage Speaker Diarization for Noisy Classrooms | Ali Sartaz Khan et.al. | 2505.10879 | link |
| 2025-05-15 | Inclusivity of AI Speech in Healthcare: A Decade Look Back | Retno Larasati et.al. | 2505.10596 | null |
| 2025-05-15 | Quantized Approximate Signal Processing (QASP): Towards Homomorphic Encryption for audio | Tu Duyen Nguyen et.al. | 2505.10500 | null |
| 2025-05-12 | Full simulation on the dynamics of auditory synaptic fusion: Strong clustering of calcium channel might be the origin of the coherent release in the auditory hair cells | Jaeyun Yoo et.al. | 2505.07273 | null |
| 2025-05-09 | Remote Rowhammer Attack using Adversarial Observations on Federated Learning Clients | Jinsheng Yuan et.al. | 2505.06335 | null |
| 2025-05-08 | Teochew-Wild: The First In-the-wild Teochew Dataset with Orthographic Annotations | Linrong Pan et.al. | 2505.05056 | null |
| 2025-05-07 | SwinLip: An Efficient Visual Speech Encoder for Lip Reading Using Swin Transformer | Young-Hu Park et.al. | 2505.04394 | null |
| 2025-05-09 | Robust Speech Recognition with Schrödinger Bridge-Based Speech Enhancement | Rauf Nasretdinov et.al. | 2505.04237 | null |
| 2025-05-06 | VITA-Audio: Fast Interleaved Cross-Modal Token Generation for Efficient Large Speech-Language Model | Zuwei Long et.al. | 2505.03739 | link |
| 2025-05-06 | Fairness of Automatic Speech Recognition in Cleft Lip and Palate Speech | Susmita Bhattacharjee et.al. | 2505.03697 | null |
| 2025-05-26 | SepALM: Audio Language Models Are Error Correctors for Robust Speech Separation | Zhaoxi Mu et.al. | 2505.03273 | null |
| 2025-05-15 | CoGenAV: Versatile Audio-Visual Representation Learning via Contrastive-Generative Synchronization | Detao Bai et.al. | 2505.03186 | link |
| 2025-05-05 | Voila: Voice-Language Foundation Models for Real-Time Autonomous Interaction and Voice Role-Play | Yemin Shi et.al. | 2505.02707 | link |
| 2025-05-08 | Transforming faces into video stories -- VideoFace2.0 | Branko Brkljač et.al. | 2505.02060 | link |
| 2025-05-06 | A Synergistic Framework of Nonlinear Acoustic Computing and Reinforcement Learning for Real-World Human-Robot Interaction | Xiaoliang Chen et.al. | 2505.01998 | null |
| 2025-05-02 | Transfer Learning-Based Deep Residual Learning for Speech Recognition in Clean and Noisy Environments | Noussaiba Djeffal et.al. | 2505.01632 | null |
| 2025-05-01 | Scaling On-Device GPU Inference for Large Generative Models | Jiuqiang Tang et.al. | 2505.00232 | null |
| 2025-07-31 | BERSting at the Screams: A Benchmark for Distanced, Emotional and Shouted Speech Recognition | Paige Tuttösí et.al. | 2505.00059 | link |
| 2025-04-30 | Retrieval-Enhanced Few-Shot Prompting for Speech Event Extraction | Máté Gedeon et.al. | 2504.21372 | null |
| 2025-04-28 | A Comprehensive Part-of-Speech Tagging to Standardize Central-Kurdish Language: A Research Guide for Kurdish Natural Language Processing Tasks | Shadan Shukr Sabr et.al. | 2504.19645 | null |
| 2025-04-25 | Kimi-Audio Technical Report | KimiTeam et.al. | 2504.18425 | link |
| 2025-04-28 | Augmenting Captions with Emotional Cues: An AR Interface for Real-Time Accessible Communication | Sunday David Ubur et.al. | 2504.17171 | null |
| 2025-04-22 | TinyML for Speech Recognition | Andrew Barovic et.al. | 2504.16213 | null |
| 2025-04-22 | LiveCC: Learning Video LLM with Streaming Speech Transcription at Scale | Joya Chen et.al. | 2504.16030 | null |
| 2025-04-22 | Development and evaluation of a deep learning algorithm for German word recognition from lip movements | Dinh Nam Pham et.al. | 2504.15792 | null |
| 2025-04-21 | Chinese-LiPS: A Chinese audio-visual speech recognition dataset with Lip-reading and Presentation Slides | Jinghua Zhao et.al. | 2504.15066 | null |
| 2025-04-21 | StableQuant: Layer Adaptive Post-Training Quantization for Speech Foundation Models | Yeona Hong et.al. | 2504.14915 | null |
| 2025-04-17 | Acoustic to Articulatory Inversion of Speech; Data Driven Approaches, Challenges, Applications, and Future Scope | Leena G Pillai et.al. | 2504.13308 | null |
| 2025-05-04 | Dysarthria Normalization via Local Lie Group Transformations for Robust ASR | Mikhail Osipov et.al. | 2504.12279 | link |
| 2025-04-03 | Edge Intelligence for Wildlife Conservation: Real-Time Hornbill Call Classification Using TinyML | Kong Ka Hing et.al. | 2504.12272 | null |
| 2025-04-19 | Advancing Arabic Speech Recognition Through Large-Scale Weakly Supervised Learning | Mahmoud Salhab et.al. | 2504.12254 | null |
| 2025-04-15 | Real-Time Word-Level Temporal Segmentation in Streaming Speech Recognition | Naoto Nishida et.al. | 2504.10849 | null |
| 2025-04-25 | Spatial Audio Processing with Large Language Model on Wearable Devices | Ayushi Mishra et.al. | 2504.08907 | null |
| 2025-04-10 | From Speech to Summary: A Comprehensive Survey of Speech Summarization | Fabian Retkowski et.al. | 2504.08024 | null |
| 2025-04-09 | Visual-Aware Speech Recognition for Noisy Scenarios | Lakshmipathi Balaji et.al. | 2504.07229 | null |
| 2025-04-09 | RNN-Transducer-based Losses for Speech Recognition on Noisy Targets | Vladimir Bataev et.al. | 2504.06963 | link |
| 2025-04-07 | DoCIA: An Online Document-Level Context Incorporation Agent for Speech Translation | Xinglin Lyu et.al. | 2504.05122 | null |
| 2025-04-06 | Public speech recognition transcripts as a configuring parameter | Damien Rudaz et.al. | 2504.04488 | null |
| 2025-04-06 | Selective Masking Adversarial Attack on Automatic Speech Recognition Systems | Zheng Fang et.al. | 2504.04394 | null |
| 2025-05-08 | An Efficient GPU-based Implementation for Noise Robust Sound Source Localization | Zirui Lin et.al. | 2504.03373 | null |
| 2025-04-04 | A Human Digital Twin Architecture for Knowledge-based Interactions and Context-Aware Conversations | Abdul Mannan Mohammed et.al. | 2504.03147 | null |
| 2025-03-26 | Efficient First-Order Optimization on the Pareto Set for Multi-Objective Learning under Preference Guidance | Lisha Chen et.al. | 2504.02854 | null |
| 2025-04-03 | LinTO Audio and Textual Datasets to Train and Evaluate Automatic Speech Recognition in Tunisian Arabic Dialect | Hedi Naouara et.al. | 2504.02604 | null |
| 2025-04-22 | F5R-TTS: Improving Flow-Matching based Text-to-Speech with Group Relative Policy Optimization | Xiaohui Sun et.al. | 2504.02407 | null |
| 2025-04-02 | Chain of Correction for Full-text Speech Recognition with Large Language Models | Zhiyuan Tang et.al. | 2504.01519 | null |
| 2025-04-01 | Whispering Under the Eaves: Protecting User Privacy Against Commercial and LLM-powered Automatic Speech Recognition Systems | Weifei Jin et.al. | 2504.00858 | link |
| 2025-03-31 | SVLA: A Unified Speech-Vision-Language Assistant with Multimodal Reasoning and Speech Generation | Ngoc Dung Huynh et.al. | 2503.24164 | null |
| 2025-04-02 | TeleAntiFraud-28k: An Audio-Text Slow-Thinking Dataset for Telecom Fraud Detection | Zhiming Ma et.al. | 2503.24115 | link |
| 2025-03-30 | The Impact of Code-switched Synthetic Data Quality is Task Dependent: Insights from MT and ASR | Injy Hamed et.al. | 2503.23576 | null |
| 2025-03-30 | Whisper-LM: Improving ASR Models with Language Models for Low-Resource Languages | Xabier de Zuazo et.al. | 2503.23542 | link |
| 2025-03-30 | Scaling Auditory Cognition via Test-Time Compute in Audio Language Models | Ting Dang et.al. | 2503.23395 | null |
| 2025-04-25 | Coverage-Guaranteed Speech Emotion Recognition via Calibrated Uncertainty-Adaptive Prediction Sets | Zijun Jia et.al. | 2503.22712 | null |
| 2025-03-13 | Enhancing Aviation Communication Transcription: Fine-Tuning Distil-Whisper with LoRA | Shokoufeh Mirzaei et.al. | 2503.22692 | null |
| 2025-03-05 | Qieemo: Speech Is All You Need in the Emotion Recognition in Conversations | Jinming Chen et.al. | 2503.22687 | null |
| 2025-03-11 | Lend a Hand: Semi Training-Free Cued Speech Recognition via MLLM-Driven Hand Modeling for Barrier-free Communication | Guanjie Huang et.al. | 2503.21785 | link |
| 2025-03-27 | VALLR: Visual ASR Language Model for Lip Reading | Marshall Thomas et.al. | 2503.21408 | null |
| 2025-03-27 | A 71.2- |
Chih-Chyau Yang et.al. | 2503.21337 | null |
| 2025-03-26 | Improving Speech Recognition Accuracy Using Custom Language Models with the Vosk Toolkit | Aniket Abhishek Soni et.al. | 2503.21025 | null |
| 2025-03-26 | FinAudio: A Benchmark for Audio Large Language Models in Financial Applications | Yupeng Cao et.al. | 2503.20990 | null |
| 2025-03-26 | Dolphin: A Large-Scale Automatic Speech Recognition Model for Eastern Languages | Yangyang Meng et.al. | 2503.20212 | link |
| 2025-03-25 | Contextual Metric Meta-Evaluation by Measuring Local Metric Accuracy | Athiya Deviyani et.al. | 2503.19828 | null |
| 2025-03-25 | Boosting the Transferability of Audio Adversarial Examples with Acoustic Representation Optimization | Weifei Jin et.al. | 2503.19591 | null |
| 2025-03-25 | Design of Seamless Multi-modal Interaction Framework for Intelligent Virtual Agents in Wearable Mixed Reality Environment | Ghazanfar Ali et.al. | 2503.19334 | null |
| 2025-05-13 | From S4 to Mamba: A Comprehensive Survey on Structured State Space Models | Shriyank Somvanshi et.al. | 2503.18970 | null |
| 2025-03-28 | Whispering in Amharic: Fine-tuning Whisper for Low-resource Language | Dawit Ketema Gete et.al. | 2503.18485 | null |
| 2025-03-23 | Elevating Robust Multi-Talker ASR by Decoupling Speaker Separation and Speech Recognition | Yufeng Yang et.al. | 2503.17886 | null |
| 2025-03-21 | Your voice is your voice: Supporting Self-expression through Speech Generation and LLMs in Augmented and Alternative Communication | Yiwen Xu et.al. | 2503.17479 | null |
| 2025-03-20 | SeniorTalk: A Chinese Conversation Dataset with Rich Annotations for Super-Aged Seniors | Yang Chen et.al. | 2503.16578 | null |
| 2025-03-19 | A Comprehensive Survey on Architectural Advances in Deep CNNs: Challenges, Applications, and Emerging Research Directions | Saddam Hussain Khan et.al. | 2503.16546 | null |
| 2025-02-27 | ACE, Action and Control via Explanations: A Proposal for LLMs to Provide Human-Centered Explainability for Multimodal AI Assistants | Elizabeth Anne Watkins et.al. | 2503.16466 | null |
| 2025-03-19 | Evaluating ASR Confidence Scores for Automated Error Detection in User-Assisted Correction Interfaces | Korbinian Kuhn et.al. | 2503.15124 | null |
| 2025-03-19 | Communication Access Real-Time Translation Through Collaborative Correction of Automatic Speech Recognition | Korbinian Kuhn et.al. | 2503.15120 | null |
| 2025-03-07 | A Causal Inference Approach for Quantifying Research Impact | Keiichi Ochiai et.al. | 2503.13485 | null |
| 2025-04-19 | Halving transcription time: A fast, user-friendly and GDPR-compliant workflow to create AI-assisted transcripts for content analysis | Jakob Sponholz et.al. | 2503.13031 | null |
| 2025-03-04 | CORDIC Is All You Need | Omkar Kokane et.al. | 2503.11685 | null |
| 2025-03-14 | MMS-LLaMA: Efficient LLM-based Audio-Visual Speech Recognition with Minimal Multimodal Speech Tokens | Jeong Hun Yeo et.al. | 2503.11315 | link |
| 2025-03-13 | Whisper Speaker Identification: Leveraging Pre-Trained Multilingual Transformers for Robust Speaker Embeddings | Jakaria Islam Emon et.al. | 2503.10446 | link |
| 2025-03-14 | Proceedings of the ISCA/ITG Workshop on Diversity in Large Speech and Language Models | Sebastian Möller et.al. | 2503.10298 | null |
| 2025-04-07 | ValSub: Subsampling Validation Data to Mitigate Forgetting during ASR Personalization | Haaris Mehmood et.al. | 2503.09906 | null |
| 2025-03-12 | Quantization for OpenAI's Whisper Models: A Comparative Analysis | Allison Andreyev et.al. | 2503.09905 | link |
| 2025-03-12 | Everything Can Be Described in Words: A Simple Unified Multi-Modal Framework with Semantic and Temporal Alignment | Xiaowei Bi et.al. | 2503.09081 | null |
| 2025-03-11 | An Exhaustive Evaluation of TTS- and VC-based Data Augmentation for ASR | Sewade Ogun et.al. | 2503.08954 | null |
| 2025-03-11 | Prompt2LVideos: Exploring Prompts for Understanding Long-Form Multimodal Videos | Soumya Shamarao Jahagirdar et.al. | 2503.08335 | null |
| 2025-03-10 | Building English ASR model with regional language support | Purvi Agrawal et.al. | 2503.07522 | null |
| 2025-03-30 | Automatic Speech Recognition for Non-Native English: Accuracy and Disfluency Handling | Michael McGuire et.al. | 2503.06924 | null |
| 2025-03-09 | Adaptive Audio-Visual Speech Recognition via Matryoshka-Based Multimodal LLMs | Umberto Cappellazzo et.al. | 2503.06362 | null |
| 2025-03-08 | Zero-AVSR: Zero-Shot Audio-Visual Speech Recognition with LLMs by Learning Language-Agnostic Speech Representations | Jeong Hun Yeo et.al. | 2503.06273 | link |
| 2025-03-08 | A Noise-Robust Turn-Taking System for Real-World Dialogue Robots: A Field Experiment | Koji Inoue et.al. | 2503.06241 | null |
| 2025-03-06 | From Voice to Safety: Language AI Powered Pilot-ATC Communication Understanding for Airport Surface Movement Collision Risk Assessment | Yutian Pang et.al. | 2503.04974 | null |
| 2025-03-04 | Normalization through Fine-tuning: Understanding Wav2vec 2.0 Embeddings for Phonetic Analysis | Yiming Wang et.al. | 2503.04814 | null |
| 2025-03-03 | Direct Speech to Speech Translation: A Review | Mohammad Sarim et.al. | 2503.04799 | null |
| 2025-03-06 | Self-Supervised Models for Phoneme Recognition: Applications in Children's Speech for Reading Learning | Lucas Block Medin et.al. | 2503.04710 | null |
| 2025-03-07 | Efficient Finetuning for Dimensional Speech Emotion Recognition in the Age of Transformers | Aneesha Sampath et.al. | 2503.03756 | null |
| 2025-03-03 | Fine-Tuning Whisper for Inclusive Prosodic Stress Analysis | Samuel S. Sohn et.al. | 2503.02907 | null |
| 2025-03-04 | Go Beyond Your Means: Unlearning with Per-Sample Gradient Orthogonalization | Aviv Shamsian et.al. | 2503.02312 | null |
| 2025-03-05 | Pruning Deep Neural Networks via a Combination of the Marchenko-Pastur Distribution and Regularization | Leonid Berlyand et.al. | 2503.01922 | null |
| 2025-03-07 | Nexus-O: An Omni-Perceptive And -Interactive Model for Language, Audio, And Vision | Che Liu et.al. | 2503.01879 | null |
| 2025-03-02 | Unveiling Biases while Embracing Sustainability: Assessing the Dual Challenges of Automatic Speech Recognition Systems | Ajinkya Kulkarni et.al. | 2503.00907 | null |
| 2025-03-02 | UniWav: Towards Unified Pre-training for Speech Representation Learning and Generation | Alexander H. Liu et.al. | 2503.00733 | null |
| 2025-02-27 | LiteASR: Efficient Automatic Speech Recognition with Low-Rank Approximation | Keisuke Kamahori et.al. | 2502.20583 | link |
| 2025-02-27 | Adapting Automatic Speech Recognition for Accented Air Traffic Control Communications | Marcus Yu Zhe Wee et.al. | 2502.20311 | null |
| 2025-02-27 | CleanMel: Mel-Spectrogram Enhancement for Improving Both Speech Quality and ASR | Nian Shao et.al. | 2502.20040 | null |
| 2025-03-12 | CS-Dialogue: A 104-Hour Dataset of Spontaneous Mandarin-English Code-Switching Dialogues for Speech Recognition | Jiaming Zhou et.al. | 2502.18913 | null |
| 2025-02-26 | Exploring Gender Disparities in Automatic Speech Recognition Technology | Hend ElGhazaly et.al. | 2502.18434 | null |
| 2025-02-25 | Silent Speech Sentence Recognition with Six-Axis Accelerometers using Conformer and CTC Algorithm | Yudong Xie et.al. | 2502.17829 | null |
| 2025-02-26 | Low-Rank and Sparse Model Merging for Multi-Lingual Speech Recognition and Translation | Qiuming Zhao et.al. | 2502.17380 | null |
| 2025-02-25 | Improving the Inclusivity of Dutch Speech Recognition by Fine-tuning Whisper on the JASMIN-CGN Corpus | Golshid Shekoufandeh et.al. | 2502.17284 | link |
| 2025-02-24 | Balancing Speech Understanding and Generation Using Continual Pre-training for Codec-based Speech LLM | Jiatong Shi et.al. | 2502.16897 | null |
| 2025-02-22 | Understanding Zero-shot Rare Word Recognition Improvements Through LLM Integration | Haoxuan Wang et.al. | 2502.16142 | null |
| 2025-02-21 | The Esethu Framework: Reimagining Sustainable Dataset Governance and Curation for Low-Resource Languages | Jenalea Rajab et.al. | 2502.15916 | null |
| 2025-02-21 | Retrieval-Augmented Speech Recognition Approach for Domain Challenges | Peng Shen et.al. | 2502.15264 | null |
| 2025-02-21 | Enhancing Speech Large Language Models with Prompt-Aware Mixture of Audio Encoders | Weiqiao Shan et.al. | 2502.15178 | null |
| 2025-02-21 | Improving Streaming Speech Recognition With Time-Shifted Contextual Attention And Dynamic Right Context Masking | Khanh Le et.al. | 2502.15158 | null |
| 2025-02-20 | WavRAG: Audio-Integrated Retrieval Augmented Generation for Spoken Dialogue Models | Yifu Chen et.al. | 2502.14727 | null |
| 2025-02-20 | SegAug: CTC-Aligned Segmented Augmentation For Robust RNN-Transducer Based Speech Recognition | Khanh Le et.al. | 2502.14685 | null |
| 2025-02-20 | Moshi Moshi? A Model Selection Hijacking Adversarial Attack | Riccardo Petrucci et.al. | 2502.14586 | null |
| 2025-02-18 | Gesture-Aware Zero-Shot Speech Recognition for Patients with Language Disorders | Seungbae Kim et.al. | 2502.13983 | null |
| 2025-02-18 | Benchmarking Automatic Speech Recognition coupled LLM Modules for Medical Diagnostics | Kabir Kumar et.al. | 2502.13982 | null |
| 2025-02-19 | Measuring the Effect of Transcription Noise on Downstream Language Understanding Tasks | Ori Shapira et.al. | 2502.13645 | link |
| 2025-02-21 | VLAS: Vision-Language-Action Model With Speech Instructions For Customized Robot Manipulation | Wei Zhao et.al. | 2502.13508 | link |
| 2025-02-19 | Adopting Whisper for Confidence Estimation | Vaibhav Aggarwal et.al. | 2502.13446 | null |
| 2025-02-18 | Neuro-oscillatory models of cortical speech processing | Olesia Dogonasheva et.al. | 2502.12935 | null |
| 2025-02-18 | Lost in Transcription, Found in Distribution Shift: Demystifying Hallucination in Speech Foundation Models | Hanin Atwany et.al. | 2502.12414 | null |
| 2025-02-18 | On the Robust Approximation of ASR Metrics | Abdul Waheed et.al. | 2502.12408 | null |
| 2025-02-17 | NaturalL2S: End-to-End High-quality Multispeaker Lip-to-Speech Synthesis with Differential Digital Signal Processing | Yifan Liang et.al. | 2502.12002 | null |
| 2025-02-17 | Can you pass that tool?: Implications of Indirect Speech in Physical Human-Robot Collaboration | Yan Zhang et.al. | 2502.11720 | null |
| 2025-02-28 | In Situ Optimization of an Optoelectronic Reservoir Computer with Digital Delayed Feedback | Fyodor Morozko et.al. | 2502.11126 | null |
| 2025-04-03 | DuplexMamba: Enhancing Real-time Speech Conversations with Duplex and Streaming Capabilities | Xiangyu Lu et.al. | 2502.11123 | link |
| 2025-02-11 | MoHAVE: Mixture of Hierarchical Audio-Visual Experts for Robust Speech Recognition | Sungnyun Kim et.al. | 2502.10447 | null |
| 2025-02-14 | OWLS: Scaling Laws for Multilingual Speech Recognition and Translation Models | William Chen et.al. | 2502.10373 | null |
| 2025-02-14 | MTLM: an Innovative Language Model Training Paradigm for ASR | Qingliang Meng et.al. | 2502.10058 | null |
| 2025-02-14 | A Preliminary Exploration with GPT-4o Voice Mode | Yu-Xiang Lin et.al. | 2502.09940 | null |
| 2025-02-14 | Microphone Array Geometry Independent Multi-Talker Distant ASR: NTT System for the DASR Task of the CHiME-8 Challenge | Naoyuki Kamo et.al. | 2502.09859 | null |
| 2025-02-13 | Shortcut Learning Susceptibility in Vision Classifiers | Pirzada Suhail et.al. | 2502.09150 | null |
| 2025-02-13 | Quantum Approaches for Dysphonia Assessment in Small Speech Datasets | Ha Tran et.al. | 2502.08968 | null |
| 2025-02-12 | Causal Analysis of ASR Errors for Children: Quantifying the Impact of Physiological, Cognitive, and Extrinsic Factors | Vishwanath Pratap Singh et.al. | 2502.08587 | null |
| 2025-02-24 | VINP: Variational Bayesian Inference with Neural Speech Prior for Joint ASR-Effective Speech Dereverberation and Blind RIR Identification | Pengyu Wang et.al. | 2502.07205 | link |
| 2025-02-16 | A Comparative Study of ASR Implementations in Resource-Constrained Wireless Sensor Networks for Real-Time Voice Communication | Inaam F. Qutaiba I. Ali et.al. | 2502.06969 | null |
| 2025-02-19 | Speech to Speech Translation with Translatotron: A State of the Art Review | Jules R. Kala et.al. | 2502.05980 | null |
| 2025-02-09 | Audio-Visual Representation Learning via Knowledge Distillation from Speech Foundation Models | Jing-Xuan Zhang et.al. | 2502.05766 | link |
| 2025-02-07 | Koel-TTS: Enhancing LLM based Speech Generation with Preference Alignment and Classifier Free Guidance | Shehzeen Hussain et.al. | 2502.05236 | null |
| 2025-02-06 | Aligner-Encoders: Self-Attention Transformers Can Be Self-Transducers | Adam Stooke et.al. | 2502.05232 | null |
| 2025-02-07 | Evaluating Standard and Dialectal Frisian ASR: Multilingual Fine-tuning and Language Identification for Improved Low-resource Performance | Reihaneh Amooie et.al. | 2502.04883 | null |
| 2025-02-07 | Lightweight Operations for Visual Speech Recognition | Iason Ioannis Panagos et.al. | 2502.04834 | null |
| 2025-02-06 | Afrispeech-Dialog: A Benchmark Dataset for Spontaneous English Conversations in Healthcare and Beyond | Mardhiyah Sanni et.al. | 2502.03945 | null |
| 2025-02-06 | Rule-Based Modeling of Low-Dimensional Data with PCA and Binary Particle Swarm Optimization (BPSO) in ANFIS | Afnan Al-Ali et.al. | 2502.03895 | null |
| 2025-02-05 | Integrating automatic speech recognition into remote healthcare interpreting: A pilot study of its impact on interpreting quality | Shiyi Tan et.al. | 2502.03381 | null |
| 2025-02-05 | Leveraging Broadcast Media Subtitle Transcripts for Automatic Speech Recognition and Subtitling | Jakob Poncelet et.al. | 2502.03212 | link |
| 2025-01-26 | SEAL: Speech Embedding Alignment Learning for Speech Large Language Model with Retrieval-Augmented Generation | Chunyu Sun et.al. | 2502.02603 | null |
| 2025-03-05 | CTC-DRO: Robust Optimization for Reducing Language Disparities in Speech Recognition | Martijn Bartelds et.al. | 2502.01777 | null |
| 2025-02-03 | Adapter-Based Multi-Agent AVSR Extension for Pre-Trained ASR Models | Christopher Simic et.al. | 2502.01709 | null |
| 2025-01-29 | Privacy-Preserving Edge Speech Understanding with Tiny Foundation Models | Afsara Benazir et.al. | 2502.01649 | null |
| 2025-02-03 | A Differentiable Alignment Framework for Sequence-to-Sequence Modeling via Optimal Transport | Yacouba Kaloga et.al. | 2502.01588 | null |
| 2025-02-11 | mWhisper-Flamingo for Multilingual Audio-Visual Noise-Robust Speech Recognition | Andrew Rouditchenko et.al. | 2502.01547 | link |
| 2025-02-03 | Gradient Norm-based Fine-Tuning for Backdoor Defense in Automatic Speech Recognition | Nanjun Zhou et.al. | 2502.01152 | null |
| 2025-02-01 | Data-Driven Mispronunciation Pattern Discovery for Robust Speech Recognition | Anna Seo Gyeong Choi et.al. | 2502.00583 | null |
| 2025-02-17 | Evaluation of End-to-End Continuous Spanish Lipreading in Different Data Conditions | David Gimeno-Gómez et.al. | 2502.00464 | link |
| 2025-02-04 | Sagalee: an Open Source Automatic Speech Recognition Dataset for Oromo Language | Turi Abu et.al. | 2502.00421 | link |
| 2025-02-01 | When End-to-End is Overkill: Rethinking Cascaded Speech-to-Text Translation | Anna Min et.al. | 2502.00377 | null |
| 2025-02-03 | SELMA: A Speech-Enabled Language Model for Virtual Assistant Interactions | Dominik Wagner et.al. | 2501.19377 | null |
| 2025-01-31 | Language Bias in Self-Supervised Learning For Automatic Speech Recognition | Edward Storey et.al. | 2501.19321 | null |
| 2025-02-03 | DyPCL: Dynamic Phoneme-level Contrastive Learning for Dysarthric Speech Recognition | Wonjun Lee et.al. | 2501.19010 | null |
| 2025-01-29 | Cross-lingual Embedding Clustering for Hierarchical Softmax in Low-Resource Multilingual Speech Recognition | Zhengdong Yang et.al. | 2501.17615 | null |
| 2025-01-28 | RDMM: Fine-Tuned LLM Models for On-Device Robotic Decision Making with Enhanced Contextual Awareness in Specific Domains | Shady Nasrat et.al. | 2501.16899 | link |
| 2025-01-28 | AVE Speech Dataset: A Comprehensive Benchmark for Multi-Modal Speech Recognition Integrating Audio, Visual, and Electromyographic Signals | Dongliang Zhou et.al. | 2501.16780 | null |
| 2025-01-28 | SCDiar: a streaming diarization system based on speaker change detection and speech recognition | Naijun Zheng et.al. | 2501.16641 | null |
| 2025-01-27 | Optimized Self-supervised Training with BEST-RQ for Speech Recognition | Ilja Baumann et.al. | 2501.16131 | null |
| 2025-01-27 | Classification Error Bound for Low Bayes Error Conditions in Machine Learning | Zijian Yang et.al. | 2501.15977 | null |
| 2025-01-26 | End-to-End Target Speaker Speech Recognition Using Context-Aware Attention Mechanisms for Challenging Enrollment Scenario | Mohsen Ghane et.al. | 2501.15466 | null |
| 2025-01-25 | The Multicultural Medical Assistant: Can LLMs Improve Medical ASR Errors Across Borders? | Ayo Adedeji et.al. | 2501.15310 | null |
| 2025-01-25 | Speech Translation Refinement using Large Language Models | Huaixia Dou et.al. | 2501.15090 | link |
| 2025-01-25 | Robust Cross-Etiology and Speaker-Independent Dysarthric Speech Recognition | Satwinder Singh et.al. | 2501.14994 | null |
| 2025-02-07 | Methods to Increase the Amount of Data for Speech Recognition for Low Resource Languages | Alexan Ayrapetyan et.al. | 2501.14788 | null |
| 2025-01-24 | FireRedASR: Open-Source Industrial-Grade Mandarin Speech Recognition Models from Encoder-Decoder to LLM Integration | Kai-Tuo Xu et.al. | 2501.14350 | link |
| 2025-01-24 | LoCoML: A Framework for Real-World ML Inference Pipelines | Kritin Maddireddy et.al. | 2501.14165 | null |
| 2025-01-23 | Integrating Persian Lip Reading in Surena-V Humanoid Robot for Human-Robot Interaction | Ali Farshian Abbasi et.al. | 2501.13996 | null |
| 2025-01-18 | Fanar: An Arabic-Centric Multimodal Generative AI Platform | Fanar Team et.al. | 2501.13944 | null |
| 2025-01-23 | Predicting Compact Phrasal Rewrites with Large Language Models for ASR Post Editing | Hao Zhang et.al. | 2501.13831 | null |
| 2025-01-23 | Learning-based A Posteriori Speech Presence Probability Estimation and Applications | Shuai Tao et.al. | 2501.13642 | null |
| 2025-01-23 | DQ-Data2vec: Decoupling Quantization for Multilingual Speech Recognition | Qijie Shao et.al. | 2501.13497 | null |
| 2025-02-16 | OSUM: Advancing Open Speech Understanding Models with Limited Resources in Academia | Xuelong Geng et.al. | 2501.13306 | link |
| 2025-01-22 | Let SSMs be ConvNets: State-space Modeling with Optimal Tensor Contractions | Yan Ru Pei et.al. | 2501.13230 | null |
| 2025-01-22 | FlanEC: Exploring Flan-T5 for Post-ASR Error Correction | Moreno La Quatra et.al. | 2501.12979 | link |
| 2025-01-21 | A Domain Adaptation Framework for Speech Recognition Systems with Only Synthetic data | Minh Tran et.al. | 2501.12501 | null |
| 2025-01-21 | DOTA-ME-CS: Daily Oriented Text Audio-Mandarin English-Code Switching Dataset | Yupei Li et.al. | 2501.12122 | null |
| 2025-01-20 | Investigation of Whisper ASR Hallucinations Induced by Non-Speech Audio | Mateusz Barański et.al. | 2501.11378 | null |
| 2025-01-19 | Enhancing Neural Spoken Language Recognition: An Exploration with Multilingual Datasets | Or Haim Anidjar et.al. | 2501.11065 | null |
| 2025-01-18 | A Benchmark of French ASR Systems Based on Error Severity | Antoine Tholly et.al. | 2501.10879 | null |
| 2025-01-18 | GEC-RAG: Improving Generative Error Correction via Retrieval-Augmented Generation for Automatic Speech Recognition Systems | Amin Robatian et.al. | 2501.10734 | null |
| 2025-01-17 | Unsupervised Rhythm and Voice Conversion of Dysarthric to Healthy Speech for ASR | Karl El Hajal et.al. | 2501.10256 | null |
| 2025-01-17 | Automatic Speech Recognition for Sanskrit with Transfer Learning | Bidit Sadhukhan et.al. | 2501.10024 | null |
| 2025-01-21 | PIER: A Novel Metric for Evaluating What Matters in Code-Switching | Enes Yavuz Ugan et.al. | 2501.09512 | null |
| 2025-01-16 | Teaching Wav2Vec2 the Language of the Brain | Tobias Fiedler et.al. | 2501.09459 | link |
| 2025-01-16 | Delayed Fusion: Integrating Large Language Models into First-Pass Decoding in End-to-end Speech Recognition | Takaaki Hori et.al. | 2501.09258 | null |
| 2025-01-17 | persoDA: Personalized Data Augmentation for Personalized ASR | Pablo Peso Parada et.al. | 2501.09113 | null |
| 2025-01-20 | A Non-autoregressive Model for Joint STT and TTS | Vishal Sunder et.al. | 2501.09104 | null |
| 2025-01-13 | Discrimination loss vs. SRT: A model-based approach towards harmonizing speech test interpretations | Mareike Buhl et.al. | 2501.08921 | null |
| 2025-01-15 | Adapting Whisper for Regional Dialects: Enhancing Public Services for Vulnerable Populations in the United Kingdom | Melissa Torgbi et.al. | 2501.08502 | null |
| 2025-01-14 | Selective Attention Merging for low resource tasks: A case study of Child ASR | Natarajan Balaji Shankar et.al. | 2501.08468 | link |
| 2025-01-14 | Loudspeaker Beamforming to Enhance Speech Recognition Performance of Voice Driven Applications | Dimme de Groot et.al. | 2501.08104 | null |
| 2025-01-17 | Joint Automatic Speech Recognition And Structure Learning For Better Speech Understanding | Jiliang Hu et.al. | 2501.07329 | link |
| 2025-01-13 | Audio-CoT: Exploring Chain-of-Thought Reasoning in Large Audio Language Model | Ziyang Ma et.al. | 2501.07246 | null |
| 2025-01-13 | AdaCS: Adaptive Normalization for Enhanced Code-Switching ASR | The Chuong Chu et.al. | 2501.07102 | link |
| 2025-01-11 | Discrete Speech Unit Extraction via Independent Component Analysis | Tomohiko Nakamura et.al. | 2501.06562 | link |
| 2025-01-11 | A Survey on Spoken Italian Datasets and Corpora | Marco Giordano et.al. | 2501.06557 | null |
| 2025-01-11 | Speech Recognition for Automatically Assessing Afrikaans and isiXhosa Preschool Oral Narratives | Christiaan Jacobs et.al. | 2501.06478 | null |
| 2025-01-10 | TTS-Transducer: End-to-End Speech Synthesis with Neural Transducer | Vladimir Bataev et.al. | 2501.06320 | null |
| 2025-01-10 | Contextual ASR Error Handling with LLMs Augmentation for Goal-Oriented Conversational AI | Yuya Asano et.al. | 2501.06129 | null |
| 2025-02-19 | Fleurs-SLU: A Massively Multilingual Benchmark for Spoken Language Understanding | Fabian David Schmidt et.al. | 2501.06117 | link |
| 2025-01-10 | Benchmarking Rotary Position Embeddings for Automatic Speech Recognition | Shucong Zhang et.al. | 2501.06051 | null |
| 2025-01-19 | Comparing Self-Supervised Learning Models Pre-Trained on Human Speech and Animal Vocalizations for Bioacoustics Processing | Eklavya Sarkar et.al. | 2501.05987 | link |
| 2025-01-10 | Universal-2-TF: Robust All-Neural Text Formatting for ASR | Yash Khare et.al. | 2501.05948 | null |
| 2025-01-09 | Right Label Context in End-to-End Training of Time-Synchronous ASR Models | Tina Raissi et.al. | 2501.04521 | null |
| 2025-01-08 | Phone-purity Guided Discrete Tokens for Dysarthric Speech Recognition | Huimeng Wang et.al. | 2501.04379 | null |
| 2025-01-08 | LipGen: Viseme-Guided Lip Video Generation for Enhancing Visual Speech Recognition | Bowen Hao et.al. | 2501.04204 | null |
| 2025-01-03 | Listening and Seeing Again: Generative Error Correction for Audio-Visual Speech Recognition | Rui Liu et.al. | 2501.04038 | link |
| 2025-01-07 | Universal Speaker Embedding Free Target Speaker Extraction and Personal Voice Activity Detection | Bang Zeng et.al. | 2501.03612 | null |
| 2025-01-14 | Towards a Generalizable Speech Marker for Parkinson's Disease Diagnosis | Maksim Siniukov et.al. | 2501.03581 | null |
| 2025-01-07 | Deep Learning for Pathological Speech: A Survey | Shakeel A. Sheikh et.al. | 2501.03536 | null |
| 2025-01-01 | Breaking Through the Spike: Spike Window Decoding for Accelerated and Precise Automatic Speech Recognition | Wei Zhang et.al. | 2501.03257 | null |
| 2025-01-08 | Samba-ASR: State-Of-The-Art Speech Recognition Leveraging Structured State-Space Models | Syed Abdul Gaffar Shakhadri et.al. | 2501.02832 | null |
| 2025-01-05 | Reducing the Gap Between Pretrained Speech Enhancement and Recognition Models Using a Real Speech-Trained Bridging Module | Zhongjian Cui et.al. | 2501.02452 | null |
| 2025-01-03 | Improving Transducer-Based Spoken Language Understanding with Self-Conditioned CTC and Knowledge Transfer | Vishal Sunder et.al. | 2501.01936 | null |
| 2025-01-11 | Advancing Singlish Understanding: Bridging the Gap with Datasets and Multimodal Models | Bin Wang et.al. | 2501.01034 | link |
| 2025-01-01 | Incremental Dialogue Management: Survey, Discussion, and Implications for HRI | Casey Kennington et.al. | 2501.00953 | null |
| 2025-01-01 | Large Language Models Are Read/Write Policy-Makers for Simultaneous Generation | Shoutao Guo et.al. | 2501.00868 | link |
| 2025-01-01 | Automatic Text Pronunciation Correlation Generation and Application for Contextual Biasing | Gaofeng Cheng et.al. | 2501.00804 | null |
| 2024-12-31 | Fotheidil: an Automatic Transcription System for the Irish Language | Liam Lonergan et.al. | 2501.00509 | null |
| 2024-12-31 | Whisper Turns Stronger: Augmenting Wav2Vec 2.0 for Superior ASR in Low-Resource Languages | Or Haim Anidjar et.al. | 2501.00425 | null |
| 2025-01-06 | Takeaways from Applying LLM Capabilities to Multiple Conversational Avatars in a VR Pilot Study | Mykola Maslych et.al. | 2501.00168 | null |
| 2024-12-30 | DiCoW: Diarization-Conditioned Whisper for Target Speaker Automatic Speech Recognition | Alexander Polok et.al. | 2501.00114 | link |
| 2024-12-25 | Speech Recognition With LLMs Adapted to Disordered Speech Using Reinforcement Learning | Chirag Nagpal et.al. | 2501.00039 | null |
| 2024-12-27 | Enhancing Whisper's Accuracy and Speed for Indian Languages through Prompt-Tuning and Tokenization | Kumud Tripathi et.al. | 2412.19785 | null |
| 2024-12-26 | Towards a Single ASR Model That Generalizes to Disordered Speech | Jimmy Tobin et.al. | 2412.19315 | null |
| 2024-12-26 | Enhancing Audiovisual Speech Recognition through Bifocal Preference Optimization | Yihan Wu et.al. | 2412.19005 | link |
| 2024-12-25 | Structured Speaker-Deficiency Adaptation of Foundation Models for Dysarthric and Elderly Speech Recognition | Shujie Hu et.al. | 2412.18832 | null |
| 2024-12-30 | Zero-resource Speech Translation and Recognition with LLMs | Karel Mundnich et.al. | 2412.18566 | null |
| 2025-01-09 | Trading Devil RL: Backdoor attack via Stock market, Bayesian Optimization and Reinforcement Learning | Orson Mengara et.al. | 2412.17908 | null |
| 2024-12-09 | Ensemble Machine Learning Model for Inner Speech Recognition: A Subject-Specific Investigation | Shahamat Mustavi Tasin et.al. | 2412.17824 | null |
| 2024-12-23 | Investigating Prosodic Signatures via Speech Pre-Trained Models for Audio Deepfake Source Attribution | Orchid Chetia Phukan et.al. | 2412.17796 | null |
| 2024-12-23 | UME: Upcycling Mixture-of-Experts for Scalable and Efficient Automatic Speech Recognition | Li Fu et.al. | 2412.17507 | null |
| 2024-12-23 | Deep Learning in Proteomics Informatics: Applications, Challenges, and Future Directions | Yindan Luo et.al. | 2412.17349 | null |
| 2025-01-17 | Uncovering the Visual Contribution in Audio-Visual Speech Recognition | Zhaofeng Lin et.al. | 2412.17129 | null |
| 2025-01-05 | Adapting Whisper for Code-Switching through Encoding Refining and Language-Aware Decoding | Jiahui Zhao et.al. | 2412.16507 | null |
| 2025-01-03 | Speech Retrieval-Augmented Generation without Automatic Speech Recognition | Do June Min et.al. | 2412.16500 | null |
| 2024-12-21 | Enhancing Multilingual ASR for Unseen Languages via Language Embedding Modeling | Shao-Syuan Huang et.al. | 2412.16474 | null |
| 2024-12-21 | Transducer-Llama: Integrating LLMs into Streamable Transducer-based Speech Recognition | Keqi Deng et.al. | 2412.16464 | null |
| 2025-01-19 | MathSpeech: Leveraging Small LMs for Accurate Conversion in Mathematical Speech-to-Formula | Sieun Hyeon et.al. | 2412.15655 | link |
| 2024-12-20 | TouchASP: Elastic Automatic Speech Perception that Everyone Can Touch | Xingchen Song et.al. | 2412.15622 | null |
| 2024-12-19 | Transcribing and Translating, Fast and Slow: Joint Speech Translation and Recognition | Niko Moritz et.al. | 2412.15415 | null |
| 2024-12-23 | LAMA-UT: Language Agnostic Multilingual ASR through Orthography Unification and Language-Specific Transliteration | Sangmin Lee et.al. | 2412.15299 | null |
| 2025-01-09 | CAMEL: Cross-Attention Enhanced Mixture-of-Experts and Language Bias for Code-Switching Speech Recognition | He Wang et.al. | 2412.12760 | null |
| 2024-12-24 | Streaming Keyword Spotting Boosted by Cross-layer Discrimination Consistency | Yu Xi et.al. | 2412.12635 | null |
| 2024-12-11 | Greek2MathTex: A Greek Speech-to-Text Framework for LaTeX Equations Generation | Evangelia Gkritzali et.al. | 2412.12167 | null |
| 2024-12-09 | Harnessing Transfer Learning from Swahili: Advancing Solutions for Comorian Dialects | Naira Abdou Mohamed et.al. | 2412.12143 | null |
| 2024-12-17 | Speak & Improve Corpus 2025: an L2 English Speech Corpus for Language Assessment and Feedback | Kate Knill et.al. | 2412.11986 | null |
| 2024-12-17 | Speak & Improve Challenge 2025: Tasks and Baseline Systems | Mengjie Qian et.al. | 2412.11985 | null |
| 2024-12-20 | MERaLiON-SpeechEncoder: Towards a Speech Foundation Model for Singapore and Beyond | Muhammad Huzaifah et.al. | 2412.11538 | null |
| 2024-12-15 | Transliterated Zero-Shot Domain Adaptation for Automatic Speech Recognition | Han Zhu et.al. | 2412.11185 | null |
| 2024-12-14 | Robust Recognition of Persian Isolated Digits in Speech using Deep Neural Network | Ali Nasr-Esfahani et.al. | 2412.10857 | null |
| 2024-12-14 | Efficient Adaptation of Multilingual Models for Japanese ASR | Mark Bajo et.al. | 2412.10705 | link |
| 2025-01-16 | MERaLiON-AudioLLM: Bridging Audio and Language with Large Language Models | Yingxu He et.al. | 2412.09818 | null |
| 2024-11-26 | Enhancing Code-Switching ASR Leveraging Non-Peaky CTC Loss and Deep Language Posterior Injection | Tzu-Ting Yang et.al. | 2412.08651 | null |
| 2024-12-11 | Bilevel Joint Unsupervised and Supervised Training for Automatic Speech Recognition | Xiaodong Cui et.al. | 2412.08548 | null |
| 2024-12-10 | Style-agnostic evaluation of ASR using multiple reference transcripts | Quinten McNamara et.al. | 2412.07937 | null |
| 2024-12-09 | Effective Text Adaptation for LLM-based ASR through Soft Prompt Fine-Tuning | Yingyi Ma et.al. | 2412.06967 | null |
| 2024-12-09 | Not All Errors Are Equal: Investigation of Speech Recognition Errors in Alzheimer's Disease Detection | Jiawen Kang et.al. | 2412.06332 | null |
| 2024-12-09 | Leveraging Prompt Learning and Pause Encoding for Alzheimer's Disease Detection | Yin-Long Liu et.al. | 2412.06259 | null |
| 2024-12-07 | SQ-Whisper: Speaker-Querying based Whisper Model for Target-Speaker ASR | Pengcheng Guo et.al. | 2412.05589 | link |
| 2024-12-06 | Adaptive Dropout for Pruning Conformers | Yotaro Kubo et.al. | 2412.04836 | null |
| 2024-12-05 | Comprehensive Audio Query Handling System with Integrated Expert Models and Contextual Understanding | Vakada Naveen et.al. | 2412.03980 | null |
| 2024-12-05 | Speech Recognition-based Feature Extraction for Enhanced Automatic Severity Classification in Dysarthric Speech | Yerin Choi et.al. | 2412.03784 | null |
| 2024-12-04 | ASR-EC Benchmark: Evaluating Large Language Models on Chinese ASR Error Correction | Victor Junqiu Wei et.al. | 2412.03075 | null |
| 2024-12-03 | GLM-4-Voice: Towards Intelligent and Human-Like End-to-End Spoken Chatbot | Aohan Zeng et.al. | 2412.02612 | link |
| 2024-12-01 | Late fusion ensembles for speech recognition on diverse input audio representations | Marin Jezidžić et.al. | 2412.01861 | null |
| 2024-12-01 | Automating Feedback Analysis in Surgical Training: Detection, Categorization, and Assessment | Firdavs Nasriddinov et.al. | 2412.00760 | link |
| 2024-12-04 | A Comparative Study of LLM-based ASR and Whisper in Low Resource and Code Switching Scenario | Zheshu Song et.al. | 2412.00721 | null |
| 2024-11-30 | Sample adaptive data augmentation with progressive scheduling | Hongxuan Lu et.al. | 2412.00415 | null |
| 2024-11-30 | Empowering the Deaf and Hard of Hearing Community: Enhancing Video Captions Using Large Language Models | Nadeen Fathallah et.al. | 2412.00342 | null |
| 2024-11-24 | High-precision medical speech recognition through synthetic data and semantic correction: UNITED-MEDASR | Sourav Banerjee et.al. | 2412.00055 | null |
| 2024-11-29 | Memristive Nanowire Network for Energy Efficient Audio Classification: Pre-Processing-Free Reservoir Computing with Reduced Latency | Akshaya Rajesh et.al. | 2411.19611 | null |
| 2024-11-28 | ArEEG_Words: Dataset for Envisioned Speech Recognition using EEG for Arabic Words | Hazem Darwish et.al. | 2411.18888 | null |
| 2024-11-20 | Towards Advanced Speech Signal Processing: A Statistical Perspective on Convolution-Based Architectures and its Applications | Nirmal Joshua Kapu et.al. | 2411.18636 | null |
| 2024-11-27 | EEG-Based Analysis of Brain Responses in Multi-Modal Human-Robot Interaction: Modulating Engagement | Suzanne Oliver et.al. | 2411.18587 | null |
| 2024-11-27 | AMPS: ASR with Multimodal Paraphrase Supervision | Amruta Parulekar et.al. | 2411.18368 | null |
| 2024-11-27 | Continual Learning in Machine Speech Chain Using Gradient Episodic Memory | Geoffrey Tyndall et.al. | 2411.18320 | null |
| 2024-11-27 | Aligning Pre-trained Models for Spoken Language Translation | Šimon Sedláček et.al. | 2411.18294 | null |
| 2024-11-27 | Efficient Nonlinear Function Approximation in Analog Resistive Crossbars for Recurrent Neural Networks | Junyi Yang et.al. | 2411.18271 | null |
| 2025-01-05 | How to Learn a New Language? An Efficient Solution for Self-Supervised Learning Models Unseen Languages Adaption in Low-Resource Scenario | Shih-Heng Wang et.al. | 2411.18217 | null |
| 2025-01-15 | MSA-ASR: Efficient Multilingual Speaker Attribution with frozen ASR Models | Thai-Binh Nguyen et.al. | 2411.18152 | null |
| 2024-11-27 | SALMONN-omni: A Codec-free LLM for Full-duplex Speech Understanding and Generation | Wenyi Yu et.al. | 2411.18138 | null |
| 2024-11-27 | Fusion of Discrete Representations and Self-Augmented Representations for Multilingual Automatic Speech Recognition | Shih-heng Wang et.al. | 2411.18107 | null |
| 2024-11-26 | Disentangled-Transformer: An Explainable End-to-End Automatic Speech Recognition Model with Speech Content-Context Separation | Pu Wang et.al. | 2411.17846 | null |
| 2024-12-02 | Scaling Speech-Text Pre-training with Synthetic Interleaved Data | Aohan Zeng et.al. | 2411.17607 | null |
| 2024-11-26 | Towards Maximum Likelihood Training for Transducer-based Streaming Speech Recognition | Hyeonseung Lee et.al. | 2411.17537 | null |
| 2024-11-26 | Comparative Analysis of ASR Methods for Speech Deepfake Detection | Davide Salvi et.al. | 2411.17349 | null |
| 2024-11-26 | k2SSL: A Faster and Better Framework for Self-Supervised Speech Representation Learning | Yifan Yang et.al. | 2411.17100 | link |
| 2024-11-22 | TSkips: Efficiency Through Explicit Temporal Delay Connections in Spiking Neural Networks | Prajna G. Malettira et.al. | 2411.16711 | null |
| 2024-11-22 | Transforming NLU with Babylon: A Case Study in Development of Real-time, Edge-Efficient, Multi-Intent Translation System for Automated Drive-Thru Ordering | Mostafa Varzaneh et.al. | 2411.15372 | null |
| 2024-11-20 | From Statistical Methods to Pre-Trained Models; A Survey on Automatic Speech Recognition for Resource Scarce Urdu Language | Muhammad Sharif et.al. | 2411.14493 | null |
| 2024-11-26 | Tiny-Align: Bridging Automatic Speech Recognition and Large Language Model on the Edge | Ruiyang Qin et.al. | 2411.13766 | null |
| 2024-11-18 | A Novel Speech Analysis and Correction Tool for Arabic-Speaking Children | Lamia Berriche et.al. | 2411.13592 | null |
| 2024-11-26 | WavChat: A Survey of Spoken Dialogue Models | Shengpeng Ji et.al. | 2411.13577 | link |
| 2024-11-20 | CAFE A Novel Code switching Dataset for Algerian Dialect French and English | Houssam Eddine-Othman Lachemat et.al. | 2411.13424 | null |
| 2024-11-20 | Hard-Synth: Synthesizing Diverse Hard Samples for ASR using Zero-Shot TTS and LLM | Jiawei Yu et.al. | 2411.13159 | null |
| 2024-11-19 | Whisper Finetuning on Nepali Language | Sanjay Rijal et.al. | 2411.12587 | null |
| 2024-11-27 | Inter-linguistic Phonetic Composition (IPC): A Theoretical and Computational Approach to Enhance Second Language Pronunciation | Jisang Park et.al. | 2411.10927 | null |
| 2024-11-16 | BanglaDialecto: An End-to-End AI-Powered Regional Speech Standardization | Md. Nazmus Sadat Samin et.al. | 2411.10879 | link |
| 2024-12-08 | Interactive Cycle Model -- The Linkage Combination among Automatic Speech Recognition, Large Language Models and Smart Glasses | Libo Wang et.al. | 2411.10362 | link |
| 2024-11-15 | Systolic Arrays and Structured Pruning Co-design for Efficient Transformers in Edge Systems | Pedro Palacios et.al. | 2411.10285 | null |
| 2024-11-15 | DiMoDif: Discourse Modality-information Differentiation for Audio-visual Deepfake Detection and Localization | Christos Koutlis et.al. | 2411.10193 | null |
| 2024-11-15 | XLSR-Mamba: A Dual-Column Bidirectional State Space Model for Spoofing Attack Detection | Yang Xiao et.al. | 2411.10027 | null |
| 2024-11-14 | Everyone deserves their voice to be heard: Analyzing Predictive Gender Bias in ASR Models Applied to Dutch Speech Data | Rik Raes et.al. | 2411.09431 | null |
| 2024-11-14 | Transferable Adversarial Attacks against ASR | Xiaoxue Gao et.al. | 2411.09220 | null |
| 2024-10-28 | Multilingual Standalone Trustworthy Voice-Based Social Network for Disaster Situations | Majid Behravan et.al. | 2411.08889 | null |
| 2024-11-11 | Mamba-based Decoder-Only Approach with Bidirectional Speech Modeling for Speech Recognition | Yoshiki Masuyama et.al. | 2411.06968 | link |
| 2024-12-28 | DCF-DS: Deep Cascade Fusion of Diarization and Separation for Speech Recognition under Realistic Single-Channel Conditions | Shu-Tong Niu et.al. | 2411.06667 | null |
| 2024-11-10 | CTC-Assisted LLM-Based Contextual ASR | Guanrou Yang et.al. | 2411.06437 | link |
| 2024-12-04 | Dialectal Coverage And Generalization in Arabic Speech Recognition | Amirbek Djanibekov et.al. | 2411.05872 | link |
| 2024-11-07 | Sentiment Analysis of Spanish Political Party Tweets Using Pre-trained Language Models | Chuqiao Song et.al. | 2411.04862 | null |
| 2024-11-07 | Multistage Fine-tuning Strategies for Automatic Speech Recognition in Low-resource Languages | Leena G Pillai et.al. | 2411.04573 | null |
| 2024-11-04 | Unified Speech Recognition: A Single Model for Auditory, Visual, and Audiovisual Inputs | Alexandros Haliassos et.al. | 2411.02256 | link |
| 2024-11-03 | SPES: Spectrogram Perturbation for Explainable Speech-to-Text Generation | Dennis Fucci et.al. | 2411.01710 | null |
| 2024-11-08 | Enhancing AAC Software for Dysarthric Speakers in e-Health Settings: An Evaluation Using TORGO | Macarious Hui et.al. | 2411.00980 | null |
| 2024-11-04 | Optimizing Contextual Speech Recognition Using Vector Quantization for Efficient Retrieval | Nikolaos Flemotomos et.al. | 2411.00664 | null |
| 2024-10-31 | IO Transformer: Evaluating SwinV2-Based Reward Models for Computer Vision | Maxwell Meyer et.al. | 2411.00252 | null |
| 2024-10-31 | Speech is More Than Words: Do Speech-to-Text Translation Systems Leverage Prosody? | Ioannis Tsiamas et.al. | 2410.24019 | null |
| 2024-10-30 | Augmenting Polish Automatic Speech Recognition System With Synthetic Data | Łukasz Bondaruk et.al. | 2410.22903 | null |
| 2024-10-30 | Run-Time Adaptation of Neural Beamforming for Robust Speech Dereverberation and Denoising | Yoto Fujita et.al. | 2410.22805 | null |
| 2024-10-29 | Joint Beamforming and Speaker-Attributed ASR for Real Distant-Microphone Meeting Transcription | Can Cui et.al. | 2410.21849 | null |
| 2024-10-28 | Asynchronous Tool Usage for Real-Time Agents | Antonio A. Ginart et.al. | 2410.21620 | null |
| 2024-10-27 | Using Confidence Scores to Improve Eyes-free Detection of Speech Recognition Errors | Sadia Nowrin et.al. | 2410.20564 | null |
| 2024-10-27 | Improving Speech-based Emotion Recognition with Contextual Utterance Analysis and LLMs | Enshi Zhang et.al. | 2410.20334 | null |
| 2024-11-04 | emg2qwerty: A Large Dataset with Baselines for Touch Typing using Surface Electromyography | Viswanath Sivakumar et.al. | 2410.20081 | link |
| 2024-10-25 | A Survey on Speech Large Language Models | Jing Peng et.al. | 2410.18908 | null |
| 2024-10-24 | We Augmented Whisper With kNN and You Won't Believe What Came Next | Maya K. Nachesa et.al. | 2410.18850 | null |
| 2024-10-24 | STTATTS: Unified Speech-To-Text And Text-To-Speech Model | Hawau Olamide Toyin et.al. | 2410.18607 | link |
| 2024-10-24 | Evaluating and Improving Automatic Speech Recognition Systems for Korean Meteorological Experts | ChaeHun Park et.al. | 2410.18444 | null |
| 2024-10-24 | Contextual Biasing to Improve Domain-specific Custom Vocabulary Audio Transcription without Explicit Fine-Tuning of Whisper Model | Vishakha Lall et.al. | 2410.18363 | null |
| 2024-10-23 | ELAICHI: Enhancing Low-resource TTS by Addressing Infrequent and Low-frequency Character Bigrams | Srija Anand et.al. | 2410.17901 | null |
| 2024-10-23 | VoiceTextBlender: Augmenting Large Language Models with Speech Capabilities via Single-Stage Joint Speech-Text Supervised Fine-Tuning | Yifan Peng et.al. | 2410.17485 | null |
| 2024-10-22 | mmWave-Whisper: Phone Call Eavesdropping and Transcription Using Millimeter-Wave Radar | Suryoday Basak et.al. | 2410.17457 | null |
| 2024-10-22 | Improving Automatic Speech Recognition with Decoder-Centric Regularisation in Encoder-Decoder Models | Alexander Polok et.al. | 2410.17437 | null |
| 2024-12-11 | VoiceBench: Benchmarking LLM-Based Voice Assistants | Yiming Chen et.al. | 2410.17196 | link |
| 2024-10-22 | Enhancing Low-Resource ASR through Versatile TTS: Bridging the Data Gap | Guanrou Yang et.al. | 2410.16726 | null |
| 2024-10-22 | DENOASR: Debiasing ASRs through Selective Denoising | Anand Kumar Rai et.al. | 2410.16712 | null |
| 2024-10-21 | AlignVSR: Audio-Visual Cross-Modal Alignment for Visual Speech Recognition | Zehua Liu et.al. | 2410.16438 | link |
| 2024-10-19 | End-to-End Transformer-based Automatic Speech Recognition for Northern Kurdish: A Pioneering Approach | Abdulhady Abas Abdullah et.al. | 2410.16330 | null |
| 2024-10-21 | Acoustic Model Optimization over Multiple Data Sources: Merging and Valuation | Victor Junqiu Wei et.al. | 2410.15620 | null |
| 2024-10-21 | Interventional Speech Noise Injection for ASR Generalizable Spoken Language Understanding | Yeonjoon Jung et.al. | 2410.15609 | null |
| 2024-10-22 | Moonshine: Speech Recognition for Live Transcription and Voice Commands | Nat Jeffries et.al. | 2410.15608 | link |
| 2024-10-20 | Ichigo: Mixed-Modal Early-Fusion Realtime Voice Assistant | Alan Dao et.al. | 2410.15316 | link |
| 2024-10-19 | Enhancing Multimodal Sentiment Analysis for Missing Modality through Self-Distillation and Unified Modality Cross-Attention | Yuzhe Weng et.al. | 2410.15029 | link |
| 2024-10-18 | AC-Mix: Self-Supervised Adaptation for Low-Resource Automatic Speech Recognition using Agnostic Contrastive Mixup | Carlos Carvalho et.al. | 2410.14910 | null |
| 2024-10-09 | A two-stage transliteration approach to improve performance of a multilingual ASR | Rohit Kumar et.al. | 2410.14709 | null |
| 2024-10-17 | Parameter-efficient Adaptation of Multilingual Multimodal Models for Low-resource ASR | Abhishek Gupta et.al. | 2410.13445 | null |
| 2024-10-17 | Computational Approaches to Arabic-English Code-Switching | Caroline Sabty et.al. | 2410.13318 | null |
| 2024-10-17 | Roadmap towards Superhuman Speech Understanding using Large Language Models | Fan Bu et.al. | 2410.13268 | null |
| 2024-10-17 | Failing Forward: Improving Generative Error Correction for ASR with Synthetic Data and Retrieval Augmentation | Sreyan Ghosh et.al. | 2410.13198 | null |
| 2024-10-17 | EH-MAM: Easy-to-Hard Masked Acoustic Modeling for Self-Supervised Speech Representation Learning | Ashish Seth et.al. | 2410.13179 | link |
| 2024-10-17 | Deep Learning-based Software Engineering: Progress, Challenges, and Opportunities | Xiangping Chen et.al. | 2410.13110 | null |
| 2024-10-07 | Automatic Screening for Children with Speech Disorder using Automatic Speech Recognition: Opportunities and Challenges | Dancheng Liu et.al. | 2410.11865 | null |
| 2024-10-15 | A Framework for Adapting Human-Robot Interaction to Diverse User Groups | Theresa Pekarek Rosin et.al. | 2410.11377 | link |
| 2024-10-15 | Investigation of Speaker Representation for Target-Speaker Speech Processing | Takanori Ashihara et.al. | 2410.11243 | null |
| 2024-10-14 | Character-aware audio-visual subtitling in context | Jaesung Huh et.al. | 2410.11068 | null |
| 2024-10-14 | In-Materia Speech Recognition | Mohamadreza Zolfagharinejad et.al. | 2410.10434 | null |
| 2024-10-13 | State of NLP in Kenya: A Survey | Cynthia Jayne Amol et.al. | 2410.09948 | null |
| 2024-10-12 | SLAM-AAC: Enhancing Audio Captioning with Paraphrasing Augmentation and CLAP-Refine through LLMs | Wenxi Chen et.al. | 2410.09503 | link |
| 2024-10-12 | Automatic Speech Recognition with BERT and CTC Transformers: A Review | Noussaiba Djeffal et.al. | 2410.09456 | null |
| 2024-10-11 | UniGlyph: A Seven-Segment Script for Universal Language Representation | G. V. Bency Sherin et.al. | 2410.08974 | null |
| 2024-10-14 | Enhancing Indonesian Automatic Speech Recognition: Evaluating Multilingual Models with Diverse Speech Variabilities | Aulia Adila et.al. | 2410.08828 | null |
| 2024-10-10 | Full-Rank No More: Low-Rank Weight Training for Modern Speech Recognition Models | Adriana Fernandez-Lopez et.al. | 2410.07771 | null |
| 2024-10-18 | Advocating Character Error Rate for Multilingual ASR Evaluation | Thennal D K et.al. | 2410.07400 | null |
| 2024-10-08 | The USTC-NERCSLIP Systems for the CHiME-8 MMCSG Challenge | Ya Jiang et.al. | 2410.05986 | null |
| 2024-10-07 | Incorporating Talker Identity Aids With Improving Speech Recognition in Adversarial Environments | Sagarika Alavilli et.al. | 2410.05423 | null |
| 2024-10-05 | The OCON model: an old but gold solution for distributable supervised classification | Stefano Giacomelli et.al. | 2410.05320 | link |
| 2024-10-07 | Enhancing Job Interview Preparation Through Immersive Experiences Using Photorealistic, AI-powered Metahuman Avatars | Navid Ashrafi et.al. | 2410.05131 | null |
| 2024-10-13 | CR-CTC: Consistency regularization on CTC for improved speech recognition | Zengwei Yao et.al. | 2410.05101 | link |
| 2024-10-06 | Punctuation Prediction for Polish Texts using Transformers | Jakub Pokrywka et.al. | 2410.04621 | null |
| 2024-10-06 | Casablanca: Data and Models for Multidialectal Arabic Speech Recognition | Bashar Talafha et.al. | 2410.04527 | null |
| 2024-10-05 | Efficient and Robust Long-Form Speech Recognition with Hybrid H3-Conformer | Tomoki Honda et.al. | 2410.04159 | link |
| 2024-10-05 | The OCON model: an old but green solution for distributable supervised classification for acoustic monitoring in smart cities | Stefano Giacomelli et.al. | 2410.04098 | null |
| 2024-10-05 | Enhancement of Dysarthric Speech Reconstruction by Contrastive Learning | Keshvari Fatemeh et.al. | 2410.04092 | null |
| 2024-10-04 | Reverb: Open-Source ASR and Diarization from Rev | Nishchal Bhandari et.al. | 2410.03930 | null |
| 2024-10-13 | Self-Powered LLM Modality Expansion for Large Speech-Text Models | Tengfei Yu et.al. | 2410.03798 | link |
| 2024-10-02 | SeeSay: An Assistive Device for the Visually Impaired Using Retrieval Augmented Generation | Melody Yu et.al. | 2410.03771 | null |
| 2024-10-02 | Efficient Streaming LLM for Speech Recognition | Junteng Jia et.al. | 2410.03752 | null |
| 2024-10-01 | Recent Advances in Speech Language Models: A Survey | Wenqian Cui et.al. | 2410.03751 | null |
| 2024-10-04 | Multi-Dialect Vietnamese: Task, Dataset, Baseline Models and Challenges | Nguyen Van Dinh et.al. | 2410.03458 | link |
| 2024-10-04 | Team MTS @ AutoMin 2021: An Overview of Existing Summarization Approaches and Comparison to Unsupervised Summarization Techniques | Olga Iakovenko et.al. | 2410.03412 | null |
| 2024-10-03 | Three-in-One: Fast and Accurate Transducer for Hybrid-Autoregressive ASR | Hainan Xu et.al. | 2410.02597 | null |
| 2024-10-04 | Convolutional Variational Autoencoders for Spectrogram Compression in Automatic Speech Recognition | Olga Iakovenko et.al. | 2410.02560 | null |
| 2024-10-03 | Algorithms For Automatic Accentuation And Transcription Of Russian Texts In Speech Recognition Systems | Olga Iakovenko et.al. | 2410.02538 | null |
| 2024-10-03 | A Pilot Study of Applying Sequence-to-Sequence Voice Conversion to Evaluate the Intelligibility of L2 Speech Using a Native Speaker's Shadowings | Haopeng Geng et.al. | 2410.02239 | null |
| 2024-09-27 | A GEN AI Framework for Medical Note Generation | Hui Yi Leong et.al. | 2410.01841 | null |
| 2024-10-02 | Spoken Grammar Assessment Using LLM | Sunil Kumar Kopparapu et.al. | 2410.01579 | null |
| 2024-10-01 | MOSEL: 950,000 Hours of Speech Data for Open-Source Speech Foundation Model Training on EU Languages | Marco Gaido et.al. | 2410.01036 | link |
| 2024-10-01 | Automatic Speech Recognition for the Ika Language | Uchenna Nzenwata et.al. | 2410.00940 | null |
| 2024-10-04 | VHASR: A Multimodal Speech Recognition System With Vision Hotwords | Jiliang Hu et.al. | 2410.00822 | link |
| 2024-10-01 | End-to-End Speech Recognition with Pre-trained Masked Language Model | Yosuke Higuchi et.al. | 2410.00528 | link |
| 2024-09-30 | Mamba for Streaming ASR Combined with Unimodal Aggregation | Ying Fang et.al. | 2410.00070 | link |
| 2024-10-02 | Moshi: a speech-text foundation model for real-time dialogue | Alexandre Défossez et.al. | 2410.00037 | link |
| 2024-09-30 | Boosting Hybrid Autoregressive Transducer-based ASR with Internal Acoustic Model Training and Dual Blank Thresholding | Takafumi Moriya et.al. | 2409.20313 | null |
| 2024-09-30 | Alignment-Free Training for Transducer-based Multi-Talker ASR | Takafumi Moriya et.al. | 2409.20301 | null |
| 2024-09-30 | AfriHuBERT: A self-supervised speech representation model for African languages | Jesujoba O. Alabi et.al. | 2409.20201 | null |
| 2024-09-30 | Predictive Speech Recognition and End-of-Utterance Detection Towards Spoken Dialog Systems | Oswald Zink et.al. | 2409.19990 | null |
| 2024-09-30 | HDMoLE: Mixture of LoRA Experts with Hierarchical Routing and Dynamic Thresholds for Fine-Tuning LLM-based ASR Models | Bingshen Mu et.al. | 2409.19878 | null |
| 2024-09-29 | Fine-Tuning Automatic Speech Recognition for People with Parkinson's: An Effective Strategy for Enhancing Speech Technology Accessibility | Xiuwen Zheng et.al. | 2409.19818 | null |
| 2024-09-29 | Efficient Long-Form Speech Recognition for General Speech In-Context Learning | Hao Yen et.al. | 2409.19757 | null |
| 2024-09-29 | Quantitative Analysis of Audio-Visual Tasks: An Information-Theoretic Perspective | Chen Chen et.al. | 2409.19575 | null |
| 2024-09-29 | CoT-ST: Enhancing LLM-based Speech Translation with Multimodal Chain-of-Thought | Yexing Du et.al. | 2409.19510 | link |
| 2024-09-28 | Advanced Clustering Techniques for Speech Signal Enhancement: A Review and Metanalysis of Fuzzy C-Means, K-Means, and Kernel Fuzzy C-Means Methods | Abdulhady Abas Abdullah et.al. | 2409.19448 | null |
| 2024-09-27 | Speech-Mamba: Long-Context Speech Recognition with Selective State Spaces Models | Xiaoxue Gao et.al. | 2409.18654 | null |
| 2024-09-30 | ChildMandarin: A Comprehensive Mandarin Speech Dataset for Young Children Aged 3-5 | Jiaming Zhou et.al. | 2409.18584 | null |
| 2024-09-27 | Improving Multilingual ASR in the Wild Using Simple N-best Re-ranking | Brian Yan et.al. | 2409.18428 | link |
| 2024-09-26 | Unveiling the Role of Pretraining in Direct Speech Translation | Belen Alastruey et.al. | 2409.18044 | null |
| 2024-09-26 | Are Transformers in Pre-trained LM A Good ASR Encoder? An Empirical Study | Keyu An et.al. | 2409.17750 | null |
| 2024-09-26 | Paraformer-v2: An improved non-autoregressive transformer for noise-robust speech recognition | Keyu An et.al. | 2409.17746 | null |
| 2024-09-26 | Deep CLAS: Deep Contextual Listen, Attend and Spell | Shifu Xiong et.al. | 2409.17603 | null |
| 2024-11-08 | How to Connect Speech Foundation Models and Large Language Models? What Matters and What Does Not | Francesco Verdini et.al. | 2409.17044 | null |
| 2024-09-25 | MT2KD: Towards A General-Purpose Encoder for Speech, Speaker, and Audio Events | Xiaoyu Yang et.al. | 2409.17010 | null |
| 2024-09-25 | Weighted Cross-entropy for Low-Resource Languages in Multilingual Speech Recognition | Andrés Piñeiro-Martín et.al. | 2409.16954 | link |
| 2024-09-27 | Semi-Supervised Cognitive State Classification from Speech with Multi-View Pseudo-Labeling | Yuanchao Li et.al. | 2409.16937 | link |
| 2024-09-25 | Speech Recognition Rescoring with Large Speech-Text Foundation Models | Prashanth Gurunath Shivakumar et.al. | 2409.16654 | null |
| 2024-09-24 | Spelling Correction through Rewriting of Non-Autoregressive ASR Lattices | Leonid Velikovich et.al. | 2409.16469 | null |
| 2024-09-24 | Revisiting Acoustic Features for Robust ASR | Muhammad A. Shah et.al. | 2409.16399 | null |
| 2024-09-10 | How Redundant Is the Transformer Stack in Speech Representation Models? | Teresa Dorszewski et.al. | 2409.16302 | null |
| 2024-09-24 | Bridging Speech and Text: Enhancing ASR with Pinyin-to-Character Pre-training in LLMs | Yang Yuhang et.al. | 2409.16005 | null |
| 2024-10-31 | Boosting Code-Switching ASR with Mixture of Experts Enhanced Speech-Conditioned LLM | Fengrun Zhang et.al. | 2409.15905 | null |
| 2024-09-24 | WeSep: A Scalable and Flexible Toolkit Towards Generalizable Target Speaker Extraction | Shuai Wang et.al. | 2409.15799 | link |
| 2024-09-24 | Hypothesis Clustering and Merging: Novel MultiTalker Speech Recognition with Speaker Tokens | Yosuke Kashiwagi et.al. | 2409.15732 | null |
| 2024-09-23 | Revise, Reason, and Recognize: LLM-Based Emotion Recognition via Emotion-Specific Prompts and ASR Error Correction | Yuanchao Li et.al. | 2409.15551 | link |
| 2024-09-17 | A Joint Spectro-Temporal Relational Thinking Based Acoustic Modeling Framework | Zheng Nan et.al. | 2409.15357 | null |
| 2024-09-11 | Contextualization of ASR with LLM using phonetic retrieval-based augmentation | Zhihong Lei et.al. | 2409.15353 | null |
| 2024-09-10 | A Large Dataset of Spontaneous Speech with the Accent Spoken in São Paulo for Automatic Speech Recognition Evaluation | Rodrigo Lima et.al. | 2409.15350 | null |
| 2024-09-13 | CPT-Boosted Wav2vec2.0: Towards Noise Robust Speech Recognition for Classroom Environments | Ahmed Adel Attia et.al. | 2409.14494 | null |
| 2024-09-21 | Strong Alone, Stronger Together: Synergizing Modality-Binding Foundation Models with Optimal Transport for Non-Verbal Emotion Recognition | Orchid Chetia Phukan et.al. | 2409.14221 | null |
| 2024-09-21 | MultiMed: Multilingual Medical Speech Recognition via Attention Encoder Decoder | Khai Le-Duc et.al. | 2409.14074 | link |
| 2024-09-20 | Time and Tokens: Benchmarking End-to-End Speech Dysfluency Detection | Xuanru Zhou et.al. | 2409.13582 | null |
| 2024-09-20 | LM-assisted keyword biasing with Aho-Corasick algorithm for Transducer-based ASR | Iuliia Thorbecke et.al. | 2409.13514 | null |
| 2024-10-07 | Fast Streaming Transducer ASR Prototyping via Knowledge Distillation with Whisper | Iuliia Thorbecke et.al. | 2409.13499 | null |
| 2024-09-20 | A Multimodal Dense Retrieval Approach for Speech-Based Open-Domain Question Answering | Georgios Sidiropoulos et.al. | 2409.13483 | null |
| 2024-09-20 | Large Language Model Should Understand Pinyin for Chinese ASR Error Correction | Yuang Li et.al. | 2409.13262 | null |
| 2024-09-19 | Personalized Speech Recognition for Children with Test-Time Adaptation | Zhonghao Shi et.al. | 2409.13095 | null |
| 2024-09-19 | Enhancing Synthetic Training Data for Speech Commands: From ASR-Based Filtering to Domain Adaptation in SSL Latent Space | Sebastião Quintas et.al. | 2409.12745 | null |
| 2024-09-19 | Hidden in Plain Sound: Environmental Backdoor Poisoning Attacks on Whisper, and Mitigations | Jonatan Bartolini et.al. | 2409.12553 | null |
| 2024-09-19 | Disentangling Speakers in Multi-Talker Speech Recognition with Speaker-Aware CTC | Jiawen Kang et.al. | 2409.12388 | null |
| 2024-09-19 | Channel-Aware Domain-Adaptive Generative Adversarial Network for Robust Speech Recognition | Chien-Chun Wang et.al. | 2409.12386 | link |
| 2024-09-19 | Robust Audiovisual Speech Recognition Models with Mixture-of-Experts | Yihan Wu et.al. | 2409.12370 | null |
| 2024-09-18 | META-CAT: Speaker-Informed Speech Embeddings via Meta Information Concatenation for Multi-talker ASR | Jinhan Wang et.al. | 2409.12352 | null |
| 2024-09-18 | Large Language Models Are Strong Audio-Visual Speech Recognition Learners | Umberto Cappellazzo et.al. | 2409.12319 | null |
| 2024-09-19 | WeHelp: A Shared Autonomy System for Wheelchair Users | Abulikemu Abuduweili et.al. | 2409.12159 | link |
| 2024-09-18 | ASR Benchmarking: Need for a More Representative Conversational Dataset | Gaurav Maheshwari et.al. | 2409.12042 | link |
| 2024-09-18 | M2R-Whisper: Multi-stage and Multi-scale Retrieval Augmentation for Enhancing Whisper | Jiaming Zhou et.al. | 2409.11889 | null |
| 2024-09-19 | Simulating Native Speaker Shadowing for Nonnative Speech Assessment with Latent Speech Representations | Haopeng Geng et.al. | 2409.11742 | null |
| 2024-09-17 | Chain-of-Thought Prompting for Speech Translation | Ke Hu et.al. | 2409.11538 | null |
| 2024-09-17 | M-BEST-RQ: A Multi-Channel Speech Foundation Model for Smart Glasses | Yufeng Yang et.al. | 2409.11494 | null |
| 2024-09-17 | Bio-Inspired Mamba: Temporal Locality and Bioplausible Learning in Selective State Space Models | Jiahao Qin et.al. | 2409.11263 | null |
| 2024-09-17 | WER We Stand: Benchmarking Urdu ASR Models | Samee Arif et.al. | 2409.11252 | null |
| 2024-09-17 | Ideal-LLM: Integrating Dual Encoders and Language-Adapted LLM for Multilingual Speech-to-Text | Hongfei Xue et.al. | 2409.11214 | null |
| 2024-09-17 | Zero Shot Text to Speech Augmentation for Automatic Speech Recognition on Low-Resource Accented Speech Corpora | Francesco Nespoli et.al. | 2409.11107 | null |
| 2024-09-17 | Enhancing Low-Resource Language and Instruction Following Capabilities of Audio Language Models | Potsawee Manakul et.al. | 2409.10999 | null |
| 2024-09-17 | Speech Recognition for Analysis of Police Radio Communication | Tejes Srivastava et.al. | 2409.10858 | null |
| 2024-09-16 | An Efficient Self-Learning Framework For Interactive Spoken Dialog Systems | Hitesh Tulsiani et.al. | 2409.10515 | null |
| 2024-09-16 | Meta-Whisper: Speech-Based Meta-ICL for ASR on Low-Resource Languages | Ming-Hao Hsu et.al. | 2409.10429 | null |
| 2024-09-16 | Voice control interface for surgical robot assistants | Ana Davila et.al. | 2409.10225 | null |
| 2024-09-17 | Augmenting Automatic Speech Recognition Models with Disfluency Detection | Robin Amann et.al. | 2409.10177 | null |
| 2024-09-16 | Optimizing Dysarthria Wake-Up Word Spotting: An End-to-End Approach for SLT 2024 LRDWWS Challenge | Shuiyun Liu et.al. | 2409.10076 | null |
| 2024-09-16 | A Study on Zero-shot Non-intrusive Speech Assessment using Large Language Models | Ryandhimas E. Zezario et.al. | 2409.09914 | null |
| 2024-09-17 | Large Language Model Based Generative Error Correction: A Challenge and Baselines for Speech Recognition, Speaker Tagging, and Emotion Recognition | Chao-Han Huck Yang et.al. | 2409.09785 | null |
| 2024-09-14 | ASR Error Correction using Large Language Models | Rao Ma et.al. | 2409.09554 | null |
| 2024-09-14 | M |
Anna Wang et.al. | 2409.09284 | null |
| 2024-09-13 | Multi-modal Speech Transformer Decoders: When Do Multiple Modalities Improve Accuracy? | Yiwen Guan et.al. | 2409.09221 | null |
| 2024-09-13 | Learnings from curating a trustworthy, well-annotated, and useful dataset of disordered English speech | Pan-Pan Jiang et.al. | 2409.09190 | null |
| 2024-09-13 | Clean Label Attacks against SLU Systems | Henry Li Xinyuan et.al. | 2409.08985 | null |
| 2024-09-13 | Exploring the Impact of Data Quantity on ASR in Extremely Low-resource Languages | Yao-Fei Cheng et.al. | 2409.08872 | null |
| 2024-09-13 | Exploring SSL Discrete Tokens for Multilingual ASR | Mingyu Cui et.al. | 2409.08805 | null |
| 2024-09-13 | NEST-RQ: Next Token Prediction for Speech Self-Supervised Pre-Training | Minglun Han et.al. | 2409.08680 | null |
| 2024-09-13 | LA-RAG:Enhancing LLM-based ASR Accuracy with Retrieval-Augmented Generation | Shaojun Li et.al. | 2409.08597 | null |
| 2024-09-13 | Large Language Model Can Transcribe Speech in Multi-Talker Scenarios with Versatile Instructions | Lingwei Meng et.al. | 2409.08596 | null |
| 2024-09-12 | Faster Speech-LLaMA Inference with Multi-token Prediction | Desh Raj et.al. | 2409.08148 | null |
| 2024-09-12 | WhisperNER: Unified Open Named Entity and Speech Recognition | Gil Ayache et.al. | 2409.08107 | null |
| 2024-10-06 | The Faetar Benchmark: Speech Recognition in a Very Under-Resourced Language | Michael Ong et.al. | 2409.08103 | null |
| 2024-09-12 | Auto-Landmark: Acoustic Landmark Dataset and Open-Source Toolkit for Landmark Extraction | Xiangyu Zhang et.al. | 2409.07969 | null |
| 2024-09-12 | Detecting and Defending Against Adversarial Attacks on Automatic Speech Recognition via Diffusion Models | Nikolai L. Kühne et.al. | 2409.07936 | link |
| 2024-09-12 | Full-text Error Correction for Chinese Speech Recognition with Large Language Model | Zhiyuan Tang et.al. | 2409.07790 | null |
| 2024-09-11 | Rethinking Mamba in Speech Processing by Self-Supervised Models | Xiangyu Zhang et.al. | 2409.07273 | null |
| 2024-09-11 | ManaTTS Persian: a recipe for creating TTS datasets for lower resource languages | Mahta Fetrat Qharabagh et.al. | 2409.07259 | null |
| 2024-09-11 | Enhancing CTC-Based Visual Speech Recognition | Hendrik Laux et.al. | 2409.07210 | null |
| 2024-09-11 | Linear Time Complexity Conformers with SummaryMixing for Streaming Speech Recognition | Titouan Parcollet et.al. | 2409.07165 | link |
| 2024-09-10 | An Effective Context-Balanced Adaptation Approach for Long-Tailed Speech Recognition | Yi-Cheng Wang et.al. | 2409.06468 | null |
| 2024-09-10 | Keyword-Aware ASR Error Augmentation for Robust Dialogue State Tracking | Jihyun Lee et.al. | 2409.06263 | null |
| 2024-09-10 | Advancing Topic Segmentation of Broadcasted Speech with Multilingual Semantic Embeddings | Sakshi Deo Shukla et.al. | 2409.06222 | link |
| 2024-09-09 | Retrieval Augmented Correction of Named Entity Speech Recognition Errors | Ernest Pusateri et.al. | 2409.06062 | null |
| 2024-09-09 | Consensus-based Distributed Quantum Kernel Learning for Speech Recognition | Kuan-Cheng Chen et.al. | 2409.05770 | null |
| 2024-09-09 | A Toolkit for Joint Speaker Diarization and Identification with Application to Speaker-Attributed ASR | Giovanni Morrone et.al. | 2409.05750 | null |
| 2024-09-11 | Evaluation of real-time transcriptions using end-to-end ASR models | Carlos Arriaga et.al. | 2409.05674 | null |
| 2024-09-09 | Longer is (Not Necessarily) Stronger: Punctuated Long-Sequence Training for Enhanced Speech Recognition and Translation | Nithin Rao Koluguri et.al. | 2409.05601 | null |
| 2024-09-09 | An investigation of modularity for noise robustness in conformer-based ASR | Louise Coppieters de Gibson et.al. | 2409.05589 | null |
| 2025-08-27 | Leveraging Content and Acoustic Representations for Speech Emotion Recognition | Soumya Dutta et.al. | 2409.05566 | null |
| 2024-09-09 | NTT Multi-Speaker ASR System for the DASR Task of CHiME-8 Challenge | Naoyuki Kamo et.al. | 2409.05554 | null |
| 2024-09-09 | Findings of the 2024 Mandarin Stuttering Event Detection and Automatic Speech Recognition Challenge | Hongfei Xue et.al. | 2409.05430 | null |
| 2024-09-08 | Exploring WavLM Back-ends for Speech Spoofing and Deepfake Detection | Theophile Stourbe et.al. | 2409.05032 | null |
| 2024-09-04 | Probing self-attention in self-supervised speech models for cross-linguistic differences | Sai Gopinath et.al. | 2409.03115 | null |
| 2024-09-04 | Quantification of stylistic differences in human- and ASR-produced transcripts of African American English | Annika Heuser et.al. | 2409.03059 | null |
| 2024-09-04 | Efficient Extraction of Noise-Robust Discrete Units from Self-Supervised Speech Models | Jakob Poncelet et.al. | 2409.02565 | null |
| 2024-09-04 | Parameter estimation of hidden Markov models: comparison of EM and quasi-Newton methods with a new hybrid algorithm | Sidonie Foulon et.al. | 2409.02477 | null |
| 2024-09-04 | What is lost in Normalization? Exploring Pitfalls in Multilingual ASR Model Evaluations | Kavya Manohar et.al. | 2409.02449 | null |
| 2024-09-05 | Temporal Order Preserved Optimal Transport-based Cross-modal Knowledge Transfer Learning for ASR | Xugang Lu et.al. | 2409.02239 | null |
| 2024-08-19 | Toward Large-scale Spiking Neural Networks: A Comprehensive Survey and Future Directions | Yangfan Hu et.al. | 2409.02111 | null |
| 2024-09-05 | Enhancing Code-Switching Speech Recognition with LID-Based Collaborative Mixture of Experts Model | Hukai Huang et.al. | 2409.02050 | null |
| 2024-09-03 | The USTC-NERCSLIP Systems for the CHiME-8 NOTSOFAR-1 Challenge | Shutong Niu et.al. | 2409.02041 | null |
| 2024-09-03 | Reassessing Noise Augmentation Methods in the Context of Adversarial Speech | Karla Pizzi et.al. | 2409.01813 | null |
| 2024-09-24 | VoxHakka: A Dialectally Diverse Multi-speaker Text-to-Speech System for Taiwanese Hakka | Li-Wei Chen et.al. | 2409.01548 | null |
| 2024-09-02 | Resource-Efficient Adaptation of Speech Foundation Models for Multi-Speaker ASR | Weiqing Wang et.al. | 2409.01438 | null |
| 2024-09-23 | Refined Statistical Bounds for Classification Error Mismatches with Constrained Bayes Error | Zijian Yang et.al. | 2409.01309 | null |
| 2024-09-02 | A Framework for Synthetic Audio Conversations Generation using Large Language Models | Kaung Myat Kyaw et.al. | 2409.00946 | null |
| 2024-09-11 | Serialized Speech Information Guidance with Overlapped Encoding Separation for Multi-Speaker Automatic Speech Recognition | Hao Shi et.al. | 2409.00815 | null |
| 2024-09-01 | Comparing Discrete and Continuous Space LLMs for Speech Recognition | Yaoxun Xu et.al. | 2409.00800 | null |
| 2024-09-11 | DCIM-AVSR : Efficient Audio-Visual Speech Recognition via Dual Conformer Interaction Module | Xinyu Wang et.al. | 2409.00481 | null |
| 2024-08-31 | Progressive Residual Extraction based Pre-training for Speech Representation Learning | Tianrui Wang et.al. | 2409.00387 | null |
| 2024-09-08 | ProGRes: Prompted Generative Rescoring on ASR n-Best | Ada Defne Tur et.al. | 2409.00217 | link |
| 2024-08-30 | Developing an End-to-End Framework for Predicting the Social Communication Severity Scores of Children with Autism Spectrum Disorder | Jihyun Mun et.al. | 2409.00158 | null |
| 2024-08-30 | Speaker Tagging Correction With Non-Autoregressive Language Models | Grigor Kirakosyan et.al. | 2409.00151 | null |
| 2024-08-30 | Advancing Multi-talker ASR Performance with Large Language Models | Mohan Shi et.al. | 2408.17431 | null |
| 2024-08-30 | Generative Modeling Perspective for Control and Reasoning in Robotics | Takuma Yoneda et.al. | 2408.17041 | null |
| 2024-08-29 | CrisperWhisper: Accurate Timestamps on Verbatim Speech Transcriptions | Laurin Wagner et.al. | 2408.16589 | link |
| 2024-08-29 | Human-Inspired Audio-Visual Speech Recognition: Spike Activity, Cueing Interaction and Causal Processing | Qianhui Liu et.al. | 2408.16564 | null |
| 2024-08-29 | Measuring the Accuracy of Automatic Speech Recognition Solutions | Korbinian Kuhn et.al. | 2408.16287 | link |
| 2024-08-29 | Revisit Micro-batch Clipping: Adaptive Data Pruning via Gradient Manipulation | Lun Wang et.al. | 2408.16204 | null |
| 2024-08-29 | Benchmarking Japanese Speech Recognition on ASR-LLM Setups with Multi-Pass Augmented Generative Error Correction | Yuka Ko et.al. | 2408.16180 | null |
| 2024-08-28 | Beyond Levenshtein: Leveraging Multiple Algorithms for Robust Word Error Rate Computations And Granular Error Classifications | Korbinian Kuhn et.al. | 2408.15616 | link |
| 2024-08-28 | Whisper-PMFA: Partial Multi-Scale Feature Aggregation for Speaker Verification using Whisper Models | Yiyang Zhao et.al. | 2408.15585 | null |
| 2024-08-27 | Speech Recognition Transformers: Topological-lingualism Perspective | Shruti Singh et.al. | 2408.14991 | null |
| 2024-08-27 | Literary and Colloquial Dialect Identification for Tamil using Acoustic Features | M. Nanmalar et.al. | 2408.14887 | null |
| 2024-09-06 | MEDSAGE: Enhancing Robustness of Medical Dialogue Summarization to ASR Errors with LLM-generated Synthetic Dialogues | Kuluhan Binici et.al. | 2408.14418 | null |
| 2024-08-26 | Self-supervised Speech Representations Still Struggle with African American Vernacular English | Kalvin Chang et.al. | 2408.14262 | link |
| 2024-08-26 | Automatic recognition and detection of aphasic natural speech | Mara Barberis et.al. | 2408.14082 | null |
| 2024-08-28 | Research Advances and New Paradigms for Biology-inspired Spiking Neural Networks | Tianyu Zheng et.al. | 2408.13996 | null |
| 2024-08-25 | Literary and Colloquial Tamil Dialect Identification | M. Nanmalar et.al. | 2408.13739 | null |
| 2024-08-24 | Studying the Effect of Audio Filters in Pre-Trained Models for Environmental Sound Classification | Aditya Dawn et.al. | 2408.13644 | null |
| 2024-09-18 | NEST: Self-supervised Fast Conformer as All-purpose Seasoning to Speech Processing Tasks | He Huang et.al. | 2408.13106 | link |
| 2024-08-23 | Focused Discriminative Training For Streaming CTC-Trained Automatic Speech Recognition Models | Adnan Haider et.al. | 2408.13008 | null |
| 2024-08-22 | Towards measuring fairness in speech recognition: Fair-Speech dataset | Irina-Elena Veliche et.al. | 2408.12734 | null |
| 2024-08-22 | WhisperMask: A Noise Suppressive Mask-Type Microphone for Whisper Speech | Hirotaka Hiraki et.al. | 2408.12500 | null |
| 2024-08-22 | Positional Description for Numerical Normalization | Deepanshu Gupta et.al. | 2408.12430 | null |
| 2024-08-22 | Developing vocal system impaired patient-aimed voice quality assessment approach using ASR representation-included multiple features | Shaoxiang Dang et.al. | 2408.12279 | null |
| 2024-08-21 | The State of Commercial Automatic French Legal Speech Recognition Systems and their Impact on Court Reporters et al | Nicolad Garneau et.al. | 2408.11940 | null |
| 2024-08-19 | Parameter-Efficient Transfer Learning under Federated Learning for Automatic Speech Recognition | Xuan Kan et.al. | 2408.11873 | null |
| 2024-08-13 | Style-Talker: Finetuning Audio Language Model and Style-Based Text-to-Speech Model for Fast Spoken Dialogue Generation | Yinghao Aaron Li et.al. | 2408.11849 | null |
| 2024-08-21 | Approaching Deep Learning through the Spectral Dynamics of Weights | David Yunis et.al. | 2408.11804 | link |
| 2024-08-21 | Improving Speech Recognition Error Prediction for Modern and Off-the-shelf Speech Recognizers | Prashant Serai et.al. | 2408.11258 | null |
| 2024-08-20 | XCB: an effective contextual biasing approach to bias cross-lingual phrases in speech recognition | Xucheng Wan et.al. | 2408.10524 | null |
| 2024-08-19 | Recording for Eyes, Not Echoing to Ears: Contextualized Spoken-to-Written Conversion of ASR Transcripts | Jiaqing Liu et.al. | 2408.09688 | null |
| 2024-08-18 | A Transcription Prompt-based Efficient Audio Large Language Model for Robust Speech Recognition | Yangze Li et.al. | 2408.09491 | null |
| 2024-08-17 | Generating Data with Text-to-Speech and Large-Language Models for Conversational Speech Recognition | Samuele Cornell et.al. | 2408.09215 | link |
| 2024-08-15 | Enhancing Large Language Model-based Speech Recognition by Contextualization for Rare and Ambiguous Words | Kento Nozawa et.al. | 2408.08027 | null |
| 2024-08-14 | SER Evals: In-domain and Out-of-domain Benchmarking for Speech Emotion Recognition | Mohamed Osman et.al. | 2408.07851 | link |
| 2024-08-14 | DPSNN: Spiking Neural Network for Low-Latency Streaming Speech Enhancement | Tao Sun et.al. | 2408.07388 | null |
| 2024-08-16 | MathBridge: A Large Corpus Dataset for Translating Spoken Mathematical Expressions into |
Kyudan Jung et.al. | 2408.07081 | null |
| 2024-08-12 | Cross-Lingual Conversational Speech Summarization with Large Language Models | Max Nelson et.al. | 2408.06484 | null |
| 2024-08-12 | Audio Enhancement for Computer Audition -- An Iterative Training Paradigm Using Sample Importance | Manuel Milling et.al. | 2408.06264 | null |
| 2024-08-12 | Enhancing Dialogue Speech Recognition with Robust Contextual Awareness via Noise Representation Learning | Wonjun Lee et.al. | 2408.06043 | null |
| 2024-08-11 | LI-TTA: Language Informed Test-Time Adaptation for Automatic Speech Recognition | Eunseop Yoon et.al. | 2408.05769 | null |
| 2024-08-11 | VQ-CTAP: Cross-Modal Fine-Grained Sequence Representation Learning for Speech Processing | Chunyu Qiang et.al. | 2408.05758 | null |
| 2024-08-10 | Improving Whisper's Recognition Performance for Under-Represented Language Kazakh Leveraging Unpaired Speech and Text | Jinpeng Li et.al. | 2408.05554 | null |
| 2024-08-09 | MooER: LLM-based Speech Recognition and Translation Models from Moore Threads | Junhao Xu et.al. | 2408.05101 | link |
| 2024-08-08 | HydraFormer: One Encoder For All Subsampling Rates | Yaoxun Xu et.al. | 2408.04325 | link |
| 2024-08-08 | Preserving spoken content in voice anonymisation with character-level vocoder conditioning | Michele Panariello et.al. | 2408.04306 | link |
| 2024-08-08 | wav2graph: A Framework for Supervised Learning Knowledge Graph from Speech | Khai Le-Duc et.al. | 2408.04174 | link |
| 2024-08-07 | Speaker Adaptation for Quantised End-to-End ASR Models | Qiuming Zhao et.al. | 2408.03979 | null |
| 2024-08-06 | ASR-enhanced Multimodal Representation Learning for Cross-Domain Product Retrieval | Ruixiang Zhao et.al. | 2408.02978 | null |
| 2024-08-06 | Self-Supervised Learning for Multi-Channel Neural Transducer | Atsushi Kojima et.al. | 2408.02945 | null |
| 2024-08-05 | Clustering and Mining Accented Speech for Inclusive and Fair Speech Recognition | Jaeyoung Kim et.al. | 2408.02582 | null |
| 2024-09-12 | The NPU-ASLP System Description for Visual Speech Recognition in CNVSRC 2024 | He Wang et.al. | 2408.02369 | link |
| 2024-08-05 | StreamVoice+: Evolving into End-to-end Streaming Zero-shot Voice Conversion | Zhichao Wang et.al. | 2408.02178 | null |
| 2024-08-03 | ALIF: Low-Cost Adversarial Audio Attacks on Black-Box Speech Platforms using Linguistic Features | Peng Cheng et.al. | 2408.01808 | link |
| 2024-08-01 | SynesLM: A Unified Approach for Audio-visual Speech Recognition and Translation via Language Model and Synthetic Data | Yichen Lu et.al. | 2408.00624 | link |
| 2024-08-01 | Sentence-wise Speech Summarization: Task, Datasets, and End-to-End Modeling with LM Knowledge Distillation | Kohei Matsuura et.al. | 2408.00205 | null |
| 2024-07-18 | Framework for Curating Speech Datasets and Evaluating ASR Systems: A Case Study for Polish | Michał Junczyk et.al. | 2408.00005 | link |
| 2024-07-18 | Handling Numeric Expressions in Automatic Speech Recognition | Christian Huber et.al. | 2408.00004 | null |
| 2024-08-15 | The Llama 3 Herd of Models | Abhimanyu Dubey et.al. | 2407.21783 | null |
| 2024-07-31 | On the Problem of Text-To-Speech Model Selection for Synthetic Data Generation in Automatic Speech Recognition | Nick Rossenbach et.al. | 2407.21476 | null |
| 2024-07-31 | Towards interfacing large language models with ASR systems using confidence measures and prompting | Maryam Naderi et.al. | 2407.21414 | null |
| 2024-07-30 | Self-Supervised Models in Automatic Whispered Speech Recognition | Aref Farhadipour et.al. | 2407.21211 | null |
| 2024-07-28 | ELP-Adapters: Parameter Efficient Adapter Tuning for Various Speech Processing Tasks | Nakamasa Inoue et.al. | 2407.21066 | null |
| 2024-07-26 | Improving noisy student training for low-resource languages in End-to-End ASR using CycleGAN and inter-domain losses | Chia-Yu Li et.al. | 2407.21061 | null |
| 2024-07-10 | Dynamic Encoder Size Based on Data-Driven Layer-wise Pruning for Speech Recognition | Jingjing Xu et.al. | 2407.18930 | null |
| 2024-08-07 | Dynamic Language Group-Based MoE: Enhancing Code-Switching Speech Recognition with Hierarchical Routing | Hukai Huang et.al. | 2407.18581 | link |
| 2024-07-29 | Speech Bandwidth Expansion Via High Fidelity Generative Adversarial Networks | Mahmoud Salhab et.al. | 2407.18571 | null |
| 2024-07-26 | Enhancing Dysarthric Speech Recognition for Unseen Speakers via Prototype-Based Adaptation | Shiyao Wang et.al. | 2407.18461 | link |
| 2024-07-08 | Analyzing Speech Unit Selection for Textless Speech-to-Speech Translation | Jarod Duret et.al. | 2407.18332 | null |
| 2024-07-25 | On the Effect of Purely Synthetic Training Data for Different Automatic Speech Recognition Architectures | Nick Rossenbach et.al. | 2407.17997 | null |
| 2024-07-25 | Improving Domain-Specific ASR with LLM-Generated Contextual Descriptions | Jiwon Suh et.al. | 2407.17874 | null |
| 2024-07-25 | Scaling A Simple Approach to Zero-Shot Speech Recognition | Jinming Zhao et.al. | 2407.17852 | link |
| 2024-07-24 | Coupling Speech Encoders with Downstream Text Models | Ciprian Chelba et.al. | 2407.17605 | null |
| 2024-07-30 | Toward Automated Detection of Biased Social Signals from the Content of Clinical Conversations | Feng Chen et.al. | 2407.17477 | null |
| 2024-07-10 | Explaining Spectrograms in Machine Learning: A Study on Neural Networks for Speech Classification | Jesin James et.al. | 2407.17416 | null |
| 2024-07-24 | A Comparative Analysis of Bilingual and Trilingual Wav2Vec Models for Automatic Speech Recognition in Multilingual Oral History Archives | Jan Lehečka et.al. | 2407.17160 | null |
| 2024-07-23 | Quantifying the Role of Textual Predictability in Automatic Speech Recognition | Sean Robertson et.al. | 2407.16537 | null |
| 2024-07-23 | The CHiME-8 DASR Challenge for Generalizable and Array Agnostic Distant Automatic Speech Recognition and Diarization | Samuele Cornell et.al. | 2407.16447 | null |
| 2024-07-23 | Evolutionary Prompt Design for LLM-Based Post-ASR Error Correction | Rithik Sachdev et.al. | 2407.16370 | link |
| 2024-07-22 | dMel: Speech Tokenization made Simple | He Bai et.al. | 2407.15835 | null |
| 2024-07-22 | Robustness of Speech Separation Models for Similar-pitch Speakers | Bunlong Lay et.al. | 2407.15749 | null |
| 2024-07-22 | SELM: Enhancing Speech Emotion Recognition for Out-of-Domain Scenarios | Hazim Bukhari et.al. | 2407.15300 | null |
| 2024-08-24 | Trading Devil Final: Backdoor attack via Stock market and Bayesian Optimization | Orson Mengara et.al. | 2407.14573 | null |
| 2024-07-07 | Morse Code-Enabled Speech Recognition for Individuals with Visual and Hearing Impairments | Ritabrata Roy Choudhury et.al. | 2407.14525 | null |
| 2024-07-19 | GE2E-AC: Generalized End-to-End Loss Training for Accent Classification | Chihiro Watanabe et.al. | 2407.14021 | null |
| 2024-07-19 | Reexamining Racial Disparities in Automatic Speech Recognition Performance: The Role of Confounding by Provenance | Changye Li et.al. | 2407.13982 | null |
| 2024-07-22 | Self-supervised ASR Models and Features For Dysarthric and Elderly Speech Recognition | Shujie Hu et.al. | 2407.13782 | null |
| 2024-07-18 | Robust ASR Error Correction with Conservative Data Filtering | Takuma Udagawa et.al. | 2407.13300 | null |
| 2024-07-18 | Low-Resourced Speech Recognition for Iu Mien Language via Weakly-Supervised Phoneme-based Multilingual Pre-training | Lukuan Dong et.al. | 2407.13292 | null |
| 2024-07-18 | How Private is Low-Frequency Speech Audio in the Wild? An Analysis of Verbal Intelligibility by Humans and Machines | Ailin Liu et.al. | 2407.13266 | null |
| 2024-07-18 | A light-weight and efficient punctuation and word casing prediction model for on-device streaming ASR | Jian You et.al. | 2407.13142 | null |
| 2024-06-29 | Error Correction by Paying Attention to Both Acoustic and Confidence References for Automatic Speech Recognition | Yuchun Shu et.al. | 2407.12817 | null |
| 2024-07-17 | Morphosyntactic Analysis for CHILDES | Houjun Liu et.al. | 2407.12389 | null |
| 2024-07-17 | Adaptive Cascading Network for Continual Test-Time Adaptation | Kien X. Nguyen et.al. | 2407.12240 | null |
| 2024-07-16 | Identifying Speakers in Dialogue Transcripts: A Text-based Approach Using Pretrained Language Models | Minh Nguyen et.al. | 2407.12094 | link |
| 2024-06-29 | A Quality-Aware Voltage Overscaling Framework to Improve the Energy Efficiency and Lifetime of TPUs based on Statistical Error Modeling | Alireza Senobari et.al. | 2407.12029 | null |
| 2024-06-28 | TreeSeg: Hierarchical Topic Segmentation of Large Transcripts | Dimitrios C. Gklezakos et.al. | 2407.12028 | null |
| 2024-05-31 | Open the Data! Chuvash Datasets | Nikolay Plotnikov et.al. | 2407.11982 | null |
| 2024-07-17 | Vibravox: A Dataset of French Speech Captured with Body-conduction Audio Sensors | Julien Hauret et.al. | 2407.11828 | link |
| 2024-07-16 | Investigating the Effect of Label Topology and Training Criterion on ASR Performance and Alignment Quality | Tina Raissi et.al. | 2407.11641 | null |
| 2024-07-16 | The VoicePrivacy 2022 Challenge: Progress and Perspectives in Voice Anonymisation | Michele Panariello et.al. | 2407.11516 | null |
| 2024-07-16 | Beyond Binary: Multiclass Paraphasia Detection with Generative Pretrained Transformers and End-to-End Models | Matthew Perez et.al. | 2407.11345 | null |
| 2024-07-15 | Leave No Knowledge Behind During Knowledge Distillation: Towards Practical and Effective Knowledge Distillation for Code-Switching ASR Using Realistic Data | Liang-Hsuan Tseng et.al. | 2407.10603 | null |
| 2024-07-14 | Improving Neural Biasing for Contextual Speech Recognition by Early Context Injection and Text Perturbation | Ruizhe Huang et.al. | 2407.10303 | null |
| 2024-07-14 | CUSIDE-T: Chunking, Simulating Future and Decoding for Transducer based Streaming ASR | Wenbo Zhao et.al. | 2407.10255 | null |
| 2024-07-14 | Textless Dependency Parsing by Labeled Sequence Prediction | Shunsuke Kando et.al. | 2407.10118 | link |
| 2024-07-14 | Whisper-SV: Adapting Whisper for Low-data-resource Speaker Verification | Li Zhang et.al. | 2407.10048 | null |
| 2024-07-13 | Text-Based Detection of On-Hold Scripts in Contact Center Calls | Dmitrii Galimzianov et.al. | 2407.09849 | link |
| 2024-08-24 | Empowering Whisper as a Joint Multi-Talker and Target-Talker Speech Recognition System | Lingwei Meng et.al. | 2407.09817 | link |
| 2024-07-13 | A Streaming Multi-Channel End-to-End Speech Recognition System with Realistic Evaluations | Xiangzhu Kong et.al. | 2407.09807 | link |
| 2024-07-13 | Speech Slytherin: Examining the Performance and Efficiency of Mamba for Speech Separation, Recognition, and Synthesis | Xilin Jiang et.al. | 2407.09732 | link |
| 2024-07-10 | Evaluating Voice Command Pipelines for Drone Control: From STT and LLM to Direct Classification and Siamese Networks | Lucca Emmanuel Pineli Simões et.al. | 2407.08658 | null |
| 2024-08-12 | Tamil Language Computing: the Present and the Future | Kengatharaiyer Sarveswaran et.al. | 2407.08618 | null |
| 2024-07-10 | HebDB: a Weakly Supervised Dataset for Hebrew Speech Processing | Arnon Turetzky et.al. | 2407.07566 | null |
| 2024-07-09 | Tailored Design of Audio-Visual Speech Recognition Models using Branchformers | David Gimeno-Gómez et.al. | 2407.06606 | link |
| 2024-07-08 | Homogeneous Speaker Features for On-the-Fly Dysarthric and Elderly Speaker Adaptation | Mengzhe Geng et.al. | 2407.06310 | null |
| 2024-07-09 | CosyVoice: A Scalable Multilingual Zero-shot Text-to-speech Synthesizer based on Supervised Semantic Tokens | Zhihao Du et.al. | 2407.05407 | null |
| 2024-07-10 | Seed-ASR: Understanding Diverse Speech and Contexts with LLM-based Speech Recognition | Ye Bai et.al. | 2407.04675 | null |
| 2024-07-05 | Multitaper mel-spectrograms for keyword spotting | Douglas Baptista de Souza et.al. | 2407.04662 | null |
| 2024-07-05 | Pretraining End-to-End Keyword Search with Automatically Discovered Acoustic Units | Bolaji Yusuf et.al. | 2407.04652 | link |
| 2024-07-05 | Speculative Speech Recognition by Audio-Prefixed Low-Rank Adaptation of Language Models | Bolaji Yusuf et.al. | 2407.04641 | null |
| 2024-07-05 | Written Term Detection Improves Spoken Term Detection | Bolaji Yusuf et.al. | 2407.04601 | link |
| 2024-07-09 | Performance Analysis of Speech Encoders for Low-Resource SLU and ASR in Tunisian Dialect | Salima Mdhaffar et.al. | 2407.04533 | link |
| 2024-07-05 | Controlling Whisper: Universal Acoustic Adversarial Attacks to Control Speech Foundation Models | Vyas Raina et.al. | 2407.04482 | null |
| 2024-07-05 | XLSR-Transducer: Streaming ASR for Self-Supervised Pretrained Models | Shashi Kumar et.al. | 2407.04439 | null |
| 2024-07-05 | Romanization Encoding For Multilingual ASR | Wen Ding et.al. | 2407.04368 | null |
| 2024-07-05 | LearnerVoice: A Dataset of Non-Native English Learners' Spontaneous Speech | Haechan Kim et.al. | 2407.04280 | null |
| 2024-07-05 | Semi-supervised Learning for Code-Switching ASR with Large Language Model Filter | Yu Xi et.al. | 2407.04219 | null |
| 2024-07-11 | FunAudioLLM: Voice Understanding and Generation Foundation Models for Natural Interaction Between Humans and LLMs | Keyu An et.al. | 2407.04051 | link |
| 2024-07-04 | Improving Accented Speech Recognition using Data Augmentation based on Unsupervised Text-to-Speech Synthesis | Cong-Thanh Do et.al. | 2407.04047 | null |
| 2024-07-04 | Serialized Output Training by Learned Dominance | Ying Shi et.al. | 2407.03966 | null |
| 2024-07-04 | Finetuning End-to-End Models for Estonian Conversational Spoken Language Translation | Tiia Sildam et.al. | 2407.03809 | null |
| 2024-07-04 | Improving Self-supervised Pre-training using Accent-Specific Codebooks | Darshan Prabhu et.al. | 2407.03734 | link |
| 2024-07-24 | Multi-Convformer: Extending Conformer with Multiple Convolution Kernels | Darshan Prabhu et.al. | 2407.03718 | link |
| 2024-07-04 | Learning Video Temporal Dynamics with Cross-Modal Attention for Robust Audio-Visual Speech Recognition | Sungnyun Kim et.al. | 2407.03563 | null |
| 2024-07-03 | Codec-ASR: Training Performant Automatic Speech Recognition Systems with Discrete Speech Representations | Kunal Dhawan et.al. | 2407.03495 | null |
| 2024-07-03 | Advanced Framework for Animal Sound Classification With Features Optimization | Qiang Yang et.al. | 2407.03440 | null |
| 2024-07-03 | Qifusion-Net: Layer-adapted Stream/Non-stream Model for End-to-End Multi-Accent Speech Recognition | Jinming Chen et.al. | 2407.03026 | null |
| 2024-07-02 | Towards the Next Frontier in Speech Representation Learning Using Disentanglement | Varun Krishna et.al. | 2407.02543 | null |
| 2024-07-02 | The USTC-NERCSLIP Systems for The ICMC-ASR Challenge | Minghui Wu et.al. | 2407.02052 | null |
| 2024-07-02 | Pinyin Regularization in Error Correction for Chinese Speech Recognition with Large Language Models | Zhiyuan Tang et.al. | 2407.01909 | link |
| 2024-06-30 | Less Forgetting for Better Generalization: Exploring Continual-learning Fine-tuning Methods for Speech Self-supervised Representations | Salah Zaiem et.al. | 2407.00756 | null |
| 2024-06-29 | When Robots Get Chatty: Grounding Multimodal Human-Robot Conversation and Collaboration | Philipp Allgeuer et.al. | 2407.00518 | null |
| 2024-07-18 | Open-Source Conversational AI with SpeechBrain 1.0 | Mirco Ravanelli et.al. | 2407.00463 | null |
| 2024-06-28 | SAML: Speaker Adaptive Mixture of LoRA Experts for End-to-End ASR | Qiuming Zhao et.al. | 2406.19706 | null |
| 2024-06-28 | Less is More: Accurate Speech Recognition & Translation without Web-Scale Data | Krishna C. Puvvada et.al. | 2406.19674 | null |
| 2024-06-27 | Voices Unheard: NLP Resources and Models for Yorùbá Regional Dialects | Orevaoghene Ahia et.al. | 2406.19564 | link |
| 2024-06-27 | Tradition or Innovation: A Comparison of Modern ASR Methods for Forced Alignment | Rotem Rousso et.al. | 2406.19363 | null |
| 2024-06-27 | Zero-Query Adversarial Attack on Black-box Automatic Speech Recognition Systems | Zheng Fang et.al. | 2406.19311 | null |
| 2024-06-27 | Applying LLMs for Rescoring N-best ASR Hypotheses of Casual Conversations: Effects of Domain Adaptation and Context Carry-over | Atsunori Ogawa et.al. | 2406.18972 | null |
| 2024-06-27 | Enhanced ASR Robustness to Packet Loss with a Front-End Adaptation Network | Yehoshua Dissen et.al. | 2406.18928 | null |
| 2024-06-27 | Streaming Decoder-Only Automatic Speech Recognition with Discrete Speech Units: A Pilot Study | Peikun Chen et.al. | 2406.18862 | link |
| 2024-06-26 | Dynamic Data Pruning for Automatic Speech Recognition | Qiao Xiao et.al. | 2406.18373 | null |
| 2024-06-26 | MSR-86K: An Evolving, Multilingual Corpus with 86,300 Hours of Transcribed Audio for Speech Recognition Research | Song Li et.al. | 2406.18301 | null |
| 2024-06-26 | Automatic Speech Recognition for Hindi | Anish Saha et.al. | 2406.18135 | null |
| 2024-07-12 | ArzEn-LLM: Code-Switched Egyptian Arabic-English Translation and Speech Recognition Using LLMs | Ahmed Heakl et.al. | 2406.18120 | link |
| 2024-06-26 | SC-MoE: Switch Conformer Mixture of Experts for Unified Streaming and Non-streaming Code-Switching ASR | Shuaishuai Ye et.al. | 2406.18021 | null |
| 2024-06-25 | Sequential Editing for Lifelong Training of Speech Recognition Models | Devang Kulshreshtha et.al. | 2406.17935 | null |
| 2024-06-25 | FASA: a Flexible and Automatic Speech Aligner for Extracting High-quality Aligned Children Speech Data | Dancheng Liu et.al. | 2406.17926 | link |
| 2024-06-25 | Automatic speech recognition for the Nepali language using CNN, bidirectional LSTM and ResNet | Manish Dhakal et.al. | 2406.17825 | link |
| 2024-06-25 | Towards Building an End-to-End Multilingual Automatic Lyrics Transcription Model | Jiawen Huang et.al. | 2406.17618 | link |
| 2024-06-25 | MSRS: Training Multimodal Speech Recognition Models from Scratch with Sparse Mask Optimization | Adriana Fernandez-Lopez et.al. | 2406.17614 | null |
| 2024-06-25 | A Comprehensive Solution to Connect Speech Encoder and Large Language Model for ASR | Van Tung Pham et.al. | 2406.17272 | null |
| 2024-06-24 | Investigating Confidence Estimation Measures for Speaker Diarization | Anurag Chowdhury et.al. | 2406.17124 | null |
| 2024-06-24 | Exploring the Capability of Mamba in Speech Applications | Koichi Miyazaki et.al. | 2406.16808 | null |
| 2024-06-24 | Blending LLMs into Cascaded Speech Translation: KIT's Offline Speech Translation System for IWSLT 2024 | Sai Koneru et.al. | 2406.16777 | null |
| 2024-06-23 | Contextualized End-to-end Automatic Speech Recognition with Intermediate Biasing Loss | Muhammad Shakeel et.al. | 2406.16120 | null |
| 2024-08-01 | Decoder-only Architecture for Streaming End-to-end Speech Recognition | Emiru Tsunoo et.al. | 2406.16107 | null |
| 2024-06-22 | Acoustic Feature Mixup for Balanced Multi-aspect Pronunciation Assessment | Heejin Do et.al. | 2406.15723 | null |
| 2024-06-21 | PI-Whisper: An Adaptive and Incremental ASR Framework for Diverse and Evolving Speaker Characteristics | Amir Nassereldine et.al. | 2406.15668 | null |
| 2024-06-21 | Perception of Phonological Assimilation by Neural Speech Recognition Models | Charlotte Pouw et.al. | 2406.15265 | null |
| 2024-06-21 | InterBiasing: Boost Unseen Word Recognition through Biasing Intermediate Predictions | Yu Nakagome et.al. | 2406.14890 | null |
| 2024-06-20 | An Adapter-Based Unified Model for Multiple Spoken Language Processing Tasks | Varsha Suresh et.al. | 2406.14747 | null |
| 2024-06-21 | DASB - Discrete Audio and Speech Benchmark | Pooneh Mousavi et.al. | 2406.14294 | null |
| 2024-06-20 | Intelligent Interface: Enhancing Lecture Engagement with Didactic Activity Summaries | Anna Wróblewska et.al. | 2406.14266 | null |
| 2024-06-19 | Joint vs Sequential Speaker-Role Detection and Automatic Speech Recognition for Air-traffic Control | Alexander Blatt et.al. | 2406.13842 | null |
| 2024-06-19 | ManWav: The First Manchu ASR Model | Jean Seo et.al. | 2406.13502 | null |
| 2024-06-24 | Children's Speech Recognition through Discrete Token Enhancement | Vrunda N. Sukhadia et.al. | 2406.13431 | null |
| 2024-06-17 | Self-Train Before You Transcribe | Robert Flynn et.al. | 2406.12937 | link |
| 2024-06-16 | Automatic Speech Recognition for Biomedical Data in Bengali Language | Shariar Kabir et.al. | 2406.12931 | null |
| 2024-06-18 | Bridging the Gap: Integrating Pre-trained Speech Enhancement and Recognition Models for Robust Speech Recognition | Kuan-Chen Wang et.al. | 2406.12699 | null |
| 2024-06-18 | Transcribe, Align and Segment: Creating speech datasets for low-resource languages | Taras Sereda et.al. | 2406.12674 | null |
| 2024-06-18 | Growing Trees on Sounds: Assessing Strategies for End-to-End Dependency Parsing of Speech | Adrien Pupier et.al. | 2406.12621 | link |
| 2024-06-18 | Rapid Language Adaptation for Multilingual E2E Speech Recognition Using Encoder Prompting | Yosuke Kashiwagi et.al. | 2406.12611 | null |
| 2024-06-18 | Unsupervised Online Continual Learning for Automatic Speech Recognition | Steven Vander Eeckt et.al. | 2406.12503 | link |
| 2024-06-18 | Performant ASR Models for Medical Entities in Accented Speech | Tejumade Afonja et.al. | 2406.12387 | null |
| 2024-06-18 | Finding Task-specific Subnetworks in Multi-task Spoken Language Understanding Model | Hayato Futami et.al. | 2406.12317 | null |
| 2024-06-18 | SyncVSR: Data-Efficient Visual Speech Recognition with End-to-End Crossmodal Audio Token Synchronization | Young Jin Ahn et.al. | 2406.12233 | link |
| 2024-06-17 | GigaSpeech 2: An Evolving, Large-Scale and Multi-domain ASR Corpus for Low-Resource Languages with Automated Crawling, Transcription and Refinement | Yifan Yang et.al. | 2406.11546 | link |
| 2024-06-16 | Continual Test-time Adaptation for End-to-end Speech Recognition on Noisy Speech | Guan-Ting Lin et.al. | 2406.11064 | null |
| 2024-06-16 | NAST: Noise Aware Speech Tokenization for Speech Language Models | Shoval Messica et.al. | 2406.11037 | link |
| 2024-06-16 | Large Language Models for Dysfluency Detection in Stuttered Speech | Dominik Wagner et.al. | 2406.11025 | null |
| 2024-06-16 | Outlier Reduction with Gated Attention for Improved Post-training Quantization in Large Sequence-to-sequence Speech Foundation Models | Dominik Wagner et.al. | 2406.11022 | null |
| 2024-06-16 | Optimized Speculative Sampling for GPU Hardware Accelerators | Dominik Wagner et.al. | 2406.11016 | null |
| 2024-06-16 | CoSTA: Code-Switched Speech Translation using Aligned Speech-Text Interleaving | Bhavani Shankar et.al. | 2406.10993 | null |
| 2024-06-16 | Imperceptible Rhythm Backdoor Attacks: Exploring Rhythm Transformation for Embedding Undetectable Vulnerabilities on Speech Recognition | Wenhan Yao et.al. | 2406.10932 | null |
| 2024-06-15 | Speech Emotion Recognition Using CNN and Its Use Case in Digital Healthcare | Nishargo Nigar et.al. | 2406.10741 | null |
| 2024-06-21 | Trading Devil: Robust backdoor attack via Stochastic investment models and Bayesian approach | Orson Mengara et.al. | 2406.10719 | null |
| 2024-08-06 | Double Multi-Head Attention Multimodal System for Odyssey 2024 Speech Emotion Recognition Challenge | Federico Costa et.al. | 2406.10598 | null |
| 2024-06-14 | CNVSRC 2023: The First Chinese Continuous Visual Speech Recognition Challenge | Chen Chen et.al. | 2406.10313 | null |
| 2024-06-12 | Improving child speech recognition with augmented child-like speech | Yuanyuan Zhang et.al. | 2406.10284 | null |
| 2024-06-14 | Inclusive ASR for Disfluent Speech: Cascaded Large-Scale Self-Supervised Learning with Targeted Fine-Tuning and Data Augmentation | Dena Mujtaba et.al. | 2406.10177 | null |
| 2024-06-14 | On the Evaluation of Speech Foundation Models for Spoken Language Understanding | Siddhant Arora et.al. | 2406.10083 | null |
| 2024-06-14 | Whisper-Flamingo: Integrating Visual Features into Whisper for Audio-Visual Speech Recognition and Translation | Andrew Rouditchenko et.al. | 2406.10082 | link |
| 2024-06-14 | Simul-Whisper: Attention-Guided Streaming Whisper with Truncation Detection | Haoyu Wang et.al. | 2406.10052 | link |
| 2024-06-14 | ROAR: Reinforcing Original to Augmented Data Ratio Dynamics for Wav2Vec2.0 Based ASR | Vishwanath Pratap Singh et.al. | 2406.09999 | null |
| 2024-06-14 | An efficient text augmentation approach for contextualized Mandarin speech recognition | Naijun Zheng et.al. | 2406.09950 | null |
| 2024-06-14 | Perceiver-Prompt: Flexible Speaker Adaptation in Whisper for Chinese Disordered Speech Recognition | Yicong Jiang et.al. | 2406.09873 | null |
| 2024-06-14 | MMM: Multi-Layer Multi-Residual Multi-Stream Discrete Speech Representation from Self-supervised Learning Model | Jiatong Shi et.al. | 2406.09869 | null |
| 2024-06-14 | Optimizing Byte-level Representation for End-to-end ASR | Roger Hsiao et.al. | 2406.09676 | null |
| 2024-06-14 | Learning Language Structures through Grounding | Freda Shi et.al. | 2406.09662 | null |
| 2024-06-13 | Multi-Modal Retrieval For Large Language Model Based Speech Recognition | Jari Kolehmainen et.al. | 2406.09618 | null |
| 2024-06-13 | Speech ReaLLM -- Real-time Streaming Speech Recognition with Multimodal LLMs by Teaching the Flow of Time | Frank Seide et.al. | 2406.09569 | null |
| 2024-06-13 | The Second DISPLACE Challenge : DIarization of SPeaker and LAnguage in Conversational Environments | Shareef Babu Kalluri et.al. | 2406.09494 | null |
| 2024-06-12 | Comparative Analysis of Personalized Voice Activity Detection Systems: Assessing Real-World Effectiveness | Satyam Kumar et.al. | 2406.09443 | null |
| 2024-04-13 | SGPRS: Seamless GPU Partitioning Real-Time Scheduler for Periodic Deep Learning Workloads | Amir Fakhim Babaei et.al. | 2406.09425 | null |
| 2024-06-13 | Language Complexity and Speech Recognition Accuracy: Orthographic Complexity Hurts, Phonological Complexity Doesn't | Chihiro Taguchi et.al. | 2406.09202 | link |
| 2024-06-13 | LASER: Learning by Aligning Self-supervised Representations of Speech for Improving Content-related Tasks | Amit Meghanani et.al. | 2406.09153 | link |
| 2024-06-13 | Transcription-Free Fine-Tuning of Speech Separation Models for Noisy and Reverberant Multi-Speaker Automatic Speech Recognition | William Ravenscroft et.al. | 2406.08914 | null |
| 2024-06-13 | AdaPTwin: Low-Cost Adaptive Compression of Product Twins in Transformers | Emil Biju et.al. | 2406.08904 | null |
| 2024-06-12 | ML-SUPERB 2.0: Benchmarking Multilingual Speech Models Across Modeling Constraints, Languages, and Datasets | Jiatong Shi et.al. | 2406.08641 | null |
| 2024-06-12 | Neural Blind Source Separation and Diarization for Distant Speech Recognition | Yoshiaki Bando et.al. | 2406.08396 | null |
| 2025-01-10 | Towards Unsupervised Speech Recognition Without Pronunciation Models | Junrui Ni et.al. | 2406.08380 | null |
| 2024-06-12 | Speech Emotion Recognition with ASR Transcripts: A Comprehensive Study on Word Error Rate and Fusion Techniques | Yuanchao Li et.al. | 2406.08353 | link |
| 2024-06-13 | Refining Self-Supervised Learnt Speech Representation using Brain Activations | Hengyu Li et.al. | 2406.08266 | null |
| 2024-06-12 | Transformer-based Model for ASR N-Best Rescoring and Rewriting | Iwen E. Kang et.al. | 2406.08207 | null |
| 2024-06-12 | Audio-conditioned phonemic and prosodic annotation for building text-to-speech models from unlabeled speech data | Yuma Shirahata et.al. | 2406.08111 | null |
| 2024-06-14 | Can Large Language Models Understand Spatial Audio? | Changli Tang et.al. | 2406.07914 | null |
| 2024-06-12 | Guiding Frame-Level CTC Alignments Using Self-knowledge Distillation | Eungbeom Kim et.al. | 2406.07909 | null |
| 2024-06-12 | DualVC 3: Leveraging Language Model Generated Pseudo Context for End-to-end Low Latency Streaming Voice Conversion | Ziqian Ning et.al. | 2406.07846 | null |
| 2024-06-12 | Dual-Pipeline with Low-Rank Adaptation for New Language Integration in Multilingual ASR | Yerbolat Khassanov et.al. | 2406.07842 | null |
| 2024-06-12 | PRoDeliberation: Parallel Robust Deliberation for End-to-End Spoken Language Understanding | Trang Le et.al. | 2406.07823 | null |
| 2024-06-12 | PolySpeech: Exploring Unified Multitask Speech Models for Competitiveness with Single-task Models | Runyan Yang et.al. | 2406.07801 | null |
| 2024-06-11 | The Interspeech 2024 Challenge on Speech Processing Using Discrete Units | Xuankai Chang et.al. | 2406.07725 | null |
| 2024-06-11 | Tag and correct: high precision post-editing approach to correction of speech recognition errors | Tomasz Ziętkiewicz et.al. | 2406.07589 | null |
| 2024-06-11 | AS-70: A Mandarin stuttered speech dataset for automatic speech recognition and stuttering event detection | Rong Gong et.al. | 2406.07256 | null |
| 2024-06-11 | Fast Context-Biasing for CTC and Transducer ASR models with CTC-based Word Spotter | Andrei Andrusenko et.al. | 2406.07096 | null |
| 2024-07-29 | Spoken Language Corpora Augmentation with Domain-Specific Voice-Cloned Speech | Mateusz Czyżnikiewicz et.al. | 2406.07090 | null |
| 2024-06-11 | Reading Miscue Detection in Primary School through Automatic Speech Recognition | Lingyun Gao et.al. | 2406.07060 | null |
| 2024-06-10 | Synthetic Query Generation using Large Language Models for Virtual Assistants | Sonal Sannigrahi et.al. | 2406.06729 | null |
| 2024-06-13 | ASTRA: Aligning Speech and Text Representations for Asr without Sampling | Neeraj Gaur et.al. | 2406.06664 | null |
| 2024-06-07 | LoRA-Whisper: Parameter-Efficient and Extensible Multilingual ASR | Zheshu Song et.al. | 2406.06619 | null |
| 2024-06-25 | Discrete Multimodal Transformers with a Pretrained Large Language Model for Mixed-Supervision Speech Processing | Viet Anh Trinh et.al. | 2406.06582 | null |
| 2024-06-10 | A Parameter-efficient Language Extension Framework for Multilingual ASR | Wei Liu et.al. | 2406.06329 | null |
| 2024-06-10 | Prompting Large Language Models with Audio for General-Purpose Speech Summarization | Wonjune Kang et.al. | 2406.05968 | link |
| 2024-07-18 | Do Prompts Really Prompt? Exploring the Prompt Understanding Capability of Whisper | Chih-Kai Yang et.al. | 2406.05806 | null |
| 2024-07-20 | Optimizing Multi-Stuttered Speech Classification: Leveraging Whisper's Encoder for Efficient Parameter Reduction in Automated Assessment | Huma Ameer et.al. | 2406.05784 | null |
| 2024-06-09 | MS-HuBERT: Mitigating Pre-training and Inference Mismatch in Masked Language Modelling methods for learning Speech Representations | Hemant Yadav et.al. | 2406.05661 | null |
| 2024-06-07 | LLM-based speaker diarization correction: A generalizable approach | Georgios Efstathiadis et.al. | 2406.04927 | link |
| 2024-07-02 | Speaker-Smoothed kNN Speaker Adaptation for End-to-End ASR | Shaojun Li et.al. | 2406.04791 | null |
| 2024-06-07 | Pitch-Aware RNN-T for Mandarin Chinese Mispronunciation Detection and Diagnosis | Xintong Wang et.al. | 2406.04595 | null |
| 2024-06-06 | Flexible Multichannel Speech Enhancement for Noise-Robust Frontend | Ante Jukić et.al. | 2406.04552 | null |
| 2024-06-06 | Label-Synchronous Neural Transducer for E2E Simultaneous Speech Translation | Keqi Deng et.al. | 2406.04541 | link |
| 2024-06-06 | To Distill or Not to Distill? On the Robustness of Robust Knowledge Distillation | Abdul Waheed et.al. | 2406.04512 | null |
| 2024-06-06 | LipGER: Visually-Conditioned Generative Error Correction for Robust Automatic Speech Recognition | Sreyan Ghosh et.al. | 2406.04432 | link |
| 2024-06-06 | Beyond Performance Plateaus: A Comprehensive Study on Scalability in Speech Enhancement | Wangyou Zhang et.al. | 2406.04269 | link |
| 2024-07-02 | Hypernetworks for Personalizing ASR to Atypical Speech | Max Müller-Eberstein et.al. | 2406.04240 | null |
| 2024-06-06 | Helsinki Speech Challenge 2024 | Martin Ludvigsen et.al. | 2406.04123 | null |
| 2024-06-06 | BLSP-Emo: Towards Empathetic Large Speech-Language Models | Chen Wang et.al. | 2406.03872 | link |
| 2024-06-14 | Improving Zero-Shot Chinese-English Code-Switching ASR with kNN-CTC and Gated Monolingual Datastores | Jiaming Zhou et.al. | 2406.03814 | null |
| 2024-06-06 | Speed of Light Exact Greedy Decoding for RNN-T Speech Recognition Models on GPU | Daniel Galvez et.al. | 2406.03791 | null |
| 2024-06-11 | Enhancing CTC-based speech recognition with diverse modeling units | Shiyi Han et.al. | 2406.03274 | null |
| 2024-06-05 | Error-preserving Automatic Speech Recognition of Young English Learners' Language | Janick Michot et.al. | 2406.03235 | link |
| 2024-06-05 | StreamSpeech: Simultaneous Speech-to-Speech Translation with Multi-task Learning | Shaolei Zhang et.al. | 2406.03049 | link |
| 2024-06-05 | 4D ASR: Joint Beam Search Integrating CTC, Attention, Transducer, and Mask Predict Decoders | Yui Sudo et.al. | 2406.02950 | null |
| 2024-06-15 | Task Arithmetic can Mitigate Synthetic-to-Real Gap in Automatic Speech Recognition | Hsuan Su et.al. | 2406.02925 | null |
| 2024-06-11 | Text Injection for Neural Contextual Biasing | Zhong Meng et.al. | 2406.02921 | null |
| 2024-06-04 | Keyword-Guided Adaptation of Automatic Speech Recognition | Aviv Shamsian et.al. | 2406.02649 | null |
| 2024-05-03 | Combining X-Vectors and Bayesian Batch Active Learning: Two-Stage Active Learning Pipeline for Speech Recognition | Ognjen Kundacina et.al. | 2406.02566 | null |
| 2024-05-02 | Sequence-to-sequence models in peer-to-peer learning: A practical application | Robert Šajina et.al. | 2406.02565 | null |
| 2024-04-29 | A cost minimization approach to fix the vocabulary size in a tokenizer for an End-to-End ASR system | Sunil Kumar Kopparapu et.al. | 2406.02563 | null |
| 2024-04-24 | Gated Low-rank Adaptation for personalized Code-Switching Automatic Speech Recognition on the low-spec devices | Gwantae Kim et.al. | 2406.02562 | null |
| 2024-04-23 | Breaking Walls: Pioneering Automatic Speech Recognition for Central Kurdish: End-to-End Transformer Paradigm | Abdulhady Abas Abdullah et.al. | 2406.02561 | null |
| 2024-07-18 | Less Peaky and More Accurate CTC Forced Alignment by Label Priors | Ruizhe Huang et.al. | 2406.02560 | link |
| 2024-03-27 | PhoWhisper: Automatic Speech Recognition for Vietnamese | Thanh-Thien Le et.al. | 2406.02555 | link |
| 2024-06-04 | Whistle: Data-Efficient Multilingual and Crosslingual Speech Recognition via Weakly Phonetic Supervision | Saierdaer Yusuyin et.al. | 2406.02166 | link |
| 2024-06-05 | Efficiently Train ASR Models that Memorize Less and Perform Better with Per-core Clipping | Lun Wang et.al. | 2406.02004 | null |
| 2024-06-03 | Enabling ASR for Low-Resource Languages: A Comprehensive Dataset Creation Approach | Ara Yeroyan et.al. | 2406.01446 | null |
| 2024-06-03 | Compute-Efficient Medical Image Classification with Softmax-Free Transformers and Sequence Normalization | Firas Khader et.al. | 2406.01314 | null |
| 2024-06-02 | YODAS: Youtube-Oriented Dataset for Audio and Speech | Xinjian Li et.al. | 2406.00899 | null |
| 2024-06-01 | Wav2Prompt: End-to-End Speech Prompt Generation and Tuning For LLM in Zero and Few-shot Learning | Keqi Deng et.al. | 2406.00522 | null |
| 2024-05-27 | ViSpeR: Multilingual Audio-Visual Speech Recognition | Sanath Narayan et.al. | 2406.00038 | null |
| 2024-05-14 | Sonos Voice Control Bias Assessment Dataset: A Methodology for Demographic Bias Assessment in Voice Assistants | Chloé Sekkat et.al. | 2405.19342 | null |
| 2024-05-31 | Zipper: A Multi-Tower Decoder Architecture for Fusing Modalities | Vicky Zayats et.al. | 2405.18669 | null |
| 2024-05-28 | Augmented Conversation with Embedded Speech-Driven On-the-Fly Referencing in AR | Shivesh Jadon et.al. | 2405.18537 | null |
| 2024-05-28 | Intelligent Clinical Documentation: Harnessing Generative AI for Patient-Centric Clinical Note Generation | Anjanava Biswas et.al. | 2405.18346 | null |
| 2024-05-28 | NUTS, NARS, and Speech | D. van der Sluis et.al. | 2405.17874 | null |
| 2024-05-28 | TransVIP: Speech to Speech Translation System with Voice and Isochrony Preservation | Chenyang Le et.al. | 2405.17809 | null |
| 2024-05-27 | Federating Dynamic Models using Early-Exit Architectures for Automatic Speech Recognition on Heterogeneous Clients | Mohamed Nabih Ali et.al. | 2405.17376 | null |
| 2024-05-27 | "Pass the butter": A study on desktop-classic multitasking robotic arm based on advanced YOLOv7 and BERT | Haohua Que et.al. | 2405.17250 | null |
| 2024-05-27 | A Variance-Preserving Interpolation Approach for Diffusion Models with Applications to Single Channel Speech Enhancement and Recognition | Zilu Guo et.al. | 2405.16952 | link |
| 2024-05-24 | Denoising LM: Pushing the Limits of Error Correction Models for Speech Recognition | Zijin Gu et.al. | 2405.15216 | null |
| 2024-05-23 | Contrastive and Consistency Learning for Neural Noisy-Channel Model in Spoken Language Understanding | Suyoung Kim et.al. | 2405.15097 | link |
| 2024-06-02 | Let's Fuse Step by Step: A Generative Fusion Decoding Algorithm with LLMs for Multi-modal Text Recognition | Chan-Jan Hsu et.al. | 2405.14259 | link |
| 2024-05-23 | Self-Taught Recognizer: Toward Unsupervised Adaptation for Speech Foundation Models | Yuchen Hu et.al. | 2405.14161 | link |
| 2024-05-23 | A Survey on Vision-Language-Action Models for Embodied AI | Yueen Ma et.al. | 2405.14093 | null |
| 2024-05-22 | ST-Gait++: Leveraging spatio-temporal convolutions for gait-based emotion recognition on videos | Maria Luísa Lima et.al. | 2405.13903 | null |
| 2024-09-12 | Joint Optimization of Streaming and Non-Streaming Automatic Speech Recognition with Multi-Decoder and Knowledge Distillation | Muhammad Shakeel et.al. | 2405.13514 | null |
| 2024-05-22 | A Near-Real-Time Processing Ego Speech Filtering Pipeline Designed for Speech Interruption During Human-Robot Interaction | Yue Li et.al. | 2405.13477 | null |
| 2024-05-22 | You don't understand me!: Comparing ASR results for L1 and L2 speakers of Swedish | Ronald Cumbal et.al. | 2405.13379 | null |
| 2024-05-22 | Contextualized Automatic Speech Recognition with Dynamic Vocabulary | Yui Sudo et.al. | 2405.13344 | null |
| 2024-05-28 | FairLENS: Assessing Fairness in Law Enforcement Speech Recognition | Yicheng Wang et.al. | 2405.13166 | null |
| 2024-05-21 | Non-autoregressive real-time Accent Conversion model with voice cloning | Vladimir Nechaev et.al. | 2405.13162 | null |
| 2024-05-15 | Continued Pretraining for Domain Adaptation of Wav2vec2.0 in Automatic Speech Recognition for Elementary Math Classroom Settings | Ahmed Adel Attia et.al. | 2405.13018 | null |
| 2024-05-12 | Large Language Models for Education: A Survey | Hanyi Xu et.al. | 2405.13001 | null |
| 2024-03-14 | Multilingual Audio-Visual Speech Recognition with Hybrid CTC/RNN-T Fast Conformer | Maxime Burchi et.al. | 2405.12983 | null |
| 2024-05-21 | Could a Computer Architect Understand our Brain? | Valentin Puente-Varona et.al. | 2405.12815 | null |
| 2024-07-01 | Mamba in Speech: Towards an Alternative to Self-Attention | Xiangyu Zhang et.al. | 2405.12609 | null |
| 2024-05-20 | Continuous Sign Language Recognition with Adapted Conformer via Unsupervised Pretraining | Neena Aloysius et.al. | 2405.12018 | null |
| 2024-05-21 | Acoustic modeling for Overlapping Speech Recognition: JHU Chime-5 Challenge System | Vimal Manohar et.al. | 2405.11078 | null |
| 2024-05-16 | Listen Again and Choose the Right Answer: A New Paradigm for Automatic Speech Recognition with Large Language Models | Yuchen Hu et.al. | 2405.10025 | null |
| 2024-05-15 | No More Mumbles: Enhancing Robot Intelligibility through Speech Adaptation | Qiaoqiao Ren et.al. | 2405.09708 | link |
| 2024-05-15 | Towards Evaluating the Robustness of Automatic Speech Recognition Systems via Audio Style Transfer | Weifei Jin et.al. | 2405.09470 | null |
| 2024-05-14 | Investigating the 'Autoencoder Behavior' in Speech Self-Supervised Models: a focus on HuBERT's Pretraining | Valentin Vielzeuf et.al. | 2405.08402 | null |
| 2024-05-31 | SpeechVerse: A Large-scale Generalizable Audio Language Model | Nilaksh Das et.al. | 2405.08295 | null |
| 2024-06-07 | Rene: A Pre-trained Multi-modal Architecture for Auscultation of Respiratory Diseases | Pengfei Zhang et.al. | 2405.07442 | link |
| 2024-05-12 | SoccerNet-Echoes: A Soccer Game Audio Commentary Dataset | Sushant Gautam et.al. | 2405.07354 | link |
| 2024-07-22 | DP-DyLoRA: Fine-Tuning Transformer-Based Models On-Device under Differentially Private Federated Learning using Dynamic Low-Rank Adaptation | Jie Xu et.al. | 2405.06368 | null |
| 2024-05-10 | Lost in Transcription: Identifying and Quantifying the Accuracy Biases of Automatic Speech Recognition Systems Against Disfluent Speech | Dena Mujtaba et.al. | 2405.06150 | null |
| 2024-07-17 | Muting Whisper: A Universal Acoustic Adversarial Attack on Speech Foundation Models | Vyas Raina et.al. | 2405.06134 | link |
| 2024-05-09 | The RoyalFlush Automatic Speech Diarization and Recognition System for In-Car Multi-Channel Automatic Speech Recognition Challenge | Jingguang Tian et.al. | 2405.05498 | null |
| 2024-05-07 | Open Implementation and Study of BEST-RQ for Speech Processing | Ryan Whetten et.al. | 2405.04296 | link |
| 2024-05-06 | Whispy: Adapting STT Whisper Models to Real-Time Environments | Antonio Bevilacqua et.al. | 2405.03484 | null |
| 2024-05-06 | MMGER: Multi-modal and Multi-granularity Generative Error Correction with LLM for Joint Accent and Speech Recognition | Bingshen Mu et.al. | 2405.03152 | null |
| 2024-05-11 | Analysis about Theoretical Foundations for Method to Enhancing ASR Performance using OCR Word Frequency Differences | Kyudan Jung et.al. | 2405.02995 | null |
| 2024-05-04 | Mixat: A Data Set of Bilingual Emirati-English Speech | Maryam Al Ali et.al. | 2405.02578 | link |
| 2024-05-06 | Unveiling the Potential of LLM-Based ASR on Chinese Open-Source Datasets | Xuelong Geng et.al. | 2405.02132 | null |
| 2024-05-01 | Efficient Sample-Specific Encoder Perturbations | Yassir Fathullah et.al. | 2405.01601 | null |
| 2024-05-02 | Low-resource speech recognition and dialect identification of Irish in a multi-task framework | Liam Lonergan et.al. | 2405.01293 | null |
| 2024-05-02 | Improving Membership Inference in ASR Model Auditing with Perturbed Loss Features | Francisco Teixeira et.al. | 2405.01207 | null |
| 2024-05-02 | Deep Learning Models in Speech Recognition: Measuring GPU Energy Consumption, Impact of Noise and Model Quantization for Edge Deployment | Aditya Chakravarty et.al. | 2405.01004 | link |
| 2024-05-02 | Efficient Compression of Multitask Multilingual Speech Models | Thomas Palmeira Ferraz et.al. | 2405.00966 | null |
| 2024-05-01 | Active Learning with Task Adaptation Pre-training for Speech Emotion Recognition | Dongyuan Li et.al. | 2405.00307 | null |
| 2024-07-24 | Confides: A Visual Analytics Solution for Automated Speech Recognition Analysis and Exploration | Sunwoo Ha et.al. | 2405.00223 | null |
| 2024-05-09 | Does Whisper understand Swiss German? An automatic, qualitative, and human evaluation | Eyal Liron Dolev et.al. | 2404.19310 | null |
| 2024-04-30 | EfficientASR: Speech Recognition Network Compression via Attention Redundancy and Chunk-Level FFN Optimization | Jianzong Wang et.al. | 2404.19214 | null |
| 2024-04-29 | Towards Dog Bark Decoding: Leveraging Human Speech Processing for Automated Bark Classification | Artem Abzaliev et.al. | 2404.18739 | null |
| 2024-04-26 | Child Speech Recognition in Human-Robot Interaction: Problem Solved? | Ruben Janssens et.al. | 2404.17394 | null |
| 2024-04-26 | Automatic Speech Recognition System-Independent Word Error Rate Estimation | Chanho Park et.al. | 2404.16743 | null |
| 2024-04-26 | Developing Acoustic Models for Automatic Speech Recognition in Swedish | Giampiero Salvi et.al. | 2404.16547 | null |
| 2024-04-25 | U2++ MoE: Scaling 4.7x parameters with minimal impact on RTF | Xingchen Song et.al. | 2404.16407 | null |
| 2024-04-24 | Mamba-360: Survey of State Space Models as Transformer Alternative for Long Sequence Modelling: Methods, Applications, and Challenges | Badri Narayana Patro et.al. | 2404.16112 | link |
| 2024-04-23 | Killkan: The Automatic Speech Recognition Dataset for Kichwa with Morphosyntactic Information | Chihiro Taguchi et.al. | 2404.15501 | link |
| 2024-04-18 | Artificial Neural Networks to Recognize Speakers Division from Continuous Bengali Speech | Hasmot Ali et.al. | 2404.15168 | null |
| 2024-04-23 | Rethinking Processing Distortions: Disentangling the Impact of Speech Enhancement Errors on Speech Recognition Performance | Tsubasa Ochiai et.al. | 2404.14860 | null |
| 2024-04-22 | Assessment of Sign Language-Based versus Touch-Based Input for Deaf Users Interacting with Intelligent Personal Assistants | Nina Tran et.al. | 2404.14605 | null |
| 2024-04-22 | Exploring neural oscillations during speech perception via surrogate gradient spiking neural networks | Alexandre Bittar et.al. | 2404.14024 | null |
| 2024-04-20 | Semantically Corrected Amharic Automatic Speech Recognition | Samuael Adnew et.al. | 2404.13362 | link |
| 2024-04-19 | Learn2Talk: 3D Talking Face Learns from 2D Talking Face | Yixiang Zhuang et.al. | 2404.12888 | null |
| 2024-04-19 | Efficient infusion of self-supervised representations in Automatic Speech Recognition | Darshan Prabhu et.al. | 2404.12628 | null |
| 2024-04-16 | Teaching a Multilingual Large Language Model to Understand Multilingual Speech via Multi-Instructional Training | Pavel Denisov et.al. | 2404.10922 | link |
| 2024-04-16 | Anatomy of Industrial Scale Multilingual ASR | Francis McCann Ramirez et.al. | 2404.09841 | null |
| 2024-04-15 | Resilience of Large Language Models for Noisy Instructions | Bin Wang et.al. | 2404.09754 | null |
| 2024-04-12 | Comparing Apples to Oranges: LLM-powered Multimodal Intention Prediction in an Object Categorization Task | Hassan Ali et.al. | 2404.08424 | null |
| 2024-07-26 | Automatic Speech Recognition Advancements for Indigenous Languages of the Americas | Monica Romero et.al. | 2404.08368 | null |
| 2024-04-10 | An inclusive review on deep learning techniques and their scope in handwriting recognition | Sukhdeep Singh et.al. | 2404.08011 | null |
| 2024-04-12 | An Effective Automated Speaking Assessment Approach to Mitigating Data Scarcity and Imbalanced Distribution | Tien-Hong Lo et.al. | 2404.07575 | null |
| 2024-04-12 | Conformer-1: Robust ASR via Large-Scale Semisupervised Bootstrapping | Kevin Zhang et.al. | 2404.07341 | null |
| 2024-03-31 | Houston we have a Divergence: A Subgroup Performance Analysis of ASR Models | Alkis Koudounas et.al. | 2404.07226 | null |
| 2024-04-10 | The X-LANCE Technical Report for Interspeech 2024 Speech Processing Using Discrete Speech Unit Challenge | Yiwei Guo et.al. | 2404.06079 | null |
| 2024-05-28 | VietMed: A Dataset and Benchmark for Automatic Speech Recognition of Vietnamese in the Medical Domain | Khai Le-Duc et.al. | 2404.05659 | link |
| 2024-04-07 | Safeguarding Voice Privacy: Harnessing Near-Ultrasonic Interference To Protect Against Unauthorized Audio Recording | Forrest McKee et.al. | 2404.04769 | null |
| 2024-04-04 | Transducers with Pronunciation-aware Embeddings for Automatic Speech Recognition | Hainan Xu et.al. | 2404.04295 | null |
| 2024-04-03 | Mai Ho'omāuna i ka 'Ai: Language Models Improve Automatic Speech Recognition in Hawaiian | Kaavya Chaparala et.al. | 2404.03073 | null |
| 2024-04-03 | CMULAB: An Open-Source Framework for Training and Deployment of Natural Language Processing Models | Zaid Sheikh et.al. | 2404.02408 | link |
| 2024-04-02 | BRAVEn: Improving Self-Supervised Pre-training for Visual and Auditory Speech Recognition | Alexandros Haliassos et.al. | 2404.02098 | link |
| 2024-04-02 | Noise Masking Attacks and Defenses for Pretrained Speech Models | Matthew Jagielski et.al. | 2404.02052 | null |
| 2024-04-02 | Kallaama: A Transcribed Speech Dataset about Agriculture in the Three Most Widely Spoken Languages in Senegal | Elodie Gauthier et.al. | 2404.01991 | link |
| 2024-04-02 | Transfer Learning from Whisper for Microscopic Intelligibility Prediction | Paul Best et.al. | 2404.01737 | null |
| 2024-07-22 | ELITR-Bench: A Meeting Assistant Benchmark for Long-Context Language Models | Thibaut Thonet et.al. | 2403.20262 | link |
| 2024-03-28 | Multi-Stage Multi-Modal Pre-Training for Automatic Speech Recognition | Yash Jain et.al. | 2403.19822 | null |
| 2024-03-25 | Hierarchical Recurrent Adapters for Efficient Multi-Task Adaptation of Large Speech Models | Tsendsuren Munkhdalai et.al. | 2403.19709 | null |
| 2024-03-29 | Emotion Neural Transducer for Fine-Grained Speech Emotion Recognition | Siyuan Shen et.al. | 2403.19224 | null |
| 2024-03-28 | LV-CTC: Non-autoregressive ASR with CTC and latent variable models | Yuya Fujita et.al. | 2403.19207 | null |
| 2024-03-04 | JEP-KD: Joint-Embedding Predictive Architecture Based Knowledge Distillation for Visual Speech Recognition | Chang Sun et.al. | 2403.18843 | null |
| 2024-06-04 | PhysicsAssistant: An LLM-Powered Interactive Learning Robot for Physics Lab Investigations | Ehsan Latif et.al. | 2403.18721 | null |
| 2024-03-27 | ZAEBUC-Spoken: A Multilingual Multidialectal Arabic-English Speech Corpus | Injy Hamed et.al. | 2403.18182 | null |
| 2024-04-11 | DANCER: Entity Description Augmented Named Entity Corrector for Automatic Speech Recognition | Yi-Cheng Wang et.al. | 2403.17645 | null |
| 2024-03-26 | Extracting Biomedical Entities from Noisy Audio Transcripts | Nima Ebadi et.al. | 2403.17363 | null |
| 2024-03-25 | Grammatical vs Spelling Error Correction: An Investigation into the Responsiveness of Transformer-based Language Models using BART and MarianMT | Rohit Raju et.al. | 2403.16655 | null |
| 2024-03-22 | Privacy-Preserving End-to-End Spoken Language Understanding | Yinggui Wang et.al. | 2403.15510 | null |
| 2024-03-20 | Isometric Neural Machine Translation using Phoneme Count Ratio Reward-based Reinforcement Learning | Shivam Ratnakant Mhaskar et.al. | 2403.15469 | null |
| 2024-07-21 | Artificial Intelligence for Cochlear Implants: Review of Strategies, Challenges, and Perspectives | Billel Essaid et.al. | 2403.15442 | null |
| 2024-03-26 | A Multimodal Approach to Device-Directed Speech Detection with Large Language Models | Dominik Wagner et.al. | 2403.14438 | null |
| 2024-03-21 | XLAVS-R: Cross-Lingual Audio-Visual Speech Representation Learning for Noise-Robust Speech Perception | HyoJung Han et.al. | 2403.14402 | null |
| 2024-06-04 | M |
Zhe Chen et.al. | 2403.14168 | null |
| 2024-03-20 | Open Access NAO (OAN): a ROS2-based software framework for HRI applications with the NAO robot | Antonio Bono et.al. | 2403.13960 | null |
| 2024-03-20 | BanglaNum -- A Public Dataset for Bengali Digit Recognition from Speech | Mir Sayeed Mohammad et.al. | 2403.13465 | null |
| 2024-03-20 | Advanced Long-Content Speech Recognition With Factorized Neural Transducer | Xun Gong et.al. | 2403.13423 | null |
| 2024-03-21 | FlowerFormer: Empowering Neural Architecture Encoding using a Flow-aware Graph Transformer | Dongyeong Hwang et.al. | 2403.12821 | link |
| 2024-03-19 | Real-time Speech Extraction Using Spatially Regularized Independent Low-rank Matrix Analysis and Rank-constrained Spatial Covariance Matrix Estimation | Yuto Ishikawa et.al. | 2403.12477 | null |
| 2024-03-18 | Multimodal Human-Autonomous Agents Interaction Using Pre-Trained Language and Visual Foundation Models | Linus Nwankwo et.al. | 2403.12273 | null |
| 2024-03-18 | AdaMER-CTC: Connectionist Temporal Classification with Adaptive Maximum Entropy Regularization for Automatic Speech Recognition | SooHwan Eom et.al. | 2403.11578 | null |
| 2024-03-16 | Energy-Based Models with Applications to Speech and Language Processing | Zhijian Ou et.al. | 2403.10961 | null |
| 2024-03-16 | Initial Decoding with Minimally Augmented Language Model for Improved Lattice Rescoring in Low Resource ASR | Savitha Murthy et.al. | 2403.10937 | null |
| 2024-03-15 | Neural Networks Hear You Loud And Clear: Hearing Loss Compensation Using Deep Neural Networks | Peter Leer et.al. | 2403.10420 | null |
| 2024-03-14 | SpokeN-100: A Cross-Lingual Benchmarking Dataset for The Classification of Spoken Numbers in Different Languages | René Groh et.al. | 2403.09753 | link |
| 2024-03-15 | More than words: Advancements and challenges in speech recognition for singing | Anna Kruspe et.al. | 2403.09298 | null |
| 2024-05-21 | Skipformer: A Skip-and-Recover Strategy for Efficient Speech Recognition | Wenjing Zhu et.al. | 2403.08258 | null |
| 2024-03-13 | SpeechColab Leaderboard: An Open-Source Platform for Automatic Speech Recognition Evaluation | Jiayu Du et.al. | 2403.08196 | link |
| 2024-03-13 | Automatic Speech Recognition (ASR) for the Diagnosis of pronunciation of Speech Sound Disorders in Korean children | Taekyung Ahn et.al. | 2403.08187 | null |
| 2024-03-12 | Gujarati-English Code-Switching Speech Recognition using ensemble prediction of spoken language | Yash Sharma et.al. | 2403.08011 | null |
| 2024-03-11 | The evaluation of a code-switched Sepedi-English automatic speech recognition system | Amanda Phaladi et.al. | 2403.07947 | null |
| 2024-03-08 | Speech Robust Bench: A Robustness Benchmark For Speech Recognition | Muhammad A. Shah et.al. | 2403.07937 | null |
| 2024-03-12 | Beyond the Labels: Unveiling Text-Dependency in Paralinguistic Speech Recognition Datasets | Jan Pešán et.al. | 2403.07767 | null |
| 2024-03-11 | Real-Time Multimodal Cognitive Assistant for Emergency Medical Services | Keshara Weerasinghe et.al. | 2403.06734 | link |
| 2024-03-11 | Towards Decoupling Frontend Enhancement and Backend Recognition in Monaural Robust ASR | Yufeng Yang et.al. | 2403.06387 | null |
| 2024-03-10 | SCORE: Self-supervised Correspondence Fine-tuning for Improved Content Representations | Amit Meghanani et.al. | 2403.06260 | link |
| 2025-11-04 | Aligning Speech to Languages to Enhance Code-switching Speech Recognition | Hexin Liu et.al. | 2403.05887 | null |
| 2024-03-02 | A Cross-Modal Approach to Silent Speech with LLM-Enhanced Recognition | Tyler Benster et.al. | 2403.05583 | link |
| 2024-03-07 | Classist Tools: Social Class Correlates with Performance in NLP | Amanda Cercas Curry et.al. | 2403.04445 | null |
| 2024-05-30 | A New Benchmark for Evaluating Automatic Speech Recognition in the Arabic Call Domain | Qusai Abo Obaidah et.al. | 2403.04280 | null |
| 2024-03-07 | A Study of Dropout-Induced Modality Bias on Robustness to Missing Video Frames for Audio-Visual Speech Recognition | Yusheng Dai et.al. | 2403.04245 | link |
| 2024-03-06 | RADIA -- Radio Advertisement Detection with Intelligent Analytics | Jorge Álvarez et.al. | 2403.03538 | null |
| 2024-03-13 | Non-verbal information in spontaneous speech -- towards a new framework of analysis | Tirza Biron et.al. | 2403.03522 | null |
| 2024-03-05 | AIx Speed: Playback Speed Optimization Using Listening Comprehension of Speech Recognition Models | Kazuki Kawamura et.al. | 2403.02938 | null |
| 2024-03-04 | PixIT: Joint Training of Speaker Diarization and Speech Separation from Real-world Multi-speaker Recordings | Joonas Kalda et.al. | 2403.02288 | link |
| 2024-03-04 | What has LeBenchmark Learnt about French Syntax? | Zdravko Dugonjić et.al. | 2403.02173 | null |
| 2024-12-05 | EMOVOME: A Dataset for Emotion Recognition in Spontaneous Real-Life Speech | Lucía Gómez-Zaragozá et.al. | 2403.02167 | null |
| 2024-03-04 | SA-SOT: Speaker-Aware Serialized Output Training for Multi-Talker ASR | Zhiyun Fan et.al. | 2403.02010 | null |
| 2024-03-04 | Language and Speech Technology for Central Kurdish Varieties | Sina Ahmadi et.al. | 2403.01983 | link |
| 2024-03-03 | A Closer Look at Wav2Vec2 Embeddings for On-Device Single-Channel Speech Enhancement | Ravi Shankar et.al. | 2403.01369 | null |
| 2024-04-18 | Automatic Speech Recognition using Advanced Deep Learning Approaches: A survey | Hamza Kheddar et.al. | 2403.01255 | null |
| 2024-03-01 | Post-decoder Biasing for End-to-End Speech Recognition of Multi-turn Medical Interview | Heyang Liu et.al. | 2403.00370 | null |
| 2024-02-29 | Probing the Information Encoded in Neural-based Acoustic Models of Automatic Speech Recognition Systems | Quentin Raymondaud et.al. | 2402.19443 | null |
| 2024-02-29 | Inappropriate Pause Detection In Dysarthric Speech Using Large-Scale Speech Recognition | Jeehyun Lee et.al. | 2402.18923 | null |
| 2024-06-04 | Exploration of Adapter for Noise Robust Automatic Speech Recognition | Hao Shi et.al. | 2402.18275 | null |
| 2024-06-19 | Twists, Humps, and Pebbles: Multilingual Speech Recognition Models Exhibit Gender Performance Gaps | Giuseppe Attanasio et.al. | 2402.17954 | link |
| 2024-02-27 | An Effective Mixture-Of-Experts Approach For Code-Switching Speech Recognition Leveraging Encoder Disentanglement | Tzu-Ting Yang et.al. | 2402.17189 | null |
| 2024-02-27 | Extreme Encoder Output Frame Rate Reduction: Improving Computational Latencies of Large End-to-End Models | Rohit Prabhavalkar et.al. | 2402.17184 | null |
| 2024-04-01 | ArEEG_Chars: Dataset for Envisioned Speech Recognition using EEG for Arabic Characters | Hazem Darwish et.al. | 2402.15733 | null |
| 2024-05-14 | Where Visual Speech Meets Language: VSP-LLM Framework for Efficient and Context-Aware Visual Speech Processing | Jeong Hun Yeo et.al. | 2402.15151 | link |
| 2024-02-22 | Efficient data selection employing Semantic Similarity-based Graph Structures for model training | Roxana Petcu et.al. | 2402.14888 | null |
| 2024-02-22 | Wizard of Oz Experimentation for Language Technology Applications: Challenges and Tools | Stephan Schlögl et.al. | 2402.14563 | null |
| 2024-02-22 | HINT: High-quality INPainting Transformer with Mask-Aware Encoding and Enhanced Attention | Shuang Chen et.al. | 2402.14185 | link |
| 2024-02-21 | An Augmented Lagrangian Method for Training Recurrent Neural Networks | Yue Wang et.al. | 2402.13687 | null |
| 2024-02-22 | Mel-FullSubNet: Mel-Spectrogram Enhancement for Improving Both Speech Quality and ASR | Rui Zhou et.al. | 2402.13511 | null |
| 2024-02-20 | How do Hyenas deal with Human Speech? Speech Recognition and Translation with ConfHyena | Marco Gaido et.al. | 2402.13208 | link |
| 2024-02-20 | Not All Weights Are Created Equal: Enhancing Energy Efficiency in On-Device Streaming Speech Recognition | Yang Li et.al. | 2402.13076 | null |
| 2024-02-20 | Comparison of Conventional Hybrid and CTC/Attention Decoders for Continuous Visual Speech Recognition | David Gimeno-Gómez et.al. | 2402.13004 | null |
| 2024-06-16 | OWSM-CTC: An Open Encoder-Only Speech Foundation Model for Speech Recognition, Translation, and Language Identification | Yifan Peng et.al. | 2402.12654 | null |
| 2024-02-19 | Multimodal Emotion Recognition from Raw Audio with Sinc-convolution | Xiaohui Zhang et.al. | 2402.11954 | null |
| 2024-02-18 | Ain't Misbehavin' -- Using LLMs to Generate Expressive Robot Behavior in Conversations with the Tabletop Robot Haru | Zining Wang et.al. | 2402.11571 | null |
| 2024-02-18 | Cross-Attention Fusion of Visual and Geometric Features for Large Vocabulary Arabic Lipreading | Samar Daou et.al. | 2402.11520 | null |
| 2024-01-04 | AntiDeepFake: AI for Deep Fake Speech Recognition | Enkhtogtokh Togootogtokh et.al. | 2402.10218 | null |
| 2024-02-15 | A cross-talk robust multichannel VAD model for multiparty agent interactions trained using synthetic re-recordings | Hyewon Han et.al. | 2402.09797 | null |
| 2024-02-14 | Listening to Multi-talker Conversations: Modular and End-to-end Perspectives | Desh Raj et.al. | 2402.08932 | null |
| 2024-02-14 | UniEnc-CASSNAT: An Encoder-only Non-autoregressive ASR for Speech SSL Models | Ruchao Fan et.al. | 2402.08898 | null |
| 2024-02-13 | An Embarrassingly Simple Approach for LLM with Strong ASR Capacity | Ziyang Ma et.al. | 2402.08846 | link |
| 2024-02-13 | Syllable based DNN-HMM Cantonese Speech to Text System | Timothy Wong et.al. | 2402.08788 | null |
| 2024-05-03 | Careless Whisper: Speech-to-Text Hallucination Harms | Allison Koenecke et.al. | 2402.08021 | link |
| 2024-07-26 | AIR-Bench: Benchmarking Large Audio-Language Models via Generative Comprehension | Qian Yang et.al. | 2402.07729 | link |
| 2024-02-12 | The Sound of Healthcare: Improving Medical Transcription ASR Accuracy with Large Language Models | Ayo Adedeji et.al. | 2402.07658 | null |
| 2024-02-12 | The Balancing Act: Unmasking and Alleviating ASR Biases in Portuguese | Ajinkya Kulkarni et.al. | 2402.07513 | null |
| 2024-02-13 | SALAD: Smart AI Language Assistant Daily | Ragib Amin Nihal et.al. | 2402.07431 | null |
| 2024-02-11 | Does ChatGPT and Whisper Make Humanoid Robots More Relatable? | Xiaohui Chen et.al. | 2402.07095 | null |
| 2024-02-10 | DeepCover: Advancing RNN Test Coverage and Online Error Prediction using State Machine Extraction | Pouria Golshanrad et.al. | 2402.06966 | link |
| 2024-02-13 | CochCeps-Augment: A Novel Self-Supervised Contrastive Learning Using Cochlear Cepstrum-based Masking for Speech Emotion Recognition | Ioannis Ziogas et.al. | 2402.06923 | null |
| 2024-02-09 | Self-consistent context aware conformer transducer for speech recognition | Konstantin Kolokolov et.al. | 2402.06592 | null |
| 2024-02-08 | Unified Speech-Text Pretraining for Spoken Dialog Modeling | Heeseung Kim et.al. | 2402.05706 | null |
| 2024-02-08 | It's Never Too Late: Fusing Acoustic Information into Large Language Models for Automatic Speech Recognition | Chen Chen et.al. | 2402.05457 | null |
| 2024-02-07 | Progressive unsupervised domain adaptation for ASR using ensemble models and multi-stage training | Rehan Ahmad et.al. | 2402.04805 | null |
| 2024-05-28 | REBORN: Reinforcement-Learned Boundary Segmentation with Iterative Training for Unsupervised ASR | Liang-Hsuan Tseng et.al. | 2402.03988 | link |
| 2024-02-05 | Resolving Transcription Ambiguity in Spanish: A Hybrid Acoustic-Lexical System for Punctuation Restoration | Xiliang Zhu et.al. | 2402.03519 | null |
| 2024-02-05 | A Comprehensive Study of the Current State-of-the-Art in Nepali Automatic Speech Recognition Systems | Rupak Raj Ghimire et.al. | 2402.03050 | null |
| 2024-02-03 | Predicting positive transfer for improved low-resource speech recognition using acoustic pseudo-tokens | Nay San et.al. | 2402.02302 | null |
| 2024-02-02 | Digits micro-model for accurate and secure transactions | Chirag Chhablani et.al. | 2402.01931 | null |
| 2024-02-02 | Whispering in Norwegian: Navigating Orthographic and Dialectic Challenges | Per E Kummervold et.al. | 2402.01917 | null |
| 2024-02-01 | Introduction to speech recognition | Gabriel Dauphin et.al. | 2402.01778 | null |
| 2024-02-02 | Streaming Sequence Transduction through Dynamic Compression | Weiting Tan et.al. | 2402.01172 | link |
| 2024-02-05 | AccentFold: A Journey through African Accents for Zero-Shot ASR Adaptation to Target Accents | Abraham Toluwase Owodunni et.al. | 2402.01152 | null |
| 2024-02-01 | Prosody in Cascade and Direct Speech-to-Text Translation: a case study on Korean Wh-Phrases | Giulio Zhou et.al. | 2402.00632 | null |
| 2024-01-31 | Exploring the limits of decoder-only models trained on public speech recognition corpora | Ankit Gupta et.al. | 2402.00235 | null |
| 2024-01-31 | SpeechComposer: Unifying Multiple Speech Tasks with Prompt Composition | Yihan Wu et.al. | 2401.18045 | null |
| 2024-02-08 | Computation and Parameter Efficient Multi-Modal Fusion Transformer for Cued Speech Recognition | Lei Liu et.al. | 2401.17604 | null |
| 2024-06-16 | OWSM v3.1: Better and Faster Open Whisper-Style Speech Models based on E-Branchformer | Yifan Peng et.al. | 2401.16658 | null |
| 2024-01-28 | Phoneme-Based Proactive Anti-Eavesdropping with Controlled Recording Privilege | Peng Huang et.al. | 2401.15704 | null |
| 2024-01-28 | On Speaker Attribution with SURT | Desh Raj et.al. | 2401.15676 | link |
| 2024-01-28 | Byte Pair Encoding Is All You Need For Automatic Bengali Speech Recognition | Ahnaf Mozib Samin et.al. | 2401.15532 | null |
| 2024-01-27 | Towards Event Extraction from Speech with Contextual Clues | Jingqi Kang et.al. | 2401.15385 | link |
| 2024-01-26 | Comparison of parameters of vowel sounds of russian and english languages | V. I. Fedoseev et.al. | 2401.14890 | null |
| 2024-01-26 | Toward Practical Automatic Speech Recognition and Post-Processing: a Call for Explainable Error Benchmark Guideline | Seonmin Koo et.al. | 2401.14625 | null |
| 2024-01-25 | TDFNet: An Efficient Audio-Visual Speech Separation Model with Top-down Fusion | Samuel Pegg et.al. | 2401.14185 | link |
| 2024-01-24 | CNN architecture extraction on edge GPU | Peter Horvath et.al. | 2401.13575 | null |
| 2024-03-18 | SpeechDPR: End-to-End Spoken Passage Retrieval for Open-Domain Spoken Question Answering | Chyi-Jiunn Lin et.al. | 2401.13463 | null |
| 2024-05-28 | MF-AED-AEC: Speech Emotion Recognition by Leveraging Multimodal Fusion, Asr Error Detection, and Asr Error Correction | Jiajun He et.al. | 2401.13260 | null |
| 2024-01-23 | Locality enhanced dynamic biasing and sampling strategies for contextual ASR | Md Asif Jalal et.al. | 2401.13146 | null |
| 2024-01-23 | Multilingual and Fully Non-Autoregressive ASR with Large Language Model Fusion: A Comprehensive Study | W. Ronny Huang et.al. | 2401.12789 | null |
| 2024-01-22 | Consistency Based Unsupervised Self-training For ASR Personalisation | Jisi Zhang et.al. | 2401.12085 | null |
| 2024-01-22 | Lightweight Protection for Privacy in Offloaded Speech Understanding | Dongqi Cai et.al. | 2401.11983 | null |
| 2024-01-22 | Keep Decoding Parallel with Effective Knowledge Distillation from Language Models to End-to-end Speech Recognisers | Michael Hentschel et.al. | 2401.11700 | null |
| 2024-06-06 | Using Large Language Model for End-to-End Chinese ASR and NER | Yuang Li et.al. | 2401.11382 | null |
| 2024-02-02 | Word-Level ASR Quality Estimation for Efficient Corpus Sampling and Post-Editing through Analyzing Attentions of a Reference-Free Metric | Golara Javadi et.al. | 2401.11268 | link |
| 2024-01-20 | ConceptThread: Visualizing Threaded Concepts in MOOC Videos | Zhiguang Zhou et.al. | 2401.11132 | null |
| 2024-01-19 | Contextualized Automatic Speech Recognition with Attention-Based Bias Phrase Boosted Beam Search | Yui Sudo et.al. | 2401.10449 | null |
| 2024-01-19 | Investigating Training Strategies and Model Robustness of Low-Rank Adaptation for Language Modeling in Speech Recognition | Yu Yu et.al. | 2401.10447 | null |
| 2024-01-19 | Large Language Models are Efficient Learners of Noise-Robust Speech Recognition | Yuchen Hu et.al. | 2401.10446 | link |
| 2024-01-18 | AGADIR: Towards Array-Geometry Agnostic Directional Speech Recognition | Ju Lin et.al. | 2401.10411 | null |
| 2024-01-18 | Communication-Efficient Personalized Federated Learning for Speech-to-Text Tasks | Yichao Du et.al. | 2401.10070 | null |
| 2024-07-18 | Efficient Training for Multilingual Visual Speech Recognition: Pre-training with Discretized Visual Speech Representation | Minsu Kim et.al. | 2401.09802 | null |
| 2024-07-02 | SlideAVSR: A Dataset of Paper Explanation Videos for Audio-Visual Speech Recognition | Hao Wang et.al. | 2401.09759 | null |
| 2024-01-12 | Transcending Controlled Environments Assessing the Transferability of ASRRobust NLU Models to Real-World Applications | Hania Khan et.al. | 2401.09354 | null |
| 2024-01-17 | On Speech Pre-emphasis as a Simple and Inexpensive Method to Boost Speech Enhancement | Iván López-Espejo et.al. | 2401.09315 | null |
| 2024-01-17 | Two-pass Endpoint Detection for Speech Recognition | Anirudh Raju et.al. | 2401.08916 | null |
| 2024-01-16 | NOTSOFAR-1 Challenge: New Datasets, Baseline, and Tasks for Distant Meeting Transcription | Alon Vinnikov et.al. | 2401.08887 | null |
| 2024-01-16 | Improving ASR Contextual Biasing with Guided Attention | Jiyang Tang et.al. | 2401.08835 | null |
| 2024-01-16 | Revisiting Self-supervised Learning of Speech Representation from a Mutual Information Perspective | Alexander H. Liu et.al. | 2401.08833 | null |
| 2024-03-01 | Multi-Input Multi-Output Target-Speaker Voice Activity Detection For Unified, Flexible, and Robust Audio-Visual Speaker Diarization | Ming Cheng et.al. | 2401.08052 | null |
| 2024-01-15 | Machine Perceptual Quality: Evaluating the Impact of Severe Lossy Compression on Audio and Image Models | Dan Jacobellis et.al. | 2401.07957 | link |
| 2024-07-24 | Cascaded Cross-Modal Transformer for Audio-Textual Classification | Nicolae-Catalin Ristea et.al. | 2401.07575 | link |
| 2024-01-15 | SeMaScore : a new evaluation metric for automatic speech recognition tasks | Zitha Sasindran et.al. | 2401.07506 | null |
| 2024-01-14 | Promptformer: Prompted Conformer Transducer for ASR | Sergio Duarte-Torres et.al. | 2401.07360 | null |
| 2024-01-13 | Joint Unsupervised and Supervised Training for Automatic Speech Recognition via Bilevel Optimization | A F M Saif et.al. | 2401.06980 | link |
| 2024-01-12 | XLS-R Deep Learning Model for Multilingual ASR on Low- Resource Languages: Indonesian, Javanese, and Sundanese | Panji Arisaputra et.al. | 2401.06832 | null |
| 2024-02-29 | The NPU-ASLP-LiAuto System Description for Visual Speech Recognition in CNVSRC 2023 | He Wang et.al. | 2401.06788 | link |
| 2024-01-15 | Dynamic Behaviour of Connectionist Speech Recognition with Strong Latency Constraints | Giampiero Salvi et.al. | 2401.06588 | null |
| 2024-01-12 | LCB-net: Long-Context Biasing for Audio-Visual Speech Recognition | Fan Yu et.al. | 2401.06390 | link |
| 2024-01-11 | End to end Hindi to English speech conversion using Bark, mBART and a finetuned XLSR Wav2Vec2 | Aniket Tathe et.al. | 2401.06183 | null |
| 2024-01-11 | UCorrect: An Unsupervised Framework for Automatic Speech Recognition Error Correction | Jiaxin Guo et.al. | 2401.05689 | null |
| 2024-01-10 | Useful Blunders: Can Automated Speech Recognition Errors Improve Downstream Dementia Classification? | Changye Li et.al. | 2401.05551 | null |
| 2024-01-10 | Towards Online Sign Language Recognition and Translation | Ronglai Zuo et.al. | 2401.05336 | link |
| 2024-07-17 | Continuously Learning New Words in Automatic Speech Recognition | Christian Huber et.al. | 2401.04482 | null |
| 2024-01-08 | High-precision Voice Search Query Correction via Retrievable Speech-text Embedings | Christopher Li et.al. | 2401.04235 | null |
| 2024-07-22 | Cross-Speaker Encoding Network for Multi-Talker Speech Recognition | Jiawen Kang et.al. | 2401.04152 | link |
| 2024-01-08 | Exploratory Evaluation of Speech Content Masking | Jennifer Williams et.al. | 2401.03936 | null |
| 2024-03-07 | An audio-quality-based multi-strategy approach for target speaker extraction in the MISP 2023 Challenge | Runduo Han et.al. | 2401.03697 | null |
| 2024-06-10 | LUPET: Incorporating Hierarchical Information Path into Multilingual ASR | Wei Liu et.al. | 2401.03689 | null |
| 2024-01-08 | BS-PLCNet: Band-split Packet Loss Concealment Network with Multi-task Learning Framework and Multi-discriminators | Zihan Zhang et.al. | 2401.03687 | null |
| 2024-07-22 | DiarizationLM: Speaker Diarization Post-Processing with Large Language Models | Quan Wang et.al. | 2401.03506 | link |
| 2024-02-21 | ICMC-ASR: The ICASSP 2024 In-Car Multi-Channel Automatic Speech Recognition Challenge | He Wang et.al. | 2401.03473 | null |
| 2024-01-07 | Multichannel AV-wav2vec2: A Framework for Learning Multichannel Multi-Modal Speech Representation | Qiushi Zhu et.al. | 2401.03468 | link |
| 2024-04-08 | MLCA-AVSR: Multi-Layer Cross Attention Fusion based Audio-Visual Speech Recognition | He Wang et.al. | 2401.03424 | null |
| 2024-01-06 | TeLeS: Temporal Lexeme Similarity Score to Estimate Confidence in End-to-End ASR | Nagarathna Ravi et.al. | 2401.03251 | link |
| 2024-01-06 | Part-of-Speech Tagger for Bodo Language using Deep Learning approach | Dhrubajyoti Pathak et.al. | 2401.03175 | null |
| 2024-01-05 | Towards ASR Robust Spoken Language Understanding Through In-Context Learning With Word Confusion Networks | Kevin Everson et.al. | 2401.02921 | null |
| 2024-01-05 | Nonlinear functional regression by functional deep neural network with kernel embedding | Zhongjie Shi et.al. | 2401.02890 | null |
| 2024-01-05 | A unified multichannel far-field speech recognition system: combining neural beamforming with attention based end-to-end model | Dongdi Zhao et.al. | 2401.02673 | null |
| 2024-01-04 | Task Oriented Dialogue as a Catalyst for Self-Supervised Automatic Speech Recognition | David M. Chan et.al. | 2401.02417 | link |
| 2024-01-04 | CTC Blank Triggered Dynamic Layer-Skipping for Efficient CTC-based Speech Recognition | Junfeng Hou et.al. | 2401.02046 | null |
| 2024-01-03 | Hallucinations in Neural Automatic Speech Recognition: Identifying Errors and Hallucinatory Models | Rita Frieske et.al. | 2401.01572 | null |
| 2024-06-04 | The Art of Deception: Robust Backdoor Attack using Dynamic Stacking of Triggers | Orson Mengara et.al. | 2401.01537 | null |
| 2024-01-01 | Enhancing Pre-trained ASR System Fine-tuning for Dysarthric Speech Recognition using Adversarial Data Augmentation | Huimeng Wang et.al. | 2401.00662 | null |
| 2024-05-02 | Stateful Conformer with Cache-based Inference for Streaming Automatic Speech Recognition | Vahid Noroozi et.al. | 2312.17279 | null |
| 2023-12-26 | The NUS-HLT System for ICASSP2024 ICMC-ASR Grand Challenge | Meng Ge et.al. | 2312.16002 | null |
| 2023-12-26 | Towards Probing Contact Center Large Language Models | Varun Nathan et.al. | 2312.15922 | null |
| 2023-12-24 | Exploring data augmentation in bias mitigation against non-native-accented speech | Yuanyuan Zhang et.al. | 2312.15499 | null |
| 2023-12-22 | BLSTM-Based Confidence Estimation for End-to-End Speech Recognition | Atsunori Ogawa et.al. | 2312.14609 | null |
| 2024-02-09 | Multimodal Attention Merging for Improved Speech Recognition and Audio Event Classification | Anirudh S. Sundar et.al. | 2312.14378 | null |
| 2024-07-22 | Multi-Sentence Grounding for Long-term Instructional Video | Zeqian Li et.al. | 2312.14055 | null |
| 2023-12-21 | BANSpEmo: A Bangla Emotional Speech Recognition Dataset | Md Gulzar Hussain et.al. | 2312.14020 | null |
| 2023-12-21 | Self-Supervised Adaptive AV Fusion Module for Pre-Trained ASR Models | Christopher Simic et.al. | 2312.13873 | null |
| 2024-02-03 | kNN-CTC: Enhancing ASR via Retrieval of CTC Pseudo Labels | Jiaming Zhou et.al. | 2312.13560 | link |
| 2025-01-14 | On the Effectiveness of ASR Representations in Real-world Noisy Speech Emotion Recognition | Xiaohan Shi et.al. | 2311.07093 | null |
| 2023-11-20 | Decoupling and Interacting Multi-Task Learning Network for Joint Speech and Accent Recognition | Qijie Shao et.al. | 2311.07062 | null |
| 2023-11-02 | An analysis of large speech models-based representations for speech emotion recognition | Adrian Bogdan Stânea et.al. | 2311.00394 | null |
| 2024-01-29 | Generative Speech Recognition Error Correction with Large Language Models and Task-Activating Prompting | Chao-Han Huck Yang et.al. | 2309.15649 | null |
| 2023-08-09 | Federated Representation Learning for Automatic Speech Recognition | Guruprasad V Ramesh et.al. | 2308.02013 | null |
| 2023-07-07 | Audio-visual End-to-end Multi-channel Speech Separation, Dereverberation and Recognition | Guinan Li et.al. | 2307.02909 | null |
| 2023-05-30 | HyperConformer: Multi-head HyperMixer for Efficient Speech Recognition | Florian Mai et.al. | 2305.18281 | null |
| 2023-04-24 | A vector quantized masked autoencoder for speech emotion recognition | Samir Sadok et.al. | 2304.11117 | null |
| 2023-03-06 | DWFormer: Dynamic Window transFormer for Speech Emotion Recognition | Shuaiqi Chen et.al. | 2303.01694 | null |
| 2024-11-08 | Pre-Finetuning for Few-Shot Emotional Speech Recognition | Maximillian Chen et.al. | 2302.12921 | null |
| 2023-03-07 | A Sidecar Separator Can Convert a Single-Talker Speech Recognition System to a Multi-Talker One | Lingwei Meng et.al. | 2302.09908 | null |
| 2022-11-16 | Improving Children's Speech Recognition by Fine-tuning Self-supervised Adult Speech Representations | Renee Lu et.al. | 2211.07769 | null |
| 2022-10-27 | Pretrained audio neural networks for Speech emotion recognition in Portuguese | Marcelo Matheus Gauy et.al. | 2210.14716 | null |
| 2022-04-07 | What can predictive speech coders learn from speaker recognizers? | Marcos Faundez-Zanuy et.al. | 2204.02400 | null |
| 2022-03-18 | Speaker Adaptation Using Spectro-Temporal Deep Features for Dysarthric and Elderly Speech Recognition | Mengzhe Geng et.al. | 2202.10290 | null |
| 2022-02-03 | Visualizing Automatic Speech Recognition -- Means for a Better Understanding? | Karla Markert et.al. | 2202.00673 | null |
| 2022-01-31 | Discovering Phonetic Inventories with Crosslingual Automatic Speech Recognition | Piotr Żelasko et.al. | 2201.11207 | null |
| 2021-12-22 | Voice Quality and Pitch Features in Transformer-Based Speech Recognition | Guillermo Cámbara et.al. | 2112.11391 | null |
| 2022-05-03 | Speech Pattern based Black-box Model Watermarking for Automatic Speech Recognition | Haozhe Chen et.al. | 2110.09814 | null |
| 2021-11-05 | Towards efficient end-to-end speech recognition with biologically-inspired neural networks | Thomas Bohnstingl et.al. | 2110.02743 | null |
| 2025-02-06 | Comparison of Self-Supervised Speech Pre-Training Methods on Flemish Dutch | Jakob Poncelet et.al. | 2109.14357 | null |
| 2021-07-27 | Differentiable Allophone Graphs for Language-Universal Speech Recognition | Brian Yan et.al. | 2107.11628 | null |
| 2021-07-06 | Arabic Code-Switching Speech Recognition using Monolingual Data | Ahmed Ali et.al. | 2107.01573 | null |
| 2021-07-05 | Supervised Contrastive Learning for Accented Speech Recognition | Tao Han et.al. | 2107.00921 | null |
| 2021-07-05 | Combining Frame-Synchronous and Label-Synchronous Systems for Speech Recognition | Qiujia Li et.al. | 2107.00764 | null |
| 2022-03-22 | Unsupervised Automatic Speech Recognition: A Review | Hanan Aldarmaki et.al. | 2106.04897 | null |
| 2021-10-05 | Non-autoregressive Mandarin-English Code-switching Speech Recognition | Shun-Po Chuang et.al. | 2104.02258 | null |
| 2021-02-16 | Thank you for Attention: A survey on Attention-based Artificial Neural Networks for Automatic Speech Recognition | Priyabrata Karmakar et.al. | 2102.07259 | null |
| 2021-02-01 | BCN2BRNO: ASR System Fusion for Albayzin 2020 Speech to Text Challenge | Martin Kocour et.al. | 2101.12729 | null |
| 2021-09-14 | Multi-task Language Modeling for Improving Speech Recognition of Rare Words | Chao-Han Huck Yang et.al. | 2011.11715 | null |
| 2020-11-13 | The CUHK-TUDELFT System for The SLT 2021 Children Speech Recognition Challenge | Si-Ioi Ng et.al. | 2011.06239 | null |
| 2020-11-10 | Data Augmentation For Children's Speech Recognition -- The "Ethiopian" System For The SLT 2021 Children Speech Recognition Challenge | Guoguo Chen et.al. | 2011.04547 | null |
| 2020-11-10 | Gated Recurrent Fusion with Joint Training Framework for Robust End-to-End Speech Recognition | Cunhang Fan et.al. | 2011.04249 | null |
| 2021-09-20 | TutorNet: Towards Flexible Knowledge Distillation for End-to-End Speech Recognition | Ji Won Yoon et.al. | 2008.00671 | null |
| 2020-10-06 | CTC-Segmentation of Large Corpora for German End-to-end Speech Recognition | Ludwig Kürzinger et.al. | 2007.09127 | null |
| 2020-06-04 | The NTNU System at the Interspeech 2020 Non-Native Children's Speech ASR Challenge | Tien-Hong Lo et.al. | 2005.08433 | null |
| 2020-04-20 | How to Teach DNNs to Pay Attention to the Visual Modality in Speech Recognition | George Sterpu et.al. | 2004.08250 | null |
| 2022-09-28 | The Effect of Silence Feature in Dimensional Speech Emotion Recognition | Bagus Tris Atmaja et.al. | 2003.01277 | null |
| 2020-03-02 | A Density Ratio Approach to Language Model Fusion in End-To-End Automatic Speech Recognition | Erik McDermott et.al. | 2002.11268 | null |
| 2020-01-08 | Domain Adaptation via Teacher-Student Learning for End-to-End Speech Recognition | Zhong Meng et.al. | 2001.01798 | null |
| 2020-01-08 | Character-Aware Attention-Based End-to-End Speech Recognition | Zhong Meng et.al. | 2001.01795 | null |
| 2023-05-23 | Leveraging End-to-End Speech Recognition with Neural Architecture Search | Ahmed Baruwa et.al. | 1912.05946 | null |
| 2019-11-21 | On using 2D sequence-to-sequence models for speech recognition | Parnia Bahar et.al. | 1911.08888 | null |
| 2019-11-13 | Recurrent Neural Network Transducer for Audio-Visual Speech Recognition | Takaki Makino et.al. | 1911.04890 | null |
| 2019-10-15 | VAIS ASR: Building a conversational speech recognition system using language model combination | Quang Minh Nguyen et.al. | 1910.05603 | null |
| 2020-05-08 | Self-Training for End-to-End Speech Recognition | Jacob Kahn et.al. | 1909.09116 | null |
| 2020-03-17 | Advancing Speech Recognition With No Speech Or With Noisy Speech | Gautam Krishna et.al. | 1906.08871 | null |
| 2019-04-26 | Phonetically-Oriented Word Error Alignment for Speech Recognition Error Analysis in Speech Translation | Nicholas Ruiz et.al. | 1904.11024 | null |
| 2019-07-10 | End-to-End Visual Speech Recognition for Small-Scale Datasets | Stavros Petridis et.al. | 1904.01954 | null |
| 2020-01-01 | A Convolutional Neural Network model based on Neutrosophy for Noisy Speech Recognition | Elyas Rashno et.al. | 1901.10629 | null |
| 2018-11-20 | Analysis of DNN Speech Signal Enhancement for Robust Speaker Recognition | Ondrej Novotny et.al. | 1811.07629 | null |
| 2018-11-13 | Reinforcement Learning Based Speech Enhancement for Robust Speech Recognition | Yih-Liang Shen et.al. | 1811.04224 | null |
| 2023-05-15 | End-to-end Audiovisual Speech Activity Detection with Bimodal Recurrent Neural Models | Fei Tao et.al. | 1809.04553 | null |
| 2018-09-13 | Isolated and Ensemble Audio Preprocessing Methods for Detecting Adversarial Examples against Automatic Speech Recognition | Krishan Rajaratnam et.al. | 1809.04397 | null |
| 2018-07-04 | Exploring End-to-End Techniques for Low-Resource Speech Recognition | Vladimir Bataev et.al. | 1807.00868 | null |
| 2018-05-29 | Automatic context window composition for distant speech recognition | Mirco Ravanelli et.al. | 1805.10498 | null |
| 2022-03-17 | Curriculum Learning for Speech Emotion Recognition from Crowdsourced Labels | Reza Lotfian et.al. | 1805.10339 | null |
| 2018-04-27 | End-to-End Multimodal Speech Recognition | Shruti Palaskar et.al. | 1804.09713 | null |
| 2018-10-17 | Deep Long Short-Term Memory Adaptive Beamforming Networks For Multichannel Robust Speech Recognition | Zhong Meng et.al. | 1711.08016 | null |
| 2019-05-01 | Unsupervised Adaptation with Domain Separation Networks for Robust Speech Recognition | Zhong Meng et.al. | 1711.08010 | null |
| 2018-02-23 | BridgeNets: Student-Teacher Transfer Learning Based on Recursive Neural Networks and its Application to Distant Speech Recognition | Jaeyoung Kim et.al. | 1710.10224 | null |
| 2018-06-29 | Combining Multiple Views for Visual Speech Recognition | Marina Zimmermann et.al. | 1710.07168 | null |
| 2018-04-26 | Visual speech recognition: aligning terminologies for better understanding | Helen L Bear et.al. | 1710.01292 | null |
| 2018-04-26 | Resolution limits on visual speech recognition | Helen L. Bear et.al. | 1710.01073 | null |
| 2017-09-01 | Leveraging Deep Neural Network Activation Entropy to cope with Unseen Data in Speech Recognition | Vikramjit Mitra et.al. | 1708.09516 | null |
| 2018-12-06 | Single-Channel Multi-talker Speech Recognition with Permutation Invariant Training | Yanmin Qian et.al. | 1707.06527 | null |
| 2017-11-16 | Rank-1 Constrained Multichannel Wiener Filter for Speech Recognition in Noisy Environments | Ziteng Wang et.al. | 1707.00201 | null |
| 2017-04-27 | Towards Estimating the Upper Bound of Visual-Speech Recognition: The Visual Lip-Reading Feasibility Database | Adriana Fernandez-Lopez et.al. | 1704.08028 | null |
| 2016-12-07 | Invariant Representations for Noisy Speech Recognition | Dmitriy Serdyuk et.al. | 1612.01928 | null |
| 2017-08-08 | Robust coherence-based spectral enhancement for speech recognition in adverse real-world environments | Hendrik Barfuss et.al. | 1604.03393 | null |
| 2015-09-25 | Noise-Robust ASR for the third 'CHiME' Challenge Exploiting Time-Frequency Masking based Multi-Channel Speech Enhancement and Recurrent Neural Network | Zaihu Pang et.al. | 1509.07211 | null |
| 2014-09-05 | Visual Speech Recognition | Ahmad B. A. Hassanat et.al. | 1409.1411 | null |
| 2014-02-12 | Modified SPLICE and its Extension to Non-Stereo Data for Noise Robust Speech Recognition | D. S. Pavan Kumar et.al. | 1307.4048 | null |
| Publish Date | Title | Authors | Code | |
|---|---|---|---|---|
| 2026-03-05 | Hierarchical Decoding for Discrete Speech Synthesis with Multi-Resolution Spoof Detection | Junchuan Zhao et.al. | 2603.05373 | null |
| 2026-03-05 | Measuring the Redundancy of Decoder Layers in SpeechLLMs | Adel Moumen et.al. | 2603.05121 | null |
| 2026-03-04 | ZeSTA: Zero-Shot TTS Augmentation with Domain-Conditioned Training for Data-Efficient Personalized Speech Synthesis | Youngwon Choi et.al. | 2603.04219 | null |
| 2026-03-04 | VietNormalizer: An Open-Source, Dependency-Free Python Library for Vietnamese Text Normalization in TTS and NLP Applications | Hung Vu Nguyen et.al. | 2603.04145 | null |
| 2026-03-02 | More Data, Fewer Diacritics: Scaling Arabic TTS | Ahmed Musleh et.al. | 2603.01622 | null |
| 2026-03-02 | End-to-End Simultaneous Dysarthric Speech Reconstruction with Frame-Level Adaptor and Multiple Wait-k Knowledge Distillation | Minghui Wu et.al. | 2603.01382 | null |
| 2026-03-02 | DARS: Dysarthria-Aware Rhythm-Style Synthesis for ASR Enhancement | Minghui Wu et.al. | 2603.01369 | null |
| 2026-03-01 | S-VoCAL: A Dataset and Evaluation Framework for Inferring Speaking Voice Character Attributes in Literature | Abigail Berthe-Pardo et.al. | 2603.00958 | null |
| 2026-02-26 | Discourse-Aware Dual-Track Streaming Response for Low-Latency Spoken Dialogue Systems | Siyuan Liu et.al. | 2602.23266 | null |
| 2026-02-26 | TADA: A Generative Framework for Speech Modeling via Text-Acoustic Dual Alignment | Trung Dang et.al. | 2602.23068 | null |
| 2026-03-03 | Scalable Multilingual Multimodal Machine Translation with Speech-Text Fusion | Yexing Du et.al. | 2602.21646 | null |
| 2026-02-25 | The Design Space of Tri-Modal Masked Diffusion Models | Louis Bethune et.al. | 2602.21472 | null |
| 2026-02-23 | Can You Tell It's AI? Human Perception of Synthetic Voices in Vishing Scenarios | Zoha Hayat Bhatti et.al. | 2602.20061 | null |
| 2026-02-23 | CTC-TTS: LLM-based dual-streaming text-to-speech with CTC alignment | Hanwen Liu et.al. | 2602.19574 | null |
| 2026-02-19 | CC-G2PnP: Streaming Grapheme-to-Phoneme and prosody with Conformer-CTC for unsegmented languages | Yuma Shirahata et.al. | 2602.17157 | null |
| 2026-02-13 | Speech to Speech Synthesis for Voice Impersonation | Bjorn Johnson et.al. | 2602.16721 | null |
| 2026-02-18 | How to Label Resynthesized Audio: The Dual Role of Neural Audio Codecs in Audio Deepfake Detection | Yixuan Xiao et.al. | 2602.16343 | null |
| 2026-02-17 | LLM-to-Speech: A Synthetic Data Pipeline for Training Dialectal Text-to-Speech Models | Ahmed Khaled Khamis et.al. | 2602.15675 | null |
| 2026-03-03 | UniTAF: A Modular Framework for Joint Text-to-Speech and Audio-to-Face Modeling | Qiangong Zhou et.al. | 2602.15651 | null |
| 2026-02-16 | Disentangling Pitch and Creak for Speaker Identity Preservation in Speech Synthesis | Frederik Rautenberg et.al. | 2602.14686 | null |
| 2026-02-16 | Probing Human Articulatory Constraints in End-to-End TTS with Reverse and Mismatched Speech-Text Directions | Parth Khadse et.al. | 2602.14664 | null |
| 2026-02-14 | ELEAT-SAGA: Early & Late Integration with Evading Alternating Training for Spoof-Robust Speaker Verification | Amro Asali et.al. | 2602.13761 | null |
| 2026-02-13 | PISHYAR: A Socially Intelligent Smart Cane for Indoor Social Navigation and Multimodal Human-Robot Interaction for Visually Impaired People | Mahdi Haghighat Joo et.al. | 2602.12597 | null |
| 2026-02-16 | "Sorry, I Didn't Catch That": How Speech Models Miss What Matters Most | Kaitlyn Zhou et.al. | 2602.12249 | null |
| 2026-02-19 | When Audio-LLMs Don't Listen: A Cross-Linguistic Study of Modality Arbitration | Jayadev Billa et.al. | 2602.11488 | null |
| 2026-02-12 | SLD-L2S: Hierarchical Subspace Latent Diffusion for High-Fidelity Lip to Speech Synthesis | Yifan Liang et.al. | 2602.11477 | null |
| 2026-02-11 | Calliope: A TTS-based Narrated E-book Creator Ensuring Exact Synchronization, Privacy, and Layout Fidelity | Hugo L. Hammer et.al. | 2602.10735 | null |
| 2026-02-10 | Emotion-Coherent Speech Data Augmentation and Self-Supervised Contrastive Style Training for Enhancing Kids's Story Speech Synthesis | Raymond Chung et.al. | 2602.10164 | null |
| 2026-02-10 | Covo-Audio Technical Report | Wenfu Wang et.al. | 2602.09823 | null |
| 2026-02-10 | TVTSyn: Content-Synchronous Time-Varying Timbre for Streaming Voice Conversion and Anonymization | Waris Quamer et.al. | 2602.09389 | null |
| 2026-02-03 | DSFlow: Dual Supervision and Step-Aware Architecture for One-Step Flow Matching Speech Synthesis | Bin Lin et.al. | 2602.09041 | null |
| 2026-02-19 | Prototype-Based Disentanglement for Controllable Dysarthric Speech Synthesis | Haoshen Wang et.al. | 2602.08696 | null |
| 2026-02-08 | SoulX-Singer: Towards High-Quality Zero-Shot Singing Voice Synthesis | Jiale Qian et.al. | 2602.07803 | null |
| 2026-01-14 | PersonaPlex: Voice and Role Control for Full Duplex Conversational Speech Models | Rajarshi Roy et.al. | 2602.06053 | null |
| 2026-02-05 | ARCHI-TTS: A flow-matching-based Text-to-Speech Model with Self-supervised Semantic Aligner and Accelerated Inference | Chunyat Wu et.al. | 2602.05207 | null |
| 2026-02-04 | HoliAntiSpoof: Audio LLM for Holistic Speech Anti-Spoofing | Xuenan Xu et.al. | 2602.04535 | null |
| 2026-02-04 | PFluxTTS: Hybrid Flow-Matching TTS with Robust Cross-Lingual Voice Cloning and Inference-Time Model Fusion | Vikentii Pankov et.al. | 2602.04160 | null |
| 2026-02-03 | CoCoEmo: Composable and Controllable Human-Like Emotional TTS via Activation Steering | Siyi Wang et.al. | 2602.03420 | null |
| 2026-03-02 | WAXAL: A Large-Scale Multilingual African Language Speech Corpus | Abdoulaye Diack et.al. | 2602.02734 | null |
| 2026-02-01 | VividVoice: A Unified Framework for Scene-Aware Visually-Driven Speech Synthesis | Chengyuan Ma et.al. | 2602.02591 | null |
| 2026-02-02 | LipSody: Lip-to-Speech Synthesis with Enhanced Prosody Consistency | Jaejun Lee et.al. | 2602.01908 | null |
| 2026-02-01 | EmoAra: Emotion-Preserving English Speech Transcription and Cross-Lingual Translation with Arabic Text-to-Speech | Besher Hassan et.al. | 2602.01170 | null |
| 2026-02-01 | Bias in the Ear of the Listener: Assessing Sensitivity in Audio Language Models Across Linguistic, Demographic, and Positional Variations | Sheng-Lun Wei et.al. | 2602.01030 | null |
| 2026-01-31 | Edit Content, Preserve Acoustics: Imperceptible Text-Based Speech Editing via Self-Consistency Rewards | Yong Ren et.al. | 2602.00560 | null |
| 2026-01-30 | Multi-Speaker Conversational Audio Deepfake: Taxonomy, Dataset and Pilot Study | Alabi Ahmed et.al. | 2602.00295 | null |
| 2026-01-30 | Now You Hear Me: Audio Narrative Attacks Against Large Audio-Language Models | Ye Yu et.al. | 2601.23255 | null |
| 2026-01-30 | EmoShift: Lightweight Activation Steering for Enhanced Emotion-Aware Speech Synthesis | Li Zhou et.al. | 2601.22873 | null |
| 2026-01-30 | Evaluating and Rewarding LALMs for Expressive Role-Play TTS via Mean Continuation Log-Probability | Yong Ren et.al. | 2601.22661 | null |
| 2026-01-29 | Speech Quality-Based Localization of Low-Quality Speech and Text-to-Speech Synthesis Artefacts | Michael Kuhlmann et.al. | 2601.21886 | null |
| 2026-01-28 | Audio Deepfake Detection in the Age of Advanced Text-to-Speech models | Robin Singh et.al. | 2601.20510 | null |
| 2026-01-28 | Erasing Your Voice Before It's Heard: Training-free Speaker Unlearning for Zero-shot Text-to-Speech | Myungjin Lee et.al. | 2601.20481 | null |
| 2026-01-29 | Unit-Based Agent for Semi-Cascaded Full-Duplex Dialogue Systems | Haoyuan Yu et.al. | 2601.20230 | null |
| 2026-01-27 | T-Mimi: A Transformer-based Mimi Decoder for Real-Time On-Phone TTS | Haibin Wu et.al. | 2601.20094 | null |
| 2026-01-27 | Phonological Tokenizer: Prosody-Aware Phonetic Token via Multi-Objective Fine-Tuning with Differentiable K-Means | Kentaro Onda et.al. | 2601.19781 | null |
| 2026-01-26 | Neural Multi-Speaker Voice Cloning for Nepali in Low-Resource Settings | Aayush M. Shrestha et.al. | 2601.18694 | null |
| 2026-01-26 | UrgentMOS: Unified Multi-Metric and Preference Learning for Robust Speech Quality Assessment | Wei Wang et.al. | 2601.18438 | null |
| 2026-01-25 | Quran-MD: A Fine-Grained Multilingual Multimodal Dataset of the Quran | Muhammad Umar Salman et.al. | 2601.17880 | null |
| 2026-01-23 | SonoEdit: Null-Space Constrained Knowledge Editing for Pronunciation Correction in LLM-Based TTS | Ayush Pratap Singh et.al. | 2601.17086 | null |
| 2026-01-16 | AI-based System for Transforming text and sound to Educational Videos | M. E. ElAlami et.al. | 2601.17022 | null |
| 2026-01-16 | ES4R: Speech Encoding Based on Prepositive Affective Modeling for Empathetic Response Generation | Zhuoyue Gao et.al. | 2601.16225 | null |
| 2026-01-22 | Timbre-Aware LLM-based Direct Speech-to-Speech Translation Extendable to Multiple Language Pairs | Lalaram Arya et.al. | 2601.16023 | null |
| 2026-01-22 | Qwen3-TTS Technical Report | Hangrui Hu et.al. | 2601.15621 | link |
| 2026-01-22 | DeepASMR: LLM-Based Zero-Shot ASMR Speech Generation for Anyone of Any Voice | Leying Zhang et.al. | 2601.15596 | null |
| 2026-01-20 | Prosody-Guided Harmonic Attention for Phase-Coherent Neural Vocoding in the Complex Spectrum | Mohammed Salah Al-Radhi et.al. | 2601.14472 | null |
| 2026-01-28 | Quantifying Speaker Embedding Phonological Rule Interactions in Accented Speech Synthesis | Thanathai Lertpetchpun et.al. | 2601.14417 | null |
| 2026-01-20 | Habibi: Laying the Open-Source Foundation of Unified-Dialectal Arabic Speech Synthesis | Yushen Chen et.al. | 2601.13802 | null |
| 2026-01-19 | Lombard Speech Synthesis for Any Voice with Controllable Style Embeddings | Seymanur Akti et.al. | 2601.12966 | null |
| 2026-01-18 | A Unified Neural Codec Language Model for Selective Editable Text to Speech Generation | Hanchen Pei et.al. | 2601.12480 | null |
| 2026-01-18 | ParaMETA: Towards Learning Disentangled Paralinguistic Speaking Styles Representations from Speech | Haowei Lou et.al. | 2601.12289 | null |
| 2026-01-18 | Confidence-based Filtering for Speech Dataset Curation with Generative Speech Enhancement Using Discrete Tokens | Kazuki Yamauchi et.al. | 2601.12254 | null |
| 2026-01-16 | WenetSpeech-Wu: Datasets, Benchmarks, and Models for a Unified Chinese Wu Dialect Speech Processing Ecosystem | Chengyou Wang et.al. | 2601.11027 | null |
| 2026-01-15 | Unifying Speech Recognition, Synthesis and Conversion with Autoregressive Transformers | Runyuan Cai et.al. | 2601.10770 | null |
| 2026-01-20 | VoiceSculptor: Your Voice, Designed By You | Jingbin Hu et.al. | 2601.10629 | null |
| 2026-01-15 | STEAMROLLER: A Multi-Agent System for Inclusive Automatic Speech Recognition for People who Stutter | Ziqi Xu et.al. | 2601.10223 | null |
| 2026-01-13 | Decoding Order Matters in Autoregressive Speech Synthesis | Minghui Zhao et.al. | 2601.08450 | null |
| 2026-01-13 | Detecting Mental Manipulation in Speech via Synthetic Multi-Speaker Dialogue | Run Chen et.al. | 2601.08342 | null |
| 2026-03-02 | FOCAL: A Novel Benchmarking Technique for Multi-modal Agents | Anupam Purwar et.al. | 2601.07367 | null |
| 2026-02-05 | ESDD2: Environment-Aware Speech and Sound Deepfake Detection Challenge Evaluation Plan | Xueping Zhang et.al. | 2601.07303 | null |
| 2026-01-10 | Lightweight Resolution-Aware Audio Deepfake Detection via Cross-Scale Attention and Consistency Learning | K. A. Shahriar et.al. | 2601.06560 | null |
| 2026-01-09 | Pantagruel: Unified Self-Supervised Encoders for French Text and Speech | Phuong-Hang Le et.al. | 2601.05911 | null |
| 2026-01-14 | Afri-MCQA: Multimodal Cultural Question Answering for African Languages | Atnafu Lambebo Tonja et.al. | 2601.05699 | null |
| 2026-01-09 | SPAM: Style Prompt Adherence Metric for Prompt-based TTS | Chanhee Cho et.al. | 2601.05554 | null |
| 2026-01-08 | CosyEdit: Unlocking End-to-End Speech Editing Capability from Zero-Shot Text-to-Speech Models | Junyang Chen et.al. | 2601.05329 | null |
| 2026-01-08 | FlexiVoice: Enabling Flexible Style Control in Zero-Shot TTS with Natural Language Instructions | Dekun Chen et.al. | 2601.04656 | null |
| 2026-01-08 | LLMs-Integrated Automatic Hate Speech Recognition Using Controllable Text Generation Models | Ryutaro Oshima et.al. | 2601.04654 | null |
| 2026-01-09 | IndexTTS 2.5 Technical Report | Yunpei Li et.al. | 2601.03888 | null |
| 2026-01-14 | Stuttering-Aware Automatic Speech Recognition for Indonesian Language | Fadhil Muhammad et.al. | 2601.03727 | null |
| 2026-01-07 | Domain Adaptation of the Pyannote Diarization Pipeline for Conversational Indonesian Audio | Muhammad Daffa'i Rafi Prasetyo et.al. | 2601.03684 | null |
| 2026-01-07 | ReStyle-TTS: Relative and Continuous Style Control for Zero-Shot Speech Synthesis | Haitao Li et.al. | 2601.03632 | null |
| 2026-01-06 | Tigrinya Number Verbalization: Rules, Algorithm, and Implementation | Fitsum Gaim et.al. | 2601.03403 | null |
| 2026-01-06 | Segment-Aware Conditioning for Training-Free Intra-Utterance Emotion and Duration Control in Text-to-Speech | Qifan Liang et.al. | 2601.03170 | null |
| 2026-01-24 | XLSR-MamBo: Scaling the Hybrid Mamba-Attention Backbone for Audio Deepfake Detection | Kwok-Ho Ng et.al. | 2601.02944 | null |
| 2026-01-06 | Vulnerabilities of Audio-Based Biometric Authentication Systems Against Deepfake Speech Synthesis | Mengze Hong et.al. | 2601.02914 | null |
| 2026-01-06 | Vclip: Face-based Speaker Generation by Face-voice Association Learning | Yao Shi et.al. | 2601.02753 | null |
| 2026-01-05 | VocalBridge: Latent Diffusion-Bridge Purification for Defeating Perturbation-Based Voiceprint Defenses | Maryam Abbasihafshejani et.al. | 2601.02444 | null |
| 2026-01-05 | Towards Prosodically Informed Mizo TTS without Explicit Tone Markings | Abhijit Mohanta et.al. | 2601.02073 | null |
| 2026-01-08 | MM-Sonate: Multimodal Controllable Audio-Video Generation with Zero-Shot Voice Cloning | Chunyu Qiang et.al. | 2601.01568 | null |
| 2026-01-04 | OV-InstructTTS: Towards Open-Vocabulary Instruct Text-to-Speech | Yong Ren et.al. | 2601.01459 | null |
| 2026-01-02 | Improving Code-Switching Speech Recognition with TTS Data Augmentation | Yue Heng Yeo et.al. | 2601.00935 | null |
| 2026-01-01 | DepFlow: Disentangled Speech Generation to Mitigate Semantic Bias in Depression Detection | Yuxin Li et.al. | 2601.00303 | null |
| 2025-12-29 | AI4Reading: Chinese Audiobook Interpretation System Based on Multi-Agent Collaboration | Minjiang Huang et.al. | 2512.23300 | null |
| 2025-12-27 | ManchuTTS: Towards High-Quality Manchu Speech Synthesis via Flow Matching and Hierarchical Text Representation | Suhua Wang et.al. | 2512.22491 | null |
| 2025-12-25 | Zero-Shot to Zero-Lies: Detecting Bengali Deepfake Audio through Transfer Learning | Most. Sharmin Sultana Samu et.al. | 2512.21702 | null |
| 2026-01-20 | Fun-Audio-Chat Technical Report | Tongyi Fun Team et.al. | 2512.20156 | link |
| 2025-12-21 | Smark: A Watermark for Text-to-Speech Diffusion Models via Discrete Wavelet Transform | Yichuan Zhang et.al. | 2512.18791 | null |
| 2025-12-21 | Task Vector in TTS: Toward Emotionally Expressive Dialectal Speech Synthesis | Pengchao Feng et.al. | 2512.18699 | null |
| 2025-12-19 | Training Text-to-Speech Model with Purely Synthetic Data: Feasibility, Sensitivity, and Generalization Capability | Tingxiao Zhou et.al. | 2512.17356 | null |
| 2025-12-19 | Robust TTS Training via Self-Purifying Flow Matching for the WildSpoof 2026 TTS Track | June Young Yi et.al. | 2512.17293 | null |
| 2025-12-24 | Hearing to Translate: The Effectiveness of Speech Modality Integration into LLMs | Sara Papi et.al. | 2512.16378 | link |
| 2025-12-16 | Adapting Speech Language Model to Singing Voice Synthesis | Yiwen Zhao et.al. | 2512.14657 | null |
| 2025-12-16 | GLM-TTS Technical Report | Jiayan Cui et.al. | 2512.14291 | link |
| 2025-12-18 | A stylometric analysis of speaker attribution from speech transcripts | Cristina Aggazzotti et.al. | 2512.13667 | null |
| 2025-12-15 | Reproducing and Dissecting Denoising Language Models for Speech Recognition | Dorian Koch et.al. | 2512.13576 | null |
| 2026-01-04 | DisCo-Speech: Controllable Zero-Shot Speech Generation with A Disentangled Speech Codec | Tao Li et.al. | 2512.13251 | null |
| 2025-12-11 | CompanionCast: A Multi-Agent Conversational AI Framework with Spatial Audio for Social Co-Viewing Experiences | Yiyang Wang et.al. | 2512.10918 | null |
| 2025-12-10 | DMP-TTS: Disentangled multi-modal Prompting for Controllable Text-to-Speech with Chained Guidance | Kang Yin et.al. | 2512.09504 | null |
| 2025-12-09 | LG Uplus System with Multi-Speaker IDs and Discriminator-based Sub-Judges for the WildSpoof Challenge | Jinyoung Park et.al. | 2512.09000 | null |
| 2025-12-08 | Beyond Unified Models: A Service-Oriented Approach to Low Latency, Context Aware Phonemization for Real Time TTS | Mahta Fetrat et.al. | 2512.08006 | link |
| 2025-12-06 | Sanvaad: A Multimodal Accessibility Framework for ISL Recognition and Voice-Based Interaction | Kush Revankar et.al. | 2512.06485 | null |
| 2025-12-05 | SEA-SafeguardBench: Evaluating AI Safety in SEA Languages and Cultures | Panuthep Tasawong et.al. | 2512.05501 | null |
| 2025-11-23 | SyncVoice: Towards Video Dubbing with Vision-Augmented Pretrained TTS Model | Kaidi Wang et.al. | 2512.05126 | null |
| 2025-12-04 | HiPPO: Exploring A Novel Hierarchical Pronunciation Assessment Approach for Spoken Languages | Bi-Cheng Yan et.al. | 2512.04964 | null |
| 2025-12-04 | M3-TTS: Multi-modal DiT Alignment & Mel-latent for Zero-shot High-fidelity Speech Synthesis | Xiaopeng Wang et.al. | 2512.04720 | null |
| 2026-01-26 | RRPO: Robust Reward Policy Optimization for LLM-based Emotional TTS | Cong Wang et.al. | 2512.04552 | null |
| 2025-12-02 | How to DP-fy Your Data: A Practical Guide to Generating Synthetic Data With Differential Privacy | Natalia Ponomareva et.al. | 2512.03238 | null |
| 2025-12-02 | BOOM: Beyond Only One Modality KIT's Multimodal Multilingual Lecture Companion | Sai Koneru et.al. | 2512.02817 | null |
| 2025-12-02 | Hear What Matters! Text-conditioned Selective Video-to-Audio Generation | Junwon Lee et.al. | 2512.02650 | null |
| 2025-12-02 | Spoken Conversational Agents with Large Language Models | Chao-Han Huck Yang et.al. | 2512.02593 | null |
| 2025-12-01 | MCAT: Scaling Many-to-Many Speech-to-Text Translation with MLLMs to 70 Languages | Yexing Du et.al. | 2512.01512 | null |
| 2025-12-01 | fMRI2GES: Co-speech Gesture Reconstruction from fMRI Signal with Dual Brain Decoding Alignment | Chunzheng Zhu et.al. | 2512.01189 | null |
| 2025-11-30 | Arabic TTS with FastPitch: Reproducible Baselines, Adversarial Training, and Oversmoothing Analysis | Lars Nippert et.al. | 2512.00937 | null |
| 2025-12-03 | STCTS: Generative Semantic Compression for Ultra-Low Bitrate Speech via Explicit Text-Prosody-Timbre Decomposition | Siyu Wang et.al. | 2512.00451 | null |
| 2025-11-28 | OmniFusion: Simultaneous Multilingual Multimodal Translations via Modular Fusion | Sai Koneru et.al. | 2512.00234 | link |
| 2025-11-28 | CoordSpeaker: Exploiting Gesture Captioning for Coordinated Caption-Empowered Co-Speech Gesture Generation | Fengyi Fang et.al. | 2511.22863 | null |
| 2025-11-27 | PURE Codec: Progressive Unfolding of Residual Entropy for Speech Codec Learning | Jiatong Shi et.al. | 2511.22687 | null |
| 2025-11-27 | Joint Speech and Text Training for LLM-Based End-to-End Spoken Dialogue State Tracking | Katia Vendrame et.al. | 2511.22503 | null |
| 2025-11-27 | GLA-Grad++: An Improved Griffin-Lim Guided Diffusion Model for Speech Synthesis | Teysir Baoueb et.al. | 2511.22293 | null |
| 2025-11-27 | VSpeechLM: A Visual Speech Language Model for Visual Text-to-Speech Task | Yuyue Wang et.al. | 2511.22229 | null |
| 2025-11-27 | Layover or Direct Flight: Rethinking Audio-Guided Image Segmentation | Joel Alberto Santos et.al. | 2511.22025 | null |
| 2025-11-26 | Advancing Marine Bioacoustics with Deep Generative Models: A Hybrid Augmentation Strategy for Southern Resident Killer Whale Detection | Bruno Padovese et.al. | 2511.21872 | null |
| 2025-12-05 | Decoding inner speech with an end-to-end brain-to-text neural interface | Yizi Zhang et.al. | 2511.21740 | null |
| 2025-11-26 | Voice, Bias, and Coreference: An Interpretability Study of Gender in Speech Translation | Lina Conti et.al. | 2511.21517 | null |
| 2025-11-26 | Multi-Reward GRPO for Stable and Prosodic Single-Codebook TTS LLMs at Scale | Yicheng Zhong et.al. | 2511.21270 | null |
| 2025-11-26 | RosettaSpeech: Zero-Shot Speech-to-Speech Translation from Monolingual Data | Zhisheng Zheng et.al. | 2511.20974 | null |
| 2025-12-24 | SingingSDS: A Singing-Capable Spoken Dialogue System for Conversational Roleplay Applications | Jionghao Han et.al. | 2511.20972 | link |
| 2025-11-25 | Continual Audio Deepfake Detection via Universal Adversarial Perturbation | Wangjie Li et.al. | 2511.19974 | null |
| 2025-11-25 | It Hears, It Sees too: Multi-Modal LLM for Depression Detection By Integrating Visual Understanding into Audio Language Models | Xiangyu Zhao et.al. | 2511.19877 | null |
| 2025-11-24 | Dynamic Multi-Species Bird Soundscape Generation with Acoustic Patterning and 3D Spatialization | Ellie L. Zhang et.al. | 2511.19275 | null |
| 2025-11-24 | Context-Aware Whisper for Arabic ASR Under Linguistic Varieties | Bashar Talafha et.al. | 2511.18774 | null |
| 2025-12-03 | First Deep Learning Approach to Hammering Acoustics for Stem Stability Assessment in Total Hip Arthroplasty | Dongqi Zhu et.al. | 2511.18725 | null |
| 2025-11-24 | AIRHILT: A Human-in-the-Loop Testbed for Multimodal Conflict Detection in Aviation | Omar Garib et.al. | 2511.18718 | null |
| 2025-11-23 | InstructAudio: Unified speech and music generation with natural language instruction | Chunyu Qiang et.al. | 2511.18487 | null |
| 2025-11-23 | A Multimodal Conversational Agent for Tabular Data Analysis | Mohammad Nour Al Awad et.al. | 2511.18405 | null |
| 2025-11-22 | A superpersuasive autonomous policy debating system | Allen Roush et.al. | 2511.17854 | null |
| 2025-11-12 | Speech Recognition Model Improves Text-to-Speech Synthesis using Fine-Grained Reward | Guansu Wang et.al. | 2511.17555 | null |
| 2025-11-21 | AI in Music and Sound: Pedagogical Reflections, Post-Structuralist Approaches and Creative Outcomes in Seminar Practice | Guilherme Coelho et.al. | 2511.17425 | null |
| 2025-11-21 | Robot Confirmation Generation and Action Planning Using Long-context Q-Former Integrated with Multimodal LLM | Chiori Hori et.al. | 2511.17335 | null |
| 2025-11-20 | Revisiting Audio-language Pretraining for Learning General-purpose Audio Representation | Wei-Cheng Tseng et.al. | 2511.16757 | null |
| 2025-11-20 | Codec2Vec: Self-Supervised Speech Representation Learning Using Neural Speech Codecs | Wei-Cheng Tseng et.al. | 2511.16639 | null |
| 2025-11-20 | SceneGuard: Training-Time Voice Protection with Scene-Consistent Audible Background Noise | Rui Sang et.al. | 2511.16114 | null |
| 2025-11-26 | Step-Audio-R1 Technical Report | Fei Tian et.al. | 2511.15848 | link |
| 2025-11-24 | PresentCoach: Dual-Agent Presentation Coaching through Exemplars and Interactive Feedback | Sirui Chen et.al. | 2511.15253 | null |
| 2025-11-18 | Quality-Controlled Multimodal Emotion Recognition in Conversations with Identity-Based Transfer Learning and MAMBA Fusion | Zanxu Wang et.al. | 2511.14969 | null |
| 2025-11-18 | Voiced-Aware Style Extraction and Style Direction Adjustment for Expressive Text-to-Speech | Nam-Gyu Kim et.al. | 2511.14824 | null |
| 2025-11-06 | The Impact of Prosodic Segmentation on Speech Synthesis of Spontaneous Speech | Julio Cesar Galdino et.al. | 2511.14779 | null |
| 2025-11-18 | Ground Truth Generation for Multilingual Historical NLP using LLMs | Clovis Gladstone et.al. | 2511.14688 | null |
| 2025-11-18 | TTA: Transcribe, Translate and Alignment for Cross-lingual Speech Representation | Wei Liu et.al. | 2511.14410 | null |
| 2025-11-19 | StreamingTalker: Audio-driven 3D Facial Animation with Autoregressive Diffusion Model | Yifan Yang et.al. | 2511.14223 | null |
| 2025-11-20 | FxSearcher: gradient-free text-driven audio transformation | Hojoon Ki et.al. | 2511.14138 | null |
| 2025-11-17 | Human-centric Maintenance Process Through Integration of AI, Speech, and AR | Parul Khanna et.al. | 2511.13918 | null |
| 2025-11-26 | Passive Dementia Screening via Facial Temporal Micro-Dynamics Analysis of In-the-Wild Talking-Head Video | Filippo Cenacchi et.al. | 2511.13802 | null |
| 2025-11-17 | Computational Measurement of Political Positions: A Review of Text-Based Ideal Point Estimation Algorithms | Patrick Parschan et.al. | 2511.13238 | null |
| 2025-11-24 | FoleyBench: A Benchmark For Video-to-Audio Models | Satvik Dixit et.al. | 2511.13219 | null |
| 2025-11-17 | Distinguishing Repetition Disfluency from Morphological Reduplication in Bangla ASR Transcripts: A Novel Corpus and Benchmarking Analysis | Zaara Zabeen Arpa et.al. | 2511.13159 | null |
| 2025-11-17 | A Smart-Glasses for Emergency Medical Services via Multimodal Multitask Learning | Liuyi Jin et.al. | 2511.13078 | null |
| 2025-11-16 | Improving Direct Persian-English Speech-to-Speech Translation with Discrete Units and Synthetic Parallel Data | Sina Rashidi et.al. | 2511.12690 | null |
| 2025-11-16 | Hi-Reco: High-Fidelity Real-Time Conversational Digital Humans | Hongbin Huang et.al. | 2511.12662 | null |
| 2025-11-23 | Uni-MoE-2.0-Omni: Scaling Language-Centric Omnimodal Large Model with Advanced MoE, Training and Data | Yunxin Li et.al. | 2511.12609 | link |
| 2025-11-16 | DenseAnnotate: Enabling Scalable Dense Caption Collection for Images and 3D Scenes via Spoken Descriptions | Xiaoyu Lin et.al. | 2511.12452 | null |
| 2025-11-15 | VoiceCraft-X: Unifying Multilingual, Voice-Cloning Speech Synthesis and Speech Editing | Zhisheng Zheng et.al. | 2511.12347 | null |
| 2025-11-15 | Fusionista2.0: Efficiency Retrieval System for Large-Scale Datasets | Huy M. Le et.al. | 2511.12255 | null |
| 2025-10-27 | TimeStampEval: A Simple LLM Eval and a Little Fuzzy Matching Trick to Improve Search Accuracy | James McCammon et.al. | 2511.11594 | null |
| 2025-11-14 | Language-Aided State Estimation | Yuki Miyoshi et.al. | 2511.11285 | null |
| 2025-11-14 | CLARITY: Contextual Linguistic Adaptation and Accent Retrieval for Dual-Bias Mitigation in Text-to-Speech Generation | Crystal Min Hui Poon et.al. | 2511.11104 | null |
| 2025-11-14 | Synthetic Voices, Real Threats: Evaluating Large Text-to-Speech Models in Generating Harmful Audio | Guangke Chen et.al. | 2511.10913 | null |
| 2025-11-13 | Curved Worlds, Clear Boundaries: Generalizing Speech Deepfake Detection using Hyperbolic and Spherical Geometry Spaces | Farhan Sheth et.al. | 2511.10793 | null |
| 2025-11-12 | Do AI Voices Learn Social Nuances? A Case of Politeness and Speech Rate | Eyal Rabin et.al. | 2511.10693 | null |
| 2025-11-12 | StyleBreak: Revealing Alignment Vulnerabilities in Large Audio-Language Models via Style-Aware Audio Jailbreak | Hongyi Li et.al. | 2511.10692 | null |
| 2025-11-09 | Towards Fine-Grained Code-Switch Speech Translation with Semantic Space Alignment | Yan Gao et.al. | 2511.10670 | null |
| 2025-11-13 | VocalNet-M2: Advancing Low-Latency Spoken Language Modeling via Integrated Multi-Codebook Tokenization and Multi-Token Prediction | Yuhao Wang et.al. | 2511.10232 | null |
| 2025-11-14 | Speech-Audio Compositional Attacks on Multimodal LLMs and Their Mitigation with SALMONN-Guard | Yudong Yang et.al. | 2511.10222 | null |
| 2025-11-13 | FabasedVC: Enhancing Voice Conversion with Text Modality Fusion and Phoneme-Level SSL Features | Wenyu Wang et.al. | 2511.10112 | null |
| 2025-11-13 | Time-Layer Adaptive Alignment for Speaker Similarity in Flow-Matching Based Zero-Shot TTS | Haoyu Li et.al. | 2511.09995 | null |
| 2025-11-12 | End-to-end Contrastive Language-Speech Pretraining Model For Long-form Spoken Question Answering | Jiliang Hu et.al. | 2511.09282 | null |
| 2025-11-12 | POTSA: A Cross-Lingual Speech Alignment Framework for Low Resource Speech-to-Text Translation | Xuanchen Li et.al. | 2511.09232 | null |
| 2025-11-01 | Retrieval-Augmented Generation of Pediatric Speech-Language Pathology vignettes: A Proof-of-Concept Study | Yilan Liu et.al. | 2511.08600 | null |
| 2025-11-11 | ParliaBench: An Evaluation and Benchmarking Framework for LLM-Generated Parliamentary Speech | Marios Koniaris et.al. | 2511.08247 | null |
| 2025-11-11 | State of the Art in Text Classification for South Slavic Languages: Fine-Tuning or Prompting? | Taja Kuzman Pungeršek et.al. | 2511.07989 | null |
| 2025-11-30 | SpeechJudge: Towards Human-Level Judgment for Speech Naturalness | Xueyao Zhang et.al. | 2511.07931 | null |
| 2025-11-24 | SynTTS-Commands: A Public Dataset for On-Device KWS via TTS-Synthesized Multilingual Speech | Lu Gan et.al. | 2511.07821 | link |
| 2025-11-10 | Conditional Diffusion as Latent Constraints for Controllable Symbolic Music Generation | Matteo Pettenó et.al. | 2511.07156 | null |
| 2025-11-10 | Generating Novel and Realistic Speakers for Voice Conversion | Meiying Melissa Chen et.al. | 2511.07135 | null |
| 2025-11-10 | On the Joint Minimization of Regularization Loss Functions in Deep Variational Bayesian Methods for Attribute-Controlled Symbolic Music Generation | Matteo Pettenó et.al. | 2511.07118 | null |
| 2025-11-10 | E2E-VGuard: Adversarial Prevention for Production LLM-based End-To-End Speech Synthesis | Zhisheng Zhang et.al. | 2511.07099 | null |
| 2025-11-10 | MedVoiceBias: A Controlled Study of Audio LLM Behavior in Clinical Decision-Making | Zhi Rui Tam et.al. | 2511.06592 | null |
| 2025-11-09 | SAR-LM: Symbolic Audio Reasoning with Large Language Models | Termeh Taheri et.al. | 2511.06483 | null |
| 2025-11-18 | TalkSketch: Multimodal Generative AI for Real-time Sketch Ideation with Speech | Weiyan Shi et.al. | 2511.05817 | null |
| 2025-11-07 | Shared Latent Representation for Joint Text-to-Audio-Visual Synthesis | Dogucan Yaman et.al. | 2511.05432 | null |
| 2025-11-07 | Synthesizing speech with selected perceptual voice qualities - A case study with creaky voice | Frederik Rautenberg et.al. | 2511.05143 | null |
| 2025-11-06 | PromptSep: Generative Audio Separation via Multimodal Prompting | Yutong Wen et.al. | 2511.04623 | null |
| 2025-11-19 | Step-Audio-EditX Technical Report | Chao Yan et.al. | 2511.03601 | link |
| 2025-11-05 | Seeing What You Say: Expressive Image Generation from Speech | Jiyoung Lee et.al. | 2511.03423 | null |
| 2025-11-05 | TASU: Text-Only Alignment for Speech Understanding | Jing Peng et.al. | 2511.03310 | null |
| 2025-11-11 | How to Evaluate Speech Translation with Source-Aware Neural MT Metrics | Mauro Cettolo et.al. | 2511.03295 | null |
| 2025-11-05 | PolyNorm: Few-Shot LLM-Based Text Normalization for Text-to-Speech | Michel Wong et.al. | 2511.03080 | null |
| 2025-11-04 | Augmenting Open-Vocabulary Dysarthric Speech Assessment with Human Perceptual Supervision | Kaimeng Jia et.al. | 2511.02270 | null |
| 2025-11-03 | Toward Objective and Interpretable Prosody Evaluation in Text-to-Speech: A Linguistically Motivated Approach | Cedric Chan et.al. | 2511.02104 | null |
| 2025-11-03 | SeaLLMs-Audio: Large Audio-Language Models for Southeast Asia | Chaoqun Liu et.al. | 2511.01670 | null |
| 2025-11-03 | Speech-DRAME: A Framework for Human-Aligned Benchmarks in Speech Role-Play | Jiatong Shi et.al. | 2511.01261 | null |
| 2025-11-28 | LongCat-Flash-Omni Technical Report | Meituan LongCat Team et.al. | 2511.00279 | null |
| 2025-10-31 | Reconstructing Unseen Sentences from Speech-related Biosignals for Open-vocabulary Neural Communication | Deok-Seon Kim et.al. | 2510.27247 | null |
| 2025-10-30 | UniTok-Audio: A Unified Audio Generation Framework via Generative Modeling on Discrete Codec Tokens | Chengwei Liu et.al. | 2510.26372 | null |
| 2025-10-30 | SP-MCQA: Evaluating Intelligibility of TTS Beyond the Word Level | Hitomi Jin Ling Tee et.al. | 2510.26190 | null |
| 2025-10-30 | ALMGuard: Safety Shortcuts and Where to Find Them as Guardrails for Audio-Language Models | Weifei Jin et.al. | 2510.26096 | null |
| 2025-10-27 | SFMS-ALR: Script-First Multilingual Speech Synthesis with Adaptive Locale Resolution | Dharma Teja Donepudi et.al. | 2510.25178 | null |
| 2025-10-29 | Explainable Disentanglement on Discrete Speech Representations for Noise-Robust ASR | Shreyas Gopal et.al. | 2510.25150 | null |
| 2025-10-30 | Evaluating Emotion Recognition in Spoken Language Models on Emotionally Incongruent Speech | Pedro Corrêa et.al. | 2510.25054 | null |
| 2025-10-28 | POWSM: A Phonetic Open Whisper-Style Speech Foundation Model | Chin-Jou Li et.al. | 2510.24992 | null |
| 2025-11-25 | Ming-Flash-Omni: A Sparse, Unified Architecture for Multimodal Perception and Generation | Inclusion AI et.al. | 2510.24821 | null |
| 2025-11-28 | STAR-Bench: Probing Deep Spatio-Temporal Reasoning as Audio 4D Intelligence | Zihan Liu et.al. | 2510.24693 | link |
| 2025-10-28 | Levée d'ambiguïtés par grammaires locales | Eric G. C. Laporte et.al. | 2510.24530 | null |
| 2025-10-28 | Bayesian Speech synthesizers Can Learn from Multiple Teachers | Ziyang Zhang et.al. | 2510.24372 | null |
| 2025-10-28 | Abjad AI at NADI 2025: CATT-Whisper: Multimodal Diacritic Restoration Using Text and Speech Representations | Ahmad Ghannam et.al. | 2510.24247 | null |
| 2025-10-28 | V-SAT: Video Subtitle Annotation Tool | Arpita Kundu et.al. | 2510.24180 | null |
| 2025-10-30 | TeleEgo: Benchmarking Egocentric AI Assistants in the Wild | Jiaqi Yan et.al. | 2510.23981 | null |
| 2025-10-28 | emg2speech: synthesizing speech from electromyography using self-supervised speech models | Harshavardhana T. Gowda et.al. | 2510.23969 | null |
| 2025-10-27 | AfriMTEB and AfriE5: Benchmarking and Adapting Text Embedding Models for African Languages | Kosei Uemura et.al. | 2510.23896 | null |
| 2025-11-01 | RoboOmni: Proactive Robot Manipulation in Omni-modal Context | Siyin Wang et.al. | 2510.23763 | link |
| 2025-10-28 | SoulX-Podcast: Towards Realistic Long-form Podcasts with Dialectal and Paralinguistic Diversity | Hanke Xie et.al. | 2510.23541 | null |
| 2025-10-29 | Are ASR foundation models generalized enough to capture features of regional dialects for low-resource languages? | Tawsif Tashwar Dipto et.al. | 2510.23252 | null |
| 2025-10-27 | Flexing in 73 Languages: A Single Small Model for Multilingual Inflection | Tomáš Sourada et.al. | 2510.23114 | null |
| 2025-10-27 | Adapting Speech Foundation Models with Large Language Models for Unified Speech Recognition | Jing-Xuan Zhang et.al. | 2510.22961 | null |
| 2025-10-30 | DiffRhythm 2: Efficient and High Fidelity Song Generation via Block Flow Matching | Yuepeng Jiang et.al. | 2510.22950 | null |
| 2025-10-26 | UltraVoice: Scaling Fine-Grained Style-Controlled Speech Conversations for Spoken Dialogue Models | Wenming Tu et.al. | 2510.22588 | link |
| 2025-10-25 | M-CIF: Multi-Scale Alignment For CIF-Based Non-Autoregressive ASR | Ruixiang Mao et.al. | 2510.22172 | null |
| 2025-10-23 | GuitarFlow: Realistic Electric Guitar Synthesis From Tablatures via Flow Matching and Style Transfer | Jackson Loth et.al. | 2510.21872 | null |
| 2025-10-24 | Foley Control: Aligning a Frozen Latent Text-to-Audio Model to Video | Ciara Rowles et.al. | 2510.21581 | null |
| 2025-10-24 | SindBERT, the Sailor: Charting the Seas of Turkish NLP | Raphael Scheible-Schmitt et.al. | 2510.21364 | null |
| 2025-10-30 | Elementary, My Dear Watson: Non-Invasive Neural Keyword Spotting in the LibriBrain Dataset | Gereon Elvers et.al. | 2510.21038 | null |
| 2025-10-27 | ReFESS-QI: Reference-Free Evaluation For Speech Separation With Joint Quality And Intelligibility Scoring | Ari Frummer et.al. | 2510.21014 | null |
| 2025-11-13 | Can Current Detectors Catch Face-to-Voice Deepfake Attacks? | Nguyen Linh Bao Nguyen et.al. | 2510.21004 | null |
| 2025-10-22 | Data-Centric Lessons To Improve Speech-Language Pretraining | Vishaal Udandarao et.al. | 2510.20860 | null |
| 2025-10-23 | \textsc{CantoNLU}: A benchmark for Cantonese natural language understanding | Junghyun Min et.al. | 2510.20670 | null |
| 2025-10-23 | Speaking Clearly: A Simplified Whisper-Based Codec for Low-Bitrate Speech Coding | Xin Zhang et.al. | 2510.20504 | null |
| 2025-10-23 | Vox-Evaluator: Enhancing Stability and Fidelity for Zero-shot TTS with A Multi-Level Evaluator | Hualei Wang et.al. | 2510.20210 | null |
| 2025-10-23 | Are Stereotypes Leading LLMs' Zero-Shot Stance Detection ? | Anthony Dubreuil et.al. | 2510.20154 | null |
| 2025-10-23 | SpeechAgent: An End-to-End Mobile Infrastructure for Speech Impairment Assistance | Haowei Lou et.al. | 2510.20113 | null |
| 2025-10-22 | OmniMotion-X: Versatile Multimodal Whole-Body Motion Generation | Guowei Xu et.al. | 2510.19789 | null |
| 2025-10-23 | Adapting Multilingual Models to Code-Mixed Tasks via Model Merging | Prashant Kodali et.al. | 2510.19782 | null |
| 2025-10-22 | Style Attack Disguise: When Fonts Become a Camouflage for Adversarial Intent | Yangshijie Zhang et.al. | 2510.19641 | null |
| 2025-10-22 | Re-evaluating Minimum Bayes Risk Decoding for Automatic Speech Recognition | Yuu Jinnai et.al. | 2510.19471 | null |
| 2025-10-22 | EchoFake: A Replay-Aware Dataset for Practical Speech Deepfake Detection | Tong Zhang et.al. | 2510.19414 | null |
| 2025-10-22 | SONAR-SLT: Multilingual Sign Language Translation via Language-Agnostic Sentence Embedding Supervision | Yasser Hamidullah et.al. | 2510.19398 | null |
| 2025-10-22 | M3-SLU: Evaluating Speaker-Attributed Reasoning in Multimodal Large Language Models | Yejin Kwon et.al. | 2510.19358 | null |
| 2025-10-22 | Modeling Turn-Taking with Semantically Informed Gestures | Varsha Suresh et.al. | 2510.19350 | null |
| 2025-10-22 | Slot Filling as a Reasoning Task for SpeechLLMs | Kadri Hacioglu et.al. | 2510.19326 | null |
| 2025-10-21 | Steering Autoregressive Music Generation with Recursive Feature Machines | Daniel Zhao et.al. | 2510.19127 | null |
| 2025-11-07 | Re:Member: Emotional Question Generation from Personal Memories | Zackary Rackauckas et.al. | 2510.19030 | null |
| 2025-11-05 | StutterZero and StutterFormer: End-to-End Speech Conversion for Stuttering Transcription and Correction | Qianheng Xu et.al. | 2510.18938 | null |
| 2025-10-21 | ProLAP: Probabilistic Language-Audio Pre-Training | Toranosuke Manabe et.al. | 2510.18423 | null |
| 2025-10-21 | KrishokBondhu: A Retrieval-Augmented Voice-Based Agricultural Advisory Call Center for Bengali Farmers | Mohd Ruhul Ameen et.al. | 2510.18355 | null |
| 2025-10-21 | ParaStyleTTS: Toward Efficient and Robust Paralinguistic Style Control for Expressive Text-to-Speech Generation | Haowei Lou et.al. | 2510.18308 | link |
| 2025-10-20 | SARSteer: Safeguarding Large Audio Language Models via Safe-Ablated Refusal Steering | Weilin Lin et.al. | 2510.17633 | null |
| 2025-10-20 | ImaGGen: Zero-Shot Generation of Co-Speech Semantic Gestures Grounded in Language and Image Input | Hendric Voss et.al. | 2510.17617 | null |
| 2025-10-20 | Addressing Antisocial Behavior in Multi-Party Dialogs Through Multimodal Representation Learning | Hajar Bakarou et.al. | 2510.17289 | null |
| 2025-10-19 | Investigating Safety Vulnerabilities of Large Audio-Language Models Under Speaker Emotional Variations | Bo-Han Feng et.al. | 2510.16893 | link |
| 2025-12-14 | SAC: Neural Speech Codec with Semantic-Acoustic Dual-Stream Quantization | Wenxi Chen et.al. | 2510.16841 | link |
| 2025-10-19 | End-to-end Listen, Look, Speak and Act | Siyin Wang et.al. | 2510.16756 | null |
| 2025-10-19 | U-Codec: Ultra Low Frame-rate Neural Speech Codec for Fast High-fidelity Speech Generation | Xusheng Yang et.al. | 2510.16718 | null |
| 2025-10-19 | Zero- and One-Shot Data Augmentation for Sentence-Level Dysarthric Speech Recognition in Constrained Scenarios | Shiyao Wang et.al. | 2510.16700 | null |
| 2025-10-18 | Edge-Based Speech Transcription and Synthesis for Kinyarwanda and Swahili Languages | Pacome Simon Mbonimpa et.al. | 2510.16497 | null |
| 2025-10-18 | Probing the Hidden Talent of ASR Foundation Models for L2 English Oral Assessment | Fu-An Chao et.al. | 2510.16387 | null |
| 2025-10-17 | AsyncVoice Agent: Real-Time Explanation for LLM Planning and Reasoning | Yueqian Lin et.al. | 2510.16156 | null |
| 2025-10-17 | Leveraging LLMs for Context-Aware Implicit Textual and Multimodal Hate Speech Detection | Joshua Wolfe Brook et.al. | 2510.15685 | null |
| 2025-10-17 | SpikeVox: Towards Energy-Efficient Speech Therapy Framework with Spike-driven Generative Language Models | Rachmad Vidya Wicaksana Putra et.al. | 2510.15566 | null |
| 2025-10-17 | Extending Audio Context for Long-Form Understanding in Large Audio-Language Models | Yuatyong Chaichana et.al. | 2510.15231 | null |
| 2025-10-17 | LongCat-Audio-Codec: An Audio Tokenizer and Detokenizer Solution Designed for Speech Large Language Models | Xiaohan Zhao et.al. | 2510.15227 | null |
| 2025-10-16 | OmniMotion: Multimodal Motion Generation with Continuous Masked Autoregression | Zhe Li et.al. | 2510.14954 | null |
| 2025-10-16 | TASLA: Text-Aligned Speech Tokens with Multiple Layer-Aggregation | Ming-Hao Hsu et.al. | 2510.14934 | null |
| 2025-10-16 | TRI-DEP: A Trimodal Comparative Study for Depression Detection Using Speech, Text, and EEG | Annisaa Fitri Nurfidausi et.al. | 2510.14922 | null |
| 2025-10-16 | RLAIF-SPA: Optimizing LLM-based Emotional Speech Synthesis via RLAIF | Qing Yang et.al. | 2510.14628 | null |
| 2025-10-15 | Closing the Gap Between Text and Speech Understanding in LLMs | Santiago Cuervo et.al. | 2510.13632 | null |
| 2025-10-15 | Mismatch Aware Guidance for Robust Emotion Control in Auto-Regressive TTS Models | Yizhou Peng et.al. | 2510.13293 | null |
| 2025-10-23 | Continuous-Token Diffusion for Speaker-Referenced TTS in Multimodal LLMs | Xinlu He et.al. | 2510.12995 | null |
| 2025-10-15 | DiSTAR: Diffusion over a Scalable Token Autoregressive Representation for Speech Generation | Yakun Song et.al. | 2510.12210 | null |
| 2025-10-14 | Understanding the Modality Gap: An Empirical Study on the Speech-Text Alignment Mechanism of Large Speech Language Models | Bajian Xiang et.al. | 2510.12116 | null |
| 2025-10-13 | BridgeCode: A Dual Speech Representation Paradigm for Autoregressive Zero-Shot Text-to-Speech Synthesis | Jingyuan Xing et.al. | 2510.11646 | null |
| 2025-10-13 | Perturbation Self-Supervised Representations for Cross-Lingual Emotion TTS: Stage-Wise Modeling of Emotion and Speaker | Cheng Gong et.al. | 2510.11124 | null |
| 2025-10-14 | ParsVoice: A Large-Scale Multi-Speaker Persian Speech Corpus for Text-to-Speech Synthesis | Mohammad Javad Ranjbar Kalahroodi et.al. | 2510.10774 | null |
| 2025-10-12 | End-to-end Speech Recognition with similar length speech and text | Peng Fan et.al. | 2510.10453 | null |
| 2025-10-17 | MRSAudio: A Large-Scale Multimodal Recorded Spatial Audio Dataset with Refined Annotations | Wenxiang Guo et.al. | 2510.10396 | link |
| 2025-10-10 | O_O-VC: Synthetic Data-Driven One-to-One Alignment for Any-to-Any Voice Conversion | Huu Tuong Tu et.al. | 2510.09061 | link |
| 2025-10-09 | DialoSpeech: Dual-Speaker Dialogue Generation with LLM and Flow Matching | Hanke Xie et.al. | 2510.08373 | null |
| 2025-10-09 | IntMeanFlow: Few-step Speech Generation with Integral Velocity Distillation | Wei Wang et.al. | 2510.07979 | null |
| 2025-10-09 | Standard-to-Dialect Transfer Trends Differ across Text and Speech: A Case Study on Intent and Topic Classification in German Dialects | Verena Blaschke et.al. | 2510.07890 | null |
| 2025-10-08 | Making Machines Sound Sarcastic: LLM-Enhanced and Retrieval-Guided Sarcastic Speech Synthesis | Zhu Li et.al. | 2510.07096 | null |
| 2025-10-08 | Towards Responsible Evaluation for Text-to-Speech | Yifan Yang et.al. | 2510.06927 | null |
| 2025-10-08 | XLSR-Kanformer: A KAN-Intergrated model for Synthetic Speech Detection | Phuong Tuan Dat et.al. | 2510.06706 | null |
| 2025-10-07 | EverydayMMQA: A Multilingual and Multimodal Framework for Culturally Grounded Spoken Visual QA | Firoj Alam et.al. | 2510.06371 | null |
| 2025-10-08 | TokenChain: A Discrete Speech Chain via Semantic Token Modeling | Mingxuan Wang et.al. | 2510.06201 | null |
| 2025-10-07 | Latent Speech-Text Transformer | Yen-Ju Lu et.al. | 2510.06195 | null |
| 2025-10-07 | ECTSpeech: Enhancing Efficient Speech Synthesis via Easy Consistency Tuning | Tao Zhu et.al. | 2510.05984 | null |
| 2025-10-07 | Data-efficient Targeted Token-level Preference Optimization for LLM-based Text-to-Speech | Rikuto Kotoge et.al. | 2510.05799 | null |
| 2025-10-07 | Sparse deepfake detection promotes better disentanglement | Antoine Teissier et.al. | 2510.05696 | null |
| 2025-10-09 | Paper2Video: Automatic Video Generation from Scientific Papers | Zeyu Zhu et.al. | 2510.05096 | link |
| 2025-10-06 | Speak, Edit, Repeat: High-Fidelity Voice Editing and Zero-Shot TTS with Cross-Attentive Mamba | Baher Mohammad et.al. | 2510.04738 | null |
| 2025-11-20 | UniVoice: Unifying Autoregressive ASR and Flow-Matching based TTS with Large Language Models | Wenhao Guan et.al. | 2510.04593 | link |
| 2025-10-07 | Synthetic Audio Forensics Evaluation (SAFE) Challenge | Kirill Trapeznikov et.al. | 2510.03387 | null |
| 2025-10-03 | Flamed-TTS: Flow Matching Attention-Free Models for Efficient Generating and Dynamic Pacing Zero-shot Text-to-Speech | Hieu-Nghia Huynh-Nguyen et.al. | 2510.02848 | null |
| 2025-09-26 | KAME: Tandem Architecture for Enhancing Knowledge in Real-Time Speech-to-Speech Conversational AI | So Kuroki et.al. | 2510.02327 | null |
| 2025-09-24 | SpeechCT-CLIP: Distilling Text-Image Knowledge to Speech for Voice-Native Multimodal CT Analysis | Lukas Buess et.al. | 2510.02322 | null |
| 2025-10-02 | Emotional Text-To-Speech Based on Mutual-Information-Guided Emotion-Timbre Disentanglement | Jianing Yang et.al. | 2510.01722 | null |
| 2025-09-30 | BatonVoice: An Operationalist Framework for Enhancing Controllable Speech Synthesis with Linguistic Intelligence from LLMs | Yue Wang et.al. | 2509.26514 | link |
| 2025-09-30 | Optimizing Speech Language Models for Acoustic Consistency | Morteza Rohanian et.al. | 2509.26276 | null |
| 2025-09-30 | HiStyle: Hierarchical Style Embedding Predictor for Text-Prompt-Guided Controllable Speech Synthesis | Ziyu Zhang et.al. | 2509.25842 | null |
| 2025-09-30 | LTA-L2S: Lexical Tone-Aware Lip-to-Speech Synthesis for Mandarin with Cross-Lingual Transfer Learning | Kang Yang et.al. | 2509.25670 | null |
| 2025-09-29 | Emotion-Aligned Generation in Diffusion Text to Speech Models via Preference-Guided Optimization | Jiacheng Shi et.al. | 2509.25416 | null |
| 2025-09-29 | MGM-Omni: Scaling Omni LLMs to Personalized Long-Horizon Speech | Chengyao Wang et.al. | 2509.25131 | link |
| 2025-09-30 | VSSFlow: Unifying Video-conditioned Sound and Speech Generation via Joint Learning | Xin Cheng et.al. | 2509.24773 | null |
| 2025-09-29 | VoxCPM: Tokenizer-Free TTS for Context-Aware Speech Generation and True-to-Life Voice Cloning | Yixuan Zhou et.al. | 2509.24650 | null |
| 2025-09-29 | Word-Level Emotional Expression Control in Zero-Shot Text-to-Speech Synthesis | Tianrui Wang et.al. | 2509.24629 | null |
| 2025-09-29 | ISSE: An Instruction-Guided Speech Style Editing Dataset And Benchmark | Yun Chen et.al. | 2509.24570 | null |
| 2025-09-29 | UniFlow-Audio: Unified Flow Matching for Audio Generation from Omni-Modalities | Xuenan Xu et.al. | 2509.24391 | link |
| 2025-09-28 | Generalizable Speech Deepfake Detection via Information Bottleneck Enhanced Adversarial Alignment | Pu Huang et.al. | 2509.23618 | null |
| 2025-09-27 | BFA: Real-time Multilingual Text-to-speech Forced Alignment | Abdul Rehman et.al. | 2509.23147 | null |
| 2025-09-26 | ArFake: A Multi-Dialect Benchmark and Baselines for Arabic Spoof-Speech Detection | Mohamed Maged et.al. | 2509.22808 | null |
| 2025-09-25 | DiaMoE-TTS: A Unified IPA-Based Dialect TTS Framework with Mixture-of-Experts and Parameter-Efficient Zero-Shot Adaptation | Ziqi Chen et.al. | 2509.22727 | null |
| 2025-09-26 | Semantic-VAE: Semantic-Alignment Latent Representation for Better Speech Synthesis | Zhikang Niu et.al. | 2509.22167 | null |
| 2025-09-26 | Speaker Anonymisation for Speech-based Suicide Risk Detection | Ziyun Cui et.al. | 2509.22148 | null |
| 2025-09-26 | Comprehend and Talk: Text to Speech Synthesis via Dual Language Modeling | Junjie Cao et.al. | 2509.22062 | null |
| 2025-09-26 | Align2Speak: Improving TTS for Low Resource Languages via ASR-Guided Online Preference Optimization | Shehzeen Hussain et.al. | 2509.21718 | null |
| 2025-09-25 | UniSS: Unified Expressive Speech-to-Speech Translation with Your Voice | Sitong Cheng et.al. | 2509.21144 | null |
| 2025-09-27 | i-LAVA: Insights on Low Latency Voice-2-Voice Architecture for Agents | Anupam Purwar et.al. | 2509.20971 | null |
| 2025-09-26 | SPADE: Structured Pruning and Adaptive Distillation for Efficient LLM-TTS | Tan Dat Nguyen et.al. | 2509.20802 | null |
| 2025-09-24 | Objective Evaluation of Prosody and Intelligibility in Speech Synthesis via Conditional Prediction of Discrete Tokens | Ismail Rasim Ulgen et.al. | 2509.20485 | null |
| 2025-09-20 | Beyond Global Emotion: Fine-Grained Emotional Speech Synthesis with Dynamic Word-Level Modulation | Sirui Wang et.al. | 2509.20378 | null |
| 2025-09-24 | OLaPh: Optimal Language Phonemizer | Johannes Wirth et.al. | 2509.20086 | null |
| 2025-09-25 | Measuring Prosody Diversity in Zero-Shot TTS: A New Metric, Benchmark, and Exploration | Yifan Yang et.al. | 2509.19928 | null |
| 2025-09-24 | CoMelSinger: Discrete Token-Based Zero-Shot Singing Synthesis With Structured Melody Control and Guidance | Junchuan Zhao et.al. | 2509.19883 | null |
| 2025-09-24 | Eliminating stability hallucinations in llm-based tts models via attention guidance | ShiMing Wang et.al. | 2509.19852 | null |
| 2025-09-24 | Efficient Speech Watermarking for Speech Synthesis via Progressive Knowledge Distillation | Yang Cui et.al. | 2509.19812 | null |
| 2025-09-24 | PART: Progressive Alignment Representation Training for Multilingual Speech-To-Text with LLMs | Pei Zhang et.al. | 2509.19745 | null |
| 2025-09-24 | Selective Classifier-free Guidance for Zero-shot Text-to-speech | John Zheng et.al. | 2509.19668 | null |
| 2025-09-23 | HD-PPT: Hierarchical Decoding of Content- and Prompt-Preference Tokens for Instruction-based TTS | Sihang Nie et.al. | 2509.19001 | null |
| 2025-09-23 | Direct Preference Optimization for Speech Autoregressive Diffusion Models | Zhijun Liu et.al. | 2509.18928 | null |
| 2025-09-23 | Group Relative Policy Optimization for Text-to-Speech with Large Language Models | Chang Liu et.al. | 2509.18798 | null |
| 2025-09-23 | Explore the Reinforcement Learning for the LLM based ASR and TTS system | Changfeng Gao et.al. | 2509.18569 | null |
| 2025-09-23 | No Verifiable Reward for Prosody: Toward Preference-Guided Prosody Learning in TTS | Seungyoun Shin et.al. | 2509.18531 | null |
| 2025-10-13 | Discrete-Time Diffusion-Like Models for Speech Synthesis | Xiaozhou Tan et.al. | 2509.18470 | null |
| 2025-09-22 | TMD-TTS: A Unified Tibetan Multi-Dialect Text-to-Speech Synthesis for Ü-Tsang, Amdo and Kham Speech Dataset Generation | Yutong Liu et.al. | 2509.18060 | null |
| 2025-09-22 | Nord-Parl-TTS: Finnish and Swedish TTS Dataset from Parliament Speech | Zirui Li et.al. | 2509.17988 | null |
| 2025-09-22 | Audiobook-CC: Controllable Long-context Speech Generation for Multicast Audiobook | Min Liu et.al. | 2509.17516 | null |
| 2025-09-29 | Sidon: Fast and Robust Open-Source Multilingual Speech Restoration for Large-scale Dataset Cleansing | Wataru Nakata et.al. | 2509.17052 | link |
| 2025-09-21 | Bridging the gap between training and inference in LM-based TTS models | Ruonan Zhang et.al. | 2509.17021 | null |
| 2025-09-21 | MBCodec:Thorough disentangle for high-fidelity audio compression | Ruonan Zhang et.al. | 2509.17006 | null |
| 2025-09-19 | Fed-PISA: Federated Voice Cloning via Personalized Identity-Style Adaptation | Qi Wang et.al. | 2509.16010 | null |
| 2025-09-19 | VoXtream: Full-Stream Text-to-Speech with Extremely Low Latency | Nikita Torgashov et.al. | 2509.15969 | link |
| 2025-09-19 | Deep Dubbing: End-to-End Auto-Audiobook System with Text-to-Timbre and Context-Aware Instruct-TTS | Ziqi Dai et.al. | 2509.15845 | null |
| 2025-09-19 | LibriTTS-VI: A Public Corpus and Novel Methods for Efficient Voice Impression Control | Junki Ohmura et.al. | 2509.15626 | null |
| 2025-09-19 | Beyond Video-to-SFX: Video to Audio Synthesis with Environmentally Aware Speech | Xinlei Niu et.al. | 2509.15492 | null |
| 2025-09-18 | A Novel Semantic Compression Approach for Ultra-low Bandwidth Voice Communication | Ryan Collette et.al. | 2509.15462 | null |
| 2025-09-23 | Frustratingly Easy Data Augmentation for Low-Resource ASR | Katsumi Ibaraki et.al. | 2509.15373 | null |
| 2025-09-18 | Emotion-Aware Speech Generation with Character-Specific Voices for Comics | Zhiwen Qian et.al. | 2509.15253 | null |
| 2025-09-18 | Real-Time Streaming Mel Vocoding with Generative Flow Matching | Simon Welker et.al. | 2509.15085 | null |
| 2025-09-18 | MELA-TTS: Joint transformer-diffusion model with representation alignment for speech synthesis | Keyu An et.al. | 2509.14784 | null |
| 2025-09-19 | DAIEN-TTS: Disentangled Audio Infilling for Environment-Aware Text-to-Speech Synthesis | Ye-Xin Lu et.al. | 2509.14684 | null |
| 2025-09-18 | Stochastic Clock Attention for Aligning Continuous and Ordered Sequences | Hyungjoon Soh et.al. | 2509.14678 | null |
| 2025-09-20 | Cross-Lingual F5-TTS: Towards Language-Agnostic Voice Cloning and Speech Synthesis | Qingyu Liu et.al. | 2509.14579 | null |
| 2025-09-17 | SpeechOp: Inference-Time Task Composition for Generative Speech Processing | Justin Lovelace et.al. | 2509.14298 | null |
| 2025-10-01 | SpeechWeave: Diverse Multilingual Synthetic Text & Audio Data Generation Pipeline for Training Text to Speech Models | Karan Dua et.al. | 2509.14270 | null |
| 2025-09-17 | CS-FLEURS: A Massively Multilingual and Code-Switched Speech Dataset | Brian Yan et.al. | 2509.14161 | null |
| 2025-09-22 | Do You Hear What I Mean? Quantifying the Instruction-Perception Gap in Instruction-Guided Expressive Text-To-Speech Systems | Yi-Cheng Lin et.al. | 2509.13989 | null |
| 2025-10-15 | MSR-Codec: A Low-Bitrate Multi-Stream Residual Codec for High-Fidelity Speech Generation with Information Disentanglement | Jingyu Li et.al. | 2509.13068 | null |
| 2025-09-16 | A Lightweight Pipeline for Noisy Speech Voice Cloning and Accurate Lip Sync Synthesis | Javeria Amir et.al. | 2509.12831 | null |
| 2025-10-16 | Preservation of Language Understanding Capabilities in Speech-aware Large Language Models | Marek Kubis et.al. | 2509.12171 | null |
| 2025-09-29 | FuseCodec: Semantic-Contextual Fusion and Supervision for Neural Codecs | Md Mubtasim Ahasan et.al. | 2509.11425 | null |
| 2025-09-14 | Length-Aware Rotary Position Embedding for Text-Speech Alignment | Hyeongju Kim et.al. | 2509.11084 | null |
| 2025-09-12 | WhisTLE: Deeply Supervised, Text-Only Domain Adaptation for Pretrained Speech Recognition Transformers | Akshat Pandey et.al. | 2509.10452 | null |
| 2025-09-12 | Towards Data Drift Monitoring for Speech Deepfake Detection in the context of MLOps | Xin Wang et.al. | 2509.10086 | null |
| 2025-09-11 | DiTReducio: A Training-Free Acceleration for DiT-Based TTS via Progressive Calibration | Yanru Huo et.al. | 2509.09748 | null |
| 2025-09-12 | DiFlow-TTS: Discrete Flow Matching with Factorized Speech Tokens for Low-Latency Zero-Shot Text-To-Speech | Ngoc-Son Nguyen et.al. | 2509.09631 | null |
| 2025-09-11 | HISPASpoof: A New Dataset For Spanish Speech Forensics | Maria Risques et.al. | 2509.09155 | null |
| 2025-09-29 | Streaming Sequence-to-Sequence Learning with Delayed Streams Modeling | Neil Zeghidour et.al. | 2509.08753 | null |
| 2025-09-09 | ParCzech4Speech: A New Speech Corpus Derived from Czech Parliamentary Data | Vladislav Stankov et.al. | 2509.06675 | null |
| 2025-08-19 | Integrating Feedback Loss from Bi-modal Sarcasm Detector for Sarcastic Speech Synthesis | Zhu Li et.al. | 2508.13028 | null |
| 2025-10-07 | EmoSSLSphere: Multilingual Emotional Speech Synthesis with Spherical Vectors and Discrete Speech Tokens | Joonyong Park et.al. | 2508.11273 | null |
| 2025-08-08 | UniTalker: Conversational Speech-Visual Synthesis | Yifan Hu et.al. | 2508.04585 | null |
| 2025-08-29 | Parallel GPT: Harmonizing the Independence and Interdependence of Acoustic and Semantic Information for Zero-Shot Text-to-Speech | Jingyuan Xing et.al. | 2508.04141 | null |
| 2025-07-23 | AI Telephone Surveying: Automating Quantitative Data Collection with an AI Interviewer | Danny D. Leybzon et.al. | 2507.17718 | null |
| 2025-07-23 | BoSS: Beyond-Semantic Speech | Qing Wang et.al. | 2507.17563 | null |
| 2025-07-22 | SplitMeanFlow: Interval Splitting Consistency in Few-Step Generative Modeling | Yi Guo et.al. | 2507.16884 | null |
| 2025-07-15 | Evaluating Speech-to-Text x LLM x Text-to-Speech Combinations for AI Interview Systems | Nima Yazdani et.al. | 2507.16835 | null |
| 2025-07-21 | A2TTS: TTS for Low Resource Indian Languages | Ayush Singh Bhadoriya et.al. | 2507.15272 | null |
| 2025-07-21 | EchoVoices: Preserving Generational Voices and Memories for Seniors and Children | Haiying Xu et.al. | 2507.15221 | null |
| 2025-07-22 | Hear Your Code Fail, Voice-Assisted Debugging for Python | Sayed Mahbub Hasan Amiri et.al. | 2507.15007 | null |
| 2025-07-20 | DMOSpeech 2: Reinforcement Learning for Duration Prediction in Metric-Optimized Speech Synthesis | Yinghao Aaron Li et.al. | 2507.14988 | null |
| 2025-07-17 | A Data-Centric Framework for Addressing Phonetic and Prosodic Challenges in Russian Speech Generative Models | Kirill Borodin et.al. | 2507.13563 | null |
| 2025-07-17 | NonverbalTTS: A Public English Corpus of Text-Aligned Nonverbal Vocalizations with Emotion Annotations for Text-to-Speech | Maksim Borisov et.al. | 2507.13155 | null |
| 2025-07-17 | Intelligent Virtual Sonographer (IVS): Enhancing Physician-Robot-Patient Communication | Tianyu Song et.al. | 2507.13052 | null |
| 2025-07-17 | Enkidu: Universal Frequential Perturbation for Real-Time Audio Privacy Protection against Voice Deepfakes | Zhou Feng et.al. | 2507.12932 | null |
| 2025-07-16 | Quantize More, Lose Less: Autoregressive Generation from Residually Quantized Speech Representations | Yichen Han et.al. | 2507.12197 | null |
| 2025-07-16 | EME-TTS: Unlocking the Emphasis and Emotion Link in Speech Synthesis | Haoxun Li et.al. | 2507.12015 | null |
| 2025-07-15 | Towards Scalable AASIST: Refining Graph Attention for Speech Deepfake Detection | Ivan Viakhirev et.al. | 2507.11777 | null |
| 2025-07-15 | P.808 Multilingual Speech Enhancement Testing: Approach and Results of URGENT 2025 Challenge | Marvin Sach et.al. | 2507.11306 | null |
| 2025-07-20 | Supporting SENCOTEN Language Documentation Efforts with Automatic Speech Recognition | Mengzhe Geng et.al. | 2507.10827 | null |
| 2025-07-14 | An Empirical Evaluation of AI-Powered Non-Player Characters' Perceived Realism and Performance in Virtual Reality Environments | Mikko Korkiakoski et.al. | 2507.10469 | null |
| 2025-07-12 | ZipVoice-Dialog: Non-Autoregressive Spoken Dialogue Generation with Flow Matching | Han Zhu et.al. | 2507.09318 | null |
| 2025-07-12 | Voice Conversion for Lombard Speaking Style with Implicit and Explicit Acoustic Feature Conditioning | Dominika Woszczyk et.al. | 2507.09310 | null |
| 2025-07-12 | ClaritySpeech: Dementia Obfuscation in Speech | Dominika Woszczyk et.al. | 2507.09282 | null |
| 2025-07-11 | SemAlignVC: Enhancing zero-shot timbre conversion using semantic alignment | Shivam Mehta et.al. | 2507.09070 | null |
| 2025-07-11 | Exploiting Leaderboards for Large-Scale Distribution of Malicious Models | Anshuman Suri et.al. | 2507.08983 | null |
| 2025-07-06 | A Hybrid Machine Learning Framework for Optimizing Crop Selection via Agronomic and Economic Forecasting | Niranjan Mallikarjun Sindhur et.al. | 2507.08832 | null |
| 2025-07-11 | Unlocking Speech Instruction Data Potential with Query Rewriting | Yonghua Hei et.al. | 2507.08603 | null |
| 2025-07-11 | MIDI-VALLE: Improving Expressive Piano Performance Synthesis Through Neural Codec Language Modelling | Jingjing Tang et.al. | 2507.08530 | null |
| 2025-07-11 | Active Learning for Text-to-Speech Synthesis with Informative Sample Collection | Kentaro Seki et.al. | 2507.08319 | null |
| 2025-07-05 | RepeaTTS: Towards Feature Discovery through Repeated Fine-Tuning | Atli Sigurgeirsson et.al. | 2507.08012 | null |
| 2025-07-10 | SecureSpeech: Prompt-based Speaker and Content Protection | Belinda Soh Hui Hui et.al. | 2507.07799 | null |
| 2025-07-09 | Learning Japanese with Jouzu: Interaction Outcomes with Stylized Dialogue Fictional Agents | Zackary Rackauckas et.al. | 2507.06483 | null |
| 2025-07-08 | Speech Quality Assessment Model Based on Mixture of Experts: System-Level Performance Enhancement and Utterance-Level Challenge Analysis | Xintong Hu et.al. | 2507.06116 | null |
| 2025-07-08 | Differentiable Reward Optimization for LLM based TTS system | Changfeng Gao et.al. | 2507.05911 | null |
| 2025-07-08 | OpenS2S: Advancing Fully Open-Source End-to-End Empathetic Large Speech Language Model | Chen Wang et.al. | 2507.05177 | null |
| 2025-07-07 | Multi-Step Prediction and Control of Hierarchical Emotion Distribution in Text-to-Speech Synthesis | Sho Inoue et.al. | 2507.04598 | null |
| 2025-07-06 | TTS-CtrlNet: Time varying emotion aligned text-to-speech generation with ControlNet | Jaeseok Jeong et.al. | 2507.04349 | null |
| 2025-07-05 | PresentAgent: Multimodal Agent for Presentation Video Generation | Jingwei Shi et.al. | 2507.04036 | null |
| 2025-07-08 | Prosody Labeling with Phoneme-BERT and Speech Foundation Models | Tomoki Koriyama et.al. | 2507.03912 | null |
| 2025-07-05 | Traceable TTS: Toward Watermark-Free TTS with Strong Traceability | Yuxiang Zhao et.al. | 2507.03887 | null |
| 2025-07-14 | DeepGesture: A conversational gesture synthesis system based on emotions and semantics | Thanh Hoang-Minh et.al. | 2507.03147 | null |
| 2025-07-03 | Open-Source System for Multilingual Translation and Cloned Speech Synthesis | Mateo Cámara et.al. | 2507.02530 | null |
| 2025-07-03 | JoyTTS: LLM-based Spoken Chatbot With Voice Cloning | Fangru Zhou et.al. | 2507.02380 | null |
| 2025-07-02 | Analyzing and Improving Speaker Similarity Assessment for Speech Synthesis | Marc-André Carbonneau et.al. | 2507.02176 | null |
| 2025-07-08 | Pronunciation Editing for Finnish Speech using Phonetic Posteriorgrams | Zirui Li et.al. | 2507.02115 | null |
| 2025-07-02 | A Dataset for Automatic Assessment of TTS Quality in Spanish | Alejandro Sosa Welford et.al. | 2507.01805 | null |
| 2025-07-02 | Voice Conversion for Likability Control via Automated Rating of Speech Synthesis Corpora | Hitoshi Suda et.al. | 2507.01356 | null |
| 2025-07-08 | SpeechAccentLLM: A Unified Framework for Foreign Accent Conversion and Text to Speech | Zhuangfei Cheng et.al. | 2507.01348 | null |
| 2025-07-02 | Multi-interaction TTS toward professional recording reproduction | Hiroki Kanagawa et.al. | 2507.00808 | null |
| 2025-07-18 | MuteSwap: Visual-informed Silent Video Identity Conversion | Yifan Liu et.al. | 2507.00498 | null |
| 2025-06-30 | Collecting, Curating, and Annotating Good Quality Speech deepfake dataset for Famous Figures: Process and Challenges | Hashim Ali et.al. | 2507.00324 | null |
| 2025-06-30 | Investigating Stochastic Methods for Prosody Modeling in Speech Synthesis | Paul Mayer et.al. | 2507.00227 | null |
| 2025-06-30 | JAM-Flow: Joint Audio-Motion Synthesis with Flow Matching | Mingi Kwon et.al. | 2506.23552 | null |
| 2025-06-29 | You Sound a Little Tense: L2 Tailored Clear TTS Using Durational Vowel Properties | Paige Tuttösí et.al. | 2506.23367 | null |
| 2025-06-27 | Evaluating Pointing Gestures for Target Selection in Human-Robot Collaboration | Noora Sassali et.al. | 2506.22116 | null |
| 2025-06-27 | Robust and Efficient Autoregressive Speech Synthesis with Dynamic Chunk-wise Prediction Policy | Bohan Li et.al. | 2506.22023 | null |
| 2025-06-23 | IndexTTS2: A Breakthrough in Emotionally Expressive and Duration-Controlled Auto-Regressive Zero-Shot Text-to-Speech | Siyi Zhou et.al. | 2506.21619 | null |
| 2025-06-27 | A Multi-Stage Framework for Multimodal Controllable Speech Synthesis | Rui Niu et.al. | 2506.20945 | null |
| 2025-06-25 | An Exploration of ECAPA-TDNN and x-vector Speaker Representations in Zero-shot Multi-speaker TTS | Marie Kunešová et.al. | 2506.20190 | null |
| 2025-06-24 | TTSDS2: Resources and Benchmark for Evaluating Human-Quality Text to Speech Systems | Christoph Minixhofer et.al. | 2506.19441 | null |
| 2025-06-23 | Selecting N-lowest scores for training MOS prediction models | Yuto Kondo et.al. | 2506.18326 | null |
| 2025-06-23 | Rethinking Mean Opinion Scores in Speech Quality Assessment: Aggregation through Quantized Distribution Fitting | Yuto Kondo et.al. | 2506.18307 | null |
| 2025-07-15 | JIS: A Speech Corpus of Japanese Idol Speakers with Various Speaking Styles | Yuto Kondo et.al. | 2506.18296 | null |
| 2025-06-21 | OpusLM: A Family of Open Unified Speech Language Models | Jinchuan Tian et.al. | 2506.17611 | null |
| 2025-06-20 | RapFlow-TTS: Rapid and High-Fidelity Text-to-Speech with Improved Consistency Flow Matching | Hyun Joon Park et.al. | 2506.16741 | null |
| 2025-06-20 | LM-SPT: LM-Aligned Semantic Distillation for Speech Tokenization | Daejin Jo et.al. | 2506.16738 | null |
| 2025-06-20 | V-CASS: Vision-context-aware Expressive Speech Synthesis for Enhancing User Understanding of Videos | Qixin Wang et.al. | 2506.16716 | null |
| 2025-06-19 | Streaming Non-Autoregressive Model for Accent Conversion and Pronunciation Improvement | Tuan-Nam Nguyen et.al. | 2506.16580 | null |
| 2025-06-19 | InstructTTSEval: Benchmarking Complex Natural-Language Instruction Following in Text-to-Speech Systems | Kexin Huang et.al. | 2506.16381 | link |
| 2025-06-19 | Optimizing Multilingual Text-To-Speech with Accents & Emotions | Pranav Pawar et.al. | 2506.16310 | null |
| 2025-06-18 | TTSOps: A Closed-Loop Corpus Optimization Framework for Training Multi-Speaker TTS Models from Dark Data | Kentaro Seki et.al. | 2506.15614 | null |
| 2025-06-18 | PredGen: Accelerated Inference of Large Language Models through Input-Time Speculation for Real-Time Speech Interaction | Shufan Li et.al. | 2506.15556 | null |
| 2025-06-18 | EmojiVoice: Towards long-term controllable expressivity in robot speech | Paige Tuttösí et.al. | 2506.15085 | null |
| 2025-06-18 | An accurate and revised version of optical character recognition-based speech synthesis using LabVIEW | Prateek Mehta et.al. | 2506.15029 | null |
| 2025-06-17 | Investigation of Zero-shot Text-to-Speech Models for Enhancing Short-Utterance Speaker Verification | Yiyang Zhao et.al. | 2506.14226 | null |
| 2025-06-17 | Pushing the Performance of Synthetic Speech Detection with Kolmogorov-Arnold Networks and Self-Supervised Learning Models | Tuan Dat Phuong et.al. | 2506.14153 | link |
| 2025-06-16 | EmoNews: A Spoken Dialogue System for Expressive News Conversations | Ryuki Matsuura et.al. | 2506.13894 | link |
| 2025-07-08 | Multimodal Integration Challenges in Emotionally Expressive Child Avatars for Training Applications | Pegah Salehi et.al. | 2506.13477 | null |
| 2025-06-20 | ZipVoice: Fast and High-Quality Zero-Shot Text-to-Speech with Flow Matching | Han Zhu et.al. | 2506.13053 | link |
| 2025-06-14 | StreamMel: Real-Time Zero-shot Text-to-Speech via Interleaved Continuous Autoregressive Modeling | Hui Wang et.al. | 2506.12570 | null |
| 2025-06-14 | Phonikud: Hebrew Grapheme-to-Phoneme Conversion for Real-Time Text-to-Speech | Yakov Kolani et.al. | 2506.12311 | null |
| 2025-07-08 | S2ST-Omni: An Efficient Multilingual Speech-to-Speech Translation Framework via Seamless Speech-Text Alignment and Progressive Fine-tuning | Yu Pan et.al. | 2506.11160 | null |
| 2025-06-16 | A Self-Refining Framework for Enhancing ASR Using TTS-Synthesized Data | Cheng-Kang Chou et.al. | 2506.11130 | null |
| 2025-06-10 | GUIRoboTron-Speech: Towards Automated GUI Agents Based on Speech Instructions | Wenkang Han et.al. | 2506.11127 | null |
| 2025-06-10 | ASRJam: Human-Friendly AI Speech Jamming to Prevent Automated Phone Scams | Freddie Grabovski et.al. | 2506.11125 | null |
| 2025-06-05 | Intelligibility of Text-to-Speech Systems for Mathematical Expressions | Sujoy Roychowdhury et.al. | 2506.11086 | null |
| 2025-06-12 | Scheduled Interleaved Speech-Text Training for Speech-to-Speech Translation with LLMs | Hayato Futami et.al. | 2506.10299 | null |
| 2025-07-10 | UmbraTTS: Adapting Text-to-Speech to Environmental Contexts with Flow Matching | Neta Glazer et.al. | 2506.09874 | null |
| 2025-06-15 | EmoNet-Voice: A Fine-Grained, Expert-Verified Benchmark for Speech Emotion Detection | Christoph Schuhmann et.al. | 2506.09827 | null |
| 2025-06-11 | OmniDRCA: Parallel Speech-Text Foundation Model via Dual-Resolution Speech Representations and Contrastive Alignment | Chao-Hong Tan et.al. | 2506.09349 | link |
| 2025-06-11 | Ming-Omni: A Unified Multimodal Model for Perception and Generation | Inclusion AI et.al. | 2506.09344 | link |
| 2025-06-13 | Step-Audio-AQAA: a Fully End-to-End Expressive Large Audio Language Model | Ailin Huang et.al. | 2506.08967 | null |
| 2025-06-10 | A Review on Score-based Generative Models for Audio Applications | Ge Zhu et.al. | 2506.08457 | null |
| 2025-06-09 | Seeing Voices: Generating A-Roll Video from Audio with Mirage | Aditi Sundararaman et.al. | 2506.08279 | null |
| 2025-06-09 | Transcript-Prompted Whisper with Dictionary-Enhanced Decoding for Japanese Speech Annotation | Rui Hu et.al. | 2506.07646 | null |
| 2025-06-07 | SynHate: Detecting Hate Speech in Synthetic Deepfake Audio | Rishabh Ranjan et.al. | 2506.06772 | null |
| 2025-06-09 | Voice Impression Control in Zero-Shot TTS | Keinichi Fujita et.al. | 2506.05688 | null |
| 2025-05-28 | Speaking images. A novel framework for the automated self-description of artworks | Valentine Bernasconi et.al. | 2506.05368 | null |
| 2025-06-05 | Grapheme-Coherent Phonemic and Prosodic Annotation of Speech by Implicit and Explicit Grapheme Conditioning | Hien Ohnaka et.al. | 2506.04527 | null |
| 2025-06-04 | Can we reconstruct a dysarthric voice with the large speech model Parler TTS? | Ariadna Sanchez et.al. | 2506.04397 | null |
| 2025-06-04 | HiFiTTS-2: A Large-Scale High Bandwidth Speech Dataset | Ryan Langman et.al. | 2506.04152 | null |
| 2025-07-23 | UniCUE: Unified Recognition and Generation Framework for Chinese Cued Speech Video-to-Speech Generation | Jinting Wang et.al. | 2506.04134 | null |
| 2025-06-04 | A Novel Data Augmentation Approach for Automatic Speaking Assessment on Opinion Expressions | Chung-Chun Wang et.al. | 2506.04077 | null |
| 2025-06-04 | Kinship in Speech: Leveraging Linguistic Relatedness for Zero-Shot TTS in Indian Languages | Utkarsh Pathak et.al. | 2506.03884 | null |
| 2025-06-04 | Mark My Words: A Robust Multilingual Model for Punctuation in Text and Speech Transcripts | Sidharth Pulipaka et.al. | 2506.03793 | null |
| 2025-06-04 | Comparative Analysis of Fast and High-Fidelity Neural Vocoders for Low-Latency Streaming Synthesis in Resource-Constrained Environments | Reo Yoneyama et.al. | 2506.03554 | null |
| 2025-06-04 | BitTTS: Highly Compact Text-to-Speech Using 1.58-bit Quantization and Weight Indexing | Masaya Kawamura et.al. | 2506.03515 | null |
| 2025-06-03 | Controllable Text-to-Speech Synthesis with Masked-Autoencoded Style-Rich Representation | Yongqi Wang et.al. | 2506.02997 | null |
| 2025-06-03 | Towards a Japanese Full-duplex Spoken Dialogue System | Atsumoto Ohashi et.al. | 2506.02979 | null |
| 2025-06-03 | CapSpeech: Enabling Downstream Applications in Style-Captioned Text-to-Speech | Helin Wang et.al. | 2506.02863 | null |
| 2025-06-03 | Prompt-Unseen-Emotion: Zero-shot Expressive Speech Synthesis with Prompt-LLM Contextual Knowledge for Mixed Emotions | Xiaoxue Gao et.al. | 2506.02742 | null |
| 2025-06-03 | Trusted Fake Audio Detection Based on Dirichlet Distribution | Chi Ding et.al. | 2506.02401 | null |
| 2025-06-02 | SALF-MOS: Speaker Agnostic Latent Features Downsampled for MOS Prediction | Saurabh Agrawal et.al. | 2506.02082 | null |
| 2025-06-02 | Speech-to-Speech Translation Pipelines for Conversations in Low-Resource Languages | Andrei Popescu-Belis et.al. | 2506.01406 | null |
| 2025-06-02 | Zero-Shot Text-to-Speech for Vietnamese | Thi Vu et.al. | 2506.01322 | null |
| 2025-06-02 | CleanS2S: Single-file Framework for Proactive Speech-to-Speech Interaction | Yudong Lu et.al. | 2506.01268 | null |
| 2025-06-02 | WCTC-Biasing: Retraining-free Contextual Biasing ASR with Wildcard CTC-based Keyword Spotting and Inter-layer Biasing | Yu Nakagome et.al. | 2506.01263 | null |
| 2025-06-01 | DS-TTS: Zero-Shot Speaker Style Adaptation from Voice Clips via Dynamic Dual-Style Feature Modulation | Ming Meng et.al. | 2506.01020 | null |
| 2025-06-01 | Counterfactual Activation Editing for Post-hoc Prosody and Mispronunciation Correction in TTS Models | Kyowoon Lee et.al. | 2506.00832 | null |
| 2025-05-31 | Chain-of-Thought Training for Open E2E Spoken Dialogue Systems | Siddhant Arora et.al. | 2506.00722 | null |
| 2025-05-30 | Werewolf: A Straightforward Game Framework with TTS for Improved User Engagement | Qihui Fan et.al. | 2506.00160 | null |
| 2025-05-30 | SwitchLingua: The First Large-Scale Multilingual and Multi-Ethnic Code-Switching Dataset | Peng Xie et.al. | 2506.00087 | null |
| 2025-05-30 | Speech Token Prediction via Compressed-to-fine Language Modeling for Speech Generation | Wenrui Liu et.al. | 2505.24496 | null |
| 2025-05-30 | DS-Codec: Dual-Stage Training with Mirror-to-NonMirror Architecture Switching for Speech Codec | Peijie Chen et.al. | 2505.24314 | null |
| 2025-05-29 | Can Emotion Fool Anti-spoofing? | Aurosweta Mahapatra et.al. | 2505.23962 | null |
| 2025-05-29 | Few-Shot Speech Deepfake Detection Adaptation with Gaussian Processes | Neta Glazer et.al. | 2505.23619 | link |
| 2025-05-29 | EmergentTTS-Eval: Evaluating TTS Models on Complex Prosodic, Expressiveness, and Linguistic Challenges Using Model-as-a-Judge | Ruskin Raj Manku et.al. | 2505.23009 | link |
| 2025-05-29 | LLM-Synth4KWS: Scalable Automatic Generation and Synthesis of Confusable Data for Custom Keyword Spotting | Pai Zhu et.al. | 2505.22995 | null |
| 2025-05-28 | BinauralFlow: A Causal and Streamable Approach for High-Quality Binaural Speech Synthesis with Flow Matching Models | Susan Liang et.al. | 2505.22865 | null |
| 2025-05-28 | Tell me Habibi, is it Real or Fake? | Kartik Kuckreja et.al. | 2505.22581 | null |
| 2025-05-28 | A Linguistically Motivated Analysis of Intonational Phrasing in Text-to-Speech Systems: Revealing Gaps in Syntactic Sensitivity | Charlotte Pouw et.al. | 2505.22236 | null |
| 2025-06-29 | Spotlight-TTS: Spotlighting the Style via Voiced-Aware Style Extraction and Style Direction Adjustment for Expressive Text-to-Speech | Nam-Gyu Kim et.al. | 2505.20868 | null |
| 2025-05-26 | ArVoice: A Multi-Speaker Dataset for Arabic Speech Synthesis | Hawau Olamide Toyin et.al. | 2505.20506 | null |
| 2025-06-04 | Accelerating Flow-Matching-Based Text-to-Speech via Empirically Pruned Step Sampling | Qixi Zheng et.al. | 2505.19931 | null |
| 2025-05-26 | DiEmo-TTS: Disentangled Emotion Representations via Self-Supervised Distillation for Cross-Speaker Emotion Transfer in Text-to-Speech | Deok-Hyeon Cho et.al. | 2505.19687 | null |
| 2025-05-26 | KIT's Low-resource Speech Translation Systems for IWSLT2025: System Enhancement with Synthetic Data and Model Regularization | Zhaolin Li et.al. | 2505.19679 | null |
| 2025-06-02 | Zero-Shot Streaming Text to Speech Synthesis with Transducer and Auto-Regressive Modeling | Haiyang Sun et.al. | 2505.19669 | null |
| 2025-05-30 | Accelerating Diffusion-based Text-to-Speech Model Training with Dual Modality Alignment | Jeongsoo Choi et.al. | 2505.19595 | link |
| 2025-05-26 | GSA-TTS : Toward Zero-Shot Speech Synthesis based on Gradual Style Adaptor | Seokgi Lee et.al. | 2505.19384 | null |
| 2025-05-25 | SpeakStream: Streaming Text-to-Speech with Interleaved Data | Richard He Bai et.al. | 2505.19206 | null |
| 2025-05-25 | CloneShield: A Framework for Universal Perturbation Against Zero-Shot Voice Cloning | Renyuan Li et.al. | 2505.19119 | null |
| 2025-05-27 | Revival with Voice: Multi-modal Controllable Text-to-Speech Synthesis | Minsu Kim et.al. | 2505.18972 | null |
| 2025-05-27 | RASMALAI: Resources for Adaptive Speech Modeling in Indian Languages with Accents and Intonations | Ashwin Sankar et.al. | 2505.18609 | null |
| 2025-05-24 | MPE-TTS: Customized Emotion Zero-Shot Text-To-Speech Using Multi-Modal Prompt | Zhichao Wu et.al. | 2505.18453 | null |
| 2025-05-27 | CosyVoice 3: Towards In-the-wild Speech Generation via Scaling-up and Post-training | Zhihao Du et.al. | 2505.17589 | null |
| 2025-05-23 | What You Read Isn't What You Hear: Linguistic Sensitivity in Deepfake Speech Detection | Binh Nguyen et.al. | 2505.17513 | null |
| 2025-05-23 | UniTTS: An end-to-end TTS system without decoupling of acoustic and semantic information | Rui Wang et.al. | 2505.17426 | link |
| 2025-05-23 | Speechless: Speech Instruction Training Without Speech for Low Resource Languages | Alan Dao et.al. | 2505.17417 | link |
| 2025-05-22 | Benchmarking Expressive Japanese Character Text-to-Speech with VITS and Style-BERT-VITS2 | Zackary Rackauckas et.al. | 2505.17320 | null |
| 2025-05-21 | Voicing Personas: Rewriting Persona Descriptions into Style Prompts for Controllable Text-to-Speech | Yejin Lee et.al. | 2505.17093 | null |
| 2025-06-13 | Impact of Frame Rates on Speech Tokenizer: A Case Study on Mandarin and English | Haoyang Zhang et.al. | 2505.17076 | null |
| 2025-05-22 | From Tens of Hours to Tens of Thousands: Scaling Back-Translation for Speech Recognition | Tianduo Wang et.al. | 2505.16972 | link |
| 2025-05-21 | MIKU-PAL: An Automated and Standardized Multi-Modal Method for Speech Paralinguistic and Affect Labeling | Yifan Cheng et.al. | 2505.15772 | null |
| 2025-05-21 | Segmentation-Variant Codebooks for Preservation of Paralinguistic and Prosodic Information | Nicholas Sanders et.al. | 2505.15667 | null |
| 2025-05-21 | Audio Jailbreak: An Open Comprehensive Benchmark for Jailbreaking Large Audio-Language Models | Zirui Song et.al. | 2505.15406 | link |
| 2025-05-21 | Prosody-Adaptable Audio Codecs for Zero-Shot Voice Conversion via In-Context Learning | Junchuan Zhao et.al. | 2505.15402 | null |
| 2025-06-03 | Accelerating Autoregressive Speech Synthesis Inference With Speech Speculative Decoding | Zijian Lin et.al. | 2505.15380 | null |
| 2025-05-20 | Pairwise Evaluation of Accent Similarity in Speech Synthesis | Jinzuomu Zhong et.al. | 2505.14410 | null |
| 2025-05-20 | FMSD-TTS: Few-shot Multi-Speaker Multi-Dialect Text-to-Speech Synthesis for Ü-Tsang, Amdo and Kham Speech Dataset Generation | Yutong Liu et.al. | 2505.14351 | null |
| 2025-05-21 | AudioJailbreak: Jailbreak Attacks against End-to-End Large Audio-Language Models | Guangke Chen et.al. | 2505.14103 | null |
| 2025-05-20 | SeamlessEdit: Background Noise Aware Zero-Shot Speech Editing with in-Context Enhancement | Kuan-Yu Chen et.al. | 2505.14066 | null |
| 2025-05-22 | Improving Noise Robustness of LLM-based Zero-shot TTS via Discrete Acoustic Token Denoising | Ye-Xin Lu et.al. | 2505.13830 | null |
| 2025-05-29 | Articulatory Feature Prediction from Surface EMG during Speech Production | Jihwan Lee et.al. | 2505.13814 | null |
| 2025-05-19 | Efficient Speech Language Modeling via Energy Distance in Continuous Latent Space | Zhengrui Ma et.al. | 2505.13181 | link |
| 2025-05-19 | OZSpeech: One-step Zero-shot Speech Synthesis with Learned-Prior-Conditioned Flow Matching | Hieu-Nghia Huynh-Nguyen et.al. | 2505.12800 | null |
| 2025-05-19 | RoVo: Robust Voice Protection Against Unauthorized Speech Synthesis with Embedding-Level Perturbations | Seungmin Kim et.al. | 2505.12686 | null |
| 2025-05-19 | Chain-Talker: Chain Understanding and Rendering for Empathetic Conversational Speech Synthesis | Yifan Hu et.al. | 2505.12597 | link |
| 2025-05-18 | Shallow Flow Matching for Coarse-to-Fine Text-to-Speech Synthesis | Dong Yang et.al. | 2505.12226 | null |
| 2025-05-16 | Audio Turing Test: Benchmarking the Human-likeness of Large Language Model-based Text-to-Speech Systems in Chinese | Xihuai Wang et.al. | 2505.11200 | null |
| 2025-05-16 | BanglaFake: Constructing and Evaluating a Specialized Bengali Deepfake Audio Dataset | Istiaq Ahmed Fahad et.al. | 2505.10885 | link |
| 2025-05-15 | UDDETTS: Unifying Discrete and Dimensional Emotions for Controllable Emotional Text-to-Speech | Jiaxuan Liu et.al. | 2505.10599 | null |
| 2025-05-14 | DPN-GAN: Inducing Periodic Activations in Generative Adversarial Networks for High-Fidelity Audio Synthesis | Zeeshan Ahmad et.al. | 2505.09091 | null |
| 2025-05-13 | Investigating self-supervised features for expressive, multilingual voice conversion | Álvaro Martín-Cortinas et.al. | 2505.08278 | null |
| 2025-05-12 | MiniMax-Speech: Intrinsic Zero-Shot Text-to-Speech with a Learnable Speaker Encoder | Bowen Zhang et.al. | 2505.07916 | null |
| 2025-05-13 | Lightweight End-to-end Text-to-speech Synthesis for low resource on-device applications | Biel Tura Vecino et.al. | 2505.07701 | null |
| 2025-05-10 | VTutor: An Animated Pedagogical Agent SDK that Provide Real Time Multi-Model Feedback | Eason Chen et.al. | 2505.06676 | null |
| 2025-05-10 | Bridging the Gap: An Intermediate Language for Enhanced and Cost-Effective Grapheme-to-Phoneme Conversion with Homographs with Multiple Pronunciations Disambiguation | Abbas Bertina et.al. | 2505.06599 | null |
| 2025-05-15 | FlexSpeech: Towards Stable, Controllable and Expressive Text-to-Speech | Linhan Ma et.al. | 2505.05159 | null |
| 2025-05-08 | Teochew-Wild: The First In-the-wild Teochew Dataset with Orthographic Annotations | Linrong Pan et.al. | 2505.05056 | null |
| 2025-05-08 | A Multi-Agent AI Framework for Immersive Audiobook Production through Spatial Audio and Neural Narration | Shaja Arul Selvamani et.al. | 2505.04885 | null |
| 2025-06-06 | Advancing Zero-shot Text-to-Speech Intelligibility across Diverse Domains via Preference Alignment | Xueyao Zhang et.al. | 2505.04113 | null |
| 2025-05-06 | VITA-Audio: Fast Interleaved Cross-Modal Token Generation for Efficient Large Speech-Language Model | Zuwei Long et.al. | 2505.03739 | link |
| 2025-05-13 | SonicRAG : High Fidelity Sound Effects Synthesis Based on Retrival Augmented Generation | Yu-Ren Guo et.al. | 2505.03244 | null |
| 2025-05-05 | Generating Narrated Lecture Videos from Slides with Synchronized Highlights | Alexander Holmberg et.al. | 2505.02966 | null |
| 2025-05-05 | Voila: Voice-Language Foundation Models for Real-Time Autonomous Interaction and Voice Role-Play | Yemin Shi et.al. | 2505.02707 | link |
| 2025-05-05 | LLaMA-Omni2: LLM-based Real-time Spoken Chatbot with Autoregressive Streaming Speech Synthesis | Qingkai Fang et.al. | 2505.02625 | link |
| 2025-04-30 | Sadeed: Advancing Arabic Diacritization Through Small Language Model | Zeina Aldallal et.al. | 2504.21635 | null |
| 2025-04-29 | AlignDiT: Multimodal Aligned Diffusion Transformer for Synchronized Speech Generation | Jeongsoo Choi et.al. | 2504.20629 | null |
| 2025-05-28 | ClonEval: An Open Voice Cloning Benchmark | Iwona Christop et.al. | 2504.20581 | link |
| 2025-05-02 | Towards Flow-Matching-based TTS without Classifier-Free Guidance | Yuzhe Liang et.al. | 2504.20334 | null |
| 2025-04-27 | Generative Adversarial Network based Voice Conversion: Techniques, Challenges, and Recent Advancements | Sandipan Dhar et.al. | 2504.19197 | null |
| 2025-04-27 | Muyan-TTS: A Trainable Text-to-Speech Model Optimized for Podcast Scenarios with a $50K Budget | Xin Li et.al. | 2504.19146 | link |
| 2025-04-22 | FADEL: Uncertainty-aware Fake Audio Detection with Evidential Deep Learning | Ju Yeon Kang et.al. | 2504.15663 | null |
| 2025-04-22 | A Multi-Agent Framework for Automated Qinqiang Opera Script Generation Using Large Language Models | Gengxian Cao et.al. | 2504.15552 | null |
| 2025-04-21 | SOLIDO: A Robust Watermarking Method for Speech Synthesis via Low-Rank Adaptation | Yue Li et.al. | 2504.15035 | null |
| 2025-04-20 | DialogueAgents: A Hybrid Agent-Based Speech Synthesis Framework for Multi-Party Dialogue | Xiang Li et.al. | 2504.14482 | link |
| 2025-04-18 | ChatNekoHacker: Real-Time Fan Engagement with Conversational Agents | Takuya Sera et.al. | 2504.13793 | null |
| 2025-04-18 | Collective Learning Mechanism based Optimal Transport Generative Adversarial Network for Non-parallel Voice Conversion | Sandipan Dhar et.al. | 2504.13791 | null |
| 2025-04-22 | EmoVoice: LLM-based Emotional Text-To-Speech Model with Freestyle Text Prompting | Guanrou Yang et.al. | 2504.12867 | null |
| 2025-05-28 | GOAT-TTS: Expressive and Realistic Speech Generation via A Dual-Branch LLM | Yaodong Song et.al. | 2504.12339 | null |
| 2025-04-15 | Dopamine Audiobook: A Training-free MLLM Agent for Emotional and Human-like Audiobook Generation | Yan Rong et.al. | 2504.11002 | null |
| 2025-04-15 | Generalized Audio Deepfake Detection Using Frame-level Latent Information Entropy | Botao Zhao et.al. | 2504.10819 | null |
| 2025-04-14 | Pseudo-Autoregressive Neural Codec Language Models for Efficient Zero-Shot Text-to-Speech Synthesis | Yifan Yang et.al. | 2504.10352 | null |
| 2025-04-14 | AutoStyle-TTS: Retrieval-Augmented Generation based Automatic Style Matching Text-to-Speech Synthesis | Dan Luo et.al. | 2504.10309 | null |
| 2025-04-14 | SafeSpeech: Robust and Universal Voice Protection Against Malicious Speech Synthesis | Zhisheng Zhang et.al. | 2504.09839 | link |
| 2025-04-12 | AMNet: An Acoustic Model Network for Enhanced Mandarin Speech Synthesis | Yubing Cao et.al. | 2504.09225 | null |
| 2025-04-11 | Generalized Multilingual Text-to-Speech Generation with Language-Aware Style Adaptation | Haowei Lou et.al. | 2504.08274 | null |
| 2025-04-10 | Empowering Global Voices: A Data-Efficient, Phoneme-Tone Adaptive Approach to High-Fidelity Speech Synthesis | Yizhong Geng et.al. | 2504.07858 | null |
| 2025-05-16 | SlimSpeech: Lightweight and Efficient Text-to-Speech with Slim Rectified Flow | Kaidi Wang et.al. | 2504.07776 | null |
| 2025-04-08 | AVENet: Disentangling Features by Approximating Average Features for Voice Conversion | Wenyu Wang et.al. | 2504.05833 | null |
| 2025-04-07 | SpeakEasy: Enhancing Text-to-Speech Interactions for Expressive Content Creation | Stephen Brade et.al. | 2504.05106 | null |
| 2025-04-04 | RWKVTTS: Yet another TTS based on RWKV-7 | Lin yueyu et.al. | 2504.03289 | link |
| 2025-04-22 | F5R-TTS: Improving Flow-Matching based Text-to-Speech with Group Relative Policy Optimization | Xiaohui Sun et.al. | 2504.02407 | null |
| 2025-04-03 | VoiceCraft-Dub: Automated Video Dubbing with Neural Codec Language Models | Kim Sung-Bin et.al. | 2504.02386 | null |
| 2025-04-02 | TeleAntiFraud-28k: An Audio-Text Slow-Thinking Dataset for Telecom Fraud Detection | Zhiming Ma et.al. | 2503.24115 | link |
| 2025-03-31 | SpeechDialogueFactory: Generating High-Quality Speech Dialogue Data to Accelerate Your Speech-LLM Development | Minghan Wang et.al. | 2503.23848 | link |
| 2025-03-30 | Speculative End-Turn Detector for Efficient Speech Chatbot Assistant | Hyunjong Ok et.al. | 2503.23439 | null |
| 2025-05-16 | SupertonicTTS: Towards Highly Scalable and Efficient Text-to-Speech System | Hyeongju Kim et.al. | 2503.23108 | null |
| 2025-03-26 | Dual Audio-Centric Modality Coupling for Talking Head Generation | Ao Fu et.al. | 2503.22728 | null |
| 2025-03-28 | DeepAudio-V1:Towards Multi-Modal Multi-Stage End-to-End Video to Speech and Audio Generation | Haomin Zhang et.al. | 2503.22265 | null |
| 2025-03-26 | Text-Driven Voice Conversion via Latent State-Space Modeling | Wen Li et.al. | 2503.20999 | null |
| 2025-05-26 | FireRedTTS-1S: An Upgraded Streamable Foundation Text-to-Speech System | Hao-Han Guo et.al. | 2503.20499 | null |
| 2025-03-21 | Your voice is your voice: Supporting Self-expression through Speech Generation and LLMs in Augmented and Alternative Communication | Yiwen Xu et.al. | 2503.17479 | null |
| 2025-03-21 | From Faces to Voices: Learning Hierarchical Representations for High-quality Video-to-Speech | Ji-Hoon Kim et.al. | 2503.16956 | null |
| 2025-03-20 | WaveFM: A High-Fidelity and Efficient Vocoder Based on Flow Matching | Tianze Luo et.al. | 2503.16689 | link |
| 2025-03-10 | VocalEyes: Enhancing Environmental Perception for the Visually Impaired through Vision-Language Models and Distance-Aware Object Detection | Kunal Chavan et.al. | 2503.16488 | null |
| 2025-01-22 | Development of an Inclusive Educational Platform Using Open Technologies and Machine Learning: A Case Study on Accessibility Enhancement | Jimi Togni et.al. | 2503.15501 | null |
| 2025-01-14 | AI-Powered Assistive Technologies for Visual Impairment | Prudhvi Naayini et.al. | 2503.15494 | null |
| 2025-03-19 | MoonCast: High-Quality Zero-Shot Podcast Generation | Zeqian Ju et.al. | 2503.14345 | link |
| 2025-03-26 | InnerSelf: Designing Self-Deepfaked Voice for Emotional Well-being | Guang Dai et.al. | 2503.14257 | null |
| 2025-03-14 | MAVFlow: Preserving Paralinguistic Elements with Conditional Flow Matching for Zero-Shot AV2AV Multilingual Translation | Sungwoo Cho et.al. | 2503.11026 | null |
| 2025-03-11 | An Exhaustive Evaluation of TTS- and VC-based Data Augmentation for ASR | Sewade Ogun et.al. | 2503.08954 | null |
| 2025-03-07 | DiVISe: Direct Visual-Input Speech Synthesis Preserving Speaker Characteristics And Intelligibility | Yifan Liu et.al. | 2503.05223 | link |
| 2025-03-03 | Direct Speech to Speech Translation: A Review | Mohammad Sarim et.al. | 2503.04799 | null |
| 2025-03-06 | LLMVoX: Autoregressive Streaming Text-to-Speech Model for Any LLM | Sambal Shikhar et.al. | 2503.04724 | null |
| 2025-03-06 | Scaling Rich Style-Prompted Text-to-Speech Datasets | Anuj Diwan et.al. | 2503.04713 | link |
| 2025-03-05 | Good practices for evaluation of synthesized speech | Erica Cooper et.al. | 2503.03250 | null |
| 2025-03-04 | InSerter: Speech Instruction Following with Unsupervised Interleaved Pre-training | Dingdong Wang et.al. | 2503.02769 | null |
| 2025-03-03 | Spark-TTS: An Efficient LLM-Based Text-to-Speech Model with Single-Stream Decoupled Speech Tokens | Xinsheng Wang et.al. | 2503.01710 | link |
| 2025-03-03 | Voice Cloning for Dysarthric Speech Synthesis: Addressing Data Scarcity in Speech-Language Pathology | Birger Moell et.al. | 2503.01266 | null |
| 2025-03-02 | UniWav: Towards Unified Pre-training for Speech Representation Learning and Generation | Alexander H. Liu et.al. | 2503.00733 | null |
| 2025-03-01 | PodAgent: A Comprehensive Framework for Podcast Generation | Yujia Xiao et.al. | 2503.00455 | link |
| 2025-03-12 | Telephone Surveys Meet Conversational AI: Evaluating a LLM-Based Telephone Survey System at Scale | Max M. Lang et.al. | 2502.20140 | null |
| 2025-02-27 | DiffCSS: Diverse and Expressive Conversational Speech Synthesis with Diffusion Models | Weihao wu et.al. | 2502.19924 | null |
| 2025-03-28 | MegaTTS 3: Sparse Alignment Enhanced Latent Diffusion Transformer for Zero-Shot Speech Synthesis | Ziyue Jiang et.al. | 2502.18924 | null |
| 2025-03-08 | Clip-TTS: Contrastive Text-content and Mel-spectrogram, A High-Quality Text-to-Speech Method based on Contextual Semantic Understanding | Tianyun Liu et.al. | 2502.18889 | null |
| 2025-02-24 | Balancing Speech Understanding and Generation Using Continual Pre-training for Codec-based Speech LLM | Jiatong Shi et.al. | 2502.16897 | null |
| 2025-02-18 | AV-Flow: Transforming Text to Audio-Visual Human-like Interactions | Aggelina Chatziagapi et.al. | 2502.13133 | null |
| 2025-02-18 | High-Fidelity Music Vocoder using Neural Audio Codecs | Luca A. Lanzendörfer et.al. | 2502.12759 | null |
| 2025-02-18 | A Survey on Bridging EEG Signals and Generative AI: From Image and Text to Beyond | Shreya Shukla et.al. | 2502.12048 | null |
| 2025-02-17 | NaturalL2S: End-to-End High-quality Multispeaker Lip-to-Speech Synthesis with Differential Digital Signal Processing | Yifan Liang et.al. | 2502.12002 | null |
| 2025-02-16 | FELLE: Autoregressive Speech Synthesis with Token-Wise Coarse-to-Fine Flow Matching | Hui Wang et.al. | 2502.11128 | null |
| 2025-02-16 | SyncSpeech: Low-Latency and Efficient Dual-Stream Text-to-Speech based on Temporal Masked Transformer | Zhengyan Sheng et.al. | 2502.11094 | null |
| 2025-02-14 | VocalCrypt: Novel Active Defense Against Deepfake Voice Based on Masking Effect | Qingyuan Fei et.al. | 2502.10329 | null |
| 2025-02-13 | TokenSynth: A Token-based Neural Synthesizer for Instrument Cloning and Text-to-Instrument | Kyungsu Kim et.al. | 2502.08939 | link |
| 2025-04-24 | ASVspoof 5: Design, Collection and Validation of Resources for Spoofing, Deepfake, and Adversarial Attack Detection Using Crowdsourced Speech | Xin Wang et.al. | 2502.08857 | null |
| 2025-02-11 | LoRP-TTS: Low-Rank Personalized Text-To-Speech | Łukasz Bondaruk et.al. | 2502.07562 | null |
| 2025-02-11 | Advanced Zero-Shot Text-to-Speech for Background Removal and Preservation with Controllable Masked Speech Prediction | Leying Zhang et.al. | 2502.07345 | null |
| 2025-02-11 | Vevo: Controllable Zero-Shot Voice Imitation with Self-Supervised Disentanglement | Xueyao Zhang et.al. | 2502.07243 | null |
| 2025-02-10 | Synthetic Audio Helps for Cognitive State Tasks | Adil Soubki et.al. | 2502.06922 | link |
| 2025-02-19 | Speech to Speech Translation with Translatotron: A State of the Art Review | Jules R. Kala et.al. | 2502.05980 | null |
| 2025-02-09 | Non-invasive electromyographic speech neuroprosthesis: a geometric perspective | Harshavardhana T. Gowda et.al. | 2502.05762 | null |
| 2025-02-09 | BnTTS: Few-Shot Speaker Adaptation in Low-Resource Setting | Mohammad Jahid Ibna Basher et.al. | 2502.05729 | null |
| 2025-02-08 | Gender Bias in Instruction-Guided Speech Synthesis Models | Chun-Yi Kuan et.al. | 2502.05649 | null |
| 2025-02-08 | IndexTTS: An Industrial-Level Controllable and Efficient Zero-Shot Text-To-Speech System | Wei Deng et.al. | 2502.05512 | link |
| 2025-02-22 | Llasa: Scaling Train-Time and Inference-Time Compute for Llama-based Speech Synthesis | Zhen Ye et.al. | 2502.04128 | link |
| 2025-02-05 | Metis: A Foundation Speech Generation Model with Masked Generative Pre-training | Yuancheng Wang et.al. | 2502.03128 | link |
| 2025-02-05 | Fine-grained Preference Optimization Improves Zero-shot Text-to-Speech | Jixun Yao et.al. | 2502.02950 | null |
| 2025-02-04 | Developing multilingual speech synthesis system for Ojibwe, Mi'kmaq, and Maliseet | Shenran Wang et.al. | 2502.02703 | link |
| 2025-02-04 | Streaming Speaker Change Detection and Gender Classification for Transducer-Based Multi-Talker Speech Translation | Peidong Wang et.al. | 2502.02683 | null |
| 2025-02-13 | Continuous Autoregressive Modeling with Stochastic Monotonic Alignment for Speech Synthesis | Weiwei Lin et.al. | 2502.01084 | null |
| 2025-02-02 | EmoTalkingGaussian: Continuous Emotion-conditioned Talking Head Synthesis | Junuk Cha et.al. | 2502.00654 | null |
| 2025-01-31 | VisualSpeech: Enhance Prosody with Visual Context in TTS | Shumin Que et.al. | 2501.19258 | null |
| 2025-01-29 | BreezyVoice: Adapting TTS for Taiwanese Mandarin with Enhanced Polyphone Disambiguation -- Challenges and Insights | Chan-Jan Hsu et.al. | 2501.17790 | null |
| 2025-01-28 | Compact Neural TTS Voices for Accessibility | Kunal Jain et.al. | 2501.17332 | null |
| 2025-02-11 | Overview of the Amphion Toolkit (v0.2) | Jiaqi Li et.al. | 2501.15442 | link |
| 2025-01-24 | Characteristic-Specific Partial Fine-Tuning for Efficient Emotion and Speaker Adaptation in Codec Language Text-to-Speech Models | Tianrui Wang et.al. | 2501.14273 | null |
| 2025-01-24 | Generalizable Audio Deepfake Detection via Latent Space Refinement and Augmentation | Wen Huang et.al. | 2501.14240 | null |
| 2025-01-24 | LoCoML: A Framework for Real-World ML Inference Pipelines | Kritin Maddireddy et.al. | 2501.14165 | null |
| 2025-01-23 | Generative Data Augmentation Challenge: Zero-Shot Speech Synthesis for Personalized Speech Enhancement | Jae-Sung Bae et.al. | 2501.13372 | null |
| 2025-01-21 | A Domain Adaptation Framework for Speech Recognition Systems with Only Synthetic data | Minh Tran et.al. | 2501.12501 | null |
| 2025-01-20 | A Non-autoregressive Model for Joint STT and TTS | Vishal Sunder et.al. | 2501.09104 | null |
| 2025-01-15 | Speech Synthesis along Perceptual Voice Quality Dimensions | Frederik Rautenberg et.al. | 2501.08791 | null |
| 2025-01-15 | Adaptive Data Augmentation with NaturalSpeech3 for Far-field Speaker Verification | Li Zhang et.al. | 2501.08691 | null |
| 2025-01-15 | Towards Lightweight and Stable Zero-shot TTS with Self-distilled Representation Disentanglement | Qianniu Chen et.al. | 2501.08566 | null |
| 2025-03-17 | CodecFake+: A Large-Scale Neural Audio Codec-Based Deepfake Speech Dataset | Xuanjun Chen et.al. | 2501.08238 | null |
| 2025-01-13 | Exploring the encoding of linguistic representations in the Fully-Connected Layer of generative CNNs for Speech | Bruno Ferenc Šegedin et.al. | 2501.07726 | null |
| 2025-01-19 | MathReader : Text-to-Speech for Mathematical Documents | Sieun Hyeon et.al. | 2501.07088 | link |
| 2025-01-11 | Retrieval-Augmented Dialogue Knowledge Aggregation for Expressive Conversational Speech Synthesis | Rui Liu et.al. | 2501.06467 | link |
| 2025-01-10 | TTS-Transducer: End-to-End Speech Synthesis with Neural Transducer | Vladimir Bataev et.al. | 2501.06320 | null |
| 2025-01-10 | MinMo: A Multimodal Large Language Model for Seamless Voice Interaction | Qian Chen et.al. | 2501.06282 | null |
| 2025-01-10 | PROEMO: Prompt-Driven Text-to-Speech Synthesis Based on Emotion and Intensity Control | Shaozuo Zhang et.al. | 2501.06276 | null |
| 2025-06-03 | Low-Resource Text-to-Speech Synthesis Using Noise-Augmented Training of ForwardTacotron | Kishor Kayyar Lakshminarayana et.al. | 2501.05976 | null |
| 2025-01-10 | MARS6: A Small and Robust Hierarchical-Codec Text-to-Speech Model | Matthew Baas et.al. | 2501.05787 | null |
| 2025-01-09 | Probing Speaker-specific Features in Speaker Representations | Aemon Yat Fei Chiu et.al. | 2501.05310 | null |
| 2025-01-09 | JELLY: Joint Emotion Recognition and Context Reasoning with LLMs for Conversational Speech Synthesis | Jun-Hyeok Cha et.al. | 2501.04904 | null |
| 2025-01-08 | Cued Speech Generation Leveraging a Pre-trained Audiovisual Text-to-Speech Model | Sanjana Sankar et.al. | 2501.04799 | null |
| 2025-01-08 | FleSpeech: Flexibly Controllable Speech Generation with Various Prompts | Hanzhao Li et.al. | 2501.04644 | null |
| 2025-02-23 | OpenOmni: Advancing Open-Source Omnimodal Large Language Models with Progressive Multimodal Alignment and Real-Time Self-Aware Emotional Speech Synthesis | Run Luo et.al. | 2501.04561 | link |
| 2025-01-08 | DrawSpeech: Expressive Speech Synthesis Using Prosodic Sketches as Control Conditions | Weidong Chen et.al. | 2501.04256 | null |
| 2025-01-07 | NeuroIncept Decoder for High-Fidelity Speech Reconstruction from Neural Activity | Owais Mujtaba Khanday et.al. | 2501.03757 | link |
| 2025-01-02 | FaceSpeak: Expressive and High-Quality Speech Synthesis from Human Portraits of Different Styles | Tian-Hao Zhang et.al. | 2501.03181 | null |
| 2025-01-02 | RingFormer: A Neural Vocoder with Ring Attention and Convolution-Augmented Transformer | Seongho Hong et.al. | 2501.01182 | link |
| 2025-01-02 | Disambiguation of Chinese Polyphones in an End-to-End Framework with Semantic Features Extracted by Pre-trained BERT | Dongyang Dai et.al. | 2501.01102 | null |
| 2025-01-06 | Takeaways from Applying LLM Capabilities to Multiple Conversational Avatars in a VR Pilot Study | Mykola Maslych et.al. | 2501.00168 | null |
| 2024-12-16 | SECodec: Structural Entropy-based Compressive Speech Representation Codec for Speech Language Models | Linqin Wang et.al. | 2501.00018 | null |
| 2024-12-28 | Stable-TTS: Stable Speaker-Adaptive Text-to-Speech Synthesis via Prosody Prompting | Wooseok Han et.al. | 2412.20155 | null |
| 2024-12-28 | CrossSpeech++: Cross-lingual Speech Synthesis with Decoupled Language and Speaker Generation | Ji-Hoon Kim et.al. | 2412.20048 | null |
| 2024-12-26 | VoiceDiT: Dual-Condition Diffusion Transformer for Environment-Aware Speech Synthesis | Jaemin Jung et.al. | 2412.19259 | null |
| 2024-12-26 | "I've Heard of You!": Generate Spoken Named Entity Recognition Data for Unseen Entities | Jiawei Yu et.al. | 2412.19102 | null |
| 2024-12-26 | Indonesian-English Code-Switching Speech Synthesizer Utilizing Multilingual STEN-TTS and Bert LID | Ahmad Alfani Handoyo et.al. | 2412.19043 | null |
| 2025-01-23 | Advancing NAM-to-Speech Conversion with Novel Methods and the MultiNAM Dataset | Neil Shah et.al. | 2412.18839 | null |
| 2025-01-17 | MRI2Speech: Speech Synthesis from Articulatory Movements Recorded by Real-time MRI | Neil Shah et.al. | 2412.18836 | null |
| 2024-12-25 | Intra- and Inter-modal Context Interaction Modeling for Conversational Speech Synthesis | Zhenqi Jia et.al. | 2412.18733 | null |
| 2024-12-24 | GenPod: Constructive News Framing in AI-Generated Podcasts More Effectively Reduces Negative Emotions Than Non-Constructive Framing | Wen Ku et.al. | 2412.18300 | null |
| 2025-03-27 | VERSA: A Versatile Evaluation Toolkit for Speech, Audio, and Music | Jiatong Shi et.al. | 2412.17667 | link |
| 2024-12-22 | Why Do Speech Language Models Fail to Generate Semantically Coherent Outputs? A Modality Evolving Perspective | Hankun Wang et.al. | 2412.17048 | null |
| 2024-12-22 | Incremental Disentanglement for Environment-Aware Zero-Shot Text-to-Speech Synthesis | Ye-Xin Lu et.al. | 2412.16977 | null |
| 2025-09-18 | KALL-E:Autoregressive Speech Synthesis with Next-Distribution Prediction | Kangxiang Xia et.al. | 2412.16846 | null |
| 2024-12-23 | Interleaved Speech-Text Language Models are Simple Streaming Text to Speech Synthesizers | Yifan Yang et.al. | 2412.16102 | null |
| 2024-12-19 | Scale This, Not That: Investigating Key Dataset Attributes for Efficient Speech Enhancement Scaling | Leying Zhang et.al. | 2412.14890 | null |
| 2024-12-17 | Deep Speech Synthesis from Multimodal Articulatory Representations | Peter Wu et.al. | 2412.13387 | null |
| 2024-12-17 | Synthetic Speech Classification: IEEE Signal Processing Cup 2022 challenge | Mahieyin Rahmun et.al. | 2412.13279 | link |
| 2024-12-17 | Enhancing Naturalness in LLM-Generated Utterances through Disfluency Insertion | Syed Zohaib Hassan et.al. | 2412.12710 | null |
| 2024-12-17 | Phoneme-Level Feature Discrepancies: A Key to Detecting Sophisticated Speech Deepfakes | Kuiyuan Zhang et.al. | 2412.12619 | null |
| 2025-01-10 | Hierarchical Control of Emotion Rendering in Speech Synthesis | Sho Inoue et.al. | 2412.12498 | link |
| 2024-12-19 | ProsodyFM: Unsupervised Phrasing and Intonation Control for Intelligible Speech Synthesis | Xiangheng He et.al. | 2412.11795 | null |
| 2024-12-16 | Region-Based Optimization in Continual Learning for Audio Deepfake Detection | Yujie Chen et.al. | 2412.11551 | link |
| 2025-01-15 | Multi-modal and Multi-scale Spatial Environment Understanding for Immersive Visual Text-to-Speech | Rui Liu et.al. | 2412.11409 | link |
| 2024-12-16 | Efficient Generative Modeling with Residual Vector Quantization-Based Tokens | Jaehyeon Kim et.al. | 2412.10208 | null |
| 2024-12-25 | CosyVoice 2: Scalable Streaming Speech Synthesis with Large Language Models | Zhihao Du et.al. | 2412.10117 | link |
| 2024-12-13 | AMuSeD: An Attentive Deep Neural Network for Multimodal Sarcasm Detection Incorporating Bi-modal Data Augmentation | Xiyuan Gao et.al. | 2412.10103 | null |
| 2024-12-13 | CSSinger: End-to-End Chunkwise Streaming Singing Voice Synthesis System Based on Conditional Variational Autoencoder | Jianwei Cui et.al. | 2412.08918 | null |
| 2024-12-11 | Multimodal Latent Language Modeling with Next-Token Diffusion | Yutao Sun et.al. | 2412.08635 | link |
| 2024-12-11 | Zero-Shot Mono-to-Binaural Speech Synthesis | Alon Levkovitch et.al. | 2412.08356 | null |
| 2024-12-11 | A Unified Model For Voice and Accent Conversion In Speech and Singing using Self-Supervised Learning and Feature Extraction | Sowmya Cheripally et.al. | 2412.08312 | null |
| 2024-12-11 | A Preliminary Analysis of Automatic Word and Syllable Prominence Detection in Non-Native Speech With Text-to-Speech Prosody Embeddings | Anindita Mondal et.al. | 2412.08283 | null |
| 2024-12-11 | LatentSpeech: Latent Diffusion for Text-To-Speech Generation | Haowei Lou et.al. | 2412.08117 | null |
| 2024-12-11 | Aligner-Guided Training Paradigm: Advancing Text-to-Speech Models with Aligner Guided Duration | Haowei Lou et.al. | 2412.08112 | null |
| 2024-12-09 | Towards Controllable Speech Synthesis in the Era of Large Language Models: A Survey | Tianxin Xie et.al. | 2412.06602 | link |
| 2024-12-12 | EmoSpeech: A Corpus of Emotionally Rich and Contextually Detailed Speech Annotations | Weizhen Bian et.al. | 2412.06581 | null |
| 2024-12-01 | Text Is Not All You Need: Multimodal Prompting Helps LLMs Understand Humor | Ashwin Baluja et.al. | 2412.05315 | null |
| 2024-12-04 | DiffStyleTTS: Diffusion-based Hierarchical Prosody Modeling for Text-to-Speech with Diverse and Controllable Styles | Jiaxuan Liu et.al. | 2412.03388 | null |
| 2024-12-05 | Analytic Study of Text-Free Speech Synthesis for Raw Audio using a Self-Supervised Learning Model | Joonyong Park et.al. | 2412.03074 | null |
| 2024-12-03 | GLM-4-Voice: Towards Intelligent and Human-Like End-to-End Spoken Chatbot | Aohan Zeng et.al. | 2412.02612 | link |
| 2024-11-19 | A Context-Based Numerical Format Prediction for a Text-To-Speech System | Yaser Darwesh et.al. | 2412.00028 | null |
| 2024-11-27 | Continual Learning in Machine Speech Chain Using Gradient Episodic Memory | Geoffrey Tyndall et.al. | 2411.18320 | null |
| 2024-11-27 | SALMONN-omni: A Codec-free LLM for Full-duplex Speech Understanding and Generation | Wenyi Yu et.al. | 2411.18138 | null |
| 2024-11-26 | Visatronic: A Multimodal Decoder-Only Model for Speech Synthesis | Akshita Gupta et.al. | 2411.17690 | null |
| 2024-11-22 | VQalAttent: a Transparent Speech Generation Pipeline based on Transformer-learned VQ-VAE Latent Space | Armani Rodriguez et.al. | 2411.14642 | null |
| 2024-11-26 | WavChat: A Survey of Spoken Dialogue Models | Shengpeng Ji et.al. | 2411.13577 | link |
| 2024-12-02 | I2TTS: Image-indicated Immersive Text-to-speech Synthesis with Spatial Perception | Jiawei Zhang et.al. | 2411.13314 | null |
| 2024-11-20 | Hard-Synth: Synthesizing Diverse Hard Samples for ASR using Zero-Shot TTS and LLM | Jiawei Yu et.al. | 2411.13159 | null |
| 2024-12-15 | Rethinking MUSHRA: Addressing Modern Challenges in Text-to-Speech Evaluation | Praveen Srinivasa Varadhan et.al. | 2411.12719 | null |
| 2024-11-19 | Leveraging Virtual Reality and AI Tutoring for Language Learning: A Case Study of a Virtual Campus Environment with OpenAI GPT Integration with Unity 3D | Adithya TG et.al. | 2411.12619 | null |
| 2024-11-18 | ESTVocoder: An Excitation-Spectral-Transformed Neural Vocoder Conditioned on Mel Spectrogram | Xiao-Hang Jiang et.al. | 2411.11258 | null |
| 2024-11-18 | SAMOS: A Neural MOS Prediction Model Leveraging Semantic Representations and Acoustic Features | Yu-Fei Shi et.al. | 2411.11232 | null |
| 2024-11-15 | SmoothCache: A Universal Inference Acceleration Technique for Diffusion Transformers | Joseph Liu et.al. | 2411.10510 | link |
| 2024-11-14 | Robust AI-Synthesized Speech Detection Using Feature Decomposition Learning and Synthesizer Feature Augmentation | Kuiyuan Zhang et.al. | 2411.09167 | null |
| 2024-11-14 | Evaluating Synthetic Command Attacks on Smart Voice Assistants | Zhengxian He et.al. | 2411.08316 | null |
| 2024-11-12 | Improving Grapheme-to-Phoneme Conversion through In-Context Knowledge Retrieval with Large Language Models | Dongrui Han et.al. | 2411.07563 | null |
| 2024-11-11 | Enhancing Accessibility in Special Libraries: A Study on AI-Powered Assistive Technologies for Patrons with Disabilities | Snehasish Paul Shivali Chauhan et.al. | 2411.06970 | null |
| 2024-12-04 | Debatts: Zero-Shot Debating Text-to-Speech Synthesis | Yiqiao Huang et.al. | 2411.06540 | null |
| 2024-11-07 | CUIfy the XR: An Open-Source Package to Embed LLM-powered Conversational Agents in XR | Kadir Burak Buldu et.al. | 2411.04671 | null |
| 2024-11-04 | EmoSphere++: Emotion-Controllable Zero-Shot Text-to-Speech via Emotion-Adaptive Spherical Vector | Deok-Hyeon Cho et.al. | 2411.02625 | link |
| 2024-11-04 | Complete reconstruction of the tongue contour through acoustic to articulatory inversion using real-time MRI data | Sofiane Azzouz et.al. | 2411.02037 | null |
| 2024-11-09 | Fish-Speech: Leveraging Large Language Models for Advanced Multilingual Text-to-Speech Synthesis | Shijia Liao et.al. | 2411.01156 | link |
| 2024-10-31 | Speech is More Than Words: Do Speech-to-Text Translation Systems Leverage Prosody? | Ioannis Tsiamas et.al. | 2410.24019 | null |
| 2024-10-30 | Lina-Speech: Gated Linear Attention is a Fast and Parameter-Efficient Learner for text-to-speech synthesis | Théodor Lemerle et.al. | 2410.23320 | link |
| 2024-10-30 | Augmenting Polish Automatic Speech Recognition System With Synthetic Data | Łukasz Bondaruk et.al. | 2410.22903 | null |
| 2024-10-29 | Very Attentive Tacotron: Robust and Unbounded Length Generalization in Autoregressive Transformer-Based Text-to-Speech | Eric Battenberg et.al. | 2410.22179 | link |
| 2024-10-29 | Fast and High-Quality Auto-Regressive Speech Synthesis via Speculative Decoding | Bohan Li et.al. | 2410.21951 | null |
| 2024-10-29 | RDSinger: Reference-based Diffusion Network for Singing Voice Synthesis | Kehan Sui et.al. | 2410.21641 | null |
| 2024-10-28 | Asynchronous Tool Usage for Real-Time Agents | Antonio A. Ginart et.al. | 2410.21620 | null |
| 2024-10-28 | Enhancing TTS Stability in Hebrew using Discrete Semantic Units | Ella Zeldes et.al. | 2410.21502 | null |
| 2024-10-28 | Mitigating Unauthorized Speech Synthesis for Voice Protection | Zhisheng Zhang et.al. | 2410.20742 | link |
| 2024-10-27 | Get Large Language Models Ready to Speak: A Late-fusion Approach for Speech Generation | Maohao Shen et.al. | 2410.20336 | null |
| 2024-10-24 | Making Social Platforms Accessible: Emotion-Aware Speech Generation with Integrated Text Analysis | Suparna De et.al. | 2410.19199 | null |
| 2024-10-24 | STTATTS: Unified Speech-To-Text And Text-To-Speech Model | Hawau Olamide Toyin et.al. | 2410.18607 | link |
| 2024-10-24 | Evaluating and Improving Automatic Speech Recognition Systems for Korean Meteorological Experts | ChaeHun Park et.al. | 2410.18444 | null |
| 2024-10-23 | ELAICHI: Enhancing Low-resource TTS by Addressing Infrequent and Low-frequency Character Bigrams | Srija Anand et.al. | 2410.17901 | null |
| 2024-10-22 | Continuous Speech Tokenizer in Text To Speech | Yixing Li et.al. | 2410.17081 | null |
| 2024-10-22 | Enhancing Low-Resource ASR through Versatile TTS: Bridging the Data Gap | Guanrou Yang et.al. | 2410.16726 | null |
| 2024-10-21 | Continuous Speech Synthesis using per-token Latent Diffusion | Arnon Turetzky et.al. | 2410.16048 | null |
| 2024-10-18 | A Unified Framework for Collecting Text-to-Speech Synthesis Datasets for 22 Indian Languages | Sujitha Sathiyamoorthy et.al. | 2410.14197 | null |
| 2024-12-23 | Multi-Source Spatial Knowledge Understanding for Immersive Visual Text-to-Speech | Shuwei He et.al. | 2410.14101 | link |
| 2024-10-17 | Accelerating Codec-based Speech Synthesis with Multi-Token Prediction and Speculative Decoding | Tan Dat Nguyen et.al. | 2410.13839 | null |
| 2024-10-17 | Enhancing Crowdsourced Audio for Text-to-Speech Models | José Giraldo et.al. | 2410.13357 | null |
| 2024-10-17 | DART: Disentanglement of Accent and Speaker Representation in Multispeaker Text-to-Speech | Jan Melechovsky et.al. | 2410.13342 | null |
| 2024-10-17 | DurIAN-E 2: Duration Informed Attention Network with Adaptive Variational Autoencoder and Adversarial Learning for Expressive Text-to-Speech Synthesis | Yu Gu et.al. | 2410.13288 | null |
| 2024-10-17 | Failing Forward: Improving Generative Error Correction for ASR with Synthetic Data and Retrieval Augmentation | Sreyan Ghosh et.al. | 2410.13198 | null |
| 2024-10-16 | ERVQ: Enhanced Residual Vector Quantization with Intra-and-Inter-Codebook Optimization for Neural Audio Codecs | Rui-Chen Zheng et.al. | 2410.12359 | null |
| 2024-10-16 | Beyond Oversmoothing: Evaluating DDPM and MSE for Scalable Speech Synthesis in ASR | Christoph Minixhofer et.al. | 2410.12279 | null |
| 2024-10-14 | IsoChronoMeter: A simple and effective isochronic translation evaluation metric | Nikolai Rozanov et.al. | 2410.11127 | null |
| 2024-10-14 | DMDSpeech: Distilled Diffusion Model Surpassing The Teacher in Zero-shot Speech Synthesis via Direct Metric Optimization | Yingahao Aaron Li et.al. | 2410.11097 | null |
| 2024-10-14 | Everyday Speech in the Indian Subcontinent | Utkarsh Pathak et.al. | 2410.10508 | null |
| 2024-10-12 | Emphasis Rendering for Conversational Text-to-Speech with Multi-modal Multi-scale Context Modeling | Rui Liu et.al. | 2410.09524 | null |
| 2024-10-10 | Unsupervised Data Validation Methods for Efficient Model Training | Yurii Paniv et.al. | 2410.07880 | null |
| 2024-10-15 | F5-TTS: A Fairytaler that Fakes Fluent and Faithful Speech with Flow Matching | Yushen Chen et.al. | 2410.06885 | link |
| 2024-10-09 | Efficient training strategies for natural sounding speech synthesis and speaker adaptation based on FastPitch | Teodora Răgman et.al. | 2410.06787 | null |
| 2024-10-09 | Bahasa Harmony: A Comprehensive Dataset for Bahasa Text-to-Speech Synthesis with Discrete Codec Modeling of EnGen-TTS | Onkar Kishor Susladkar et.al. | 2410.06608 | null |
| 2024-10-09 | Can DeepFake Speech be Reliably Detected? | Hongbin Liu et.al. | 2410.06572 | null |
| 2024-10-07 | SegINR: Segment-wise Implicit Neural Representation for Sequence Alignment in Neural Text-to-Speech | Minchan Kim et.al. | 2410.04690 | null |
| 2024-10-06 | HALL-E: Hierarchical Neural Codec Language Model for Minute-Long Zero-Shot Text-to-Speech Synthesis | Yuto Nishimura et.al. | 2410.04380 | null |
| 2024-10-10 | SONAR: A Synthetic AI-Audio Detection Framework and Benchmark | Xiang Li et.al. | 2410.04324 | link |
| 2024-10-05 | Adversarial Attacks and Robust Defenses in Speaker Embedding based Zero-Shot Text-to-Speech System | Ze Li et.al. | 2410.04017 | null |
| 2024-10-01 | Recent Advances in Speech Language Models: A Survey | Wenqian Cui et.al. | 2410.03751 | null |
| 2024-09-30 | Accent conversion using discrete units with parallel data synthesized from controllable accented TTS | Tuan Nam Nguyen et.al. | 2410.03734 | null |
| 2024-09-28 | FluentEditor+: Text-based Speech Editing by Modeling Local Hierarchical Acoustic Smoothness and Global Prosody Consistency | Rui Liu et.al. | 2410.03719 | null |
| 2024-10-04 | Generative Semantic Communication for Text-to-Speech Synthesis | Jiahao Zheng et.al. | 2410.03459 | null |
| 2024-10-04 | Textless Streaming Speech-to-Speech Translation using Semantic Speech Tokens | Jinzheng Zhao et.al. | 2410.03298 | null |
| 2024-10-04 | Narrative Player: Reviving Data Narratives with Visuals | Zekai Shao et.al. | 2410.03268 | null |
| 2024-10-04 | MultiVerse: Efficient and Expressive Zero-Shot Multi-Task Text-to-Speech | Taejun Bak et.al. | 2410.03192 | null |
| 2024-10-07 | Algorithms For Automatic Accentuation And Transcription Of Russian Texts In Speech Recognition Systems | Olga Iakovenko et.al. | 2410.02538 | null |
| 2024-10-01 | Augmentation through Laundering Attacks for Audio Spoof Detection | Hashim Ali et.al. | 2410.01108 | null |
| 2024-10-01 | Zero-Shot Text-to-Speech from Continuous Text Streams | Trung Dang et.al. | 2410.00767 | null |
| 2024-10-01 | EmoKnob: Enhance Voice Cloning with Fine-Grained Emotion Control | Haozhe Chen et.al. | 2410.00316 | link |
| 2024-10-02 | Moshi: a speech-text foundation model for real-time dialogue | Alexandre Défossez et.al. | 2410.00037 | link |
| 2024-09-30 | Word-wise intonation model for cross-language TTS systems | Tomilov A. A. et.al. | 2409.20374 | null |
| 2024-09-29 | Quantitative Analysis of Audio-Visual Tasks: An Information-Theoretic Perspective | Chen Chen et.al. | 2409.19575 | null |
| 2024-09-27 | Audio-Based Linguistic Feature Extraction for Enhancing Multi-lingual and Low-Resource Text-to-Speech | Youngjae Kim et.al. | 2409.18622 | null |
| 2024-09-27 | EmoPro: A Prompt Selection Strategy for Emotional Expression in LM-based Speech Synthesis | Haoyu Wang et.al. | 2409.18512 | null |
| 2024-09-26 | Description-based Controllable Text-to-Speech with Cross-Lingual Voice Control | Ryuichi Yamamoto et.al. | 2409.17452 | null |
| 2024-09-25 | Exploring synthetic data for cross-speaker style transfer in style representation based TTS | Lucas H. Ueda et.al. | 2409.17364 | null |
| 2024-09-18 | SpoofCeleb: Speech Deepfake Detection and SASV In The Wild | Jee-weon Jung et.al. | 2409.17285 | null |
| 2024-09-25 | Emotional Dimension Control in Language Model-Based Text-to-Speech: Spanning a Broad Spectrum of Human Emotions | Kun Zhou et.al. | 2409.16681 | null |
| 2024-09-25 | Enabling Auditory Large Language Models for Automatic Speech Quality Evaluation | Siyin Wang et.al. | 2409.16644 | null |
| 2024-09-24 | FastTalker: Jointly Generating Speech and Conversational Gestures from Text | Zixin Guo et.al. | 2409.16404 | null |
| 2024-09-24 | Beyond Text-to-Text: An Overview of Multimodal and Generative Artificial Intelligence for Education Using Topic Modeling | Ville Heilala et.al. | 2409.16376 | null |
| 2024-09-24 | Facial Expression-Enhanced TTS: Combining Face Representation and Emotion Intensity for Adaptive Speech | Yunji Chu et.al. | 2409.16203 | null |
| 2024-09-24 | NanoVoice: Efficient Speaker-Adaptive Text-to-Speech for Multiple Speakers | Nohil Park et.al. | 2409.15760 | null |
| 2024-09-24 | VoiceGuider: Enhancing Out-of-Domain Performance in Parameter-Efficient Speaker-Adaptive Text-to-Speech via Autoguidance | Jiheum Yeom et.al. | 2409.15759 | null |
| 2024-09-24 | StyleFusion TTS: Multimodal Style-control and Enhanced Feature Fusion for Zero-shot Text-to-speech Synthesis | Zhiyong Chen et.al. | 2409.15741 | null |
| 2024-09-04 | Real-time Robotics Situation Awareness for Accident Prevention in Industry | Juan M. Deniz et.al. | 2409.15305 | null |
| 2024-11-28 | A Comprehensive Survey with Critical Analysis for Deepfake Speech Detection | Lam Pham et.al. | 2409.15180 | null |
| 2024-09-23 | HiFi-Glot: Neural Formant Synthesis with Differentiable Resonant Filters | Lauri Juvela et.al. | 2409.14823 | null |
| 2024-09-23 | LlamaPartialSpoof: An LLM-Driven Fake Speech Dataset Simulating Disinformation Generation | Hieu-Thi Luong et.al. | 2409.14743 | null |
| 2024-09-20 | Zero-shot Cross-lingual Voice Transfer for TTS | Fadi Biadsy et.al. | 2409.13910 | null |
| 2024-09-20 | On the Feasibility of Fully AI-automated Vishing Attacks | João Figueiredo et.al. | 2409.13793 | null |
| 2024-09-24 | Enhancing Kurdish Text-to-Speech with Native Corpus Training: A High-Quality WaveGlow Vocoder Approach | Abdulhady Abas Abdullah et.al. | 2409.13734 | null |
| 2024-09-20 | Audio Codec Augmentation for Robust Collaborative Watermarking of Speech Synthesis | Lauri Juvela et.al. | 2409.13382 | link |
| 2024-09-19 | Enhancing Synthetic Training Data for Speech Commands: From ASR-Based Filtering to Domain Adaptation in SSL Latent Space | Sebastião Quintas et.al. | 2409.12745 | null |
| 2024-09-19 | NDVQ: Robust Neural Audio Codec with Normal Distribution-Based Vector Quantization | Zhikang Niu et.al. | 2409.12717 | null |
| 2024-09-19 | Preference Alignment Improves Language Model-Based TTS | Jinchuan Tian et.al. | 2409.12403 | null |
| 2024-09-10 | Prosodic Parameter Manipulation in TTS generated speech for Controlled Speech Generation | Podakanti Satyajith Chary et.al. | 2409.12176 | null |
| 2024-09-18 | Low Frame-rate Speech Codec: a Codec Designed for Fast High-quality Speech LLM Training and Inference | Edresson Casanova et.al. | 2409.12117 | null |
| 2024-09-18 | Exploring an Inter-Pausal Unit (IPU) based Approach for Indic End-to-End TTS Systems | Anusha Prakash et.al. | 2409.11915 | null |
| 2024-09-18 | Mixture of Experts Fusion for Fake Audio Detection Using Frozen wav2vec 2.0 | Zhiyong Wang et.al. | 2409.11909 | null |
| 2024-09-18 | DPI-TTS: Directional Patch Interaction for Fast-Converging and Style Temporal Modeling in Text-to-Speech | Xin Qi et.al. | 2409.11835 | null |
| 2024-09-18 | Speaking from Coarse to Fine: Improving Neural Codec Language Model via Multi-Scale Speech Coding and Generation | Haohan Guo et.al. | 2409.11630 | null |
| 2024-09-17 | SpMis: An Investigation of Synthetic Spoken Misinformation Detection | Peizhuo Liu et.al. | 2409.11308 | null |
| 2024-09-19 | The Art of Storytelling: Multi-Agent Generative AI for Dynamic Multimodal Narratives | Samee Arif et.al. | 2409.11261 | link |
| 2024-09-17 | Zero Shot Text to Speech Augmentation for Automatic Speech Recognition on Low-Resource Accented Speech Corpora | Francesco Nespoli et.al. | 2409.11107 | null |
| 2024-09-17 | Single-stage TTS with Masked Audio Token Modeling and Semantic Knowledge Distillation | Gerard I. Gállego et.al. | 2409.11003 | null |
| 2024-09-17 | Enhancing Multilingual Speech Generation and Recognition Abilities in LLMs with Constructed Code-switched Data | Jing Xu et.al. | 2409.10969 | null |
| 2024-09-16 | Emo-DPO: Controllable Emotional Speech Synthesis through Direct Preference Optimization | Xiaoxue Gao et.al. | 2409.10157 | null |
| 2024-09-16 | StyleTTS-ZS: Efficient High-Quality Zero-Shot Text-to-Speech Synthesis with Distilled Time-Varying Style Diffusion | Yinghao Aaron Li et.al. | 2409.10058 | null |
| 2024-09-15 | Acquiring Pronunciation Knowledge from Transcribed Speech Audio via Multi-task Learning | Siqi Sun et.al. | 2409.09891 | null |
| 2025-01-13 | MacST: Multi-Accent Speech Synthesis via Text Transliteration for Accent Conversion | Sho Inoue et.al. | 2409.09352 | null |
| 2024-09-14 | E1 TTS: Simple and Fast Non-Autoregressive TTS | Zhijun Liu et.al. | 2409.09351 | null |
| 2024-09-14 | Improving Robustness of Diffusion-Based Zero-Shot Speech Synthesis via Stable Formant Generation | Changjin Han et.al. | 2409.09311 | null |
| 2024-09-14 | SafeEar: Content Privacy-Preserving Audio Deepfake Detection | Xinfeng Li et.al. | 2409.09272 | link |
| 2024-09-13 | AccentBox: Towards High-Fidelity Zero-Shot Accent Generation | Jinzuomu Zhong et.al. | 2409.09098 | null |
| 2024-09-17 | HLTCOE JHU Submission to the Voice Privacy Challenge 2024 | Henry Li Xinyuan et.al. | 2409.08913 | null |
| 2024-09-13 | Text-To-Speech Synthesis In The Wild | Jee-weon Jung et.al. | 2409.08711 | null |
| 2024-09-13 | LLM-Powered Grapheme-to-Phoneme Conversion: Benchmark and Case Study | Mahta Fetrat Qharabagh et.al. | 2409.08554 | null |
| 2024-09-14 | Exploring Accessibility Trends and Challenges in Mobile App Development: A Study of Stack Overflow Questions | Amila Indika et.al. | 2409.07945 | null |
| 2024-09-12 | Full-text Error Correction for Chinese Speech Recognition with Large Language Model | Zhiyuan Tang et.al. | 2409.07790 | null |
| 2025-01-03 | SSR-Speech: Towards Stable, Safe and Robust Zero-shot Text-based Speech Editing and Synthesis | Helin Wang et.al. | 2409.07556 | link |
| 2024-09-11 | D-CAPTCHA++: A Study of Resilience of Deepfake CAPTCHA under Transferable Imperceptible Adversarial Attack | Hong-Hanh Nguyen-Le et.al. | 2409.07390 | null |
| 2024-09-11 | Cross-Dialect Text-To-Speech in Pitch-Accent Language Incorporating Multi-Dialect Phoneme-Level BERT | Kazuki Yamauchi et.al. | 2409.07265 | null |
| 2024-09-11 | Zero-Shot Text-to-Speech as Golden Speech Generator: A Systematic Framework and its Applicability in Automatic Pronunciation Assessment | Tien-Hong Lo et.al. | 2409.07151 | null |
| 2024-09-11 | The VoiceMOS Challenge 2024: Beyond Speech Quality Prediction | Wen-Chin Huang et.al. | 2409.07001 | null |
| 2024-09-10 | Enhancing Emotional Text-to-Speech Controllability with Natural Language Guidance through Contrastive Learning and Diffusion Models | Xin Jing et.al. | 2409.06451 | null |
| 2024-09-26 | What happens to diffusion model likelihood when your model is conditional? | Mattias Cross et.al. | 2409.06364 | null |
| 2024-09-10 | VoiceWukong: Benchmarking Deepfake Voice Detection | Ziwei Yan et.al. | 2409.06348 | null |
| 2024-09-10 | AS-Speech: Adaptive Style For Speech Synthesis | Zhipeng Li et.al. | 2409.05730 | null |
| 2024-10-07 | IndicVoices-R: Unlocking a Massive Multilingual Multi-speaker Speech Corpus for Scaling Indian TTS | Ashwin Sankar et.al. | 2409.05356 | link |
| 2024-09-10 | Disentangling the Prosody and Semantic Information with Pre-trained Model for In-Context Learning based Zero-Shot Voice Conversion | Zhengyang Chen et.al. | 2409.05004 | null |
| 2024-09-01 | Sample-Efficient Diffusion for Text-To-Speech Synthesis | Justin Lovelace et.al. | 2409.03717 | link |
| 2024-09-10 | LAST: Language Model Aware Speech Tokenization | Arnon Turetzky et.al. | 2409.03701 | null |
| 2024-09-05 | FireRedTTS: A Foundation Text-To-Speech Framework for Industry-Level Generative Speech Applications | Hao-Han Guo et.al. | 2409.03283 | null |
| 2024-09-04 | Training Universal Vocoders with Feature Smoothing-Based Augmentation Methods for High-Quality TTS Systems | Jeongmin Liu et.al. | 2409.02517 | null |
| 2024-09-04 | Fast, High-Quality and Parameter-Efficient Articulatory Synthesis using Differentiable DSP | Yisi Liu et.al. | 2409.02451 | null |
| 2024-09-11 | vec2wav 2.0: Advancing Voice Conversion via Discrete Token Vocoders | Yiwei Guo et.al. | 2409.01995 | null |
| 2024-10-02 | VoxHakka: A Dialectally Diverse Multi-speaker Text-to-Speech System for Taiwanese Hakka | Li-Wei Chen et.al. | 2409.01548 | null |
| 2024-09-02 | A multilingual training strategy for low resource Text to Speech | Asma Amalas et.al. | 2409.01217 | null |
| 2024-09-02 | A Framework for Synthetic Audio Conversations Generation using Large Language Models | Kaung Myat Kyaw et.al. | 2409.00946 | null |
| 2024-09-02 | SoCodec: A Semantic-Ordered Multi-Stream Speech Codec for Efficient Language Model Based Text-to-Speech Synthesis | Haohan Guo et.al. | 2409.00933 | link |
| 2024-10-11 | MaskGCT: Zero-Shot Text-to-Speech with Masked Generative Codec Transformer | Yuancheng Wang et.al. | 2409.00750 | null |
| 2024-08-30 | SelectTTS: Synthesizing Anyone's Voice via Discrete Unit-Based Frame Selection | Ismail Rasim Ulgen et.al. | 2408.17432 | null |
| 2024-08-30 | AASIST3: KAN-Enhanced AASIST Speech Deepfake Detection using SSL Features and Additional Regularization for the ASVspoof 2024 Challenge | Kirill Borodin et.al. | 2408.17352 | null |
| 2024-09-19 | Codec Does Matter: Exploring the Semantic Shortcoming of Codec for Audio Language Model | Zhen Ye et.al. | 2408.17175 | link |
| 2024-08-30 | Utilizing Speaker Profiles for Impersonation Audio Detection | Hao Gu et.al. | 2408.17009 | null |
| 2024-08-30 | Mini-Omni: Language Models Can Hear, Talk While Thinking in Streaming | Zhifei Xie et.al. | 2408.16725 | link |
| 2024-08-29 | RAVE for Speech: Efficient Voice Conversion at High Sampling Rates | Anders R. Bargum et.al. | 2408.16546 | null |
| 2024-08-29 | Enabling Beam Search for Language Model-Based Text-to-Speech Synthesis | Zehai Tu et.al. | 2408.16373 | null |
| 2024-08-28 | Multi-modal Adversarial Training for Zero-Shot Voice Cloning | John Janiczek et.al. | 2408.15916 | null |
| 2024-08-29 | Easy, Interpretable, Effective: openSMILE for voice deepfake detection | Octavian Pascu et.al. | 2408.15775 | null |
| 2024-08-28 | VoxInstruct: Expressive Human Instruction-to-Speech Generation with Unified Multilingual Codec Language Modelling | Yixuan Zhou et.al. | 2408.15676 | link |
| 2024-08-27 | Literary and Colloquial Dialect Identification for Tamil using Acoustic Features | M. Nanmalar et.al. | 2408.14887 | null |
| 2024-08-28 | VoiceTailor: Lightweight Plug-In Adapter for Diffusion-Based Personalized Text-to-Speech | Heeseung Kim et.al. | 2408.14739 | null |
| 2024-08-27 | StyleSpeech: Parameter-efficient Fine Tuning for Pre-trained Controllable Text-to-Speech | Haowei Lo |