-
Shanghai Jiao Tong University
- https://yfyeung.github.io
- https://scholar.google.com/citations?user=slhAlQ0AAAAJ
- in/yifan-yang-290ba624b
Highlights
- Pro
Lists (10)
Sort Name ascending (A-Z)
Starred repositories
A SOTA Industrial-Grade Voice Activity Detection & Audio Event Detection, supporting 100+ languages, outperforming Silero-VAD, TEN-VAD, FunASR-VAD and WebRTC-VAD
Qwen3.5 is the large language model series developed by Qwen team, Alibaba Cloud.
A Framework for Speech, Language, Audio, Music Processing with Large Language Model
A SOTA Industrial-Grade All-in-One ASR system with ASR, VAD, LID, and Punc modules. FireRedASR2 supports Chinese (Mandarin, 20+ dialects/accents), English, code-switching, and both speech and singi…
MOSS‑TTS Family is an open‑source speech and sound generation model family from MOSI.AI and the OpenMOSS team. It is designed for high‑fidelity, high‑expressiveness, and complex real‑world scenario…
📚 《从零开始构建智能体》——从零开始的智能体原理与实践教程
Open-Ended Speaking Style Modeling via Fine-Grained and Multi-Granular Contrastive Language-Speech Pre-training
UniAudio 2.0: An audio fundation model for text, speech, sound, and music
Elevate your AI research writing, no more tedious polishing ✨
Qwen3-ASR is an open-source series of ASR models developed by the Qwen team at Alibaba Cloud, supporting stable multilingual speech/music/song recognition, language detection and timestamp prediction.
Scaling Text-to-Image Diffusion Transformers with Representation Autoencoders
Qwen3-TTS is an open-source series of TTS models developed by the Qwen team at Alibaba Cloud, supporting stable, expressive, and streaming speech generation, free-form voice design, and vivid voice…
A Large-scale Wu Dialect Speech Corpus with Multi-dimensional Annotations
Official repository for the WenetSpeech-Chuan dataset.
An All-in-One Speech, Sound, Music Codec with Single Nested Codebook
Hybrid Flow Matching and GAN with Multi-Resolution Network for Few-Step High-Fidelity Audio Generation
Fun-ASR is an end-to-end speech recognition large model launched by Tongyi Lab.
The repository provides code for running inference with the Meta Segment Anything Audio Model (SAM-Audio), links for downloading the trained model checkpoints, and example notebooks that show how t…
GLM-TTS: Controllable & Emotion-Expressive Zero-shot TTS with Multi-Reward Reinforcement Learning
Codebase for 'Scaling Rich Style-Prompted Text-to-Speech Datasets'
PyTorch implementation of JiT https://arxiv.org/abs/2511.13720
A powerful 3B-parameter, LLM-based Reinforcement Learning audio edit model excels at editing emotion, speaking style, and paralinguistics, and features robust zero-shot text-to-speech
An intuitive and low-overhead instrumentation tool for Python
SpeechJudge: Towards Human-Level Judgment for Speech Naturalness (https://arxiv.org/abs/2511.07931)
Omnilingual ASR Open-Source Multilingual SpeechRecognition for 1600+ Languages
[CVPR 2025] MMAudio: Taming Multimodal Joint Training for High-Quality Video-to-Audio Synthesis






