We highlight a collaborative breakthrough from HKU and ByteDance introducing JoVA, a streamlined framework designed for high-fidelity joint video and audio generation. By utilizing a single joint self-attention layer for cross-modal interaction, we eliminate the need for complex external fusion modules found in traditional cascaded or end-to-end models. Our analysis shows that JoVA addresses the critical challenge of lip-syncing through a novel Mouth-Aware Supervision strategy, which applies weighted flow matching losses to precisely mapped mouth regions in latent space. Utilizing a diverse dataset of approximately 1.9 million samples, the model achieves a state-of-the-art LSE-C score of 6.64, outperforming existing solutions in both temporal alignment and audio-visual consistency. This research provides developers with a more efficient architectural blueprint for multimodal diffusion models, simplifying the path toward realistic digital human synthesis.
Topic: Diffusion Models
A curated collection of WindFlash AI Daily Report items tagged “Diffusion Models” (bilingual summaries with evidence quotes).
1 items→ Browse Daily Reports
December 30, 2025
Open this daily report →机器之心Dec 30, 07:17 AM