We highlight a collaborative breakthrough from HKU and ByteDance introducing JoVA, a streamlined framework designed for high-fidelity joint video and audio generation. By utilizing a single joint self-attention layer for cross-modal interaction, we eliminate the need for complex external fusion modules found in traditional cascaded or end-to-end models. Our analysis shows that JoVA addresses the critical challenge of lip-syncing through a novel Mouth-Aware Supervision strategy, which applies weighted flow matching losses to precisely mapped mouth regions in latent space. Utilizing a diverse dataset of approximately 1.9 million samples, the model achieves a state-of-the-art LSE-C score of 6.64, outperforming existing solutions in both temporal alignment and audio-visual consistency. This research provides developers with a more efficient architectural blueprint for multimodal diffusion models, simplifying the path toward realistic digital human synthesis.
Topic: Digital Humans
A curated collection of WindFlash AI Daily Report items tagged “Digital Humans” (bilingual summaries with evidence quotes).
1 items→ Browse Daily Reports
What this topic covers
This hub groups WindFlash coverage of models, tools, companies, and workflows related to Digital Humans.
Why it matters
We prioritize changes that affect development, product decisions, creator workflows, or small-team strategy.
How to use it
Start with the newest dates, scan important items, sources, and summaries, then open the original source or related report.
December 30, 2025
Open this daily report →机器之心Dec 30, 07:17 AM
FAQ
Where do these items come from?
They come from published WindFlash AI Daily items, with source, summary, and report links preserved.
Will this hub update?
Yes. New daily report items tagged with this topic are added to this hub.
广告