Multimodal AI Lab @ KAIST

Publications

2025

CrossSpeech++: Cross-lingual Speech Synthesis with Decoupled Language and Speaker Generation
J. Kim, H. Yang, Y. Ju, I. Kim, B. Kim, J. S. Chung
IEEE Transactions on Audio, Speech and Language Processing
PDF

Seeing Speech and Sound: Distinguishing and Locating Audio Sources in Visual Scenes
H. Ryu, S. Kim, J. S. Chung, A. Senocak
IEEE Conference on Computer Vision and Pattern Recognition
PDF

From Faces to Voices: Learning Hierarchical Representations for High-quality Video-to-Speech
J. Kim, J. Choi, J. Kim, C. Jung, J. S. Chung
IEEE Conference on Computer Vision and Pattern Recognition
PDF

High-Quality Joint Image and Video Compression with Causal VAE
D. M. Argaw, X. Liu, Q. Zhang, J. S. Chung, M. Liu
International Conference on Learning Representations
PDF

AVHBench: A Cross-Modal Hallucination Benchmark for Audio-Visual Large Language Models
S. Kim, H. Oh, J. Lee, A. Senocak, J. S. Chung, T. Oh
International Conference on Learning Representations
PDF

V2SFlow: Video-to-Speech Generation with Speech Decomposition and Rectified Flow
J. Choi, J. Kim, J. Li, J. S. Chung, S. Liu
International Conference on Acoustics, Speech, and Signal Processing
PDF

LAVCap: LLM-based Audio-Visual Captioning using Optimal Transport
K. Rho, H. Lee, V. Iverson, J. S. Chung
International Conference on Acoustics, Speech, and Signal Processing
PDF

VoiceDiT: Dual-Condition Diffusion Transformer for Environment-Aware Speech Synthesis
J. Jung, J. Ahn, C. Jung, T. D. Nguyen, Y. Jang, J. S. Chung
International Conference on Acoustics, Speech, and Signal Processing
PDF

Accelerating Codec-based Speech Synthesis with Multi-Token Prediction and Speculative Decoding
T. D. Nguyen, J. Kim, J. Choi, S. Choi, J. Park, Y. Lee, J. S. Chung
International Conference on Acoustics, Speech, and Signal Processing
PDF

AdaptVC: High Quality Voice Conversion with Adaptive Learning
J. Kim, J. Kim, Y. Choi, T. D. Nguyen, S. Mun, J. S. Chung
International Conference on Acoustics, Speech, and Signal Processing
PDF

2024

Audio Mamba: Bidirectional State Space Model for Audio Representation Learning
M. H. Erol, A. Senocak, J. Feng, J. S. Chung
IEEE Signal Processing Letters
PDF

Bridging the Gap between Audio and Text using Parallel-attention for User-defined Keyword Spotting
Y. Kim, J. Jung, J. Park, B. Kim, J. S. Chung
IEEE Signal Processing Letters
PDF

Let Me Finish My Sentence: Video Temporal Grounding with Holistic Text Understanding
J. Woo, H. Ryu, Y. Jang, J. W. Cho, J. S. Chung
ACM International Conference on Multimedia
PDF

VoxSim: A perceptual voice similarity dataset
J. Ahn, Y. Kim, Y. Choi, D. Kwak, J. Kim, S. Mun, J. S. Chung
Interspeech
PDF

Lightweight Audio Segmentation for Long-form Speech Translation
J. Lee, S. Kim, H. Kim, J. S. Chung
Interspeech
PDF

ElasticAST: An Audio Spectrogram Transformer for All Length and Resolutions
J. Feng, M. H. Erol, J. S. Chung, A. Senocak
Interspeech
PDF

To what extent can ASV systems naturally defend against spoofing attacks?
J. Jung, X. Wang, N. Evans, S. Watanabe, H. Shim, H. Tak, S. Arora, J. Yamagishi, J. S. Chung
Interspeech
PDF

Disentangled Representation Learning for Environment-agnostic Speaker Recognition
K. Nam, H. Heo, J. Jung, J. S. Chung
Interspeech
PDF

FlowAVSE: Efficient Audio-Visual Speech Enhancement with Conditional Flow Matching
C. Jung, S. Lee, J. Kim, J. S. Chung
Interspeech
PDF

EquiAV: Leveraging Equivariance for Audio-Visual Contrastive Learning
J. Kim, H. Lee, K. Rho, J. Kim, J. S. Chung
International Conference on Machine Learning
PDF

Faces that Speak: Jointly Synthesising Talking Face and Speech from Text
Y. Jang, J. Kim, J. Ahn, D. Kwak, H. Yang, Y. Ju, I. Kim, B. Kim, J. S. Chung
IEEE Conference on Computer Vision and Pattern Recognition
PDF

Scaling Up Video Summarization Pretraining with Large Language Models
D. M. Argaw, S. Yoon, F. C. Heilbron, H. Deilamsalehy, T. Bui, Z. Wang, F. Dernoncourt, J. S. Chung
IEEE Conference on Computer Vision and Pattern Recognition
PDF

Towards Automated Movie Trailer Generation
D. M. Argaw, M. Soldan, A. Pardo, C. Zhao, F. C. Heilbron, J. S. Chung, B. Ghanem
IEEE Conference on Computer Vision and Pattern Recognition
PDF

FreGrad: Lightweight and fast frequency-aware diffusion vocoder
T. D. Nguyen, J. Kim, Y. Jang, J. Kim, J. S. Chung
International Conference on Acoustics, Speech, and Signal Processing
PDF Project page

SlowFast Network for Continuous Sign Language Recognition
J. Ahn, Y. Jang, J. S. Chung
International Conference on Acoustics, Speech, and Signal Processing
PDF

Rethinking Session Variability: Leveraging Session Embeddings for Session Robustness in Speaker Verification
H. Heo, K. Nam, B. Lee, Y. Kwon, M. Lee, Y. J. Kim, J. S. Chung
International Conference on Acoustics, Speech, and Signal Processing
PDF

Speech Guided Masked Image Modeling for Visually Grounded Speech
J. Woo, H. Ryu, A. Senocak, J. S. Chung
International Conference on Acoustics, Speech, and Signal Processing
PDF

VoxMM: Rich Transcription of Conversations in the Wild
D. Kwak, J. Jung, K. Nam, Y. Jang, J. Jung, S. Watanabe, J. S. Chung
International Conference on Acoustics, Speech, and Signal Processing
PDF

From Coarse To Fine: Efficient Training for Audio Spectrogram Transformers
J. Feng, M. H. Erol, J. S. Chung, A. Senocak
International Conference on Acoustics, Speech, and Signal Processing
PDF

VoiceLDM: Text-to-Audio Generation with Linguistic Content
Y. Lee, I. Yeon, J. Nam, J. S. Chung
International Conference on Acoustics, Speech, and Signal Processing
PDF Project page

TalkNCE: Improving Active Speaker Detection with Talking-Aware Contrastive Learning
C. Jung, S. Lee, K. Nam, K. Rho, Y. J. Kim, Y. Jang, J. S. Chung
International Conference on Acoustics, Speech, and Signal Processing
PDF

Seeing Through the Conversation: Audio-Visual Speech Separation based on Diffusion Model
S. Lee, C. Jung, Y. Jang, J. Kim, J. S. Chung
International Conference on Acoustics, Speech, and Signal Processing
PDF

Let There Be Sound: Reconstructing High Quality Speech from Silent Videos
J. Kim, J. Kim, J. S. Chung
AAAI Conference on Artificial Intelligence
PDF Project page

Can CLIP Help Sound Source Localization?
S. Park, A. Senocak, J. S. Chung
Winter Conference on Applications of Computer Vision
PDF

2023

That's What I Said: Fully-Controllable Talking Face Generation
Y. Jang, K. Rho, J. Woo, H. Lee, J. Park, Y. Lim, B. Kim, J. S. Chung
ACM International Conference on Multimedia
PDF Project page

Sound Source Localization is All about Cross-Modal Alignment
A. Senocak, H. Ryu, J. Kim, T. Oh, H. Pfister, J. S. Chung
International Conference on Computer Vision
PDF

FlexiAST: Flexibility is What AST Needs
J. Feng, M. H. Erol, J. S. Chung, A. Senocak
Interspeech
PDF

Disentangled Representation Learning for Multilingual Speaker Recognition
K. Nam, Y. Kim, J. Huh, H. Heo, J. Jung, J. S. Chung
Interspeech
PDF Project page

Curriculum learning for self-supervised speaker verification
H. Heo, J. Jung, J. Kang, Y. Kwon, B. Lee, Y. J. Kim, J. S. Chung
Interspeech
PDF

Self-sufficient framework for continuous sign language recognition
Y. Jang, Y. Oh, J. W. Cho, M. Kim, D. Kim, I. S. Kweon, J. S. Chung
International Conference on Acoustics, Speech, and Signal Processing
PDF Project page

Metric learning for user-defined keyword spotting
J. Jung, Y. Kim, J. Park, Y. Lim, B. Kim, Y. Jang, J. S. Chung
International Conference on Acoustics, Speech, and Signal Processing
PDF Project page

Hindi as a second language: improving visually grounded speech with semantically similar samples
H. Ryu, A. Senocak, I. S. Kweon, J. S. Chung
International Conference on Acoustics, Speech, and Signal Processing
PDF

MarginNCE: Robust Sound Localization with a Negative Margin
S. Park, A. Senocak, J. S. Chung
International Conference on Acoustics, Speech, and Signal Processing
PDF

Advancing the dimensionality reduction of speaker embeddings for speaker diarisation: disentangling noise and informing speech activity
Y. J. Kim, H. Heo, J. Jung, Y. Kwon, B. Lee, J. S. Chung
International Conference on Acoustics, Speech, and Signal Processing
PDF

In search of strong embedding extractors for speaker diarisation
J. Jung, B. Lee, J. Huh, A. Brown, Y. Kwon, S. Watanabe, J. S. Chung
International Conference on Acoustics, Speech, and Signal Processing
PDF

Imaginary Voice: Face-styled Diffusion Model for Text-to-Speech
J. Lee, J. S. Chung, S. Chung
International Conference on Acoustics, Speech, and Signal Processing
PDF

2022

Signing Outside the Studio: Benchmarking Background Robustness for Continuous Sign Language Recognition
Y. Jang, Y. Oh, J. W. Cho, D. Kim, J. S. Chung, I. S. Kweon
British Machine Vision Conference
PDF Project page

Augmentation adversarial training for self-supervised speaker representation learning
J. Kang, J. Huh, H. Heo, J. S. Chung
Journal of Selected Topics in Signal Processing
PDF

Pushing the limits of raw waveform speaker recognition
J. Jung, Y. J. Kim, H. Heo, B. Lee, Y. Kwon, J. S. Chung
Interspeech
PDF

Spell my name: Keyword boosted speech recognition
N. Jung, G. Kim, J. S. Chung
International Conference on Acoustics, Speech, and Signal Processing
PDF

Multi-scale speaker embedding-based graph attention networks for speaker diarisation
Y. Kwon, H. Heo, J. Jung, Y. J. Kim, B. Lee, J. S. Chung
International Conference on Acoustics, Speech, and Signal Processing
PDF

AASIST: Audio Anti-Spoofing using Integrated Spectro-Temporal Graph Attention Networks
J. Jung, H. Heo, H. Tak, H. Shim, J. S. Chung, B. Lee, H. Yu, N. Evans
International Conference on Acoustics, Speech, and Signal Processing
PDF