CS 6824

course number	instructor	title
CS 6804	C Thomas	Multimodal Vision

Humans are able to reason about how concepts read in text, heard in audio, and seen in visual content (different modalities of data) relate to one another by drawing on a learned multimodal understanding of the world. For example, reading a textual description might allow one to recognize a bird they have never seen before by drawing on a background understanding of what the color blue looks like and what a high-pitched bird call sounds like. Thus, building artificially intelligent systems capable of robust multimodal reasoning is an area of intense research interest.

This graduate-level seminar course will introduce students to the latest research in multimodal computer vision, with a significant emphasis on vision and language. The course will feature foundational lectures, in-depth student-led presentations on state-of-the-art research, and classroom discussions. Students will complete an intensive, semester-long research project and report. A background in deep learning is strongly recommended.

Example topics include representation learning, fusion, pretraining, privileged modalities, prompt learning, cross-modal retrieval, model architectures (e.g. transformers, two-stream, convolutional), attention mechanisms, zero-shot and few-shot recognition, knowledge representation, generative models, and embodied vision.