course number | instructor | title |
CS 6804 | C Thomas | Multimodal Vision |
Humans are able to reason about how concepts read in text, heard in audio, and seen in visual content (different modalities of data) relate to one another by drawing on a learned multimodal understanding of the world. For example, reading a textual description might allow one to recognize a bird they have never seen before by drawing on a background understanding of what the color blue looks like and what a high-pitched bird call sounds like. Thus, building artificially intelligent systems capable of robust multimodal reasoning is an area of intense research interest.
This graduate-level seminar course will introduce students to the latest research in multimodal computer vision, with a significant emphasis on vision and language. The course will feature foundational lectures, in-depth student-led presentations on state-of-the-art research, and classroom discussions. Students will complete an intensive, semester-long research project and report. A background in deep learning is strongly recommended.
Example topics include representation learning, fusion, pretraining, privileged modalities, prompt learning, cross-modal retrieval, model architectures (e.g. transformers, two-stream, convolutional), attention mechanisms, zero-shot and few-shot recognition, knowledge representation, generative models, and embodied vision.