Transformers and other AI breakthroughs have shown state-of-the-art performance across different modalities
Text-to-Text (OpenAI ChatGPT)
Text-to-Image (Stable Diffusion)
Image-to-Text (Open AI CLIP)
Speech-to-Text (OpenAI Whisper)
Text-to-Speech (Meta’s Massively Multilingual Speech)
Text-to-Audio (Meta MusicGen)
Text-to-Code (OpenAI Codex / GitHub Copilot)
Code-to-Text (ChatGPT, etc.)
The next frontier in AI is combining these modalities in interesting ways. Explain what’s happening in a photo. Debug a program with your voice. Generate music from an image. There’s still technical work to be done with combining these modalities, but the greatest challenge is not a technical one but a user experience one.
What is the right UX for these use cases?
Chat isn’t always the best interface for tasks — although it’s one of the most intuitive, especially when users are being introduced to new technology (why does every AI cycle start with chat?). Sticking images, audio, and other modalities in a chat interface can get confusing very quickly. It’s why technologies like Jupyter Notebooks (which combine markup, graphs, and code in the same interface) are so polarizing. Great for many exploratory tasks, but master of none.
There’s a huge opportunity in the UX layer for integrating these different modalities. How do we best present these different types of outputs to users — audio, text, images, or code? How do we allow users to iterate on these models and provide feedback (e.g., what does it mean to fine-tune a multimodal model)?
Matt, I would love for you to write an updated one again a few months later. I think it adds a lot of value
I'm looking forward seeing Obsidian and Notion evolve their products around modalities. Also, there might be one app or agent in the future that can deliver all modalities through one UI that changes its layout depending on what the manual task is needed from humans.