Towards more multimodal, robust, and general AI systems
Next week marks the start of the 37th annual conference on Neural Information Processing Systems (NeurIPS),the largest artificial intelligence (AI) conference in the world. NeurIPS 2023 will be taking place December 10-16 in New Orleans, USA.
Teams from across Google DeepMind are presenting more than 180 papers at the main conference and workshops.
We’ll be showcasing demos of our cutting edge AI models for global weather forecasting, materials discovery, and watermarking AI-generated content. There will also be an opportunity to hear from the team behind Gemini, our largest and most capable AI model.
Here’s a look at some of our research highlights:
Multimodality: language, video, action
Generative AI models can create paintings, compose music, and write stories. But however capable these models may be in one medium, most struggle to transfer those skills to another. We delve into how generative abilities could help to learn across modalities. In a spotlight presentation, we show that diffusion models can be used to classify images with no additional training required. Diffusion models like Imagen classify images in a more human-like way than other models, relying on shapes rather than textures. What’s more, we show how just predicting captions from images can improve computer-vision learning. Our approach surpassed current methods on vision and language tasks, and showed more potential to scale.
More multimodal models could give way to more useful digital and robot assistants to help people in their everyday lives. In a spotlight poster, we create agents that could interact with the digital world like humans do — through screenshots, and keyboard and mouse actions. Separately, we show that by leveraging video generation, including subtitles and closed captioning, models can transfer knowledge by predicting video plans for real robot actions.
One of the next milestones could be to generate realistic experience in response to actions carried out by humans, robots, and other types of interactive agents. We’ll be showcasing a demo of UniSim, our universal simulator of real-world interactions. This type of technology could have applications across industries from video games and film, to training agents for the real world.
Building safe and understandable AI
Large Language Models can generate impressive answers, but are prone to “hallucinations”, text that seems correct but is made up. Our researchers raise the question of whether a method to find a fact stored location (localization) can enable editing the fact. Surprisingly, they found that localization of a fact and editing the location does not edit the fact, hinting at the complexity of understanding and controlling stored information in LLMs. With Tracr, we propose a novel way of evaluating interpretability methods by translating human-readable programs into transformer models. We’ve open sourced a version of Tracr to help serve as a ground-truth for evaluating interpretability methods.
When developing and deploying large models, privacy needs to be embedded at every step of the way. For training, our teams are studying how to measure if language models are memorizing data – in order to protect private and sensitive material. In parallel, our researchers demonstrate how to evaluate privacy-preserving training with a technique that is efficient enough for real-world use. In another oral presentation, our scientists investigate the limitations of training through “student” and “teacher” models that have different levels of access and vulnerability if attacked.
As large models become more capable, our research is pushing the limits of new abilities to develop more general AI systems.
While language models are used for general tasks, they lack the necessary exploratory and contextual understanding to solve more complex problems. We introduce the Tree of Thoughts, a new framework for language model inference to help models explore and reason over a wide range of possible solutions. By organizing the reasoning and planning as a tree instead of the commonly used flat chain-of-thoughts, we demonstrate that a language model is able to solve complex tasks like “game 24” much more accurately.
To help people solve problems and find what they’re looking for, AI models need to process billions of unique values efficiently. With Feature Multiplexing, one single representation space is used for many different features, allowing large embedding models (LEMs) to scale to products for billions of users.
Finally, with DoReMi we show how using AI to automate the mixture of training data types can significantly speed up language model training and improve performance on new and unseen tasks.
Fostering a global AI community
We’re proud to sponsor NeurIPS, and support workshops led by LatinX in AI, QueerInAI, and Women In ML, helping foster research collaborations and developing a diverse AI and machine learning community. This year, NeurIPS will have a creative track featuring our Visualising AI project, which commissions artists to create more diverse and accessible representations of AI.
If you’re attending NeurIPS, come by our booth to learn more about our cutting-edge research and meet our teams hosting workshops and presenting across the conference.