Google Launches First Natively Multimodal Embedding Model, Gemini Embedding 2, Supporting Text, Images, Video, and Audio in a Unified Space

Mar 11, 2026
Google
Article image for Google Launches First Natively Multimodal Embedding Model, Gemini Embedding 2, Supporting Text, Images, Video, and Audio in a Unified Space

Summary

Google launches Gemini Embedding 2, its first natively multimodal embedding model, capable of processing text, images, video, audio, and documents together in a single unified space across 100+ languages, now available in public preview via the Gemini API and Vertex AI.

Key Points

  • Google releases Gemini Embedding 2 in public preview, marking its first natively multimodal embedding model built on the Gemini architecture, capable of mapping text, images, video, audio, and documents into a single unified embedding space across 100+ languages.
  • The model supports flexible input types including up to 8,192 text tokens, 6 images per request, 120 seconds of video, native audio ingestion, and PDFs up to 6 pages, while also handling interleaved multimodal inputs in a single request.
  • Gemini Embedding 2 features Matryoshka Representation Learning for scalable output dimensions and is now accessible via the Gemini API, Vertex AI, and integrations with tools like LangChain, LlamaIndex, ChromaDB, and others.

Tags

Read Original Article