Google Launches First Natively Multimodal Embedding Model, Gemini Embedding 2, Supporting Text, Images, Video, and Audio in a Unified Space
Summary
Google launches Gemini Embedding 2, its first natively multimodal embedding model, capable of processing text, images, video, audio, and documents together in a single unified space across 100+ languages, now available in public preview via the Gemini API and Vertex AI.
Key Points
- Google releases Gemini Embedding 2 in public preview, marking its first natively multimodal embedding model built on the Gemini architecture, capable of mapping text, images, video, audio, and documents into a single unified embedding space across 100+ languages.
- The model supports flexible input types including up to 8,192 text tokens, 6 images per request, 120 seconds of video, native audio ingestion, and PDFs up to 6 pages, while also handling interleaved multimodal inputs in a single request.
- Gemini Embedding 2 features Matryoshka Representation Learning for scalable output dimensions and is now accessible via the Gemini API, Vertex AI, and integrations with tools like LangChain, LlamaIndex, ChromaDB, and others.