Models Generic Vertex AI Multi-Modal
Google

Vertex AI Multi-Modal

Foundation embedding model projecting text, images, and video into a shared space.

Google Proprietary Multi-Modal
Generic
Model Type
#3
Overall Rank
42.8%
Avg P@1
34.8%
Avg mAP@10
1408
Embed Dim
N/A
Input Res
8
Datasets

About This Model

Overview

Google's Vertex AI Multimodal Embeddings model is a foundation embedding model that projects text, images, and video into a shared semantic space. The model exposes a multimodalembedding@001 endpoint that outputs 1,408-dimensional vectors for all supported modalities.

Capabilities

The embeddings are designed for tasks such as:

  • Semantic search
  • Recommendation
  • Content moderation
  • Classification
  • Similarity-based retrieval across modalities

Both image and text embeddings share the same dimensionality and space, enabling cross-modal queries (e.g. text-to-image retrieval).

Evaluation Setup

In our study, we use only the image embedding pathway and evaluate the model in a pure image-to-image retrieval setting, to understand how a general multimodal model behaves on industrial instance-level search tasks.

References

Performance Across Datasets

Dataset Category P@1 P@5 R@1 R@5 mAP@10
VPRC 2023 Mixed Retail 29.66% 14.42% 19.76% 42.64% 33.57%
Intercars Automotive 19.69% 18.13% 6.38% 21.37% 21.84%
Stanford Online Products E-commerce 76.88% 51.66% 19.05% 50.09% 54.68%
IKEA Furniture 52.29% 33.26% 15.04% 36.71% 36.93%
Hornbach Hardware/DIY 24.46% 9.51% 24.46% 47.54% 34.07%
ARaymond Industrial 8.77% 6.17% 0.55% 1.93% 2.47%
Products-10K E-commerce 63.29% 39.70% 13.50% 41.66% 40.88%
TOPEX Industrial 67.47% 64.55% 2.11% 10.09% 54.03%
Average 42.81% 29.68% 12.61% 31.50% 34.81%