Your search says “contract termination clauses.” Your contract database has documents about “agreement dissolution provisions.” Without embeddings, those don’t match. With the right embedding model, they do.
Embeddings convert text into numerical vectors — lists of hundreds or thousands of decimal numbers that represent a piece of text’s meaning in high-dimensional space. Two semantically similar texts produce vectors that are close together. Two semantically unrelated texts produce vectors that are far apart.
This is what makes RAG systems work. A traditional keyword search requires the query to share words with the document. Semantic search, powered by embeddings, finds conceptually related content even when the words don’t match.
How the Process Works
- At index time: Each document chunk is passed through an embedding model, which produces a vector. The vector is stored in a vector database alongside the original text.
- At query time: The user’s query is passed through the same embedding model, producing a query vector.
- Matching: The vector database finds document vectors closest to the query vector. Those documents are retrieved and passed to the LLM as context.
The embedding model used at index time and at query time must be identical. Switch models mid-deployment and every stored vector becomes incompatible — requiring a full re-indexing of your knowledge base.
Model Options
| Model | Hosting | Cost | Dimensions | Notes |
|---|---|---|---|---|
text-embedding-3-small (OpenAI) | API | $0.02/1M tokens | 1536 | Good default for most use cases |
text-embedding-3-large (OpenAI) | API | $0.13/1M tokens | 3072 | Higher accuracy, 6× more expensive |
nomic-embed-text | Local (Ollama) | Free | 768 | Strong performer for local deployments |
all-MiniLM-L6-v2 (Sentence-Transformers) | Local | Free | 384 | Smaller, fast, lower accuracy |
mxbai-embed-large | Local (Ollama) | Free | 1024 | Competitive with OpenAI small on MTEB |
The MTEB (Massive Text Embedding Benchmark) leaderboard at Hugging Face is the authoritative ranking of embedding model performance across retrieval, classification, and clustering tasks.
Dimensions and What They Mean
Every embedding model produces vectors of a fixed dimension — typically 384 to 3072 numbers per chunk. Dimensions roughly correspond to how much semantic nuance the model captures.
More dimensions generally means:
- Better accuracy on semantically subtle queries
- More storage space per document chunk
- Slower similarity search at large scale
For most SMB deployments (< 1 million documents), dimension count has negligible impact on search latency. The accuracy difference between models matters more than the dimension count difference.
The Model Choice Is a Long-Term Decision
Switching embedding models requires re-embedding your entire knowledge base. If you embedded 50,000 documents with text-embedding-3-small, switching to nomic-embed-text means:
- Re-processing all 50,000 documents through the new model
- Replacing all stored vectors in your vector database
- Accepting that your retrieval behavior will change (possibly for better or worse)
For small knowledge bases (under 10,000 chunks), this is a few hours of work. For large ones, it’s a weekend project. Choose your embedding model at the start of a project, not after you’ve indexed everything.
Local vs API
API embeddings (OpenAI, Cohere, Google): Lower setup friction, high quality, cost scales with document count and query volume. At $0.02/million tokens for text-embedding-3-small, embedding a 10,000-page knowledge base costs roughly $2–5 one-time. Ongoing query costs depend on traffic volume.
Local embeddings (via Ollama): Zero per-query cost after setup. nomic-embed-text running on a CPU server handles hundreds of thousands of embeddings per day for free. The cost is setup time and the hardware it runs on.
For GDPR or DPDP deployments where documents contain sensitive data, local embeddings keep the indexing pipeline air-gapped. No document content leaves your network during indexing.
The Quality Problem with Mismatch
A common failure pattern: choosing a fast, cheap embedding model at the start and discovering months later that certain queries fail to retrieve relevant documents. The mismatch between how users phrase questions and how documents are written is too large for a low-dimension model to bridge.
Before committing to an embedding model, test it with the actual queries your users will ask against the actual documents you’ll retrieve from. A model that scores well on general benchmarks may perform poorly on your specific domain vocabulary.
Related
- RAG — The retrieval pattern that embeddings enable
- Vector Databases — Where embeddings are stored and searched
- Ollama — Runs embedding models locally (nomic-embed-text, mxbai-embed-large)
- Data & Knowledge — The knowledge infrastructure layer
- Knowledge Base Decay — When the embedded documents go stale