Designing the Embedding Engine: The Quiet Powerhouse Behind LLM4S
π September 25, 2025 β Embedding Engine
Most people praise the βintelligenceβ of large language models. Few talk about the invisible machinery that makes them useful in the real world. This article is about that machinery β specifically, the Embedding Engine I built for the LLM4S project as part of my Google Summer of Code 2025 work.
If LLMs are the brain, then embeddings are the memory architecture they rely on to understand, search, retrieve, and stay grounded.
No embeddings β no retrieval β no context β no real intelligence.
Below, I walk through the entire journey: why I built it, what it actually does, how itβs architected, where it fits inside LLM4S, and what unexpected problems showed up along the way.
The Real Reason an Embedding Engine Is Necessary
Iβll be blunt: most βRAGβ systems built today fail not because of the LLM, not because of the database, but because their embedding layer is weak.
Here are the four truths I realized early:
Documents are ugly. PDFs break, tables collapse, DOCX files hide formatting landmines. You cannot feed this directly to an embedding API.
Providers donβt behave consistently. Different vector dimensions, latency, token limits, rate limits, and inconsistent errors.
Chunking determines accuracy. The same LLM can perform terribly or brilliantly depending on how documents are segmented.
RAG is only as good as its embeddings. No amount of prompt engineering rescues bad vector representations.
If LLM4S wanted serious developers to use it for retrieval, search, assistants, or agent workflows, it needed a strong, configurable, modular, stable embedding engine β not a wrapper around API calls.
This became my first mission.
The Embedding Engine in One Sentence
A fully modular, provider-agnostic system that extracts real-world documents, cleans them, chunks them intelligently, embeds them using configurable providers, and returns normalized vectors ready for semantic retrieval.
Breaking it down:
Support for OpenAI, VoyageAI, and future providers
Processing for pdf, docx, txt, xlsx, html URLs
Static model selection via .env
Chunking with natural boundaries
Vector normalization and dimensional consistency
Dual inputs: document embedding + query embedding
Cosine similarity scoring
Error-hardened retry, timeout, and rate-limit logic
Integration-ready output for pgvector, RAG, and agents
This wasnβt just an API wrapper β it became a full pipeline.
The Internal Architecture (with Diagram)
ββββββββββββββββββββββββββββββββ
β Input Layer β
β (.pdf / .docx / .txt / URL) β
ββββββββββββββββ¬ββββββββββββββββ
β
βΌ
ββββββββββββββββββββββββββββββββββββββββββββββ
β UniversalExtractor.scala β
β - PDF parsing (multi-strategy) β
β - DOCX extraction β
β - XLSX table flattening β
β - HTML cleaning for URLs β
β - Noise removal & fallback logic β
βββββββββββββββββββββββββββββ¬βββββββββββββββββ
β
βΌ
βββββββββββββββββββββββββββββββββββββ
β Chunker Layer β
β - Sentence-aware segmentation β
β - Token-limit guard β
β - Metadata tracking β
βββββββββββββββββββ¬ββββββββββββββββββ
β
βΌ
ββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Embedding Provider Layer β
β EmbeddingProvider (trait) β
β βββ OpenAIEmbeddings β
β βββ VoyageAIEmbeddings β
β - Static model selection via config β
β - Retry & backoff β
β - Dimension checking β
βββββββββββββββββββββ¬βββββββββββββββββββββββββββββββ
β
βΌ
ββββββββββββββββββββββββββββββββββ
β Post-Processing + Scoring β
β - Vector normalization β
β - Dual-input embeddings β
β - Cosine similarity β
β - Result packaging β
ββββββββββββββββββ¬ββββββββββββββββ
β
βΌ
βββββββββββββββββββββββββββββββββββββββββββ
β Downstream Integrations β
β - pgvector storage β
β - RAG retrieval β
β - Agent context injection β
βββββββββββββββββββββββββββββββββββββββββββ
The Embedding Engine Inside LLM4S
The Embedding Engine sits at the center of four major workflows:
1. Vector Database (pgvector)
Every embedded chunk is sent into the DB β searchable via vector similarity.
2. RAG (Retrieval Augmented Generation)
Queries use the same provider to generate a comparable embedding β retrieve top-k chunks β feed them into the LLM.
3. Agent Context Management
Agents retrieve relevant memory from previous interactions using embeddings.
4. Multi-provider Flexibility
Developers can plug in OpenAI, VoyageAI, or future models (local CPU/GPU embedding models, sentence transformers, etc.).
Essentially, nothing downstream works reliably without this engine upstream.
Key Challenges (and what they taught me)
1. Long documents crashed providers
Solution: multi-stage chunking + preventive token checks.
2. PDFs behaved differently depending on their origin
Solution: layered extractor with PDF β OCR β HTML fallback.
3. Providers returned inconsistent vector dimensions
Solution: dimension enforcement + normalization + registry.
4. Costs exploded during early tests
Solution: caching & batching.
5. Functional Scala for error-handling was hard
Solution: outcome modeling (like ZIOβs Either) to keep the API sane.
Every issue forced the architecture to become cleaner.
What This Engine Unlocks (the real impact)
Scalable semantic search
Clean integration with pgvector
Fast RAG pipelines
Multi-step agent workflows
Structured conversation memory
Enterprise-ready document ingestion
This is the foundation upon which every intelligent feature of LLM4S depends.
ReferencesΒ
Project (repo root):
[GitHub: LLM4S repository]Embedding Engine code:
[GitHub: embeddings module path]UniversalExtractor.scala:
[GitHub: file link]PgVector integration (teaser for Article 2):
[GitHub: pgvector module path]Key PRs:
Β