Gopi Trinadh Maddikunta

Gopi Trinadh Maddikunta

Copyright @ 2025 GT Groups.
All rights are reserved.

Designing the Embedding Engine: The Quiet Powerhouse Behind LLM4S

πŸ“… September 25, 2025 – Embedding Engine

Most people praise the β€œintelligence” of large language models. Few talk about the invisible machinery that makes them useful in the real world. This article is about that machinery β€” specifically, the Embedding Engine I built for the LLM4S project as part of my Google Summer of Code 2025 work.

If LLMs are the brain, then embeddings are the memory architecture they rely on to understand, search, retrieve, and stay grounded.
No embeddings β†’ no retrieval β†’ no context β†’ no real intelligence.

Below, I walk through the entire journey: why I built it, what it actually does, how it’s architected, where it fits inside LLM4S, and what unexpected problems showed up along the way.

The Real Reason an Embedding Engine Is Necessary

I’ll be blunt: most β€œRAG” systems built today fail not because of the LLM, not because of the database, but because their embedding layer is weak.

Here are the four truths I realized early:

Documents are ugly. PDFs break, tables collapse, DOCX files hide formatting landmines. You cannot feed this directly to an embedding API.

Providers don’t behave consistently. Different vector dimensions, latency, token limits, rate limits, and inconsistent errors.

Chunking determines accuracy. The same LLM can perform terribly or brilliantly depending on how documents are segmented.

RAG is only as good as its embeddings. No amount of prompt engineering rescues bad vector representations.

If LLM4S wanted serious developers to use it for retrieval, search, assistants, or agent workflows, it needed a strong, configurable, modular, stable embedding engine β€” not a wrapper around API calls.

This became my first mission.

The Embedding Engine in One Sentence

A fully modular, provider-agnostic system that extracts real-world documents, cleans them, chunks them intelligently, embeds them using configurable providers, and returns normalized vectors ready for semantic retrieval.

Breaking it down:

Support for OpenAI, VoyageAI, and future providers

Processing for pdf, docx, txt, xlsx, html URLs

Static model selection via .env

Chunking with natural boundaries

Vector normalization and dimensional consistency

Dual inputs: document embedding + query embedding

Cosine similarity scoring

Error-hardened retry, timeout, and rate-limit logic

Integration-ready output for pgvector, RAG, and agents

This wasn’t just an API wrapper β€” it became a full pipeline.

The Internal Architecture (with Diagram)

 β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Input Layer β”‚
β”‚ (.pdf / .docx / .txt / URL) β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
β”‚
β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ UniversalExtractor.scala β”‚
β”‚ - PDF parsing (multi-strategy) β”‚
β”‚ - DOCX extraction β”‚
β”‚ - XLSX table flattening β”‚
β”‚ - HTML cleaning for URLs β”‚
β”‚ - Noise removal & fallback logic β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
β”‚
β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Chunker Layer β”‚
β”‚ - Sentence-aware segmentation β”‚
β”‚ - Token-limit guard β”‚
β”‚ - Metadata tracking β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
β”‚
β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Embedding Provider Layer β”‚
β”‚ EmbeddingProvider (trait) β”‚
β”‚ β”œβ”€β”€ OpenAIEmbeddings β”‚
β”‚ └── VoyageAIEmbeddings β”‚
β”‚ - Static model selection via config β”‚
β”‚ - Retry & backoff β”‚
β”‚ - Dimension checking β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
β”‚
β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Post-Processing + Scoring β”‚
β”‚ - Vector normalization β”‚
β”‚ - Dual-input embeddings β”‚
β”‚ - Cosine similarity β”‚
β”‚ - Result packaging β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
β”‚
β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Downstream Integrations β”‚
β”‚ - pgvector storage β”‚
β”‚ - RAG retrieval β”‚
β”‚ - Agent context injection β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

The Embedding Engine Inside LLM4S

The Embedding Engine sits at the center of four major workflows:

1. Vector Database (pgvector)

Every embedded chunk is sent into the DB β†’ searchable via vector similarity.

2. RAG (Retrieval Augmented Generation)

Queries use the same provider to generate a comparable embedding β†’ retrieve top-k chunks β†’ feed them into the LLM.

3. Agent Context Management

Agents retrieve relevant memory from previous interactions using embeddings.

4. Multi-provider Flexibility

Developers can plug in OpenAI, VoyageAI, or future models (local CPU/GPU embedding models, sentence transformers, etc.).

Essentially, nothing downstream works reliably without this engine upstream.

Key Challenges (and what they taught me)
1. Long documents crashed providers

Solution: multi-stage chunking + preventive token checks.

2. PDFs behaved differently depending on their origin

Solution: layered extractor with PDF β†’ OCR β†’ HTML fallback.

3. Providers returned inconsistent vector dimensions

Solution: dimension enforcement + normalization + registry.

4. Costs exploded during early tests

Solution: caching & batching.

5. Functional Scala for error-handling was hard

Solution: outcome modeling (like ZIO’s Either) to keep the API sane.

Every issue forced the architecture to become cleaner.

What This Engine Unlocks (the real impact)

Scalable semantic search

Clean integration with pgvector

Fast RAG pipelines

Multi-step agent workflows

Structured conversation memory

Enterprise-ready document ingestion

This is the foundation upon which every intelligent feature of LLM4S depends.

ReferencesΒ 

Β