Gopi Trinadh Maddikunta

Gopi Trinadh Maddikunta

Copyright @ 2025 GT Groups.
All rights are reserved.

🌍 Embedding Anything: Real-World Progress in LLM4S (GSoC 2025, Phase 1)

🧩 TL;DR

This blog shares my GSoC 2025 Phase 1 milestone progress with the Scala Center. I implemented OpenAI and VoyageAI embedding support and built a universal document extractor that reads PDFs, DOCX, HTML, and TXT files. This is part of a larger goal: creating a modular, end-to-end RAG (Retrieval-Augmented Generation) system in Scala.

🎯 Introduction: GSoC, LLM4S, and Why This Matters

As a GSoC 2025 contributor under the Scala Center, I’m working on the LLM4S project. My main goal is to bring embedding support to LLMClient, allowing developers to convert any document into vector representations β€” a key building block for AI applications like:

  • Semantic Search

  • Q&A over custom data

  • Document classification

  • Retrieval-Augmented Generation (RAG)

Why should you care? Because while RAG systems are popular in Python, Scala developers are left behind. With this work, we’re bridging that gap β€” and making embeddings, document parsing, and provider switching dead simple in a typed, functional Scala ecosystem.

❓ Why This Milestone Matters

πŸ’‘ Following mentor advice: explain why/how/what clearly.

βœ… Why

Embeddings are the foundation of intelligent retrieval. Without them, your LLM is just guessing. But embedding documents takes work: parsing them, sending them to APIs, handling models, configs, and responses. My work makes that accessible in just a few lines.

πŸ› οΈ How

By introducing:

  • A generic interface for any embedding provider (OpenAI, VoyageAI… more coming)

  • A universal extractor for real-world file formats (PDF, DOCX, HTML, TXT)

  • A configuration-first design β€” swap models/providers with no code changes

πŸ” What

Two major PRs:

  • PR #92: Embedding support with OpenAI & VoyageAI

  • PR #100: Universal document extractor

Both are tested, extensible, and designed to fit cleanly into the larger LLMClient ecosystem.

πŸ”§ PR #1: Embedding Support via OpenAI & VoyageAI

🧩 Highlights:

  • EmbeddingProvider trait defines a common interface

  • OpenAIEmbeddingProvider and VoyageAIEmbeddingProvider follow it

  • Configuration handled through .env or system props

  • Supports model switching (text-embedding-ada-002, voyage-large-2)

βš™οΈ Usage:

Screenshot-2025-07-13-094232

Β 

πŸ“‚ PR #2: Universal Extractor for Real-World Documents

🧠 Supported Formats:

  • .txt – via standard file reader

  • .pdf – using Apache PDFBox

  • .docx – via Apache POI

  • .html – with JSoup

🧰 Design:

  • FileExtractor trait defines contract

  • Each format has its own implementation

  • UniversalExtractor dispatches based on file extension

βš™οΈ Usage:

Screenshot 2025-07-13 095508

🧭 RAG Is the Destination

These modules are part of a larger pipeline that looks like this:

LayerStatus
🧩 Extractionβœ… Universal Extractor
πŸ”’ Embeddingβœ… OpenAI & Voyage
πŸ“¦ StorageπŸ”œ FAISS Integration
πŸ” RetrievalπŸ”œ Similarity Scoring
🧠 GenerationπŸ”œ LLM Response

Β 

πŸ§ͺ Testing Philosophy

πŸ—’οΈ Per mentor advice: focus on what matters to the reader β€” skip trivial unit test details.

Each extractor is tested with real sample files. Providers are tested using mocked or live API responses. No hardcoded paths. Everything is file-safe, clean, and config-driven.

Β 

πŸ’¬ Mentor Feedback (and How It Helped)

Thanks to my mentor Rory Graves, this blog and the implementation behind it followed better structure, clarity, and real-world usefulness. His suggestions led me to:

  • Frame my work using Why/How/What

  • Remove redundant detail that didn’t help readers

  • Focus on meaningful contribution, not just implementation

  • Think about the reader’s perspective, not just my own

Β 

πŸ”— Useful Links

Β 

πŸ“£ Wrapping Up

This phase was about making it real. I wrote code that:

  • Embeds any text via two providers

  • Parses real documents automatically

  • Sets the stage for full RAG support in Scala

β€œThanks for following along β€” the code is live, and I’m just getting started.”