🚀 Diving Into LLMClient: Starting My GSoC 2025 Journey with Embedding Support

Gopi Trinadh Maddikunta

🚀 Diving Into LLMClient: Starting My GSoC 2025 Journey with Embedding Support

🧩 TL;DR

In this post, I share my early experience contributing to the LLMClient project under the Scala Center as a GSoC 2025 contributor. I walk through my onboarding journey, repo exploration, technical roadblocks, and initial contributions. This work lays the groundwork for a fully modular RAG system in Scala, and I’m excited to share what’s ahead.

🚀 Introduction: A Summer of Code and Curiosity

Hi, I’m Gopi Trinadh Maddikunta — a graduate student in Engineering Data Science at the University of Houston, currently contributing to Google Summer of Code 2025 with the Scala Center.

My GSoC project is focused on enabling embedding support in LLMClient, a library designed for working with large language models in Scala. Embeddings are the bedrock of intelligent retrieval systems like semantic search, document Q&A, and RAG (Retrieval-Augmented Generation).

This post covers my first few weeks: understanding the codebase, breaking things (and fixing them), and setting the stage for meaningful contributions.

🧠 The Mission: Real-World Embeddings in Scala

The core problem is simple: LLMs don’t remember what they haven’t seen. Embeddings give them a way to retrieve relevant context from external documents. My project focuses on:

Supporting OpenAI and VoyageAI embedding providers
Handling real-world input formats: .txt, .pdf, .docx, .html
Enabling configuration-driven pipelines
Setting up for integration with vector databases like FAISS.

🌍 A Bigger Vision: Building a RAG System

Although my GSoC focus is on embedding support, the long-term plan is to develop a modular Retrieval-Augmented Generation (RAG) system using LLMClient. Here’s the vision:

Document Extraction → Parse PDFs, DOCX, Web
Embedding Generation → Use provider APIs to vectorize content
Retrieval Engine → Index and search using FAISS
Augmented Prompting → Feed top-k results into an LLM
Answer Generation → Stream final responses

Every part of the pipeline will be configurable, extensible, and reusable across backends.

🧭 Exploring the LLM4S Codebase

When I first opened the llm4s repo, I was overwhelmed. Scala wasn’t my daily driver, and the repo had layers. So I broke things down:

📦 llmconnect/ – Core embedding logic (providers, client)
🧪 samples/ – Run-ready code examples
🔐 config/ – .env loading, model keys
📄 model/ – Request/Response case classes

To get up to speed, I took the Rock the JVM: Scala at Light Speed course. That gave me just enough to stop being afraid of Either, case class, and for comprehension.

🪓 Git, SBT, and Growing Pains

Before writing code, I had to get the project working. Sounds easy. It wasn’t:

❌ SBT version mismatches (Scala 2.13 vs 3.7)
❌ Environment variables not detected
❌ sample.pdf not found → hours lost to a file path
❌ Git merge issues from syncing upstream/main

But I logged everything I broke and fixed. That doc now lives next to me every day.

✅ First Wins: From Config to Vector

The best moment? Watching my config-driven code hit the OpenAI endpoint and return a valid embedding vector.

In short:

✅ Loaded .env config with API key
✅ Created a working EmbeddingProvider scaffold
✅ Printed JSON output from OpenAI successfully

Now I can toggle between providers, pass sample text, and extract embeddings from live APIs.

🔧 What’s Next

Here’s a snapshot of what’s coming in the next phases:

Milestone	Status
PR 1: OpenAI & VoyageAI Embedding Providers	✅ Done
PR 2: Universal Extractor (Text, PDF, DOCX, HTML)	🔧 In Progress
PR 3: FAISS Integration + Vector Search	🧩 Coming Up
CLI Tool, Chunking, Metadata	🚧 Scheduled

🙏 A Quick Shout-Out

Big thanks to my mentors Rory Graves, Kannupriya Kalra and Dmitry Mamonov for their feedback, support, and encouragement. Every review is a learning moment.

🔗 Useful Links

📣 Let’s Connect

If you’re exploring RAG systems, embeddings, or working in Scala — I’d love to hear from you. Blog 2 will dive deep into the architecture, design, and learnings from my first two PRs.

Stay tuned — the real building has just begun.