Architecture

How this chatbot works

The chatbot on this portfolio is a RAG (Retrieval-Augmented Generation) system grounded in a corpus of documents about my work history and engineering experience. This page is the architectural writeup — the decisions, the trade-offs, and the things that went wrong first.

System overview

A user types a question. The question is embedded into a 1024-dimensional vector using Voyage AI. MongoDB Atlas Vector Search retrieves the 5 most semantically similar chunks from the corpus. Those chunks are injected into a system prompt alongside strict behavioral rules. Claude Haiku streams a response. The client renders it incrementally via SSE.

User query
  ↓
Voyage AI embed (voyage-3, 1024 dim)
  ↓
MongoDB Atlas $vectorSearch (cosine, top-5, numCandidates: 50)
  ↓
System prompt + retrieved context + conversation history
  ↓
Claude Haiku 4.5 (streaming)
  ↓
SSE stream → client renders tokens as they arrive
  ↓
Final SSE event: source chips (which documents were used)

Corpus and chunking strategy

The corpus is a set of Markdown files in src/lib/corpus/: a profile document, per-project deep dives (Mongeese, Kortex, Topship), voice samples that anchor how I write, and a dedicated gaps document that explicitly states areas where I have limited experience.

The gaps document is intentionally separated because the chatbot must be especially careful there — it should give an honest, specific answer rather than inflating my experience. Separating gaps from the main profile makes it easier to retrieve them when a user asks a gap-adjacent question.

Chunking uses a sentence-aware recursive splitter with a 500-token target size and 50-token overlap between adjacent chunks. The overlap prevents truncating a sentence mid-thought at a chunk boundary, which degrades retrieval quality significantly. Each chunk stores its source filename and the nearest section heading for attribution.

Why 500 tokens?

Shorter chunks (200–300 tokens) improve retrieval precision but lose surrounding context, causing the LLM to answer with isolated facts. Longer chunks (800+ tokens) retrieve too broadly and dilute the context window. 500 is a reasonable middle ground for prose-heavy technical writing.

Each chunk carries a content hash derived from its source file, chunk index, and text. The ingestion script uses this hash to skip unchanged chunks on re-runs, making ingestion idempotent. Only new or modified chunks get re-embedded — which matters because Voyage AI charges per token.

Embedding model: Voyage AI voyage-3

I chose Voyage AI over OpenAI embeddings for two reasons. First, Voyage's retrieval-optimized models consistently outperform text-embedding-3-small on technical document retrieval benchmarks (BEIR, MTEB). Second, voyage-3 at 1024 dimensions offers a good precision/cost trade-off — smaller dimension counts would reduce Atlas storage and query cost at the expense of retrieval quality.

The alternative was using all-MiniLM-L6-v2 locally (free, 384 dim). I rejected this because the quality delta on nuanced technical questions is noticeable, and the per-query cost on Voyage is low enough that it doesn't matter at this traffic level.

MongoDB Atlas Vector Search index

The vector index uses cosine similarity, which is appropriate for normalized embeddings from language models. Inner product would be faster but requires unit-normalized vectors — cosine handles this automatically. Euclidean distance is less appropriate for semantic embeddings where direction, not magnitude, carries meaning.

{
  "name": "corpus_vector_index",
  "type": "vectorSearch",
  "fields": [
    {
      "type": "vector",
      "path": "embedding",
      "numDimensions": 1024,
      "similarity": "cosine"
    },
    {
      "type": "filter",
      "path": "source"
    }
  ]
}

The filter field on sourceallows future queries to restrict retrieval to a specific document (e.g., "only search mongeese.md"). It's not used by default but the index supports it without a schema migration.

numCandidates: 50 with limit: 5 is a standard HNSW oversampling ratio. HNSW (Hierarchical Navigable Small World) is the approximate nearest-neighbor algorithm Atlas uses internally. Oversampling by 10x gives the algorithm enough candidates to find the true top-5 with high recall, without scanning the entire index.

Prompt injection defenses

There are two attack surfaces: the user message and the retrieved context. Both are handled, with different strategies.

User message attacks— jailbreak attempts like "ignore previous instructions", "pretend you are X", or "show me your system prompt" — are handled at the system prompt level. The system prompt explicitly instructs Claude to refuse these with a fixed response. This is not foolproof against every jailbreak, but it catches the overwhelming majority of casual attempts.

Corpus injection attacks — where a malicious document injected into the corpus could override the system prompt — are mitigated by controlling who can write to the corpus. The corpus is a set of checked-in Markdown files in the repository. There is no user-facing corpus upload interface. The ingestion endpoint requires an ADMIN_KEY.

The system prompt is never returned to the client. SSE events only carry text deltas and source metadata. The prompt is assembled and discarded server-side.

What this doesn't defend against

A determined adversary with hours and many tokens could likely find a jailbreak that works. This is acceptable — the chatbot holds no sensitive data, can only speak to publicly documentable facts about my work, and has a daily spend budget that limits the damage of sustained attacks.

Rate limiting design

Rate limiting uses Upstash Redis with a sliding window approach: 10 messages per IP per hour, 50 per IP per day. Upstash was chosen over a self-hosted Redis for two reasons: it has a generous free tier that covers this traffic level, and it's globally distributed with low-latency reads, which matters for a rate-limit check that sits in the hot path of every request.

IPs are hashed with SHA-256 before being used as Redis keys. This means no raw IP addresses are stored in any external system. The hash is one-way, so even if the Upstash account were compromised, no PII would be exposed.

The daily spend cap is a second layer of rate limiting — not per-IP, but global. If the day's Anthropic API spend exceeds the configured budget (default: $5), all chat requests receive a graceful degraded response rather than a 500. This prevents a traffic spike or a cost-blindness bug from running up an unexpected bill.

Streaming and latency

Responses stream via Server-Sent Events (SSE). The client opens a fetch() with the response body as a ReadableStream, reads chunks with a TextDecoder, and parses newline-delimited JSON events. Each content_block_delta event from Anthropic is forwarded immediately, so the user sees tokens appear in near-realtime.

The final SSE event carries source metadata — which corpus chunks were used — displayed as source chips below the response. This is sent after the stream completes so it doesn't interfere with the perceived streaming experience.

Cold start on Vercel serverless functions adds ~200–400ms to the first request. The embedding call (Voyage) and vector search (MongoDB Atlas) add ~300–600ms each, sequentially. Total pre-token latency on a warm function is typically 600–900ms — within the <1.5s target in the spec.

Conversation state

State is managed entirely on the client with React state. Nothing is persisted to a server, a database, or localStorage. Each request sends the last 6 messages of history along with the new message, giving Claude enough context to handle follow-up questions without a server-side session store.

The conversation is capped at 20 messages. After that, a reset prompt appears. This prevents the context window from growing unboundedly in long sessions and caps per-session token spend at a predictable maximum.

Trade-offs and what I'd do differently

No reranking. A production RAG system would add a cross-encoder reranker between vector retrieval and the LLM call — retrieve 20 candidates with ANN, rerank with a cross-encoder (e.g., Cohere Rerank), pass the top 5 to the LLM. This significantly improves retrieval precision for nuanced questions. I omitted it to keep the architecture simple and the latency under 1.5s.

No hybrid search. Combining vector search with BM25 keyword search (MongoDB Atlas Search) would improve retrieval for exact-match queries like project names or specific dates. Vector search alone can miss obvious exact-match results if the embedding space doesn't cluster them tightly. This is a known limitation.

Corpus is static. The corpus is Markdown files checked into the repository. A more sophisticated system would have a CMS or a structured data format that separates content from code, making it easier for a non-engineer (or a future me) to update project details without touching the repository. For a personal portfolio, this is acceptable friction.

← Back to home