Every team we meet wants the same thing: a clean conversational interface where customers, employees or partners can ask anything and get useful, branded answers — backed by their own knowledge. The demo is easy. The production system is not.
This post walks through how we design and operate an AI Conversations SaaS — the kind of platform Devonstank ships for clients who need it embedded into their product, support flow or partner portal.
What we mean by “AI Conversations SaaS”
A multi-tenant platform where each tenant can:
- Ingest their own knowledge (docs, PDFs, URLs, databases, APIs).
- Configure assistants per use case — support, sales, internal ops, partner portal.
- Deploy the assistant across channels — web chat, in-app, WhatsApp, email, voice.
- See conversation analytics, satisfaction signals and cost per conversation.
- Stay in their own data lane (tenant isolation, audit logs, retention controls).
High-level architecture
We break the platform into five layers that can scale independently:
- Ingestion & sync — connectors for docs, websites, Notion, Confluence, Drive, S3 and databases. Incremental sync, change detection and chunking.
- Retrieval & memory — vector store (we like pgvector or Milvus depending on scale), hybrid search (BM25 + dense), tenant-scoped namespaces.
- Orchestration — an agent runtime that selects tools, calls the LLM and handles fallbacks. We typically build on top of LangGraph or a thin in-house orchestrator.
- Channels — web widget, WhatsApp Business API, Slack, Teams, voice. Same brain, different mouth.
- Ops & analytics — eval harness, conversation review, cost dashboards, content gap detection.
The hardest part of a conversational-AI SaaS isn't the LLM call — it's everything around the LLM call.
Retrieval is where most products silently fail
Once you go past the demo, retrieval quality becomes the dominant driver of perceived intelligence. We invest heavily in:
- Chunking that respects structure — headings, lists and tables aren't just text, they carry signal.
- Hybrid retrieval — lexical + semantic, with reranking on top.
- Per-tenant tuning — B2B customers have wildly different knowledge shapes. The default config is the worst config for everyone.
Evaluation: how we know it actually works
We build an internal eval harness from day one. It runs on every prompt change, every retrieval tweak and every model bump. Tests include:
- Golden Q&A pairs per tenant with rubric-based grading.
- Hallucination probes — questions whose ground truth isn't in the index.
- Tool-use traces — did the agent pick the right tool, in the right order?
- Latency and cost regression checks — quality isn't free.
Cost control
LLM bills surprise nobody and everybody. We instrument every conversation with token, model and tool cost. Then we apply the usual tricks:
- Smaller models for routing and classification, larger ones only for the final answer.
- Aggressive prompt caching for system prompts and tenant context.
- Streaming & early termination for “I don't know” flows.
- Tenant-level budgets and alerts so a runaway integration never becomes a runaway invoice.
Where Devonstank fits in
We build AI Conversations SaaS platforms as either:
- A managed product — we design, build and operate it under your brand.
- An embedded squad — we build alongside your engineers and hand over.
- A discovery sprint — 4–8 week pilot to validate ROI before committing.
Frequently asked questions
What is an AI Conversations SaaS?
A multi-tenant platform where each tenant can ingest their own knowledge, configure assistants per use case, deploy them across channels (web, in-app, WhatsApp, voice) and monitor analytics, satisfaction and cost.
Which models do you use?
Model-agnostic — Anthropic Claude, OpenAI GPT, Google Gemini and open-weight models via Bedrock, Vertex or on-prem. We pick per-use-case based on quality, latency and cost.
How do you control LLM costs?
Smaller models for routing and classification, larger for final answers; aggressive prompt caching; streaming with early termination; tenant-level budgets and alerts.
How long does a pilot take?
Typically 4 to 8 weeks for a focused, production-leaning pilot.
Want one for your product?
Let's scope what a conversational AI layer could look like for your business.

