Building UniMind: A Multi-Agent Mental Health Chatbot

November 17, 2025

From RASA to a multi-model system that plans, reasons, and responds with empathy

Co-authors:

This is the story of UniMind — a student-facing mental-health chatbot we built at the University of Northampton. As part of our second-year group project module, we were given two main objectives:

The chatbot should not attempt to solve problems directly, but rather signpost students to appropriate resources or provide supportive guidance.
The chatbot must implement conversation planning to maintain a coherent therapeutic trajectory. This ensures that even during extended interactions, the conversation remains structured and goal-oriented, rather than devolving into disconnected, one-off responses.

It began as a rule-driven RASA assistant and grew into a safety-first, multi-agent AI that plans the conversation before it speaks.

How we started: bridging the support gap

University counselling teams work incredibly hard — and they’re also at capacity. Our goal with UniMind was not to replace clinicians but to offer safe, 24/7 first-line support that complements existing services and seamlessly routes students to real help.
The final system hits those targets across multiple metrics — including layered crisis detection with ~99.99% reliability (internal) (100% on our scenario set; ≥99.9% with a single-provider outage; ~99% worst-case local fallback), 8.2/10 satisfaction, and an ~89% projected cost reduction.

Phase 1 — The legacy RASA approach (and what it taught us)

We began with a conventional RASA stack and built a deep intent hierarchy (200+ intents) with stories covering crisis indicators, academic stress, finances, relationships and campus services. In controlled tests we reached 100% story-level accuracy, but real-input intent classification averaged ~5.24%, a critical limitation for this domain.

Legacy repo

Phase 2 — From rules to reasoning: the multi-agent redesign

We rebuilt UniMind as a three-agent therapeutic architecture:

PSY-OVERSEER-1 (Planner) — proposes a short plan for the next step(s) of the conversation (Initial → Assessment → Treatment).
PSY-OVERSEER-2 (Guidance) — translates each plan step into micro-instructions.
PSY_mini (Conversationalist) — a psychotherapy-tuned LLM that speaks in a warm, concise style, under strict safety and tone rules.

Agent Responsibility Matrix

Agent	Primary Function	Technology	Input	Output
PSY-OVERSEER-1	Strategic Planning	Gemini 2.0 Flash	User conversation + context	Therapeutic plan steps
PSY-OVERSEER-2	Tactical Guidance	Combined with PSY-OVERSEER-1 (Gemini 2.0 Flash or LM Studio fallback)	Plan step + user state + conversation context	Response guidance
PSY_mini	Conversation Execution	Psychotherapy-LLM (8B)	Guidance + user message	Empathetic response
RESOURCES_assist	University Integration	OpenAI Assistant API	Conversation content	UON resources

Table 1 — Agent Responsibility Matrix. See report p.15 (Table 1).

Making “helpful” feel human — style & safety by design

A strict system prompt sets a warm persona, keeps replies short (max 2 sentences, ≈60 words), and adds a safety override for crisis language.

The PSY_mini prompt enforces a micro-structure:
validation → reflection → optional psychoeducation-with-application → one open question.

An action threshold (typically 60–70% plan progress) delays concrete skills/advice until sufficient understanding. This prevents premature advice-dumping and improves perceived empathy.

How we measured progress — cycles, rubrics, and evidence

Evaluation covered quantitative and qualitative signals across internal sessions, a public demo, and external participants. Reliability and performance summaries are detailed in the technical report.

Prompt-design analysis (pre-evaluation)

We chose this quantifiable route so changes could be defended with data rather than handcrafted rules. Using base data comparisons (original therapy vs student responses) and an LLM-based scoring rubric evaluated with OpenAI’s o4-mini model , we targeted the upper quartile for human‑likeness and tone, then encoded the highest‑impact micro‑elements into guidance.

Table 2: LLM Scoring Rubric Dimensions

Dimension	Weight	Description
Therapeutic Goals	35%	Validation, insight building, coping-skill development, exploration, reframing, emotional regulation
Intervention Types	25%	Reflective listening, open-ended questions, psychoeducation, empathetic responses, summarisation, homework assignments
Human-Likeness	15%	Lexical variety, mild hedging, grounded references, natural rhythm, no boiler-plate
Emotional Tone	15%	Empathy, support, curiosity, warmth, neutrality
Response Structure	10%	Validation→question, reflection→reframe, psychoeducation→application, empathy→coping, exploration-focused

Table 3: Upper Quartile Response Characteristics (65–78% scores)

Micro-Element	Implementation
Early validation	Response begins with acknowledgment of user’s feelings
Reflective sentence	One sentence mirroring the user’s key point (no tag-questions)
Open question	Exactly one open-ended question to encourage exploration
Conversational tone	Warm, natural language with varied sentence length and contractions

Analysis revealed that responses scoring in the upper quartile (typically 65–78%) consistently featured these micro-elements, which we directly encoded into the PSY-OVERSEER-2 guidance template.
See analysis assets in the repo’s Analysis_of_Test_Cycles.

Distribution of Cycle_1 scores illustrating performance metrics for the UniMind chatbot evaluation.

Key finding: High-scoring responses consistently featured early validation, one reflective sentence, exactly one open question, gentle reframing, brief psychoeducation→application, timed coping/action suggestions, warm tone, and avoidance of bullet lists. These micro-elements were encoded directly into the PSY-OVERSEER-2 guidance template.

Technical addendum (from cycle insights)

Micro-structure enforcement: PSY-OVERSEER-2 guidance always yields validation → reflection → optional psychoeducation-with-application → one open question. This structure is encoded directly in the combined prompt.
Action threshold gating: Coping/homework appears only after sufficient exploration progress (typically 60–70% plan completion) to avoid premature advice.
Stage transitions: Fixed initial plan (3 steps) then automatic Initial → Assessment → Treatment progression with dynamic plan alignment. The system checks alignment with each user message.
Conversation summarisation: Every 10 interactions, the system automatically summarises conversation chunks to maintain context in long sessions without exceeding token limits.
Lightweight eval harness: Internal scoring correlated with the micro-elements above; details in the repo Cycle Insights report.

Therapy session feedback (evaluation notes)

Short therapy‑style test sessions provided qualitative checks on warmth, brevity and progression, plus safety behaviour under crisis language. We used these logs and reports to refine style guards, the action threshold and plan transitions. Session artefacts live in experiments/therapy_ses s ions and the framework notes in psychocounsel_testing_framework.md.

Model strengths & weaknesses

Type	Detail
Strength	Multi‑layer crisis safety (deterministic crisis copy + layered detection); warm, concise tone with structured micro‑moves (validate → reflect → single open question → optional psychoeducation‑to‑application); clear plan stages; UON resource integration.
Strength	Neutral, jargon‑free language aiding inclusivity; consistent structure reduces ad‑hoc bias; explicit safety overrides.
Weakness	UK/UON‑centric resources by default; needs broader localisation for non‑UK contexts and international students (including crisis lines and services).
Weakness	Prompts lack explicit cultural‑sensitivity cues; missing gentle checks for cultural/identity context when relevant.
Weakness	Occasional edge‑case tone misreads (e.g., “disrespectful” perception) and handling of abrupt topic shifts; plan‑alignment smoothing needed.

User feedback & next steps

User feedback (final system): external student testing averaged 8.2/10; feedback highlighted understanding/empathy, appropriate response quality, and a calming UI; follow‑up questions felt natural; a small number flagged tone edge‑cases in specific turns.
Performance notes: public demo showed ~20 s cold‑start then ~9 s warm responses; current optimised stack targets ≈2.5 s on warm paths.
Improvements in progress: pre‑warming and pooling to reduce cold‑start spikes; broadened resource localisation; explicit cultural‑sensitivity cues in prompts; continued tuning of action‑threshold and plan alignment; structured feedback collection in sessions.
Deployment: web deployment work is active to make the system publicly accessible, alongside GDPR controls, monitoring, and reliability targets.

Results (prototype stage)

Layered crisis detection: ~99.99% reliability (internal) with 100% on our scenario set; 99.9% with a single-provider outage; ~99% worst-case local fallback (OpenAI 94%, Gemini 96%, local 88%; Table 2, p.24; Table 3, p.28).
Satisfaction: 8.2/10 (Table 3, p.28).
Latency: ~9s warm on early demo; ≈2.5s optimised (p.25 and Table 3, p.28).
Cost: £13.36 per user/month (~89% reduction) — p.26 and Fig. 5 on p.27.

AI and Data Science at University of Northampton