Building UniMind: A Multi-Agent Mental Health Chatbot

From RASA to a multi-model system that plans, reasons, and responds with empathy

Co-authors:

This is the story of UniMind — a student-facing mental-health chatbot we built at the University of Northampton. As part of our second-year group project module, we were given two main objectives:

  1. The chatbot should not attempt to solve problems directly, but rather signpost students to appropriate resources or provide supportive guidance.
  2. The chatbot must implement conversation planning to maintain a coherent therapeutic trajectory. This ensures that even during extended interactions, the conversation remains structured and goal-oriented, rather than devolving into disconnected, one-off responses.

It began as a rule-driven RASA assistant and grew into a safety-first, multi-agent AI that plans the conversation before it speaks.

How we started: bridging the support gap

University counselling teams work incredibly hard — and they’re also at capacity. Our goal with UniMind was not to replace clinicians but to offer safe, 24/7 first-line support that complements existing services and seamlessly routes students to real help.
The final system hits those targets across multiple metrics — including layered crisis detection with ~99.99% reliability (internal) (100% on our scenario set; ≥99.9% with a single-provider outage; ~99% worst-case local fallback), 8.2/10 satisfaction, and an ~89% projected cost reduction.

Phase 1 — The legacy RASA approach (and what it taught us)

We began with a conventional RASA stack and built a deep intent hierarchy (200+ intents) with stories covering crisis indicators, academic stress, finances, relationships and campus services. In controlled tests we reached 100% story-level accuracy, but real-input intent classification averaged ~5.24%, a critical limitation for this domain.

Phase 2 — From rules to reasoning: the multi-agent redesign

We rebuilt UniMind as a three-agent therapeutic architecture:

Agent Responsibility Matrix

AgentPrimary FunctionTechnologyInputOutput
PSY-OVERSEER-1Strategic PlanningGemini 2.0 FlashUser conversation + contextTherapeutic plan steps
PSY-OVERSEER-2Tactical GuidanceCombined with PSY-OVERSEER-1 (Gemini 2.0 Flash or LM Studio fallback)Plan step + user state + conversation contextResponse guidance
PSY_miniConversation ExecutionPsychotherapy-LLM (8B)Guidance + user messageEmpathetic response
RESOURCES_assistUniversity IntegrationOpenAI Assistant APIConversation contentUON resources

Table 1 — Agent Responsibility Matrix. See report p.15 (Table 1).

Making “helpful” feel human — style & safety by design

A strict system prompt sets a warm persona, keeps replies short (max 2 sentences, ≈60 words), and adds a safety override for crisis language.

The PSY_mini prompt enforces a micro-structure:
validation → reflection → optional psychoeducation-with-application → one open question.

An action threshold (typically 60–70% plan progress) delays concrete skills/advice until sufficient understanding. This prevents premature advice-dumping and improves perceived empathy.

How we measured progress — cycles, rubrics, and evidence

Evaluation covered quantitative and qualitative signals across internal sessions, a public demo, and external participants. Reliability and performance summaries are detailed in the technical report.

Prompt-design analysis (pre-evaluation)

We chose this quantifiable route so changes could be defended with data rather than handcrafted rules. Using base data comparisons (original therapy vs student responses) and an LLM-based scoring rubric evaluated with OpenAI’s o4-mini model , we targeted the upper quartile for human‑likeness and tone, then encoded the highest‑impact micro‑elements into guidance.

Table 2: LLM Scoring Rubric Dimensions

DimensionWeightDescription
Therapeutic Goals35%Validation, insight building, coping-skill development, exploration, reframing, emotional regulation
Intervention Types25%Reflective listening, open-ended questions, psychoeducation, empathetic responses, summarisation, homework assignments
Human-Likeness15%Lexical variety, mild hedging, grounded references, natural rhythm, no boiler-plate
Emotional Tone15%Empathy, support, curiosity, warmth, neutrality
Response Structure10%Validation→question, reflection→reframe, psychoeducation→application, empathy→coping, exploration-focused

Table 3: Upper Quartile Response Characteristics (65–78% scores)

Micro-ElementImplementation
Early validationResponse begins with acknowledgment of user’s feelings
Reflective sentenceOne sentence mirroring the user’s key point (no tag-questions)
Open questionExactly one open-ended question to encourage exploration
Conversational toneWarm, natural language with varied sentence length and contractions

Analysis revealed that responses scoring in the upper quartile (typically 65–78%) consistently featured these micro-elements, which we directly encoded into the PSY-OVERSEER-2 guidance template.
See analysis assets in the repo’s Analysis_of_Test_Cycles.

Key finding: High-scoring responses consistently featured early validation, one reflective sentence, exactly one open question, gentle reframing, brief psychoeducation→application, timed coping/action suggestions, warm tone, and avoidance of bullet lists. These micro-elements were encoded directly into the PSY-OVERSEER-2 guidance template.

Technical addendum (from cycle insights)

Therapy session feedback (evaluation notes)

Short therapy‑style test sessions provided qualitative checks on warmth, brevity and progression, plus safety behaviour under crisis language. We used these logs and reports to refine style guards, the action threshold and plan transitions. Session artefacts live in experiments/therapy_sessions and the framework notes in psychocounsel_testing_framework.md.

Model strengths & weaknesses

TypeDetail
StrengthMulti‑layer crisis safety (deterministic crisis copy + layered detection); warm, concise tone with structured micro‑moves (validate → reflect → single open question → optional psychoeducation‑to‑application); clear plan stages; UON resource integration.
StrengthNeutral, jargon‑free language aiding inclusivity; consistent structure reduces ad‑hoc bias; explicit safety overrides.
WeaknessUK/UON‑centric resources by default; needs broader localisation for non‑UK contexts and international students (including crisis lines and services).
WeaknessPrompts lack explicit cultural‑sensitivity cues; missing gentle checks for cultural/identity context when relevant.
WeaknessOccasional edge‑case tone misreads (e.g., “disrespectful” perception) and handling of abrupt topic shifts; plan‑alignment smoothing needed.

User feedback & next steps

Results (prototype stage)

ryangichuru22 Avatar

Posted by

Leave a comment