Architecture
The Co-mind.ai Private AI Platform routes requests to multiple AI backends through a unified gateway. This page explains how to choose the right endpoint for your use case.
API Decision Tree
Use this decision tree to determine which endpoint to use:
+-------------------------------------+
| Do you need knowledge base context? |
+-----------------+-------------------+
|
+---------+---------+
| |
NO YES
| |
v v
+---------------+ +---------------------------------+
| /v1/chat/ | | Do you need server-managed |
| completions | | conversation history? |
| | +----------------+----------------+
| Pure OpenAI | |
+---------------+ +---------+---------+
| |
NO YES
| |
v v
+---------------------+ +--------------+
| /v1/knowledgebase/ | | /v1/chat/ |
| chat/completions | | sessions |
| | | |
| KB + Stateless | | KB + Stateful|
+---------------------+ +--------------+
When to Use Each Endpoint
Best for: Standard AI chat without document context.
- OpenAI-compatible drop-in replacement
- Supports streaming, vision, and tool calling
- You manage conversation history client-side
- Lowest latency option
curl -X POST https://your-instance/v1/chat/completions \
-H "Authorization: Bearer YOUR_TOKEN" \
-H "Content-Type: application/json" \
-d '{
"model": "tiiuae/Falcon3-7B-Instruct",
"messages": [{"role": "user", "content": "Hello"}]
}'
Best for: Question answering over your documents (RAG).
- Retrieves relevant document chunks before generating a response
- Returns source citations with relevance scores
- You manage conversation history client-side
- Supports vision + RAG for image analysis with document context
curl -X POST https://your-instance/v1/knowledgebase/chat/completions \
-H "Authorization: Bearer YOUR_TOKEN" \
-H "Content-Type: application/json" \
-d '{
"model": "Qwen/Qwen2.5-32B-Instruct",
"messages": [{"role": "user", "content": "Summarize the contract"}],
"knowledgebase_ids": ["kb_abc123"]
}'
Best for: Multi-turn conversations with document context and server-managed history.
- Server stores and manages conversation history
- Linked to knowledge bases at session creation
- Just send new messages — no need to replay history
- Ideal for chat UIs and interactive assistants
# Create session
curl -X POST https://your-instance/v1/chat/sessions \
-H "Authorization: Bearer YOUR_TOKEN" \
-H "Content-Type: application/json" \
-d '{
"name": "Contract Review",
"knowledgebase_ids": ["kb_abc123"],
"model": "falcon3:3b"
}'
# Send message (history managed by server)
curl -X POST https://your-instance/v1/chat/sessions/SESSION_ID/messages \
-H "Authorization: Bearer YOUR_TOKEN" \
-H "Content-Type: application/json" \
-d '{"content": "What are the key terms?"}'
Backend Architecture
The platform supports multiple AI inference backends. Each backend provides different performance characteristics and model support.
Backend Providers
| Backend | Engine | Strengths |
|---|
| vLLM | High-performance GPU inference | Best throughput for large models, supports vision and tool calling |
| Ollama | Local model server | Easy setup, good for smaller models, supports embeddings |
| llama.cpp | CPU/GPU GGUF inference | Runs on CPU, minimal resource requirements |
Capability Matrix
| Capability | vLLM | Ollama | llama.cpp |
|---|
| Chat completions | Yes | Yes | Yes |
| Text completions | Yes | Yes | Yes |
| Embeddings | Yes | Yes | No |
| Streaming | Yes | Yes | Yes |
| Vision | Yes | No | No |
| Tool calling | Yes | No | No |
Use GET /v1/capabilities to check the current capability matrix for your deployment, and GET /v1/backends/health to monitor backend status.
Discovery Endpoints
| Endpoint | Purpose |
|---|
GET /v1/models | List all available models across all backends |
GET /v1/backends | List backend providers with their supported features |
GET /v1/capabilities | Full capability matrix (backend → features → models) |
GET /v1/backends/health | Real-time health and latency for each backend |
Next Steps