Skip to main content

Architecture

The Co-mind.ai Private AI Platform routes requests to multiple AI backends through a unified gateway. This page explains how to choose the right endpoint for your use case.

API Decision Tree

Use this decision tree to determine which endpoint to use: API Decision Tree
  +-------------------------------------+
  | Do you need knowledge base context? |
  +-----------------+-------------------+
                    |
          +---------+---------+
          |                   |
         NO                  YES
          |                   |
          v                   v
  +---------------+   +---------------------------------+
  | /v1/chat/     |   | Do you need server-managed      |
  | completions   |   | conversation history?           |
  |               |   +----------------+----------------+
  | Pure OpenAI   |                    |
  +---------------+          +---------+---------+
                             |                   |
                            NO                  YES
                             |                   |
                             v                   v
                 +---------------------+  +--------------+
                 | /v1/knowledgebase/  |  | /v1/chat/    |
                 | chat/completions    |  | sessions     |
                 |                     |  |              |
                 | KB + Stateless      |  | KB + Stateful|
                 +---------------------+  +--------------+

When to Use Each Endpoint

Best for: Standard AI chat without document context.
  • OpenAI-compatible drop-in replacement
  • Supports streaming, vision, and tool calling
  • You manage conversation history client-side
  • Lowest latency option
curl -X POST https://your-instance/v1/chat/completions \
  -H "Authorization: Bearer YOUR_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "tiiuae/Falcon3-7B-Instruct",
    "messages": [{"role": "user", "content": "Hello"}]
  }'

Backend Architecture

The platform supports multiple AI inference backends. Each backend provides different performance characteristics and model support.

Backend Providers

BackendEngineStrengths
vLLMHigh-performance GPU inferenceBest throughput for large models, supports vision and tool calling
OllamaLocal model serverEasy setup, good for smaller models, supports embeddings
llama.cppCPU/GPU GGUF inferenceRuns on CPU, minimal resource requirements

Capability Matrix

CapabilityvLLMOllamallama.cpp
Chat completionsYesYesYes
Text completionsYesYesYes
EmbeddingsYesYesNo
StreamingYesYesYes
VisionYesNoNo
Tool callingYesNoNo
Use GET /v1/capabilities to check the current capability matrix for your deployment, and GET /v1/backends/health to monitor backend status.

Discovery Endpoints

EndpointPurpose
GET /v1/modelsList all available models across all backends
GET /v1/backendsList backend providers with their supported features
GET /v1/capabilitiesFull capability matrix (backend → features → models)
GET /v1/backends/healthReal-time health and latency for each backend

Next Steps