Přeskočit obsah

AI Helper — specifikace platform service

Status: draft v1.0 Aktualizováno: 2026-05-27 Související: apps-gateway.md, per-app-container.md, ai-chat.md, ai-rag.md

1. Cíl

AI Helper je platform service který:

  1. Orchestruje AI workery — LM Studio, ComfyUI, Whisper a další AI servery. Vendor apps i externí klienti volají AI bez znalosti hardware.
  2. Spravuje frontu a VRAM — interactive chat dotazy předbíhají před background indexací. Worker s 24 GB VRAM neobsluhuje 5 modelů zaráz.
  3. Distribuuje workery napříč ownership — AVAXIS pool (shared, production), vendor private (development), klient on-premise.
  4. Hostuje RAG corpus — 3 typy (global, shared, private) s ACL. Legal corpus dostupný napříč apps.
  5. Centralizuje auth + quota — user JWT (z launcheru), vendor M2M (cron, batch), API key (externí klienti).

Co NENÍ AI Helper (záměrně)

  • End-user UI aplikace. AI Pomocník (app-ai-helper 0.x) má 9-tab UI co je nadstavba nad service. Service samotná je headless. UI klient může komunikovat se service stejně jako jakákoli vendor app.
  • Trainer / fine-tuner modelů. Modely jsou pre-trained, hostovány na workerech. Fine-tuning je out-of-scope (v 3.0+ možný PEFT/LoRA pipeline).
  • Validátor vstupních dat. Vendor app si validuje data sama. AI Helper přijme co dostane (s rate-limit + max-tokens cap).
  • Replacement pro core-api auth. AI Helper validuje JWT z core-api JWKS, ale nevystavuje vlastní login.

2. Architecture

                              Externí klienti (SaaS partneři)
                                       │ X-API-Key
                              ┌────────────────────┐
                              │  vm-gateway nginx  │  TLS + rate limit
                              │  api.avaxis.cz     │
                              └─────────┬──────────┘
                                        │ /v1/ai/*
                  ┌─────────────────────┼─────────────────────┐
                  │                     │                     │
  Vendor apps (legal, mzdy,             │     Launcher2 (user) ── IPC token ──┐
  finance, …) — interní               │                                     │
        │ Authorization: Bearer JWT     │                                     │
        ▼                               ▼                                     ▼
                              ┌──────────────────────┐
                              │   apps-gateway       │  cache, circuit breaker
                              │   :8100              │
                              └──────────┬───────────┘
                                         │ /apps/app-ai-helper/v1/*
                  ┌──────────────────────────────────────────────────┐
                  │              AI Helper service                    │
                  │   (FastAPI + Celery + Redis + pgvector + S3)     │
                  │  ┌──────────┐  ┌──────────┐  ┌──────────────┐   │
                  │  │  API     │  │ Scheduler│  │ Capability   │   │
                  │  │  routers │  │ (Celery) │  │ registry     │   │
                  │  └──────────┘  └────┬─────┘  └──────────────┘   │
                  │                     │                            │
                  │  ┌──────────┐  ┌────▼────┐  ┌──────────────┐   │
                  │  │ RAG      │  │ Worker  │  │ Usage logger │   │
                  │  │ (pgvec)  │  │ probe   │  │ (Postgres)   │   │
                  │  └──────────┘  └────┬────┘  └──────────────┘   │
                  └─────────────────────┼────────────────────────────┘
                ┌───────────────────────┼─────────────────────────┐
                ▼                       ▼                         ▼
         ┌────────────┐         ┌────────────┐           ┌────────────┐
         │ Platform   │         │ Vendor     │           │ Company    │
         │ workers    │         │ private    │           │ private    │
         │            │         │            │           │            │
         │ LM Studio  │         │ Michal-PC  │           │ Klient GPU │
         │ ComfyUI    │         │ LM Studio  │           │ on-premise │
         │ Whisper    │         │ (dev only) │           │ (rare)     │
         └────────────┘         └────────────┘           └────────────┘

Komponenty

Komponenta Účel Tech
API routers REST + SSE endpointy /v1/* FastAPI
Scheduler Routing requestů na vhodný worker (capability + load + VRAM) Celery + Redis priority queues
Worker probe Periodic health check workerů, model_loaded detect Background tasks
Capability registry Dynamic registr typed + extensible capabilities Postgres ai_capabilities table + cache
RAG service pgvector ANN search, document indexer (Celery task) pgvector ext, BGE embed via worker
Usage logger Per-request token usage + latency + cost Postgres ai_usage_log table

3. Federated workers

Worker je registrovaná AI server instance (LM Studio, ComfyUI, …) s deklarovanými capabilities. Vlastnictví je rozdělené:

Scope tiers

Scope Owner Visibility Příklad
platform AVAXIS Všem firmám, všem apps gerry-prod-1 (24 GB LM Studio)
vendor Vendor (firma s is_vendor=true) Vendor's apps napříč firmami michal-laptop-lm (dev only)
company Klient firma Jen apps subscribed touto firmou klient-XYZ-on-prem (data residency)

Worker schema

CREATE TABLE ai_helper.worker (
  id            UUID PRIMARY KEY,
  name          VARCHAR(100) NOT NULL,   -- 'gerry', 'michal-laptop-lm'
  kind          VARCHAR(50)  NOT NULL,   -- 'lmstudio', 'comfyui', 'whisper', 'generic'

  -- Routing
  transport_class VARCHAR(20) NOT NULL,  -- 'datacenter' (direct-dial) | 'agent' (pull-poll)
  base_url      VARCHAR(500),            -- pro datacenter: http URL
  agent_session VARCHAR(100),            -- pro agent: poll session ID

  -- Ownership
  scope            VARCHAR(20) NOT NULL, -- 'platform' | 'vendor' | 'company'
  owner_vendor_id  UUID,                 -- FK companies.id (jen pro scope='vendor')
  owner_company_id UUID,                 -- FK companies.id (jen pro scope='company')

  -- Environment (kde je worker eligible)
  environment      VARCHAR(20) NOT NULL DEFAULT 'production',
                                         -- 'dev' | 'staging' | 'production'

  -- Hardware
  vram_total_mb    INT,
  vram_used_mb     INT,                  -- aktualizováno scheduler/probe

  -- Capabilities — extensible JSON, viz §5
  capabilities     JSONB NOT NULL DEFAULT '{}',

  -- Lifecycle
  status           VARCHAR(20) NOT NULL DEFAULT 'unknown',
                                         -- 'unknown' | 'online' | 'busy' | 'offline' | 'error'
  tunnel_status    VARCHAR(20),          -- agent only: 'connected'|'disconnected'
  last_health_check_at TIMESTAMPTZ,
  last_health_error    TEXT,

  created_by_user_id UUID,
  created_at         TIMESTAMPTZ DEFAULT NOW(),
  updated_at         TIMESTAMPTZ DEFAULT NOW()
);

CREATE INDEX idx_worker_scope ON ai_helper.worker (scope, status, environment);
CREATE INDEX idx_worker_caps ON ai_helper.worker USING GIN (capabilities);

Routing pravidla

Pro request (app_slug, company_id, environment, capability):

  1. Kandidáty: WHERE status = 'online' AND environment <= request.environment AND capabilities ? capability AND:
  2. scope = 'platform' (vždy)
  3. OR (scope = 'vendor' AND owner_vendor_id = <app vendor>) (vendor's own)
  4. OR (scope = 'company' AND owner_company_id = company_id) (klient on-premise)
  5. Filtr VRAM: vram_total_mb - vram_used_mb >= request.vram_estimate
  6. Sort: ASC by (current_queue_depth, last_used_at) → least loaded first
  7. Pick: first match. Pokud žádný, return 503 no_capable_worker_available.

Environment escalation

environment definuje minimum tier pro worker. Request s environment=production přijme jen production workers. Request s environment=dev přijme dev OR staging OR production (escalation up).

Tedy: - Dev requesty mohou utilize productů workers (pokud production server má volnou kapacitu) - Production requesty NIKDY nepoužijí dev worker (data security)

Worker onboarding

Akce Kdo
Register platform worker super_admin přes admin UI
Register vendor worker vendor (přes vendor portal, scope=vendor forced na vendor's id)
Register company worker super_admin (klient zaregistruje přes service ticket, manuál)
Update capabilities owner (vendor pro vendor scope, super_admin pro platform)
Disable / delete owner

4. Queue + scheduler

Priorities

realtime    │ Voice call live transcription, must answer <2s nebo fail
            │ Default: NO queue — pokud žádný free worker, return 503 immediately
            │ Allowed jen pro core platform features (voice/video call)
interactive │ User čeká před obrazovkou (chat dotaz, image preview)
            │ Soft target: <5s start of response (SSE first token)
            │ Default pro: /chat/stream, /chat/complete, /image/generate s preview=true
background  │ App backend volá AI bez user wait (RAG indexace, batch embed,
            │ document understanding)
            │ Soft target: <30s
            │ Default pro: /jobs/*, /embed/batch
scheduled   │ Cron, noční reindex, periodic recomputation
            │ Soft target: hodiny OK
            │ Default pro: nic přímo z API, jen vendor cron via M2M token

Scheduler chování

  • No preempce. LLM inference je atomic (token-by-token, neemůžem ho zastavit uprostřed). Místo toho scheduler nedá nový request na worker pokud čeká vyšší priority v queue.
  • Per-worker queue + global pending pool. Pokud worker dokončí task, scheduler ho přidělí highest-priority pending request co kind/capabilities match.
  • Starvation prevention. Background tasky čekající >10 min se promote na priority+1 (interactive level). Scheduled tasky čekající >1h promote.
  • VRAM-aware. Pokud worker má LM Studio loaded qwen-27b (16 GB) a přijde request pro BGE-m3 embed (2 GB), scheduler ho nepošle na ten worker (model switch = 5-30s, drahé). Místo toho hledá worker s BGE už loaded, nebo task čeká.

Model switching

Workery deklarují models_loaded JSON (např. ["qwen-27b"]). Scheduler:

  1. Preferuje workery kde request model je už loaded (sticky)
  2. Pokud žádný takový, vybere worker s free VRAM ≥ model size + buffer (volně)
  3. Pošle LM Studio API call POST /v1/models/load s target model (LM Studio podporuje)
  4. Worker probe updatne models_loaded po success

Pokud žádný kandidát volný + žádný se sticky model → request čeká v queue.

Capacity overflow → external fallback

Pokud queue depth > N (config), AI Helper má opt-in fallback na external API providers: - OpenAI API (pokud company má openai_api_key v env) - Anthropic API - Vlastní hosted models

Cost tracking pak per-request (external API tarif). Default disabled — explicit opt-in per company.

5. Capabilities — typed core + extensible

Capability je named function worker umí provádět. Není to hardcoded enum, ale plugin model.

Core capabilities (vždy podporované)

Capability Pattern Worker kinds
chat SSE stream lmstudio, ollama, generic_llm
chat.complete sync (full response) jako výše
embed sync lmstudio, bge_server, sentence_transformers
embed.batch sync (batch input) jako výše
image.generate async job comfyui, automatic1111
image.understand sync (vision LLM) lmstudio (Llama Vision), generic_vlm
speech.to_text sync nebo SSE whisper_server, ollama_whisper
text.to_speech sync nebo SSE piper, xtts, openai_tts_compatible
video.generate async job (long) comfyui (AnimateDiff), wan2

Extensible namespaces

text_understanding.* — NLP úkoly (typicky implementované LLM s system prompt template, ale dedicated NLP workers možní):

  • text_understanding.ner — Named Entity Recognition (PERSON, ORG, DATE, MONEY, …)
  • text_understanding.classify — document classification
  • text_understanding.extract — strukturovaná extrakce (key-value z PDFs)
  • text_understanding.summarize — sumarizace dokumentu
  • text_understanding.translate — CZ ↔ EN (a další)
  • text_understanding.sentiment — sentiment analysis

audio_understanding.*:

  • audio_understanding.diarize — speaker separation
  • audio_understanding.intent — voice command classification

custom.<namespace>.<task> — vendor-specific:

  • custom.legal.contract_extract — Legal-vendor specific contract parser
  • custom.mzdy.payslip_ocr — Mzdy app payslip extractor

Capability declaration v worker

{
  "chat": {
    "models": ["qwen3.6-27b", "llama-3.3-70b"],
    "max_context": 32768,
    "supports_streaming": true
  },
  "embed": {
    "models": ["bge-m3"],
    "dim": 1024,
    "max_batch_size": 64
  },
  "text_understanding.ner": {
    "languages": ["cs", "en"],
    "entities": ["PERSON", "ORG", "DATE", "MONEY", "LAW_REFERENCE"]
  }
}

Capability registry

CREATE TABLE ai_helper.capability (
  id           VARCHAR(100) PRIMARY KEY,  -- 'chat', 'text_understanding.ner'
  category     VARCHAR(50)  NOT NULL,     -- 'core' | 'text_understanding' | 'custom'
  schema_json  JSONB,                     -- JSON schema pro input/output
  added_by     VARCHAR(100),
  added_at     TIMESTAMPTZ DEFAULT NOW()
);

Super_admin (nebo vendor admin) může přidat nový capability ID + worker ho začne deklarovat.

Generic SDK call

# Typed core
async for tok in app.ai.chat(prompt="..."): ...
vec = await app.ai.embed("text")

# Generic (libovolný capability)
result = await app.ai.call(
    "text_understanding.ner", 
    payload={"text": "Jan Novák uzavřel smlouvu s ABC s.r.o. dne 12.5.2024"},
    timeout=10.0
)
# → {"entities": [
#     {"type":"PERSON", "value":"Jan Novák", "span":[0,9]},
#     {"type":"ORG",    "value":"ABC s.r.o.","span":[20,30]},
#     {"type":"DATE",   "value":"2024-05-12","span":[35,44]}
#   ]}

6. RAG corpus

Three corpus types

Type Owner Read access Write access
global Platform (AVAXIS) Všechny apps automaticky super_admin only
shared Konkrétní app + per company Owner + explicit grants Owner app only
private App + company Jen owner Jen owner

Use cases

global examples: - laws-cr — Zákony ČR (sbírka, vyhlášky), kurátoruje AVAXIS legal team - tax-forms — Daňové formuláře (šablony) - accounting-standards — IFRS / CZ účetní standardy

shared examples: - Owner legal má corpus legal-precedents — read access pro mzdy, finance (legal poradenství napříč apps) - Owner mzdy má corpus mzdy-tariff-tables — read access pro finance (sjednoceně počítat náklady)

private examples: - mzdy-payroll-history-<company_id> — historie výplat per klient firma, izolované - legal-client-cases-<company_id> — kauzy klienta, taky izolované

Schema

CREATE TABLE ai_helper.rag_corpus (
  id                UUID PRIMARY KEY,
  slug              VARCHAR(100) UNIQUE NOT NULL,
  name              VARCHAR(200) NOT NULL,
  type              VARCHAR(20) NOT NULL,   -- 'global' | 'shared' | 'private'

  owner_app_id      UUID,                   -- FK apps.id (NULL pro global)
  owner_company_id  UUID,                   -- FK companies.id (NULL pro global)

  embedding_dim     INT NOT NULL,           -- 768, 1024, 1536, ...
  embedding_model   VARCHAR(100) NOT NULL,  -- 'bge-m3'
  vector_provider   VARCHAR(50) NOT NULL DEFAULT 'pgvector',

  description       TEXT,
  document_count    INT DEFAULT 0,
  chunk_count       INT DEFAULT 0,
  total_tokens      BIGINT DEFAULT 0,

  created_by_user_id UUID,
  created_at        TIMESTAMPTZ DEFAULT NOW(),
  updated_at        TIMESTAMPTZ DEFAULT NOW()
);

CREATE TABLE ai_helper.rag_corpus_grant (
  corpus_id        UUID NOT NULL REFERENCES ai_helper.rag_corpus(id) ON DELETE CASCADE,
  reader_app_id    UUID NOT NULL REFERENCES apps(id) ON DELETE CASCADE,
  granted_by_user_id UUID NOT NULL,
  granted_at       TIMESTAMPTZ DEFAULT NOW(),
  PRIMARY KEY (corpus_id, reader_app_id)
);

CREATE TABLE ai_helper.rag_document (
  id              UUID PRIMARY KEY,
  corpus_id       UUID NOT NULL REFERENCES ai_helper.rag_corpus(id),
  source_url      TEXT,                    -- s3://... or http://...
  title           VARCHAR(500),
  content_hash    VARCHAR(64),             -- sha256
  metadata        JSONB DEFAULT '{}',
  status          VARCHAR(20) DEFAULT 'pending',
                                           -- 'pending' | 'indexing' | 'ready' | 'failed'
  indexed_at      TIMESTAMPTZ,
  created_at      TIMESTAMPTZ DEFAULT NOW()
);

CREATE TABLE ai_helper.rag_chunk (
  id              UUID PRIMARY KEY,
  document_id     UUID NOT NULL REFERENCES ai_helper.rag_document(id) ON DELETE CASCADE,
  corpus_id       UUID NOT NULL,           -- denormalized pro filter
  chunk_index     INT NOT NULL,
  content         TEXT NOT NULL,
  embedding       vector(1024),            -- pgvector, dim per corpus
  metadata        JSONB DEFAULT '{}',
  token_count     INT,
  created_at      TIMESTAMPTZ DEFAULT NOW()
);

CREATE INDEX idx_chunk_corpus ON ai_helper.rag_chunk (corpus_id);
CREATE INDEX idx_chunk_embedding_hnsw ON ai_helper.rag_chunk USING hnsw (embedding vector_cosine_ops);

Query flow

# SDK call z app mzdy
chunks = await app.ai.rag.query(
    "výpočet daně z příjmů zaměstnance",
    top_k=5,
    company_id=app.company_id  # automatic z JWT
)

AI Helper backend:

  1. Resolve accessible corpora pro (caller_app=mzdy, company=app.company_id):
    SELECT id FROM rag_corpus WHERE
      -- Global: vždy
      (type = 'global')
      -- Shared: jen pokud má grant
      OR (type = 'shared' AND id IN (
          SELECT corpus_id FROM rag_corpus_grant WHERE reader_app_id = $caller_app_id
        ))
      -- Private: jen vlastní (per firma)
      OR (type = 'private' AND owner_app_id = $caller_app_id 
          AND owner_company_id = $company_id)
    
  2. Embed query text přes embed capability (sticky model = BGE-m3)
  3. ANN search:
    SELECT c.id, c.content, c.metadata, corpus.name AS source_corpus,
           c.embedding <=> $query_vec AS distance
      FROM rag_chunk c
      JOIN rag_corpus corpus ON corpus.id = c.corpus_id
     WHERE c.corpus_id = ANY($accessible_corpora)
     ORDER BY distance ASC
     LIMIT $top_k;
    
  4. Return chunks s source_corpus metadata (caller ví odkud)

Cross-corpus query

# Explicit corpora list (skipne default ACL discovery)
chunks = await app.ai.rag.query(
    "výpočet bonusu po skončení smlouvy",
    corpora=["legal-precedents", "mzdy-payroll-history"]
)

ACL check stále aktivní — pokud caller nemá grant na legal-precedents, request fail s 403.

Indexace dokumentu

job = await app.ai.rag.index_document(
    corpus="mzdy-payroll-history",
    source="s3://app-mzdy/exports/payroll-2024.pdf",  # nebo content_b64
    metadata={"klient": "XYZ s.r.o.", "rok": 2024}
)
# job.id, job.status
await job.wait(timeout=300)

Async Celery task: 1. Stáhne dokument (S3, HTTP) 2. Extrahuje text (PDF parser, OCR pokud potřeba) 3. Chunkuje (1000 tokenů s 200 overlap default) 4. Pošle chunks na embed.batch capability worker (sticky BGE-m3) 5. Insert do rag_chunk table 6. Update rag_document.status = 'ready', increment corpus.document_count

7. Auth flows

1. User JWT (z launcher session)

Default pro all-user-facing flow. Vendor app dostane user JWT přes IPC session handshake od launcher2 (per c376ea3 fix). Volá AI Helper s:

Authorization: Bearer eyJ0eXAi... (RS256, kid header)

AI Helper backend: - Validuje JWT přes core-api JWKS (AVAX_JWKS_URL) - Extract user_id, company_id, system_role, subscriptions - Authorization: caller (user.company_id) má subscription na app-ai-helper alpha/beta/stable channel - Rate limit per user (default 60 req/min, super_admin 600)

2. Vendor M2M (service identity)

Pro cron, batch jobs, vendor's background services. Vendor (firma s is_vendor=true) má vystavený m2m_client_id + m2m_secret. Flow:

POST /auth/token (na core-api)
  grant_type=client_credentials&
  client_id=<m2m_client_id>&
  client_secret=<m2m_secret>&
  scope=ai-helper.chat ai-helper.embed ai-helper.rag.query

→ {access_token: "eyJ...", expires_in: 3600, token_type: "Bearer"}

Pak volání AI Helper:

Authorization: Bearer <m2m_token>
X-Vendor-App-Slug: app-legal     # který app to volá (kvůli ACL/quota)

M2M token má sub=vendor:<vendor_id>, žádný user_id. RAG ACL platí jak by user volal jménem app.

Implementace MVP: core-api endpoint POST /auth/token (client_credentials grant), tabulka m2m_clients (client_id, hashed_secret, allowed_scopes, owner_vendor_id).

3. External API key (SaaS partneři)

Pro klienty z venku — api.avaxis.cz. Klient zaregistruje API client v AVAXIS portál (admin), dostane api_key (např. avax_live_<32_random>).

POST https://api.avaxis.cz/v1/ai/chat/complete
X-API-Key: avax_live_abcd1234...
Content-Type: application/json

{model, messages, max_tokens, ...}

Backend: - API key lookup v ai_helper.api_client table - Per-client rate limit + monthly quota - Allowed capabilities (může jen chat, ne image.generate?) - Logging do ai_usage_log s api_client_id

Implementace na dev v MVP (user explicit potvrdil), production rollout 2.0 (s portál UI pro key management, billing integration).

8. API endpoints

Všechny pod /v1/* prefix (apps-gateway routuje /apps/app-ai-helper/v1/* → strip → /v1/*).

Chat

POST /v1/chat/stream         → SSE stream (default pro UI)
POST /v1/chat/complete       → sync full response

Request body:

{
  "messages": [
    {"role": "system", "content": "..."},
    {"role": "user", "content": "..."}
  ],
  "model": "qwen3.6-27b",       // optional, scheduler default
  "max_tokens": 4096,
  "temperature": 0.7,
  "stream": true,                // ignored pro /complete
  "priority": "interactive",     // optional override
  "rag": {                       // optional inline RAG
    "enabled": true,
    "corpora": ["laws-cr"],      // optional, default accessible
    "top_k": 5
  }
}

Embed

POST /v1/embed                → sync (single)
POST /v1/embed/batch          → sync (batch up to 64)

RAG

POST /v1/rag/query                   → sync ANN search
POST /v1/rag/documents               → async index job
GET  /v1/rag/jobs/{job_id}           → poll status
GET  /v1/rag/corpora                 → list accessible
POST /v1/rag/corpora                 → create new (owner only)
POST /v1/rag/corpora/{id}/grants     → grant read to another app

Image / video / speech

POST /v1/image/generate              → async job
POST /v1/image/understand            → sync (vision LLM)
POST /v1/speech/to-text              → sync nebo SSE
POST /v1/speech/from-text            → sync nebo SSE
POST /v1/video/generate              → async job
GET  /v1/jobs/{job_id}               → poll status (unified pro async)

Workers (admin / monitoring)

GET  /v1/workers                     → list (filtruje by caller scope)
POST /v1/workers                     → register new (super_admin nebo vendor)
GET  /v1/workers/{id}                → detail
PUT  /v1/workers/{id}                → update (capabilities, environment, …)
DELETE /v1/workers/{id}              → unregister
GET  /v1/workers/{id}/health         → live ping

Capabilities

GET  /v1/capabilities                → list všech registered capabilities
POST /v1/capabilities                → register new (super_admin)
GET  /v1/capabilities/{id}/workers   → kdo deklaruje tuto capability

Generic call (extensible)

POST /v1/call/{capability_id}        → libovolný capability
                                       body = capability-specific payload
                                       routing scheduler vybere worker

Usage / admin

GET  /v1/usage/me                    → moje usage tento měsíc (per user)
GET  /v1/usage/company               → company-wide (super_admin / company_admin)
GET  /v1/admin/quota                 → admin: quota config (super_admin)
PUT  /v1/admin/quota/{company_id}    → set quota

9. SDK contract — avaxis_sdk.ai

from avaxis_sdk import AvaxApp

app = AvaxApp(slug="app-mzdy")
app.connect()
app.ready()

@app.on_session
def on_session():
    # AI helper je teď usable (potřebuje session JWT)
    pass

# ── Chat ──
async for token in app.ai.chat(prompt="Spočti mzdu...", model=None):
    print(token, end="", flush=True)

# Synchronous chat (vrátí full string)
text = await app.ai.chat_complete(prompt="...", model="qwen3.6-27b")

# ── Embed ──
vec: list[float] = await app.ai.embed("text")
vecs: list[list[float]] = await app.ai.embed_batch(["t1", "t2"])

# ── RAG ──
chunks = await app.ai.rag.query("zákon 89/2012")
# → [{content, metadata, source_corpus, score}]

chunks = await app.ai.rag.query("dotaz", corpora=["laws-cr"], top_k=5)

# Index document (async)
job = await app.ai.rag.index_document(
    corpus="mzdy-payroll-history",
    source="s3://...",
    metadata={"klient": "..."}
)
await job.wait()           # nebo await job.poll() pro non-blocking

# Create new corpus
corpus_id = await app.ai.rag.create_corpus(
    slug="mzdy-tariff-tables",
    type="shared",
    embedding_model="bge-m3"
)

# Grant read access to another app
await app.ai.rag.grant_read(corpus_id=corpus_id, reader_app="finance")

# ── Image ──
job = await app.ai.image.generate(
    prompt="logo firmy XYZ, minimalist",
    model="sdxl-base",
    width=1024, height=1024
)
await job.wait()
image_url = job.result.url

# Vision (image understanding)
description = await app.ai.image.understand(
    image_url="s3://...",
    prompt="Co je na obrázku?"
)

# ── Speech ──
text = await app.ai.speech.transcribe(audio_url="s3://...")  # STT
audio_url = await app.ai.speech.synthesize(text="Ahoj")      # TTS

# ── Generic call ──
result = await app.ai.call(
    "text_understanding.ner",
    payload={"text": "Jan Novák..."},
    timeout=10.0
)

# ── Capabilities discovery ──
caps = await app.ai.capabilities()
# → {"chat": {"models": [...]}, "embed": {...}, "text_understanding.ner": {...}}

# ── Usage check ──
usage = await app.ai.usage.me()
# → {"input_tokens": 12345, "output_tokens": 5678, "month_budget": 1000000}

SDK implementation under the hood

  • HTTP client (httpx async) s connection pooling
  • Backend URL: app.platform_url + "/apps/app-ai-helper" (z launcher session)
  • Auth: Authorization: Bearer <app.access_token> na každý request
  • Auto-refresh JWT před expirací (refresh token flow)
  • Retry policy: 3× exponential backoff pro 503 / network errors
  • Streaming: SSE parser (event-stream content-type)
  • Job polling: helper Job.wait(timeout) co polluje /v1/jobs/{id} každé 2s

10. Usage / quota tracking

Logging schema

CREATE TABLE ai_helper.usage_log (
  id              UUID PRIMARY KEY,

  -- Caller identity
  user_id         UUID,
  company_id      UUID,
  app_id          UUID,
  m2m_client_id   UUID,                  -- pro vendor M2M
  api_client_id   UUID,                  -- pro external API

  -- Request
  capability      VARCHAR(100) NOT NULL, -- 'chat.stream', 'embed', ...
  worker_id       UUID,
  model           VARCHAR(100),
  priority        VARCHAR(20),

  -- Cost
  input_tokens    INT,
  output_tokens   INT,
  duration_ms     INT,
  vram_used_mb    INT,

  -- Outcome
  status_code     INT,                   -- 200, 503, ...
  error           TEXT,

  created_at      TIMESTAMPTZ DEFAULT NOW()
);

CREATE INDEX idx_usage_company_month ON ai_helper.usage_log (company_id, created_at);
CREATE INDEX idx_usage_app_month ON ai_helper.usage_log (app_id, created_at);

Quota table (enforce v 2.0)

CREATE TABLE ai_helper.quota (
  id                  UUID PRIMARY KEY,
  scope_type          VARCHAR(20) NOT NULL,  -- 'company' | 'app' | 'api_client'
  scope_id            UUID NOT NULL,
  monthly_input_tokens   BIGINT,
  monthly_output_tokens  BIGINT,
  rate_limit_per_minute  INT,

  -- Computed
  current_month_usage BIGINT DEFAULT 0,
  reset_at            TIMESTAMPTZ,

  created_at  TIMESTAMPTZ DEFAULT NOW()
);

MVP: jen log do usage_log. Enforce přijde 2.0 (middleware check before request, increment after).

10.5 GPU workload scheduler (single-GPU mutex)

Status: TODO 2026-05-29 — implementovat před app-hotline P1 deploy Audience: druhý Claude na app-ai-helper repo Driver: AVAXIS interní infra (gerry stroj) má 1 GPU (RTX 3090) sdílený mezi LM Studio chat/RAG worker a WhisperX (app-hotline STT). VRAM celkem 24 GB, ale Qwen 14B Q4 + bge-m3 zabere ~11 GB a WhisperX Large + pyannote ~6 GB — dohromady ne paralelně.

Use case

1. User klika chat v app-ai-helper UI
   → gpu_sched.acquire("lm_studio")
   → LM Studio worker dostupný → response

2. Operátor app-hotline ukončí hovor → upload WAV → POST transcribe_structured
   → gpu_sched.acquire("whisperx")
     ├─ Pokud current=lm_studio: POST gerry /unload model (~5s)
     └─ Start whisperx subprocess / docker exec (~10s)
   → STT 10min audio (~1-3 min na RTX 3090)
   → ai-helper chat NEDOSTUPNÝ během STT cyklu (vrátí 503 Service Busy
     nebo zařadí do queue s ETA)

3. Po dokončení STT → gpu_sched.release("whisperx")
   → Pokud queue má pending chat request → acquire("lm_studio") → resume
   → Else → zůstane "free" do dalšího requestu

Implementační kontrakt

# backend/app/services/gpu_scheduler.py

class GPUWorkloadManager:
    _current: str = "free"     # "lm_studio" | "whisperx" | "free"
    _lock = asyncio.Lock()
    _enabled: bool = True       # flip na false po HW upgrade (2 GPU)

    async def acquire(self, worker: str, *, timeout: int = 120) -> bool:
        if not self._enabled:
            return True  # no-op v multi-GPU režimu
        async with self._lock:
            if self._current == worker:
                return True
            await self._unload_current()
            await self._load(worker)
            self._current = worker
            return True

    async def _unload_current(self):
        if self._current == "lm_studio":
            # POST {worker_url}/v1/models/{model}/unload (LM Studio API)
            ...
        elif self._current == "whisperx":
            # docker stop whisperx-webui (nebo POST /shutdown)
            ...

    async def _load(self, worker: str):
        if worker == "lm_studio":
            # POST {worker_url}/v1/models/{model}/load
            # Wait until model status == "loaded"
            ...
        elif worker == "whisperx":
            # docker start whisperx-webui (nebo subprocess start)
            # Wait until http://localhost:7860/health 200
            ...

Konfigurace (env / settings)

# app-ai-helper .env
GPU_SCHEDULER_ENABLED=true       # false po přidání 4060 Ti dedicated
GPU_SCHEDULER_LM_STUDIO_URL=http://host.docker.internal:1234
GPU_SCHEDULER_WHISPERX_URL=http://host.docker.internal:7860
GPU_SCHEDULER_DEFAULT_AFTER_IDLE=lm_studio   # warmup po N min idle

Integration points

  1. Chat endpoint (/v1/chat/stream, /v1/chat/complete):

    await gpu_sched.acquire("lm_studio")
    # ... existing chat logic ...
    

  2. Capability speech.transcribe_structured (nová pro app-hotline):

    await gpu_sched.acquire("whisperx")
    result = await whisperx_client.transcribe(...)
    # NE release zde — držet pro further calls; release po N s idle
    

  3. Idle timeout — pokud worker neaktivní 5 min, release lock (GPU free). Příští acquire pak rychlejší (no unload).

After hardware upgrade (RTX 4060 Ti dedicated pro WhisperX)

  • GPU_SCHEDULER_ENABLED=falseacquire() always True, žádný unload
  • LM Studio binduje na GPU 0 (3090) přes CUDA_VISIBLE_DEVICES=0
  • WhisperX binduje na GPU 1 (4060 Ti) přes CUDA_VISIBLE_DEVICES=1
  • Oba workers běží furt, zero přepnutí latence
  • Acquire/release zůstávají jako no-op pro budoucí re-enable (extra GPU workload, scaling, atd.)

Status backend response

GET /v1/workers/gpu_state (pro UI):

{
  "scheduler_enabled": true,
  "current_worker": "lm_studio",
  "vram_used_gb": 11.2,
  "vram_total_gb": 24,
  "since": "2026-05-29T14:23:11Z",
  "next_swap_eta": null
}

UI feedback v app-ai-helper

V chat tabu, pokud current_worker != "lm_studio": - Banner „⏳ Worker je zaneprázdněn STT job (zbývá ~45s)" + countdown - Disable input field během swap

11. Roadmap

MVP (1.0) — implementovat napřed

  • ✅ Worker registry (scope: platform/vendor/company), env tiers
  • ✅ Queue + scheduler (4 priorities, no preempce, sticky model)
  • ✅ Capabilities — core list + extensible registry
  • ✅ RAG (3 typy corpus, ACL, query, index document async)
  • ✅ Auth: user JWT + vendor M2M
  • ✅ External API key — dev only (test prostředí)
  • ✅ Usage logging (no enforcement)
  • ✅ SDK avaxis_sdk.ai.* namespace

1.1 — vendor portal UX

  • Vendor register workera přes portál UI
  • Toggle "Direct-dial vs Agent" v register dialogu
  • Auto-detect transport_class (TCP probe na worker URL)
  • GPU workload scheduler (§10.5) — single-GPU mutex pro AVAXIS interní gerry stroj před přidáním RTX 4060 Ti. Implementovat před app-hotline P1 deploy (čeká capability speech.transcribe_structured).

1.5 — production

  • External API portal UI pro klienty (API key issuance + billing)
  • Quota enforcement (middleware)
  • External fallback providers (OpenAI/Anthropic opt-in)
  • Per-company dashboard (usage, quota, top capabilities)

2.0 — advanced

  • Webhook callbacks pro async jobs
  • Worker auto-scaling (kubernetes deploy on demand)
  • Multi-region failover (vm-gateway s GeoDNS)
  • Fine-tuning pipeline (PEFT/LoRA per company)
  • Custom model upload (vendor brings model artifact)

3.0 — future

  • Federated learning (model improves z usage napříč companies)
  • Agent loop framework (multi-step reasoning, tool use)
  • Embedding model selection per RAG corpus (mix dimensions)

12. Související