AI Helper — specifikace platform service¶
Status: draft v1.0 Aktualizováno: 2026-05-27 Související:
apps-gateway.md,per-app-container.md,ai-chat.md,ai-rag.md
1. Cíl¶
AI Helper je platform service který:
- Orchestruje AI workery — LM Studio, ComfyUI, Whisper a další AI servery. Vendor apps i externí klienti volají AI bez znalosti hardware.
- Spravuje frontu a VRAM — interactive chat dotazy předbíhají před background indexací. Worker s 24 GB VRAM neobsluhuje 5 modelů zaráz.
- Distribuuje workery napříč ownership — AVAXIS pool (shared, production), vendor private (development), klient on-premise.
- Hostuje RAG corpus — 3 typy (global, shared, private) s ACL. Legal corpus dostupný napříč apps.
- Centralizuje auth + quota — user JWT (z launcheru), vendor M2M (cron, batch), API key (externí klienti).
Co NENÍ AI Helper (záměrně)¶
- End-user UI aplikace. AI Pomocník (
app-ai-helper0.x) má 9-tab UI co je nadstavba nad service. Service samotná je headless. UI klient může komunikovat se service stejně jako jakákoli vendor app. - Trainer / fine-tuner modelů. Modely jsou pre-trained, hostovány na workerech. Fine-tuning je out-of-scope (v 3.0+ možný PEFT/LoRA pipeline).
- Validátor vstupních dat. Vendor app si validuje data sama. AI Helper přijme co dostane (s rate-limit + max-tokens cap).
- Replacement pro core-api auth. AI Helper validuje JWT z core-api JWKS, ale nevystavuje vlastní login.
2. Architecture¶
Externí klienti (SaaS partneři)
│ X-API-Key
▼
┌────────────────────┐
│ vm-gateway nginx │ TLS + rate limit
│ api.avaxis.cz │
└─────────┬──────────┘
│ /v1/ai/*
┌─────────────────────┼─────────────────────┐
│ │ │
Vendor apps (legal, mzdy, │ Launcher2 (user) ── IPC token ──┐
finance, …) — interní │ │
│ Authorization: Bearer JWT │ │
▼ ▼ ▼
┌──────────────────────┐
│ apps-gateway │ cache, circuit breaker
│ :8100 │
└──────────┬───────────┘
│ /apps/app-ai-helper/v1/*
▼
┌──────────────────────────────────────────────────┐
│ AI Helper service │
│ (FastAPI + Celery + Redis + pgvector + S3) │
│ ┌──────────┐ ┌──────────┐ ┌──────────────┐ │
│ │ API │ │ Scheduler│ │ Capability │ │
│ │ routers │ │ (Celery) │ │ registry │ │
│ └──────────┘ └────┬─────┘ └──────────────┘ │
│ │ │
│ ┌──────────┐ ┌────▼────┐ ┌──────────────┐ │
│ │ RAG │ │ Worker │ │ Usage logger │ │
│ │ (pgvec) │ │ probe │ │ (Postgres) │ │
│ └──────────┘ └────┬────┘ └──────────────┘ │
└─────────────────────┼────────────────────────────┘
│
┌───────────────────────┼─────────────────────────┐
▼ ▼ ▼
┌────────────┐ ┌────────────┐ ┌────────────┐
│ Platform │ │ Vendor │ │ Company │
│ workers │ │ private │ │ private │
│ │ │ │ │ │
│ LM Studio │ │ Michal-PC │ │ Klient GPU │
│ ComfyUI │ │ LM Studio │ │ on-premise │
│ Whisper │ │ (dev only) │ │ (rare) │
└────────────┘ └────────────┘ └────────────┘
Komponenty¶
| Komponenta | Účel | Tech |
|---|---|---|
| API routers | REST + SSE endpointy /v1/* |
FastAPI |
| Scheduler | Routing requestů na vhodný worker (capability + load + VRAM) | Celery + Redis priority queues |
| Worker probe | Periodic health check workerů, model_loaded detect | Background tasks |
| Capability registry | Dynamic registr typed + extensible capabilities | Postgres ai_capabilities table + cache |
| RAG service | pgvector ANN search, document indexer (Celery task) | pgvector ext, BGE embed via worker |
| Usage logger | Per-request token usage + latency + cost | Postgres ai_usage_log table |
3. Federated workers¶
Worker je registrovaná AI server instance (LM Studio, ComfyUI, …) s deklarovanými capabilities. Vlastnictví je rozdělené:
Scope tiers¶
| Scope | Owner | Visibility | Příklad |
|---|---|---|---|
platform |
AVAXIS | Všem firmám, všem apps | gerry-prod-1 (24 GB LM Studio) |
vendor |
Vendor (firma s is_vendor=true) |
Vendor's apps napříč firmami | michal-laptop-lm (dev only) |
company |
Klient firma | Jen apps subscribed touto firmou | klient-XYZ-on-prem (data residency) |
Worker schema¶
CREATE TABLE ai_helper.worker (
id UUID PRIMARY KEY,
name VARCHAR(100) NOT NULL, -- 'gerry', 'michal-laptop-lm'
kind VARCHAR(50) NOT NULL, -- 'lmstudio', 'comfyui', 'whisper', 'generic'
-- Routing
transport_class VARCHAR(20) NOT NULL, -- 'datacenter' (direct-dial) | 'agent' (pull-poll)
base_url VARCHAR(500), -- pro datacenter: http URL
agent_session VARCHAR(100), -- pro agent: poll session ID
-- Ownership
scope VARCHAR(20) NOT NULL, -- 'platform' | 'vendor' | 'company'
owner_vendor_id UUID, -- FK companies.id (jen pro scope='vendor')
owner_company_id UUID, -- FK companies.id (jen pro scope='company')
-- Environment (kde je worker eligible)
environment VARCHAR(20) NOT NULL DEFAULT 'production',
-- 'dev' | 'staging' | 'production'
-- Hardware
vram_total_mb INT,
vram_used_mb INT, -- aktualizováno scheduler/probe
-- Capabilities — extensible JSON, viz §5
capabilities JSONB NOT NULL DEFAULT '{}',
-- Lifecycle
status VARCHAR(20) NOT NULL DEFAULT 'unknown',
-- 'unknown' | 'online' | 'busy' | 'offline' | 'error'
tunnel_status VARCHAR(20), -- agent only: 'connected'|'disconnected'
last_health_check_at TIMESTAMPTZ,
last_health_error TEXT,
created_by_user_id UUID,
created_at TIMESTAMPTZ DEFAULT NOW(),
updated_at TIMESTAMPTZ DEFAULT NOW()
);
CREATE INDEX idx_worker_scope ON ai_helper.worker (scope, status, environment);
CREATE INDEX idx_worker_caps ON ai_helper.worker USING GIN (capabilities);
Routing pravidla¶
Pro request (app_slug, company_id, environment, capability):
- Kandidáty: WHERE
status = 'online'ANDenvironment <= request.environmentANDcapabilities ? capabilityAND: scope = 'platform'(vždy)- OR
(scope = 'vendor' AND owner_vendor_id = <app vendor>)(vendor's own) - OR
(scope = 'company' AND owner_company_id = company_id)(klient on-premise) - Filtr VRAM:
vram_total_mb - vram_used_mb >= request.vram_estimate - Sort: ASC by
(current_queue_depth, last_used_at)→ least loaded first - Pick: first match. Pokud žádný, return
503 no_capable_worker_available.
Environment escalation¶
environment definuje minimum tier pro worker. Request s environment=production přijme jen production workers. Request s environment=dev přijme dev OR staging OR production (escalation up).
Tedy: - Dev requesty mohou utilize productů workers (pokud production server má volnou kapacitu) - Production requesty NIKDY nepoužijí dev worker (data security)
Worker onboarding¶
| Akce | Kdo |
|---|---|
| Register platform worker | super_admin přes admin UI |
| Register vendor worker | vendor (přes vendor portal, scope=vendor forced na vendor's id) |
| Register company worker | super_admin (klient zaregistruje přes service ticket, manuál) |
| Update capabilities | owner (vendor pro vendor scope, super_admin pro platform) |
| Disable / delete | owner |
4. Queue + scheduler¶
Priorities¶
realtime │ Voice call live transcription, must answer <2s nebo fail
│ Default: NO queue — pokud žádný free worker, return 503 immediately
│ Allowed jen pro core platform features (voice/video call)
▼
interactive │ User čeká před obrazovkou (chat dotaz, image preview)
│ Soft target: <5s start of response (SSE first token)
│ Default pro: /chat/stream, /chat/complete, /image/generate s preview=true
▼
background │ App backend volá AI bez user wait (RAG indexace, batch embed,
│ document understanding)
│ Soft target: <30s
│ Default pro: /jobs/*, /embed/batch
▼
scheduled │ Cron, noční reindex, periodic recomputation
│ Soft target: hodiny OK
│ Default pro: nic přímo z API, jen vendor cron via M2M token
Scheduler chování¶
- No preempce. LLM inference je atomic (token-by-token, neemůžem ho zastavit uprostřed). Místo toho scheduler nedá nový request na worker pokud čeká vyšší priority v queue.
- Per-worker queue + global pending pool. Pokud worker dokončí task, scheduler ho přidělí highest-priority pending request co
kind/capabilitiesmatch. - Starvation prevention. Background tasky čekající >10 min se promote na priority+1 (interactive level). Scheduled tasky čekající >1h promote.
- VRAM-aware. Pokud worker má LM Studio loaded
qwen-27b(16 GB) a přijde request proBGE-m3embed (2 GB), scheduler ho nepošle na ten worker (model switch = 5-30s, drahé). Místo toho hledá worker s BGE už loaded, nebo task čeká.
Model switching¶
Workery deklarují models_loaded JSON (např. ["qwen-27b"]). Scheduler:
- Preferuje workery kde request model je už loaded (sticky)
- Pokud žádný takový, vybere worker s free VRAM ≥ model size + buffer (volně)
- Pošle LM Studio API call
POST /v1/models/loads target model (LM Studio podporuje) - Worker probe updatne
models_loadedpo success
Pokud žádný kandidát volný + žádný se sticky model → request čeká v queue.
Capacity overflow → external fallback¶
Pokud queue depth > N (config), AI Helper má opt-in fallback na external API providers:
- OpenAI API (pokud company má openai_api_key v env)
- Anthropic API
- Vlastní hosted models
Cost tracking pak per-request (external API tarif). Default disabled — explicit opt-in per company.
5. Capabilities — typed core + extensible¶
Capability je named function worker umí provádět. Není to hardcoded enum, ale plugin model.
Core capabilities (vždy podporované)¶
| Capability | Pattern | Worker kinds |
|---|---|---|
chat |
SSE stream | lmstudio, ollama, generic_llm |
chat.complete |
sync (full response) | jako výše |
embed |
sync | lmstudio, bge_server, sentence_transformers |
embed.batch |
sync (batch input) | jako výše |
image.generate |
async job | comfyui, automatic1111 |
image.understand |
sync (vision LLM) | lmstudio (Llama Vision), generic_vlm |
speech.to_text |
sync nebo SSE | whisper_server, ollama_whisper |
text.to_speech |
sync nebo SSE | piper, xtts, openai_tts_compatible |
video.generate |
async job (long) | comfyui (AnimateDiff), wan2 |
Extensible namespaces¶
text_understanding.* — NLP úkoly (typicky implementované LLM s system prompt template, ale dedicated NLP workers možní):
text_understanding.ner— Named Entity Recognition (PERSON, ORG, DATE, MONEY, …)text_understanding.classify— document classificationtext_understanding.extract— strukturovaná extrakce (key-value z PDFs)text_understanding.summarize— sumarizace dokumentutext_understanding.translate— CZ ↔ EN (a další)text_understanding.sentiment— sentiment analysis
audio_understanding.*:
audio_understanding.diarize— speaker separationaudio_understanding.intent— voice command classification
custom.<namespace>.<task> — vendor-specific:
custom.legal.contract_extract— Legal-vendor specific contract parsercustom.mzdy.payslip_ocr— Mzdy app payslip extractor
Capability declaration v worker¶
{
"chat": {
"models": ["qwen3.6-27b", "llama-3.3-70b"],
"max_context": 32768,
"supports_streaming": true
},
"embed": {
"models": ["bge-m3"],
"dim": 1024,
"max_batch_size": 64
},
"text_understanding.ner": {
"languages": ["cs", "en"],
"entities": ["PERSON", "ORG", "DATE", "MONEY", "LAW_REFERENCE"]
}
}
Capability registry¶
CREATE TABLE ai_helper.capability (
id VARCHAR(100) PRIMARY KEY, -- 'chat', 'text_understanding.ner'
category VARCHAR(50) NOT NULL, -- 'core' | 'text_understanding' | 'custom'
schema_json JSONB, -- JSON schema pro input/output
added_by VARCHAR(100),
added_at TIMESTAMPTZ DEFAULT NOW()
);
Super_admin (nebo vendor admin) může přidat nový capability ID + worker ho začne deklarovat.
Generic SDK call¶
# Typed core
async for tok in app.ai.chat(prompt="..."): ...
vec = await app.ai.embed("text")
# Generic (libovolný capability)
result = await app.ai.call(
"text_understanding.ner",
payload={"text": "Jan Novák uzavřel smlouvu s ABC s.r.o. dne 12.5.2024"},
timeout=10.0
)
# → {"entities": [
# {"type":"PERSON", "value":"Jan Novák", "span":[0,9]},
# {"type":"ORG", "value":"ABC s.r.o.","span":[20,30]},
# {"type":"DATE", "value":"2024-05-12","span":[35,44]}
# ]}
6. RAG corpus¶
Three corpus types¶
| Type | Owner | Read access | Write access |
|---|---|---|---|
global |
Platform (AVAXIS) | Všechny apps automaticky | super_admin only |
shared |
Konkrétní app + per company | Owner + explicit grants | Owner app only |
private |
App + company | Jen owner | Jen owner |
Use cases¶
global examples:
- laws-cr — Zákony ČR (sbírka, vyhlášky), kurátoruje AVAXIS legal team
- tax-forms — Daňové formuláře (šablony)
- accounting-standards — IFRS / CZ účetní standardy
shared examples:
- Owner legal má corpus legal-precedents — read access pro mzdy, finance (legal poradenství napříč apps)
- Owner mzdy má corpus mzdy-tariff-tables — read access pro finance (sjednoceně počítat náklady)
private examples:
- mzdy-payroll-history-<company_id> — historie výplat per klient firma, izolované
- legal-client-cases-<company_id> — kauzy klienta, taky izolované
Schema¶
CREATE TABLE ai_helper.rag_corpus (
id UUID PRIMARY KEY,
slug VARCHAR(100) UNIQUE NOT NULL,
name VARCHAR(200) NOT NULL,
type VARCHAR(20) NOT NULL, -- 'global' | 'shared' | 'private'
owner_app_id UUID, -- FK apps.id (NULL pro global)
owner_company_id UUID, -- FK companies.id (NULL pro global)
embedding_dim INT NOT NULL, -- 768, 1024, 1536, ...
embedding_model VARCHAR(100) NOT NULL, -- 'bge-m3'
vector_provider VARCHAR(50) NOT NULL DEFAULT 'pgvector',
description TEXT,
document_count INT DEFAULT 0,
chunk_count INT DEFAULT 0,
total_tokens BIGINT DEFAULT 0,
created_by_user_id UUID,
created_at TIMESTAMPTZ DEFAULT NOW(),
updated_at TIMESTAMPTZ DEFAULT NOW()
);
CREATE TABLE ai_helper.rag_corpus_grant (
corpus_id UUID NOT NULL REFERENCES ai_helper.rag_corpus(id) ON DELETE CASCADE,
reader_app_id UUID NOT NULL REFERENCES apps(id) ON DELETE CASCADE,
granted_by_user_id UUID NOT NULL,
granted_at TIMESTAMPTZ DEFAULT NOW(),
PRIMARY KEY (corpus_id, reader_app_id)
);
CREATE TABLE ai_helper.rag_document (
id UUID PRIMARY KEY,
corpus_id UUID NOT NULL REFERENCES ai_helper.rag_corpus(id),
source_url TEXT, -- s3://... or http://...
title VARCHAR(500),
content_hash VARCHAR(64), -- sha256
metadata JSONB DEFAULT '{}',
status VARCHAR(20) DEFAULT 'pending',
-- 'pending' | 'indexing' | 'ready' | 'failed'
indexed_at TIMESTAMPTZ,
created_at TIMESTAMPTZ DEFAULT NOW()
);
CREATE TABLE ai_helper.rag_chunk (
id UUID PRIMARY KEY,
document_id UUID NOT NULL REFERENCES ai_helper.rag_document(id) ON DELETE CASCADE,
corpus_id UUID NOT NULL, -- denormalized pro filter
chunk_index INT NOT NULL,
content TEXT NOT NULL,
embedding vector(1024), -- pgvector, dim per corpus
metadata JSONB DEFAULT '{}',
token_count INT,
created_at TIMESTAMPTZ DEFAULT NOW()
);
CREATE INDEX idx_chunk_corpus ON ai_helper.rag_chunk (corpus_id);
CREATE INDEX idx_chunk_embedding_hnsw ON ai_helper.rag_chunk USING hnsw (embedding vector_cosine_ops);
Query flow¶
# SDK call z app mzdy
chunks = await app.ai.rag.query(
"výpočet daně z příjmů zaměstnance",
top_k=5,
company_id=app.company_id # automatic z JWT
)
AI Helper backend:
- Resolve accessible corpora pro
(caller_app=mzdy, company=app.company_id):SELECT id FROM rag_corpus WHERE -- Global: vždy (type = 'global') -- Shared: jen pokud má grant OR (type = 'shared' AND id IN ( SELECT corpus_id FROM rag_corpus_grant WHERE reader_app_id = $caller_app_id )) -- Private: jen vlastní (per firma) OR (type = 'private' AND owner_app_id = $caller_app_id AND owner_company_id = $company_id) - Embed query text přes
embedcapability (sticky model = BGE-m3) - ANN search:
- Return chunks s
source_corpusmetadata (caller ví odkud)
Cross-corpus query¶
# Explicit corpora list (skipne default ACL discovery)
chunks = await app.ai.rag.query(
"výpočet bonusu po skončení smlouvy",
corpora=["legal-precedents", "mzdy-payroll-history"]
)
ACL check stále aktivní — pokud caller nemá grant na legal-precedents, request fail s 403.
Indexace dokumentu¶
job = await app.ai.rag.index_document(
corpus="mzdy-payroll-history",
source="s3://app-mzdy/exports/payroll-2024.pdf", # nebo content_b64
metadata={"klient": "XYZ s.r.o.", "rok": 2024}
)
# job.id, job.status
await job.wait(timeout=300)
Async Celery task:
1. Stáhne dokument (S3, HTTP)
2. Extrahuje text (PDF parser, OCR pokud potřeba)
3. Chunkuje (1000 tokenů s 200 overlap default)
4. Pošle chunks na embed.batch capability worker (sticky BGE-m3)
5. Insert do rag_chunk table
6. Update rag_document.status = 'ready', increment corpus.document_count
7. Auth flows¶
1. User JWT (z launcher session)¶
Default pro all-user-facing flow. Vendor app dostane user JWT přes IPC session handshake od launcher2 (per c376ea3 fix). Volá AI Helper s:
AI Helper backend:
- Validuje JWT přes core-api JWKS (AVAX_JWKS_URL)
- Extract user_id, company_id, system_role, subscriptions
- Authorization: caller (user.company_id) má subscription na app-ai-helper alpha/beta/stable channel
- Rate limit per user (default 60 req/min, super_admin 600)
2. Vendor M2M (service identity)¶
Pro cron, batch jobs, vendor's background services. Vendor (firma s is_vendor=true) má vystavený m2m_client_id + m2m_secret. Flow:
POST /auth/token (na core-api)
grant_type=client_credentials&
client_id=<m2m_client_id>&
client_secret=<m2m_secret>&
scope=ai-helper.chat ai-helper.embed ai-helper.rag.query
→ {access_token: "eyJ...", expires_in: 3600, token_type: "Bearer"}
Pak volání AI Helper:
Authorization: Bearer <m2m_token>
X-Vendor-App-Slug: app-legal # který app to volá (kvůli ACL/quota)
M2M token má sub=vendor:<vendor_id>, žádný user_id. RAG ACL platí jak by user volal jménem app.
Implementace MVP: core-api endpoint POST /auth/token (client_credentials grant), tabulka m2m_clients (client_id, hashed_secret, allowed_scopes, owner_vendor_id).
3. External API key (SaaS partneři)¶
Pro klienty z venku — api.avaxis.cz. Klient zaregistruje API client v AVAXIS portál (admin), dostane api_key (např. avax_live_<32_random>).
POST https://api.avaxis.cz/v1/ai/chat/complete
X-API-Key: avax_live_abcd1234...
Content-Type: application/json
{model, messages, max_tokens, ...}
Backend:
- API key lookup v ai_helper.api_client table
- Per-client rate limit + monthly quota
- Allowed capabilities (může jen chat, ne image.generate?)
- Logging do ai_usage_log s api_client_id
Implementace na dev v MVP (user explicit potvrdil), production rollout 2.0 (s portál UI pro key management, billing integration).
8. API endpoints¶
Všechny pod /v1/* prefix (apps-gateway routuje /apps/app-ai-helper/v1/* → strip → /v1/*).
Chat¶
Request body:
{
"messages": [
{"role": "system", "content": "..."},
{"role": "user", "content": "..."}
],
"model": "qwen3.6-27b", // optional, scheduler default
"max_tokens": 4096,
"temperature": 0.7,
"stream": true, // ignored pro /complete
"priority": "interactive", // optional override
"rag": { // optional inline RAG
"enabled": true,
"corpora": ["laws-cr"], // optional, default accessible
"top_k": 5
}
}
Embed¶
RAG¶
POST /v1/rag/query → sync ANN search
POST /v1/rag/documents → async index job
GET /v1/rag/jobs/{job_id} → poll status
GET /v1/rag/corpora → list accessible
POST /v1/rag/corpora → create new (owner only)
POST /v1/rag/corpora/{id}/grants → grant read to another app
Image / video / speech¶
POST /v1/image/generate → async job
POST /v1/image/understand → sync (vision LLM)
POST /v1/speech/to-text → sync nebo SSE
POST /v1/speech/from-text → sync nebo SSE
POST /v1/video/generate → async job
GET /v1/jobs/{job_id} → poll status (unified pro async)
Workers (admin / monitoring)¶
GET /v1/workers → list (filtruje by caller scope)
POST /v1/workers → register new (super_admin nebo vendor)
GET /v1/workers/{id} → detail
PUT /v1/workers/{id} → update (capabilities, environment, …)
DELETE /v1/workers/{id} → unregister
GET /v1/workers/{id}/health → live ping
Capabilities¶
GET /v1/capabilities → list všech registered capabilities
POST /v1/capabilities → register new (super_admin)
GET /v1/capabilities/{id}/workers → kdo deklaruje tuto capability
Generic call (extensible)¶
POST /v1/call/{capability_id} → libovolný capability
body = capability-specific payload
routing scheduler vybere worker
Usage / admin¶
GET /v1/usage/me → moje usage tento měsíc (per user)
GET /v1/usage/company → company-wide (super_admin / company_admin)
GET /v1/admin/quota → admin: quota config (super_admin)
PUT /v1/admin/quota/{company_id} → set quota
9. SDK contract — avaxis_sdk.ai¶
from avaxis_sdk import AvaxApp
app = AvaxApp(slug="app-mzdy")
app.connect()
app.ready()
@app.on_session
def on_session():
# AI helper je teď usable (potřebuje session JWT)
pass
# ── Chat ──
async for token in app.ai.chat(prompt="Spočti mzdu...", model=None):
print(token, end="", flush=True)
# Synchronous chat (vrátí full string)
text = await app.ai.chat_complete(prompt="...", model="qwen3.6-27b")
# ── Embed ──
vec: list[float] = await app.ai.embed("text")
vecs: list[list[float]] = await app.ai.embed_batch(["t1", "t2"])
# ── RAG ──
chunks = await app.ai.rag.query("zákon 89/2012")
# → [{content, metadata, source_corpus, score}]
chunks = await app.ai.rag.query("dotaz", corpora=["laws-cr"], top_k=5)
# Index document (async)
job = await app.ai.rag.index_document(
corpus="mzdy-payroll-history",
source="s3://...",
metadata={"klient": "..."}
)
await job.wait() # nebo await job.poll() pro non-blocking
# Create new corpus
corpus_id = await app.ai.rag.create_corpus(
slug="mzdy-tariff-tables",
type="shared",
embedding_model="bge-m3"
)
# Grant read access to another app
await app.ai.rag.grant_read(corpus_id=corpus_id, reader_app="finance")
# ── Image ──
job = await app.ai.image.generate(
prompt="logo firmy XYZ, minimalist",
model="sdxl-base",
width=1024, height=1024
)
await job.wait()
image_url = job.result.url
# Vision (image understanding)
description = await app.ai.image.understand(
image_url="s3://...",
prompt="Co je na obrázku?"
)
# ── Speech ──
text = await app.ai.speech.transcribe(audio_url="s3://...") # STT
audio_url = await app.ai.speech.synthesize(text="Ahoj") # TTS
# ── Generic call ──
result = await app.ai.call(
"text_understanding.ner",
payload={"text": "Jan Novák..."},
timeout=10.0
)
# ── Capabilities discovery ──
caps = await app.ai.capabilities()
# → {"chat": {"models": [...]}, "embed": {...}, "text_understanding.ner": {...}}
# ── Usage check ──
usage = await app.ai.usage.me()
# → {"input_tokens": 12345, "output_tokens": 5678, "month_budget": 1000000}
SDK implementation under the hood¶
- HTTP client (httpx async) s connection pooling
- Backend URL:
app.platform_url + "/apps/app-ai-helper"(z launcher session) - Auth:
Authorization: Bearer <app.access_token>na každý request - Auto-refresh JWT před expirací (refresh token flow)
- Retry policy: 3× exponential backoff pro 503 / network errors
- Streaming: SSE parser (event-stream content-type)
- Job polling: helper
Job.wait(timeout)co polluje/v1/jobs/{id}každé 2s
10. Usage / quota tracking¶
Logging schema¶
CREATE TABLE ai_helper.usage_log (
id UUID PRIMARY KEY,
-- Caller identity
user_id UUID,
company_id UUID,
app_id UUID,
m2m_client_id UUID, -- pro vendor M2M
api_client_id UUID, -- pro external API
-- Request
capability VARCHAR(100) NOT NULL, -- 'chat.stream', 'embed', ...
worker_id UUID,
model VARCHAR(100),
priority VARCHAR(20),
-- Cost
input_tokens INT,
output_tokens INT,
duration_ms INT,
vram_used_mb INT,
-- Outcome
status_code INT, -- 200, 503, ...
error TEXT,
created_at TIMESTAMPTZ DEFAULT NOW()
);
CREATE INDEX idx_usage_company_month ON ai_helper.usage_log (company_id, created_at);
CREATE INDEX idx_usage_app_month ON ai_helper.usage_log (app_id, created_at);
Quota table (enforce v 2.0)¶
CREATE TABLE ai_helper.quota (
id UUID PRIMARY KEY,
scope_type VARCHAR(20) NOT NULL, -- 'company' | 'app' | 'api_client'
scope_id UUID NOT NULL,
monthly_input_tokens BIGINT,
monthly_output_tokens BIGINT,
rate_limit_per_minute INT,
-- Computed
current_month_usage BIGINT DEFAULT 0,
reset_at TIMESTAMPTZ,
created_at TIMESTAMPTZ DEFAULT NOW()
);
MVP: jen log do usage_log. Enforce přijde 2.0 (middleware check before request, increment after).
10.5 GPU workload scheduler (single-GPU mutex)¶
Status: TODO 2026-05-29 — implementovat před app-hotline P1 deploy Audience: druhý Claude na app-ai-helper repo Driver: AVAXIS interní infra (gerry stroj) má 1 GPU (RTX 3090) sdílený mezi LM Studio chat/RAG worker a WhisperX (app-hotline STT). VRAM celkem 24 GB, ale Qwen 14B Q4 + bge-m3 zabere ~11 GB a WhisperX Large + pyannote ~6 GB — dohromady ne paralelně.
Use case¶
1. User klika chat v app-ai-helper UI
→ gpu_sched.acquire("lm_studio")
→ LM Studio worker dostupný → response
2. Operátor app-hotline ukončí hovor → upload WAV → POST transcribe_structured
→ gpu_sched.acquire("whisperx")
├─ Pokud current=lm_studio: POST gerry /unload model (~5s)
└─ Start whisperx subprocess / docker exec (~10s)
→ STT 10min audio (~1-3 min na RTX 3090)
→ ai-helper chat NEDOSTUPNÝ během STT cyklu (vrátí 503 Service Busy
nebo zařadí do queue s ETA)
3. Po dokončení STT → gpu_sched.release("whisperx")
→ Pokud queue má pending chat request → acquire("lm_studio") → resume
→ Else → zůstane "free" do dalšího requestu
Implementační kontrakt¶
# backend/app/services/gpu_scheduler.py
class GPUWorkloadManager:
_current: str = "free" # "lm_studio" | "whisperx" | "free"
_lock = asyncio.Lock()
_enabled: bool = True # flip na false po HW upgrade (2 GPU)
async def acquire(self, worker: str, *, timeout: int = 120) -> bool:
if not self._enabled:
return True # no-op v multi-GPU režimu
async with self._lock:
if self._current == worker:
return True
await self._unload_current()
await self._load(worker)
self._current = worker
return True
async def _unload_current(self):
if self._current == "lm_studio":
# POST {worker_url}/v1/models/{model}/unload (LM Studio API)
...
elif self._current == "whisperx":
# docker stop whisperx-webui (nebo POST /shutdown)
...
async def _load(self, worker: str):
if worker == "lm_studio":
# POST {worker_url}/v1/models/{model}/load
# Wait until model status == "loaded"
...
elif worker == "whisperx":
# docker start whisperx-webui (nebo subprocess start)
# Wait until http://localhost:7860/health 200
...
Konfigurace (env / settings)¶
# app-ai-helper .env
GPU_SCHEDULER_ENABLED=true # false po přidání 4060 Ti dedicated
GPU_SCHEDULER_LM_STUDIO_URL=http://host.docker.internal:1234
GPU_SCHEDULER_WHISPERX_URL=http://host.docker.internal:7860
GPU_SCHEDULER_DEFAULT_AFTER_IDLE=lm_studio # warmup po N min idle
Integration points¶
-
Chat endpoint (
/v1/chat/stream,/v1/chat/complete): -
Capability
speech.transcribe_structured(nová pro app-hotline): -
Idle timeout — pokud worker neaktivní 5 min, release lock (GPU free). Příští acquire pak rychlejší (no unload).
After hardware upgrade (RTX 4060 Ti dedicated pro WhisperX)¶
GPU_SCHEDULER_ENABLED=false→acquire()always True, žádný unload- LM Studio binduje na GPU 0 (3090) přes
CUDA_VISIBLE_DEVICES=0 - WhisperX binduje na GPU 1 (4060 Ti) přes
CUDA_VISIBLE_DEVICES=1 - Oba workers běží furt, zero přepnutí latence
- Acquire/release zůstávají jako no-op pro budoucí re-enable (extra GPU workload, scaling, atd.)
Status backend response¶
GET /v1/workers/gpu_state (pro UI):
{
"scheduler_enabled": true,
"current_worker": "lm_studio",
"vram_used_gb": 11.2,
"vram_total_gb": 24,
"since": "2026-05-29T14:23:11Z",
"next_swap_eta": null
}
UI feedback v app-ai-helper¶
V chat tabu, pokud current_worker != "lm_studio":
- Banner „⏳ Worker je zaneprázdněn STT job (zbývá ~45s)" + countdown
- Disable input field během swap
11. Roadmap¶
MVP (1.0) — implementovat napřed¶
- ✅ Worker registry (scope: platform/vendor/company), env tiers
- ✅ Queue + scheduler (4 priorities, no preempce, sticky model)
- ✅ Capabilities — core list + extensible registry
- ✅ RAG (3 typy corpus, ACL, query, index document async)
- ✅ Auth: user JWT + vendor M2M
- ✅ External API key — dev only (test prostředí)
- ✅ Usage logging (no enforcement)
- ✅ SDK
avaxis_sdk.ai.*namespace
1.1 — vendor portal UX¶
- Vendor register workera přes portál UI
- Toggle "Direct-dial vs Agent" v register dialogu
- Auto-detect transport_class (TCP probe na worker URL)
- GPU workload scheduler (§10.5) — single-GPU mutex pro AVAXIS interní
gerry stroj před přidáním RTX 4060 Ti. Implementovat před app-hotline
P1 deploy (čeká capability
speech.transcribe_structured).
1.5 — production¶
- External API portal UI pro klienty (API key issuance + billing)
- Quota enforcement (middleware)
- External fallback providers (OpenAI/Anthropic opt-in)
- Per-company dashboard (usage, quota, top capabilities)
2.0 — advanced¶
- Webhook callbacks pro async jobs
- Worker auto-scaling (kubernetes deploy on demand)
- Multi-region failover (vm-gateway s GeoDNS)
- Fine-tuning pipeline (PEFT/LoRA per company)
- Custom model upload (vendor brings model artifact)
3.0 — future¶
- Federated learning (model improves z usage napříč companies)
- Agent loop framework (multi-step reasoning, tool use)
- Embedding model selection per RAG corpus (mix dimensions)
12. Související¶
apps-gateway.md— gateway routing pro/apps/app-ai-helper/*per-app-container.md— app-ai-helper deploy modelai-chat.md— V1 chat spec (legacy, refactored do tohoto)ai-rag.md— V2 RAG spec (legacy, refactored)vendor-onboarding.md— vendor app flow (zahrne AI Helper integration)