Introduction

LLM Manager is a terminal UI (TUI) for managing local LLM models. It lets you search HuggingFace, download GGUF models, load them via llama.cpp’s llama-server, and chat with them — all from your terminal.

Features

Model Discovery & Downloads

HuggingFace search — GGUF-filtered, paginated results with multiple sort options
GGUF file browser — browse and select specific GGUF files for each model
Download manager — progress tracking with speed, ETA, and cancellation support
Multi-word search — space-separated words use AND logic for precise filtering

Configuration

Per-model settings — full control over context length, GPU layers, sampling parameters, and more
Profiles — save and quickly switch between named presets of settings
System Prompt Presets — named system prompts for different use cases
Multi-backend support — CPU, Vulkan, ROCm, ROCm Lemonade, and CUDA (13 platform-specific variants)
Speculative decoding — MTP and other speculative decoding types
YaRN RoPE — extend context beyond training length
Benchmark Tuning — auto-tune model parameters for optimal performance

Dashboard & Networking

WebSocket Dashboard — real-time metrics visualization in a web browser
TLS support — secure connections with auto-generated self-signed certificates
API proxy — expose an OpenAI-compatible API with CORS and SSE streaming
Web Search — automatically search the web via SearXNG when messages contain comparison/research keywords

Interface

Log panel — expandable/collapsible with following and manual scroll modes
README rendering — full markdown renderer for HuggingFace model documentation
Model info — GGUF metadata display with HuggingFace URL navigation
CmdLine overlay — view the full llama-server command line (Ctrl+K)
Panel resize — drag left/right border, Shift+←/→ (horizontal), Shift+↑/↓ (Server/LLM Settings split, vertical)
Mouse support — click panels to focus, scroll in logs, README, and settings
Multi-language UI — switch between English, French, and Italian with Ctrl+L

Prerequisites

Rust toolchain — edition 2024
HuggingFace account — required for downloading gated models
GPU (optional) — NVIDIA (CUDA), AMD (ROCm/ROCm Lemonade), or Intel (Vulkan) for accelerated inference; CPU-only inference is fully supported

Screenshot

LLM Manager

Quick Start

git clone https://github.com/aginies/llmtui.git
cd llmtui
cargo build --release
cargo run

See the Getting Started guide for a full walkthrough.

Getting Started

This guide walks you through installing llm-manager, searching for models, loading one, configuring settings, and connecting a client.

1. Install & Start

git clone https://github.com/aginies/llmtui.git
cd llmtui
cargo build --release
cargo run

On first launch, llm-manager creates a default configuration in ~/.config/llm-manager/config.yaml and sets up the models directory at ~/.local/share/llm-manager/models/.

LLM Manager

2. Search & Download Models

Press / to enter search mode, type a query (e.g., qwen2.5), and press Enter.

Results appear sorted by relevance. Press Ctrl+S to cycle sort order (Relevance / Downloads / Likes / Trending / Created). Press Ctrl+B to go back, or scroll down for more results.

Search Results

Press Enter on a result to browse available GGUF files:

GGUF File Browser

Select a file and press Enter to download. The download progress shows speed (MiB/s), ETA, and status. Press p to pause/resume, ⌥C to cancel.

3. Load a Model

Once a model is downloaded (or already exists locally):

Select the model in the Models panel
Press l (or Enter) to load it

The loading process shows a progress bar with phases: server starting, loading model weights, loading metadata, loading tensors (with GPU layer count), server listening, and ready.

4. Configure Settings

Press F2 to open the Server Settings panel.

Server Settings

Setting	Description
Host	Bind address (default: `127.0.0.1`)
Backend	GPU acceleration (auto-detected, or CPU/Vulkan/ROCm/CUDA)
Threads	CPU threads for generation
Threads Batch	CPU threads for batch processing
Mode	Server mode (Normal, Router, Bench, BenchTune)
API Endpoint	Enable OpenAI-compatible API proxy
Dashboard	Enable WebSocket dashboard
RPC Workers	Manage distributed inference nodes
Language	UI language (en/fr/it)

Press F3 to open the LLM Settings panel. Toggle expert mode with Ctrl+X to reveal 17 additional parameters.

LLM Settings Expert Mode

Press Ctrl+H for panel-specific help:

Help Panel

Saving Settings

Ctrl+S — Save settings for the selected model
Ctrl+R — Reset to defaults
Ctrl+P — Apply a profile (built-in: Qwen, Gemma, Llama, Mistral, Phi)

Settings are stored in ~/.config/llm-manager/models/<model_name>.yaml. For global defaults, edit ~/.config/llm-manager/config.yaml directly.

5. Connect a Client

With the API Endpoint enabled (default port 49222), you can connect any OpenAI-compatible client:

curl

curl http://localhost:49222/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model":"llama","messages":[{"role":"user","content":"Hello"}]}'

With auth key:

curl http://localhost:49222/v1/chat/completions \
  -H "Authorization: Bearer your-api-key" \
  -d '{"model":"llama","messages":[{"role":"user","content":"Hello"}]}'

opencode

See the opencode documentation for configuring opencode to use llm-manager’s API endpoint.

Dashboard

Open the WebSocket Dashboard in your browser:

http://localhost:49223

See the Dashboard documentation for authentication and TLS configuration.

Usage

Serve Mode

Run a model directly with llama-server and expose an OpenAI-compatible API:

# Serve a model with API proxy on port 49222
./build.sh serve --model /path/to/model.gguf --api-port 49222

# Serve with a settings profile
./build.sh serve --model model.gguf --profile qwen

# Serve with API key authentication (same key for API proxy and dashboard)
./build.sh serve --model model.gguf --api-port 49222 --api-key secret --ws-enable

# Serve with API proxy and WebSocket dashboard
./build.sh serve --model model.gguf --api-port 49222 --ws-enable

# Serve with custom dashboard port
./build.sh serve --model model.gguf --api-port 49222 --ws-enable --ws-port 8081

# Serve with a custom backend binary path
./build.sh serve --model model.gguf --backend-binary /path/to/custom/llama-server

# Serve bound to a specific network interface
./build.sh serve --model model.gguf --host 0.0.0.0

# Redirect logs to a file (useful for systemd)
./build.sh serve --model model.gguf --log-file /var/log/llm-manager/model.log

# Combine options
# Serve with API proxy and WebSocket dashboard on a specific host
./build.sh serve --model model.gguf --api-port 49222 --ws-enable --host 192.168.1.100

# Combine all options
./build.sh serve --model model.gguf --api-port 49222 --ws-enable --host 0.0.0.0 --backend-binary /opt/rocm/bin/llama-server --log-file /var/log/llm-manager/model.log

The serve command automatically resolves the llama-server binary from the backend-specific directory (~/.local/share/llm-manager/bin/llama-server-{cpu,vulkan,rocm}-{version}/) and sets LD_LIBRARY_PATH for shared libraries. If the binary is not found, it downloads it from the llama.cpp GitHub releases. Use --backend-binary to specify a custom binary path, --host to override the network bind address for both the API proxy and WebSocket servers (default is from config), and --log-file to redirect logs to a file instead of stdout.

Model Management

Listing Models

The Models panel shows all .gguf files found in your models directories (recursively). The display name is the relative path from the models directory.

f — Filter local models by name (case-insensitive substring match)
Esc — Clear active filter and return to full list

Loading and Unloading

l or Enter — Load selected model
u — Unload model from server
Ctrl+D — Delete model (with confirmation)

When a model is loaded, it shows [LOADED: <name>] in green bold. You can load multiple models when using Router mode (Work In Progress — not yet selectable in TUI, enable via config.yaml).

Deleting Models

Pressing Ctrl+D prompts for confirmation before moving the model file and its YAML config to ~/.config/llm-manager/unused/. Both can be restored later.

Search

Search mode lets you browse and download GGUF models from HuggingFace:

Key	Action
`/`	Open search input modal — type query and press `Enter` to search
`Enter`	Select GGUF files for the highlighted model
`Esc`	Exit search
`Ctrl+S`	Cycle sort order
`Ctrl+B`	Go back one page
`Down` (at bottom)	Load more results
`Ctrl+R`	Fetch and view README for the selected model

Multi-word Search

Type space-separated words (e.g. qwen opus) to search with AND logic — all words must match the model name. Matching words are highlighted in cyan in the results list.

Search Results

GGUF File Browser

When viewing GGUF files for a model:

Key	Action
`j` / `k`	Navigate files
`Enter`	Download selected file
`Esc`	Go back to search results
`⌥C`	Cancel download and remove temp file

Download Panel

When one or more files are downloading, the Download panel appears at the bottom of the screen, showing progress, speed (MiB/s), ETA, and status for each download. Before downloading, the app checks available disk space and warns if insufficient. Cancelled downloads automatically remove the temporary file.

Key	Action
`j` / `k`	Navigate downloads
`p`	Pause / Resume selected download
`⌥C`	Cancel selected download and remove temp file

Status indicators: Downloading (yellow), Paused (white), Complete (green), Cancelled (red), Error (red).

Loading Models

When you load a model, the application:

Resolves the llama-server binary for the selected backend (CPU/Vulkan/ROCm)
Spawns the server with the current settings
Loads the model via the server’s /models/load API
Polls the server’s /metrics and /health endpoints for status
Displays a progress bar showing loading phases

Loading Phases

The progress bar tracks:

Server starting (8%) — llama.cpp binary is launched
Loading model (7%) — weights file is being read
Loading metadata (7%) — GGUF metadata is parsed
Loading tensors (70%) — tensors are loaded and offloaded to GPU
Server listening (8%) — HTTP server is ready
Complete — model is ready for inference

During tensor loading, the progress bar shows offloaded layers (e.g., 16/32) parsed from llama.cpp’s log output.

Settings

Server Settings

Setting	Default	Description
Host	127.0.0.1	Bind address for the llama.cpp server. Use `0.0.0.0` to accept connections from other machines.
Backend	auto-detected	Acceleration backend: auto-detected based on GPU (Cuda for NVIDIA, Rocm for AMD, Vulkan for Intel). Options: `cpu` (CPU-only), `vulkan` (NVIDIA/AMD/Intel GPU), `rocm` (AMD GPU), `rocm-lemonade` (AMD optimized), `cuda` (NVIDIA CUDA 12.8). Shows the currently selected version.
Threads	(physical cores)	CPU threads for generation. Set to your physical core count for best performance.
Threads Batch	8	CPU threads for batch processing (prompt evaluation).
Mode	Normal	Server mode: `Normal` (single model), `Router` (multiple models), `Bench` (run llama-bench), or `BenchTune` (parameter auto-tuning).
API Endpoint	false	Enable the API proxy server (see Serve Mode).
Dashboard	false	WebSocket dashboard server (port 49223). Press `Enter` to configure (enabled, port, auth key, TLS).
RPC Workers	None	Open a dedicated window to manage distributed inference nodes (IP:Port).
Language	en	UI language. Press `Enter` to cycle between English, French, and Italian.

Note: The Server Settings panel is hidden when a server is already running. Press F2 to toggle Server Settings only when no server is active.

LLM Settings

The LLM Settings panel has 19 standard fields, 12 expert fields (revealed with Ctrl+X), and 15 ultra fields (hidden even in expert mode), for a total of 46 fields. Arrow keys adjust values; +/- for coarse changes, Left/Right for fine. Toggle fields respond to e or Ctrl+E.

LLM Settings

Loading

Field	Default	Description
Prompt	General	System prompt preset that defines the model’s initial behavior. Presets include General, Coder, Thinker, Mathematician, and any user-defined prompts.
Context	131072	Context window size in tokens. Larger values consume more VRAM and RAM. Models often have a maximum context length (e.g., 32K, 128K).
Keep in memory	false	Locks model weights in RAM (`-mlock`) to prevent the OS from swapping them out. Useful when repeatedly loading/unloading models. Increases RAM usage.

The Ctx (U) column in the Models panel shows the user-configured context length from LLM settings (the (U) suffix distinguishes it from the model’s actual loaded context). This value comes from the Context field above and applies to all loaded models unless overridden per-model.

GPU Offload

Field	Default	Description
GPU Layers	Auto	Number of model layers offloaded to GPU memory. `Auto` lets llama.cpp decide based on available VRAM. `Specific` sets an exact number. `All` offloads every layer (`-ngl 999`).
Flash Attention	true	Enables Flash Attention 2 for faster inference with lower memory usage. Requires GPU support. Can improve throughput by 20-40%.
KV Cache Offload	true	Offloads the KV cache to RAM when GPU memory is full. Trade-off: more VRAM available for model weights at the cost of slower cache access.
Cache Type K	F16	Data type for the key cache. Options: F32 (most accurate, most memory), F16 (default), BF16 (better than F16 for some models), Q8_0, Q5_0, Q5_1, Q4_0, Q4_1, Iq4Nl.
Cache Type V	F16	Data type for the value cache. Same options as Cache Type K. Using lower precision reduces VRAM but may affect quality.
Active Experts	1	For Mixture-of-Experts (MoE) models, the number of experts activated per token. Higher values improve quality but increase compute.

Evaluation

Field	Default	Description
Eval Batch	512	Logical maximum batch size for evaluation. Larger batches improve throughput but increase memory usage. Set to the model’s native context length for single-sequence inference.
Unified KV	true	Shares KV cache across sequences, reducing memory usage when running multiple prompts. Can cause cache eviction conflicts.

Sampling

Field	Default	Description
Seed	-1	Random seed for reproducible outputs. `-1` means random each time. Set to a fixed value for debugging or reproducibility.
Temperature	0.8	Controls randomness in sampling. Higher values (1.0-2.0) produce more creative/divergent outputs. Lower values (0.0-0.5) produce more deterministic/crisp outputs.
Top-k	40	Limits sampling to the k most likely next tokens. `0` disables. Smaller values make outputs more focused. Typical: 20-50.
Top-p	0.95	Nucleus sampling: limits to tokens whose cumulative probability reaches p. `1.0` disables. Lower values (0.8-0.95) reduce randomness.
Min P	0.0	Minimum probability threshold for sampling. Tokens with probability below this fraction of the highest-probability token are excluded. Useful for controlling extreme outputs.
Max Tokens	None (unlimited)	Maximum tokens to generate per response. None means no limit (until EOS token).

Repetition Control

Field	Default	Description
Repetition Penalty	1.1	Penalizes tokens that have already appeared. Values > 1.0 reduce repetition. Typical: 1.1-1.2.
Rep. Last N	64	Number of recent tokens to consider for repetition penalty. `-1` uses the full context.

Yarn RoPE

Field	Default	Description
Yarn RoPE	false	Enables YaRN (Yet another RoPE extensioN) for extending context beyond the model’s training length.
Yarn Params	—	Opens a modal to configure three floating-point values: `rope_scale` (default 1.0, multiplies context), `rope_freq_base` (default 0.0, overrides the model’s base frequency), `rope_freq_scale` (default 1.0, scales the frequency). Only digits, `.`, `-`, `e`, and `E` are accepted.

Yarn RoPE Parameters

Field	Default	Description
Tags	None	Per-model tags stored in the YAML config. Press `Enter` to open the tag editor modal. Press `t` in the LLM Settings panel to open the tag editor.

Backend

Field	Default	Description
LLama.cpp Version	Latest	Shows the currently selected backend version. Press `Enter` to open the backend version picker.

Expert Mode

Press Ctrl+X to toggle expert mode, which reveals 17 additional parameters:

Loading (expert): NUMA (None/Distribute/Isolate/Numactl)

GPU (expert): Cache Type K (toggle), Cache Type V (toggle), Main GPU, Fit, Active Experts (toggle)

Sampling (expert): Mirostat (Off/1/2), Mirostat LR, Mirostat Ent, Ignore EOS (toggle)

Repetition (expert): Presence Penalty (toggle, -2.0 to 2.0), Frequency Penalty (toggle, -2.0 to 2.0)

Speculative (expert): MTP (toggle), Spec Type, Spec Draft N Max

Evaluation (expert): SWA Full Cache (toggle), Cache Reuse

These fields follow the same navigation and editing rules as standard fields. Arrow keys adjust values, Enter enters direct edit mode, and dirty fields are highlighted in yellow.

Ultra Fields

19 ultra fields are hidden even in expert mode. They include: Typical P, Mirostat, Mirostat LR, Mirostat Ent, Ignore EOS, Samplers, DRY Multiplier, DRY Base, DRY Allowed Length, DRY Penalty Last N, Threads Batch, UBatch Size, Keep, Split Mode, Tensor Split, Main GPU, Fit, Embedding, RPC. These require direct config file editing or profile application.

Cache Type K/V options: F32, F16, BF16, Q8_0, Q5_0, Q5_1, Q4_0, Q4_1, Iq4Nl

Changing Values

Use Left/Right to adjust numeric fields by 1, or Up/Down for larger steps. Toggle fields respond to e or Ctrl+E. Dirty (changed) fields have the name in red and a trailing *. The status bar shows *unsaved* when settings are dirty.

Unsaved Settings

Saving Settings

Ctrl+S — Save settings for the selected model
Ctrl+R — Reset to defaults
e / Ctrl+E — Toggle enabled/disabled (for Keep in memory, Flash Attention, KV Cache Offload, Cache Type K/V, Fit, Unified KV, Max Tokens, Presence/Frequency Penalty, Max Concurrent Pred, MTP, Ignore EOS, Yarn RoPE, Active Experts, SWA Full Cache)
Ctrl+X — Toggle expert mode (reveals 17 additional parameters)

Dirty (changed) fields are highlighted with red names and a trailing *.

Keyboard Shortcuts

Models Panel

List Mode (local models)

Key	Action
`j` / `k` / `Up` / `Down`	Navigate model list
`Enter` / `l`	Load selected model
`u`	Unload selected model (prompts confirmation)
`Ctrl+D` / `Del`	Delete selected model (with confirmation)
`f`	Enter local filter mode (type to filter, `Esc` to cancel, `Enter` to confirm)
`Ctrl+S`	Cycle list sort order (Name, Params, Qual, Context)
`Ctrl+G`	Show GGUF filename explanation
`Shift+A`	Open About modal

Search Mode (HuggingFace)

Key	Action
`j` / `k` / `Up` / `Down`	Navigate search results
`/`	Enter search input mode — type query, press `Enter` to search
`Enter`	Select result: fetch README, then open GGUF files view
`Esc`	Exit search mode, return to List mode
`l`	View GGUF files for selected model
`S`	Cycle search sort order (Relevance, Downloads, Likes, Trending, CreatedAt)
`Ctrl+S`	Cycle sort order (same as `S`)
`Ctrl+B`	Go to previous page of results
`Down` (at bottom)	Load more results (pagination)
`Ctrl+R`	Fetch README for selected model and switch to README panel

Files Mode (GGUF file browser)

Key	Action
`j` / `k` / `Up` / `Down`	Navigate GGUF files
`Enter`	Download selected GGUF file
`Right`	Switch to README panel
`Esc`	Return to search results

BenchTune Mode

Key	Action
`Up` / `Down`	Navigate benchmark results
`Enter`	View output for selected benchmark result
`Esc`	Cancel benchmark, return to List mode

Log Panel

Key	Action
`j` / `k` / `Up` / `Down`	Scroll log entries
`g` / `Home`	Jump to top, turn off follow mode
`G` / `End`	Jump to bottom, turn on follow mode
`PageUp` / `PageDown`	Scroll 15 lines up/down
`f`	Toggle follow mode (auto-scroll to newest)
`Enter`	Expand log panel
`Esc`	Collapse log panel

Server Settings Panel

Key	Action
`j` / `k` / `Up` / `Down`	Navigate settings fields
`Enter`	Activate selected field (opens picker, toggles, or cycles)
`Left` / `h`	Decrease value (Threads, Threads Batch)
`Right` / `l`	Increase value (Threads, Threads Batch)
`Ctrl+S`	Save settings

Field-specific Enter behavior:

Field	Action
Host	Open Host picker modal
Backend	Open Backend picker modal
Threads	Cycle threads value
Threads Batch	Cycle threads batch value
Mode	Cycle: Normal → Bench GPU → BenchTune → Normal
API Endpoint	Open API Endpoint picker modal
Dashboard	Open Dashboard URL modal
RPC Workers	Open RPC Manager modal
Web Search	Open Web Search picker modal
Language	Cycle language (en → fr → it → en)

LLM Settings Panel

Key	Action
`j` / `k` / `Up` / `Down`	Navigate settings fields
`Ctrl+PageDown` / `Ctrl+D`	Jump down 10 fields
`Ctrl+PageUp` / `Ctrl+U`	Jump up 10 fields
`PageDown`	Scroll down 5 fields
`PageUp`	Scroll up 5 fields

Edit Values

Key	Action
`Left` / `Backspace`	Decrease value (or remove char from edit buffer)
`Right`	Increase value
`0-9`	Append digit to edit buffer
`-`	Append minus to edit buffer
`.`	Append decimal point
`Enter` (with buffer)	Apply edited value
`Esc` (with buffer)	Cancel edit, clear buffer

Global Shortcuts

Key	Action
`Ctrl+S`	Save settings
`Ctrl+R`	Reset settings (confirmation if dirty)
`Ctrl+E`	Toggle current field (boolean fields)
`Ctrl+X`	Toggle expert mode (reveals 17 additional parameters)
`t`	Open Tags modal

Field-specific Enter actions:

Field	Action
Prompt	Open Prompt Picker modal
GPU Layers	Enter edit mode or cycle GPU layers (Auto → Specific → All)
Chat Template	Open Chat Template picker
Cache Type K / V	Cycle cache type (F16, Q8_0, Q6_K, Q5_0, …)
Max Concurrent Pred	Open Max Concurrent picker
Yarn Params	Open Yarn RoPE Settings modal
Spec Type	Open Speculative Decoding Type picker
Tags	Open Tags modal
LLama.cpp Version	Open Backend version picker

Profiles Panel

Key	Action
`j` / `k` / `Up` / `Down`	Navigate profiles
`PageUp` / `PageDown`	Scroll 5 entries up/down
`Enter`	Apply selected profile and switch to LLM Settings
`s` / `Ctrl+S`	Save current settings as a new profile
`d`	Delete selected user profile (moved to `unused_profiles/`)
`Esc`	Return to LLM Settings

System Prompt Presets Panel

List Mode

Key	Action
`j` / `k` / `Up` / `Down`	Navigate presets
`PageUp` / `PageDown`	Scroll 5 entries up/down
`Enter`	Apply selected preset and switch to LLM Settings
`e`	Edit selected preset (enters edit mode)
`n`	Create new preset (enters edit mode)
`d`	Delete selected custom preset (not built-in)
`Esc`	Return to LLM Settings

Edit Mode

Key	Action
`Enter`	Insert newline at cursor
`Ctrl+S`	Save preset and exit edit mode
`Esc`	Cancel edit
`Left` / `Right`	Move cursor
`Backspace`	Delete char before cursor
`Delete`	Delete char at cursor
Any character	Insert at cursor

Search README Panel

Key	Action
`j` / `k` / `Up` / `Down`	Scroll README content
`h` / `Left`	Switch focus back to Models panel
`Enter`	Expand README panel
`Esc`	Hide README panel

Downloads Panel

Key	Action
`j` / `k` / `Up` / `Down`	Navigate download entries
`p`	Pause/resume selected download
`Alt+C`	Cancel selected download and remove temp file

Active Model / Model Info Panels

Read-only display panels. No dedicated key bindings.

Key	Action
`Tab` / `Shift+Tab`	Switch to other panels
`Ctrl+G`	GGUF filename explanation (Model Info only)
`Shift+A`	Open About modal

F-keys control panel visibility and focus. Each panel has a bit index (0-5):

Key	Panel	Bit	Action
`F1`	Models	0	Focus Models (no toggle)
`F2`	Server Settings	1	Focus Server Settings
`Ctrl+F2`	Server Settings	1	Toggle Server visibility
`Ctrl+F4`	Model Info	2	Toggle Model Info visibility
`F3`	LLM Settings	3	Focus LLM Settings
`Ctrl+F3`	LLM Settings	3	Toggle LLM Settings visibility
`Ctrl+F5`	Active Model	4	Toggle Active Model visibility
`F6`	Log	5	Focus Log
`Ctrl+F6`	Log	5	Toggle Log visibility
`F10`	—	—	Hide all panels except Models
`Ctrl+F10`	—	—	Show all panels

Tab / Shift+Tab cycle focus only among currently visible panels. Panel order: Models → (Server Settings / README / Profiles / Presets) → Active Model → Log → Downloads.

Panel Resize

Method	Description
`Shift+←` / `Shift+→`	Resize left/right split by 1% (range: 20%-80%)
`Shift+↑` / `Shift+↓`	Resize Server Settings panel height by 1 row (range: 3-20 rows)
Mouse drag on border	Drag the vertical border between left and right panels
Scroll on border	Scroll mouse wheel while hovering the border (1% steps)

Global Shortcuts

Key	Action
`Ctrl+H`	Toggle panel-specific help overlay
`Ctrl+K`	Show CmdLine overlay (full server command)
`Ctrl+Alt+K`	Kill running llama-server
`Ctrl+P`	Open Profile Picker modal
`Ctrl+U`	Open Dashboard URL modal (copy URL to clipboard)
`Ctrl+G`	Show GGUF filename explanation (any panel)
`Ctrl+X`	Toggle expert mode (any panel)
`Ctrl+L`	Cycle UI language (en → fr → it → en)
`Ctrl+O`	Re-trigger onboarding wizard
`Ctrl+C`	Exit (warns if models loaded)
`A`	Open About modal
`y`	Confirm destructive action
`Alt+M`	Toggle benchmark mode (RuntimeOnly / Full)
`Alt+P`	Edit benchmark prompt
`Alt+N`	Edit n_predict (max tokens)
`Alt+I`	Edit iterations
`Alt+C`	Edit chat template kwargs / Cancel confirmation
`Space`	Toggle selection (RPC workers, benchmark parameters)

Log Panel

The Log panel displays live output from the llama.cpp server with level-based coloring.

Log Modes

Mode	Behavior
Following (default)	Auto-scrolls to the bottom as new entries arrive. Press `g` to exit.
Manual	Allows manual scrolling through log history. Press `G` to return to bottom.

Press f in the Log panel to toggle between modes. The current mode is shown in the panel title. Expand the log to fullscreen with Enter; collapse with Esc.

RPC Workers

RPC Workers enable distributed inference across multiple machines. Each worker has a name, IP address, and port (default: 50052).

Open the RPC Workers manager from the Server Settings panel. Within the manager:

Key	Action
`n`	Add new worker
`e`	Edit selected worker
`d`	Delete selected worker
`Space`	Toggle worker selection
`Esc`	Close manager

WebSocket Dashboard

The WebSocket Dashboard provides a real-time visualization of model metrics in any web browser. Access it at http://localhost:49223 (default port).

Configuration

Open the Server Settings panel, navigate to Dashboard, and press Enter to configure:

Field	Description
Enabled	Toggle the dashboard on/off
Port	Server port (default: 49223)
Auth Key	Optional authentication key
TLS Enabled	Enable TLS for secure dashboard access
TLS Cert	Path to TLS certificate file
TLS Key	Path to TLS private key file

When an auth key is set, clients must include it as a WebSocket subprotocol (not a URL parameter). The auth key is passed via the Sec-WebSocket-Protocol header during the WebSocket handshake. With TLS enabled, the URL uses https://.

Dashboard Display

The dashboard shows real-time metrics (TPS, prompt TPS, latency, context, VRAM, RAM, CPU) and current inference settings (backend, threads, temperature, sampling parameters, etc.) alongside the full server command line.

Benchmark Tuning

Benchmark Tuning auto-tunes model parameters for optimal performance. Access it by setting the Server Mode to BenchTune.

Two modes are available:

RuntimeOnly — Single server, params sent in request body (no server restarts)
Full — New server spawned for each parameter combination

Tunable parameters: temperature (0.4–1.0), top_p (0.8–1.0), top_k (40–50), repeat_penalty (1.0–1.2), flash_attn (0/1), threads (4–16), batch_size (512–2048), expert_count (1–4), context_length, spec_type (speculative decoding type), draft_tokens.

Benchmark Configuration

Results can be exported as Markdown table, JSON, YAML, or HTML report with summary cards, winner section, impact analysis, and Chart.js charts.

Benchmark Results

Navigate between results with p (previous) and n (next).

System Prompt Presets

Named system prompts for different use cases. Built-in presets: General, Coder, Thinker, Mathematician. User presets are stored as YAML files in ~/.config/llm-manager/presets/<name>.yaml.

Open the System Prompt Presets panel and manage presets:

Key	Action
`n`	Create new preset
`e`	Edit selected preset
`↵`	Apply preset
`d`	Delete selected preset (moved to `unused_presets/`)
`⌃S`	Save preset during edit
`Esc`	Close / Cancel edit

GPU Layers Cycling

In the LLM Settings panel, the GPU Layers field cycles through three modes with arrow keys:

Mode	Behavior
Auto	Lets llama.cpp auto-detect based on available VRAM (default)
Specific number	Offloads exactly that many layers to GPU
All	Offloads all layers (equivalent to `-ngl 999`)

Arrow keys cycle: Auto → 1 → 2 → … → N → All → Auto. Pressing Enter from a specific number opens an edit buffer for direct input. The -ngl flag is only added for Specific and All modes.

MTP (Multi-Token Prediction)

MTP is an experimental feature that uses a draft model to predict multiple tokens in parallel, improving inference speed. When a model with MTP architecture is selected, the app automatically detects it and enables the --draft-mtp flag. The number of draft tokens is read from the GGUF metadata and displayed in the Model Info panel.

GGUF Metadata

The Model Info panel shows parsed GGUF metadata including: architecture, layers, hidden size, context length, attention heads, KV heads, domain, capabilities, quantization, parameters (e.g., “7B”, “405B”), tokenizer type, vocabulary size, and max context for VRAM. Metadata is parsed once and cached (debounced by file mtime).

Active Model Metrics

The Active Model panel shows real-time metrics:

Metric	Description
TPS	Tokens per second (generation speed)
Prompt TPS	Prompt processing speed
Gen TPS	Generation tokens per second (separate from prompt TPS)
Context usage	Progress bar showing ctx_used/ctx_max
CPU%	CPU usage percentage
RAM	RAM usage
VRAM	GPU memory used/total
Total VRAM	Total GPU memory used (including non-model allocations)
Latency	Milliseconds per token (generation and prompt)
Tokens	Total decoded tokens generated

The panel also shows benchmarking state with progress bar and current parameter display when running BenchTune.

Backend Selection

Multiple backends are supported via the llama.cpp server:

Backend	Source	Description
CPU	ggml-org/llama.cpp	CPU-only inference (standard)
Vulkan	ggml-org/llama.cpp	GPU via Vulkan (Universal: AMD/NVIDIA/Intel)
ROCm	ggml-org/llama.cpp	GPU via ROCm (AMD Native)
ROCm Lemonade	lemonade-sdk/llamacpp-rocm	GPU via ROCm (AMD Optimized, auto-detects GFX architecture)
CUDA	ai-dock/llama.cpp-cuda	GPU via CUDA (NVIDIA Native, CUDA 12.8)
CPU ARM64	ggml-org/llama.cpp	CPU-only for ARM64 Linux
CPU Windows	ggml-org/llama.cpp	CPU-only for Windows
Vulkan Windows	ggml-org/llama.cpp	Vulkan for Windows
CUDA Windows 12.4	ai-dock/llama.cpp-cuda	CUDA 12.4 for Windows
CUDA Windows 13.1	ai-dock/llama.cpp-cuda	CUDA 13.1 for Windows
HIP Windows	ggml-org/llama.cpp	HIP (ROCm) for Windows
CPU macOS ARM64	ggml-org/llama.cpp	CPU-only for macOS Apple Silicon
CPU macOS x64	ggml-org/llama.cpp	CPU-only for macOS Intel

Each backend has its own independently configurable llama.cpp version. Switching versions is instant — no re-download.

Server Modes

Mode	Description
Normal	Single model via CLI (default)
Router (WIP)	Multiple models via API, loads via `/load` endpoint (Work In Progress — not yet selectable in TUI; enable via config.yaml `default.server_mode: router`)
Bench	GPU benchmarking mode (runs llama-bench)
BenchTune	Parameter auto-tuning mode

VRAM Estimate

The app computes a detailed VRAM estimate based on model size, GPU layers, KV cache, activation overhead, and fixed overhead. The formula accounts for GQA ratio, FlashAttention (0.5× KV cache reduction), unified KV cache, KV cache quantization bytes, activation overhead (8× multiplier), YaRN RoPE scale (effective context = context_length * rope_scale), MoE expert ratio (applied to FFN portion only), and fixed overhead (3.8% of max VRAM or 500 MiB fallback). The estimate is shown in the LLM Settings title (e.g., “VRAM ~= 8.2 GB”).

Confirmation Dialogs

The app uses confirmation dialogs for destructive actions:

Exit — warns about loaded models
Delete — confirms irreversible deletion
Reset — confirms resetting all LLM settings
Unload — confirms unloading a model via API
DeleteBackend — confirms deleting a backend binary version from disk

Dialogs require a minimum terminal height of 12 lines. Height is calculated as content lines plus 6 lines of vertical padding, clamped to area.height - 4 to fit small terminals.

Mouse Support

Mouse interactions are supported: clicking on panels to focus them, and scrolling in the log panel, README panel, settings, profiles, and presets panels.

Panel Resize

The horizontal split between left panels (Models + Info) and right panels (Settings/README) can be resized:

Method	Description
Drag border	Click and drag the vertical border between left and right panels
Scroll on border	Scroll mouse wheel while hovering over the border (1% steps)
Keyboard horizontal	`Shift+←` / `Shift+→` to adjust left/right split by 1% (range: 20%-80%)
Keyboard vertical	`Shift+↑` / `Shift+↓` to adjust Server Settings height by 1 row (range: 3-20 rows, persisted)

The current horizontal split percentage is shown in the status bar (e.g., │ 55%). While actively resizing, the indicator shows │ 55% ← resize →.

You can toggle individual panels visibility using F1–F6 keys or Ctrl+F7–Ctrl+F9 to focus specific panels.

Panel Toggling

The TUI shows panel visibility status via small indicators on each panel border. When all panels are visible, all indicators appear. When a panel is hidden, its indicator disappears, making it easy to see which panels are currently shown or hidden.

All Panels Visible

Some Panels Hidden

CmdLine Overlay

Press Ctrl+K to view the full command line that would be executed to start the llama.cpp server. This shows the binary path, model path, and all parameters.

CmdLine Overlay

Press e in the overlay to export the command to /tmp/test_llamaserver.sh.

Server Status

The status bar shows the current server status at the top:

Running: ● 9090 Normal (green dot with port and mode)
Stopped: ○ Server (gray)

Press Ctrl+Alt+K to kill the running llama-server. When stopped, all loaded models are reset to Available state.

Profiles

Profiles are named presets of LLM settings. Built-in profiles include Qwen, Gemma, Llama, Mistral, and Phi. User profiles are stored as YAML files in ~/.config/llm-manager/profiles/<name>.yaml.

p — Apply a profile to current settings
Ctrl+S — Save current settings as a new profile (in the Profiles panel)
Ctrl+D — Delete a user-defined profile (moved to unused_profiles/)

LLM Profiles

opencode

This document covers how to use opencode with llm-manager’s API endpoint.

Prerequisites

Configure the API endpoint and auth key in llm-manager (see Configuration):

Open the Server Settings panel (F2)
Navigate to API Endpoint and press Enter
Configure:
- Enabled — toggle on
- Port — default 49222, configurable via api_endpoint_port in config.yaml
- API Key — the auth key for Bearer token authentication
Press Enter to save

Connecting with opencode

Note: API Endpoint vs Direct Server

You can connect opencode directly to llama.cpp’s default port 8080:

export OPENAI_API_BASE="http://localhost:8080/v1"

However, using llm-manager’s API endpoint (port 49222) provides additional features:

Web Search — automatic SearXNG integration when messages contain research keywords
API proxy — CORS, SSE streaming optimization, request interception
Auth key — Bearer token authentication
Metrics — Prometheus metrics at /metrics

For full feature access, use the API endpoint below.

CORS

The API proxy validates the Origin header at runtime. Only requests from localhost, 127.0.0.1, or the configured bind host are allowed. External websites are blocked.

Auth JSON approach

Set the OPENAI_API_KEY and OPENAI_API_BASE environment variables:

export OPENAI_API_KEY="my-secret-key"          # api_endpoint_key value
export OPENAI_API_BASE="http://localhost:49222/v1"  # or https:// if TLS enabled
opencode auth login

opencode stores the API key in ~/.local/share/opencode/auth.json.

Env var + custom provider approach

Store the token in an environment variable:

export OPENCODE_BEARER_TOKEN="my-secret-key"

Use it in opencode configuration. opencode supports {env:VAR_NAME} syntax for referencing environment variables in config:

{
  "llm-manager": {
    "type": "api",
    "key": "{env:OPENCODE_BEARER_TOKEN}"
  }
}

Bypassing TLS certificate verification

For self-signed or untrusted TLS certificates, set NODE_TLS_REJECT_UNAUTHORIZED=0 before running opencode:

NODE_TLS_REJECT_UNAUTHORIZED=0 opencode

Custom CA Certificates (More Secure)

If you want to properly trust your self-signed certificate instead of disabling verification:

export SSL_CERT_FILE=/path/to/your/certificate.crt
opencode

TLS / HTTPS

When server_tls_enabled: true in config.yaml, the API endpoint uses HTTPS:

Change OPENAI_API_BASE to https://localhost:49222/v1
Change curl URL to https://localhost:49222/v1/chat/completions

The TLS certificate is shared with the WebSocket dashboard (see WebSocket Dashboard).

Configuration

Directory Layout

llm-manager uses XDG directories for config and data:

~/.config/llm-manager/          # Config directory
├── config.yaml                 # Global settings
├── models/                     # Per-model YAML configs
│   └── qwen2.5-7b.yaml
├── profiles/                   # Per-profile YAML configs
│   └── my-profile.yaml
├── presets/                    # Per-preset YAML configs
│   └── custom-preset.yaml
├── unused/                     # Deleted model configs
├── unused_profiles/            # Deleted profiles
└── unused_presets/             # Deleted presets

~/.local/share/llm-manager/     # Data directory
├── models/                     # GGUF model files
│   └── qwen2.5-7b.Q4_K_M.gguf
└── bin/                        # llama-server binaries
    └── llama-server-cpu-...

Per-model configs are named <model_name>.yaml where model_name is the GGUF filename without the .gguf extension. Deleted configs are moved to unused/ subdirectories (recoverable).

Config File

The main config file is ~/.config/llm-manager/config.yaml. It is created automatically on first run with sensible defaults.

models_dirs:
  - ~/.local/share/llm-manager/models
llama_server: llama-server
default:
  context_length: 131072
  threads: <physical cores>
  threads_batch: 8
  batch_size: 512
  temperature: 0.8
  # ... more settings

You can specify a custom config path with --config:

cargo run -- --config /path/to/config.yaml

Default Parameters

Parameter	Default	Description
`context_length`	131072	Context window size in tokens
`threads`	(physical cores)	CPU threads for generation
`threads_batch`	8	CPU threads for batch processing
`batch_size`	512	Logical maximum batch size
`ubatch_size`	512	Physical maximum batch size
`keep`	0	Keep N tokens from initial prompt
`mlock`	false	Lock model weights in RAM
`mmap`	true	Memory-map the model
`kv_cache_offload`	true	Offload KV cache to RAM
`flash_attn`	true	Enable Flash Attention
`temperature`	0.8	Sampling temperature
`top_k`	40	Top-k sampling
`top_p`	0.95	Top-p sampling
`min_p`	0.0	Min-p sampling
`typical_p`	1.0	Typical-p sampling
`repeat_penalty`	1.1	Repetition penalty
`repeat_last_n`	64	Repetition penalty last N tokens
`presence_penalty`	null	Presence penalty
`frequency_penalty`	null	Frequency penalty
`max_tokens`	null	Maximum generation tokens (unlimited if null)
`seed`	-1	Random seed (-1 = random)
`backend`	auto-detected	Default backend (auto-detected: Cuda for NVIDIA, Rocm for AMD, Vulkan for Intel; falls back to cpu). Options: `cpu`, `vulkan`, `rocm`, `rocm-lemonade`, `cuda`, `cpu_arm64`, `win_cpu`, `win_vulkan`, `win_cuda_12_4`, `win_cuda_13_1`, `win_hip`, `macos_arm64`, `macos_x64`

Advanced Parameters

Parameter	Default	Description
`swa_full`	false	Full-size SWA cache
`numa`	none	NUMA optimization mode
`uniform_cache`	true	Unified KV cache across sequences
`parallel`	1	Max concurrent predictions
`max_concurrent_predictions`	null	Max requests in flight
`gpu_layers`	-1	GPU layers (-1 = all)
`gpu_layers_mode`	Auto	GPU layers distribution mode
`split_mode`	layer	Split mode for multi-GPU
`tensor_split`	(empty)	Tensor split across GPUs
`main_gpu`	0	Main GPU ID
`fit`	true	Fit GPU layers automatically
`embedding`	false	Enable embedding mode
`jinja`	true	Use Jinja chat template
`chat_template`	null	Custom chat template string
`expert_count`	-1	Expert count for MoE models
`mirostat`	off	Mirostat mode (off, 1, 2)
`mirostat_lr`	0.1	Mirostat learning rate
`mirostat_ent`	5.0	Mirostat entropy
`ignore_eos`	false	Ignore EOS token
`samplers`	penalties;dry;top_n_sigma;top_k;typ_p;top_p;min_p;xtc;temperature	Sampler chain
`dry_multiplier`	0.0	DRY sampling multiplier
`dry_base`	1.75	DRY sampling base
`dry_allowed_length`	2	DRY allowed repetition length
`dry_penalty_last_n`	-1	DRY last N tokens
`rope_scaling`	none	RoPE scaling type (none, linear, yarn)
`rope_scale`	1.0	RoPE scale factor
`rope_freq_base`	0.0	RoPE frequency base (0 = auto)
`rope_freq_scale`	1.0	RoPE frequency scale
`rope_yarn_enabled`	false	Enable RoPE Yarn
`cache_prompt`	true	Cache prompt tokens
`cache_reuse`	0	Cache reuse length in tokens
`cache_type`	f16	KV cache quantization type
`cache_type_k`	null	KV cache quantization type (K)
`cache_type_v`	null	KV cache quantization type (V)
`spec_type`	(empty)	Speculative decoding type (e.g. “draft-mtp”, “ngram-simple”)
`draft_tokens`	0	Number of draft tokens for MTP
`host`	127.0.0.1	Server bind address
`port`	8080	Server port
`timeout`	600	Server timeout in seconds
`webui`	false	Enable web UI
`ws_server_enabled`	false	Enable WebSocket server
`ws_server_port`	49223	WebSocket server port
`ws_server_auth_key`	null	WebSocket server auth key
`server_tls_enabled`	true	WebSocket server TLS
`server_tls_cert`	null	WebSocket server TLS cert path
`server_tls_key`	null	WebSocket server TLS key path
`router_max_models`	4	Max models in router mode
`server_mode`	Normal	Server mode (Normal, Router, Bench, BenchTune)
`api_endpoint_enabled`	false	Enable built-in API endpoint
`api_endpoint_port`	49222	Built-in API endpoint port
`api_endpoint_key`	null	Bearer token for API endpoint authentication
`web_search_enabled`	false	Enable web search
`web_search_engine`	searxng	Search engine (searxng)
`web_search_engine_url`	(empty)	URL of SearXNG instance (required for web search to work)
`web_search_api_key`	null	Bearer token for SearXNG authentication
`platform`	null	Platform override (linux, windows, macos)
`tags`	(empty)	Model tags for filtering

These can be configured via the LLM Settings panel, per-model config files, or directly in config.yaml.

Profiles

Profiles are named presets of settings. The built-in profiles are:

Profile	Description	Key Settings
Qwen	Optimized for Qwen models (dense)	temp: 0.7, top-k: 20, presence-penalty: 0.0
Qwen-MoE	Optimized for Qwen MoE models (35B-A3B)	temp: 0.8, top-k: 20, presence-penalty: 1.5
Qwen-Coding	Optimized for Qwen models in coding mode	temp: 0.6, top-k: 20, presence-penalty: 0.0
Gemma	Optimized for Gemma 2/4 models	temp: 1.0, min-p: 0.1, top-k: 65
Llama	Optimized for Llama 3.1/3.3 models	temp: 0.7, top-p: 0.9, repeat-penalty: 1.1
Mistral	Optimized for Mistral 7B/NeMo models	temp: 0.7, top-k: 50, top-p: 0.9
Phi	Optimized for Phi 3.5 Mini models	temp: 0.7, top-k: 50, top-p: 0.9, repeat-penalty: 1.1

All profiles also set context_length: 131072, top_p: 0.95, max_tokens: 4096, and uniform_cache: true. Qwen, Qwen-MoE, Qwen-Coding, Gemma, Llama, and Mistral also set jinja: true (Phi does not).

User-defined profiles are stored as individual YAML files in ~/.config/llm-manager/profiles/<name>.yaml. Built-in profiles are auto-merged on load.

System Prompt Presets

System prompt presets define the initial system prompt. Built-in presets:

Preset	Description
General	“You are a helpful assistant.”
Coder	Expert software developer
Thinker	Analytical and thoughtful
Mathematician	Expert in mathematics

The default preset is Coder. User-defined presets are stored as individual YAML files in ~/.config/llm-manager/presets/<name>.yaml. Built-in presets are auto-merged on load.

Backend Binaries

llama-server binaries are stored in ~/.local/share/llm-manager/bin/ with versioned directories:

~/.local/share/llm-manager/bin/
├── llama-server-cpu-{version}/llama-server
├── llama-server-vulkan-{version}/llama-server
├── llama-server-rocm-{version}/llama-server
├── llama-server-rocm-lemonade-{version}/llama-server
└── llama-server-cuda-{version}/llama-server

Binaries are downloaded from specialized repositories on first use:

CPU, Vulkan, ROCm (Native): Fetched from ggml-org/llama.cpp
ROCm Lemonade: Fetched from lemonade-sdk/llamacpp-rocm (ZIP, auto-detects GFX architecture like gfx1100)
CUDA (NVIDIA): Fetched from ai-dock/llama.cpp-cuda (CUDA 12.8 builds)

Switching versions is instant — no re-download.

Per-backend Version Config

llama_cpp_version_cpu: null
llama_cpp_version_vulkan: null
llama_cpp_version_rocm: null
llama_cpp_version_rocm_lemonade: null
llama_cpp_version_cuda: null

Platform-specific backend variants (e.g. CpuArm64, CpuWindows, CpuMacosArm64) are handled through the Backend enum and platform field, not through separate version config keys. Each backend has its own independently configurable version.

Setting to null uses the latest release. Specific versions can be set via the version picker in LLM Settings. These selections are automatically persisted to your configuration and remembered across restarts.

Asset Names

Assets are selected based on the detected platform. Linux examples:

CPU (x64): llama-{tag}-bin-ubuntu-x64.tar.gz
CPU (ARM64): llama-{tag}-bin-ubuntu-arm64.tar.gz
Vulkan: llama-{tag}-bin-ubuntu-vulkan-x64.tar.gz
ROCm: llama-{tag}-bin-ubuntu-rocm-7.2-x64.tar.gz
ROCm Lemonade: llama-{tag}-ubuntu-rocm-{gfx}-x64.zip (auto-detects GPU architecture)
CUDA: llama.cpp-{tag}-cuda-12.8-amd64.tar.gz

Windows assets use *.zip (e.g. llama-{tag}-bin-win-cpu-x64.zip). macOS assets use llama-{tag}-bin-macos-arm64.tar.gz or llama-{tag}-bin-macos-x64.tar.gz.

Serve Mode

You can start a model directly from the command line without the TUI:

./build.sh serve --model /path/to/model.gguf

Options

Option	Description
`--model`	Path to the GGUF model file
`--profile`	Apply a settings profile (e.g., `qwen`, `llama`)
`--config`	Path to config file
`--api-port`	Start API proxy on given port
`--api-key`	API key for Bearer token authentication (API proxy)
`--ws-enable`	Enable WebSocket dashboard server
`--ws-port`	Port for WebSocket dashboard server
`--host`	Bind address for the server (e.g., `0.0.0.0`)
`--backend-binary`	Path to a custom llama-server binary
`--log-file`	Log file path (default: stdout)
`--tls-enable`	Enable TLS for WebSocket dashboard
`--tls-cert`	Path to TLS certificate file
`--tls-key`	Path to TLS private key file

Note: --threads, --context, and --gpu-layers are not CLI flags. They are configured via config.yaml (default section) or per-model override files.

API Proxy

The API proxy forwards requests to the llama.cpp server and provides OpenAI-compatible and Anthropic-compatible endpoints. It supports SSE (Server-Sent Events) streaming for chat completions and other streaming endpoints. CORS is enabled with dynamic origin validation — only requests from localhost, 127.0.0.1, or the configured bind host are allowed. External websites are blocked. When --api-key is set, all requests require Authorization: Bearer <key>.

API Endpoints

The API proxy explicitly handles the following endpoints, while all other paths are automatically proxied to the llama-server instance:

Endpoint	Method	Description
`/health`	GET	Health check
`/metrics`	GET	Prometheus metrics
`/v1/chat/completions`	POST	Chat completions (OpenAI)
`/v1/completions`	POST	Completions (OpenAI)
`/v1/embeddings`	POST	Embeddings
`/v1/models`	GET	List models
`/api/status`	GET	Server status (pid, uptime, loaded models)

The following endpoints are forwarded to llama-server (llama-server built-in endpoints, not explicitly handled by llm-manager):

Endpoint	Method	Description
`/v1/responses`	POST	Responses (Anthropic)
`/v1/messages`	POST	Messages (Anthropic)
`/v1/messages/count_tokens`	POST	Count tokens (Anthropic)
`/completion`	POST	Legacy completion
`/infill`	POST	Code completion (FIM)
`/reranking`	POST	Re-ranking
`/tokenize`	POST	Tokenize text
`/detokenize`	POST	Detokenize tokens
`/apply-template`	POST	Apply chat template
`/v1/health`	GET	Health check (alias)
`/props`	GET/POST	Get/set server properties
`/slots`	GET	Slot monitoring
`/lora-adapters`	GET/POST	List/load LoRA adapters
`/models/load`	POST	Load a model (router mode, Work In Progress)
`/models/unload`	POST	Unload a model (router mode, Work In Progress)

Model Overrides

Settings can be saved per-model. Overrides are stored as individual YAML files in ~/.config/llm-manager/models/<name>.yaml (where name is the GGUF filename without .gguf). When a model is loaded, its override settings are merged into the defaults. Deleted configs are moved to ~/.config/llm-manager/unused/ for recovery.

RPC Workers

You can manage a list of remote llama-rpc-server nodes for distributed inference. These are stored in the rpc_workers list in the config:

rpc_workers:
  - selected: true
    name: "Worker 1"
    ip: "192.168.1.10"
    port: 50052

Workers can be managed via the RPC Workers window in the Server Settings panel. Selected workers are combined into the --rpc flag when starting the server.

Server Settings

Server Settings cover the infrastructure and networking configuration of llm-manager.

Quick Start

The two most important parameters to configure are the Host and Backend selection.

Host

Bind address for the llama.cpp server. Select from available network interfaces.

Host Picker

Backend

GPU/CPU backend selection. Choose between CPU, Vulkan, ROCm, CUDA, and platform variants.

Backend Picker

API Endpoint

The API Endpoint exposes an OpenAI-compatible API proxy that forwards requests to llama-server.

Enabling

Enable from the Server Settings panel (F2):

Navigate to API Endpoint and press Enter
Configure:
- Enabled — toggle on
- Port — default 49222, configurable via api_endpoint_port in config.yaml
- API Key — optional Bearer token for authentication
Press Enter to save

Or in ~/.config/llm-manager/config.yaml:

default:
  api_endpoint_enabled: true
  api_endpoint_port: 49222
  api_endpoint_key: your-secret-key

Serve Mode

Start with the API proxy from the command line:

./build.sh serve --model model.gguf --api-port 49222

With authentication:

./build.sh serve --model model.gguf --api-port 49222 --api-key secret

In serve mode, --api-key sets the key for both the API proxy and the WebSocket dashboard (if enabled).

API Endpoints

The proxy handles these endpoints explicitly:

Endpoint	Method	Description
`/health`	GET	Health check
`/metrics`	GET	Prometheus metrics
`/v1/chat/completions`	POST	Chat completions (OpenAI)
`/v1/completions`	POST	Completions (OpenAI)
`/v1/embeddings`	POST	Embeddings
`/v1/models`	GET	List models
`/api/status`	GET	Server status

All other paths are proxied to llama-server (chat completions, embeddings, reranking, tokenization, etc.).

Authentication

When api_endpoint_key is configured, clients must include:

Authorization: Bearer <key>

TLS / HTTPS

The API proxy shares TLS configuration with the WebSocket dashboard. Enable in config.yaml:

default:
  server_tls_enabled: true
  server_tls_cert: /path/to/cert.pem  # optional, auto-generated if omitted
  server_tls_key: /path/to/key.pem     # optional, auto-generated if omitted

When TLS is enabled, use https://localhost:49222 instead of http://.

Auto-generated Certificates

When TLS is enabled without specifying cert/key paths, llm-manager auto-generates a self-signed certificate and CA. Certificates are stored in ~/.config/llm-manager/tls/:

~/.config/llm-manager/tls/
├── ca.pem              # CA certificate
├── ca-key.pem          # CA private key
├── server.pem          # Server certificate
└── server-key.pem      # Server private key

To trust the auto-generated CA:

# Linux (system-wide)
sudo cp ~/.config/llm-manager/tls/ca.pem /usr/local/share/ca-certificates/ && sudo update-ca-certificates

# macOS
sudo security add-trusted-cert -d -r trustRoot -k /Library/Keychains/System.keychain ~/.config/llm-manager/tls/ca.pem

CORS

CORS is enabled for all origins with GET/POST/PUT/DELETE/OPTIONS methods.

SSE Streaming

The API proxy supports SSE (Server-Sent Events) streaming for chat completions and other streaming endpoints. Set stream: true in the request body.

WebSocket Dashboard

The WebSocket Dashboard provides a real-time visualization of model metrics and settings via a web browser.

Accessing the Dashboard

The dashboard runs as a built-in HTTP server on port 49223 by default. Open it in your browser:

http://localhost:49223

Enabling in Serve Mode

The dashboard can be enabled in serve mode using the --ws-enable flag:

./build.sh serve --model model.gguf --api-port 49222 --ws-enable

Customize the dashboard port:

./build.sh serve --model model.gguf --api-port 49222 --ws-enable --ws-port 8081

Note: In serve mode, --api-key sets the key for both the API proxy and the WebSocket dashboard (if enabled).

Customize the host and use a specific backend binary:

./build.sh serve --model model.gguf --api-port 49222 --ws-enable --host 0.0.0.0 --backend-binary /opt/rocm/bin/llama-server

The --host option controls the bind address for both the API proxy server and the WebSocket dashboard server, ensuring they use the same network interface. The default is 127.0.0.1 (from config).

Enabling in TUI Mode

The dashboard can also be enabled from the TUI:

Open the Server Settings panel (F2)
Navigate to Dashboard and press Enter
Configure:
- Enabled — toggle on/off
- Port — server port (default: 49223)
- Auth Key — optional authentication (see below)
Press Enter to save, Esc to close

Dashboard Overview

The dashboard displays real-time metrics in a card-based layout:

Dashboard

Metrics Cards

Metric	Description
Status	Current model state (loaded / unloaded / loading)
Generation Speed	Tokens per second (TPS) for text generation
Prompt Speed	Tokens per second for prompt processing
Latency	Milliseconds per token
Tokens	Tokens generated with progress bar (decoded_tokens / max_tokens, or ‘∞’ if not configured)
VRAM	GPU memory used/total with color-coded progress bar (green <50%, yellow 50-80%, red >80%)
RAM	System memory usage
CPU	CPU usage percentage

Settings Panel

Below the metrics, the dashboard shows a grid of current inference settings:

Setting	Description
Backend & Version	llama.cpp backend and version
Threads / Threads Batch	CPU thread configuration
Context / Batch Size / Ubatch Size	Model execution parameters
Temperature / Top-k / Top-p / Min P / Typical P	Sampling parameters
Seed	Random seed for reproducibility
Repeat Penalty / Repeat Last N	Repetition control
Presence Penalty / Frequency Penalty	Advanced repetition control
Flash Attention / KV Cache Offload	Performance optimizations
Cache Type K / Cache Type V	KV cache quantization
Unified KV / Mlock / Mmap	Memory management
Expert Count / GPU Layers	Model-specific settings
Spec Type / Draft Tokens	Speculative decoding configuration
Yarn RoPE / Yarn Params	Context extension parameters
Tags	Per-model tags

Server Command

The full llama-server command line is displayed at the bottom of the dashboard, showing the exact invocation with all parameters. This is useful for debugging and inspecting the exact configuration being used.

Configuration

To enable and configure the dashboard:

Open the Server Settings panel (F2)
Navigate to Dashboard and press Enter
Configure:
- Enabled – toggle on/off
- Port – server port (default: 49223)
- Auth Key – optional authentication (see below)
Press Enter to save, Esc to close

Dashboard Configuration

Authentication

When an auth key is configured, clients must include it as a WebSocket subprotocol (not a URL parameter):

WebSocket URL: ws://localhost:49223/ws
Subprotocol: mysecretkey

TLS / HTTPS

The WebSocket Dashboard supports TLS (HTTPS) for encrypted connections.

In TUI Mode

Enable TLS from the Server Settings panel (F2 → Dashboard):

Enabled — toggle on/off (default: off)
TLS — toggle on/off (default: on)
TLS Cert — path to a PEM certificate file (optional; leave blank for auto-generated self-signed certificate)
TLS Key — path to a PEM private key file (optional; leave blank for auto-generated certificate)

When TLS is enabled without specifying cert/key paths, the application auto-generates a self-signed certificate and CA. The certificates are stored in ~/.config/llm-manager/tls/:

~/.config/llm-manager/tls/
├── ca.pem              # CA certificate
├── ca-key.pem          # CA private key
├── server.pem          # Server certificate
└── server-key.pem      # Server private key

To trust the auto-generated CA:

# Linux (system-wide)
sudo cp ~/.config/llm-manager/tls/ca.pem /usr/local/share/ca-certificates/ && sudo update-ca-certificates

# macOS
sudo security add-trusted-cert -d -r trustRoot -k /Library/Keychains/System.keychain ~/.config/llm-manager/tls/ca.pem

The dashboard URL changes to https:// when TLS is enabled:

https://localhost:49223

In Serve Mode

Enable TLS from the command line:

./build.sh serve --model model.gguf --api-port 49222 --ws-enable --tls-enable

# With custom certificate and key
./build.sh serve --model model.gguf --api-port 49222 --ws-enable --tls-enable --tls-cert /path/to/cert.pem --tls-key /path/to/key.pem

When --tls-enable is set without --tls-cert/--tls-key, self-signed certificates are auto-generated.

Note: The TLS certificate is also applied to the API proxy server, so both the dashboard and API endpoints use HTTPS.

Connection Status

The dashboard shows a connection indicator at the top of the page:

Green pulsing dot — Connected via WebSocket
Red dot — Disconnected (auto-reconnects every 2 seconds)

Architecture

The dashboard server is built with axum and tokio. It:

Creates a broadcast::channel(64) for metrics distribution
Spawns the server on the configured port
Each metrics update is sent to the broadcast channel
WebSocket clients subscribe and receive real-time updates
The HTML dashboard (embedded in the binary) connects via WebSocket and renders the metrics

The server is started/stopped automatically when you toggle the Dashboard setting in Server Settings.

Web Search

llm-manager can automatically search the web when your chat messages contain research-oriented keywords. Results are fetched via SearXNG and injected into the prompt before your message, allowing the LLM to cite sources and provide up-to-date information.

Server-Side Flow

Web search runs entirely on the llm-manager server. External clients (chat frontends, curl, etc.) connect to llm-manager’s API proxy (default port 49222) just like any other chat request — no special headers or endpoints needed. The server intercepts chat completions requests, checks for search keywords, performs the SearXNG search, injects the results into the prompt, and forwards the enriched request to llama-server.

┌──────────┐     /v1/chat/completions      ┌──────────────────┐
│  Client  │ ──────────────────────────────►│ llm-manager API  │
│ (curl,   │ ◄──────────────────────────────│ proxy (port 49222)│
│  UI, etc)│     SSE streaming response     └────────┬─────────┘
└──────────┘                                        │
                                                    │ triggers SearXNG
                                                    ▼
                                           ┌──────────────────┐
                                           │   SearXNG        │
                                           │   instance       │
                                           └──────────────────┘

The web_search_engine_url config points to the SearXNG instance, not the client. Clients never need direct access to SearXNG — they only talk to llm-manager’s API proxy.

Trigger

Web search triggers when your message contains $web:

$web best model for coding 2026
$web compare qwen 3 and llama 4
$web recommend vision model

Configuration

Via Server Settings Panel

Open the Server Settings panel (press F2 or l when focused)
Navigate to the Web Search field using arrow keys
Press ↵ (Enter) to open the Web Search Picker dialog

The dialog (65 columns wide, 15 rows tall) shows:

Field	Type	Description
Enabled	Toggle	Shows “On” (green) or “Off” (gray) — press `↵` to toggle
Engine	Dropdown	Search engine: `searxng`
Engine URL	Text input	URL of your SearXNG instance (e.g., `https://search.example.com`)
API Key	Text input	Bearer token for authentication (optional, masked as `****` when set)

Navigation: ↑/↓ (or j/k) to move between fields, ↵ to toggle/edit, ⎋ (Esc) to close.

Via config.yaml

Add these fields to your ~/.config/llm-manager/config.yaml:

default:
  web_search_enabled: true
  web_search_engine: searxng
  web_search_engine_url: "https://search.example.com"
  web_search_api_key: null  # optional, omit or set to null if not needed

Per-Model Override

Web search settings can also be configured per-model in ~/.config/llm-manager/models/<model_name>.yaml:

web_search_enabled: true
web_search_engine: searxng
web_search_engine_url: "https://search.example.com"
web_search_api_key: null

Model-level settings override the global defaults.

How It Works

When a message matches a trigger keyword:

Query extraction — the full user message is used as the search query
SearXNG search — HTTP GET request to {engine_url}/search?q={query}&format=json
Result parsing — expects JSON with a results array; each result needs title, url, and content/snippet fields
Page fetching — Wikipedia results and up to 5 other URLs have their page content fetched in parallel
Context injection — results are prepended to the message as a [WEB CONTEXT]...[END WEB CONTEXT] block

Request Details

Endpoint: {engine_url}/search?q={url_encoded_query}&format=json
User-Agent: Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36
Authentication: Authorization: Bearer {api_key} header (only if api_key is configured)
Timeout: 15 seconds
Max results: 10

Injected Prompt Format

The web context is prepended to the user message like this:

[WEB CONTEXT]
INSTRUCTION: Cite sources using inline markdown links in your answer.

## Search Results
1. **Title** - URL
   snippet text

## Web Context
## [Title](URL)
...fetched page content...

[END WEB CONTEXT]

[Original user message]

Engine Support

Engine	Status	Notes
SearXNG	✅ Fully functional	Requires a configured `engine_url` pointing to a SearXNG instance

SearXNG Setup

SearXNG must be self-hosted. Official installation guides:

Minimal `settings.yaml`

SearXNG requires a settings.yaml configuration file. Create one before deploying:

use_default_settings: true
search:
  default_lang: en
  # Enable JSON format for API access (required for llm-manager web search)
  formats:
    - json
server:
  secret_key: "change-this-to-a-random-secret"  # generate with: python3 -c "import secrets; print(secrets.token_hex(32))"
  port: 8081
  bind_address: "0.0.0.0"
  # Base URL — required to avoid 303 redirects
  # Set to the public URL where SearXNG is accessible
  # base_url: "http://localhost:8081"  # or "https://search.example.com"

Podman (standalone)

Run SearXNG as a standalone Podman container:

# Create config directory and settings file
mkdir -p ~/.searxng
cat > ~/.searxng/settings.yaml << 'EOF'
use_default_settings: true
search:
  default_lang: en
  # Enable JSON format for API access (required for llm-manager web search)
  formats:
    - json
server:
  secret_key: "change-this-to-a-random-secret"
  port: 8081
  bind_address: "0.0.0.0"
  # base_url: "http://localhost:8081"  # uncomment if behind reverse proxy
EOF

# Run the container
podman run -d \
  --name searxng \
  -p 8081:8081 \
  -v ~/.searxng/settings.yaml:/etc/searxng/settings.yaml:Z \
  -v ~/.searxng:/etc/searxng/lib/searx:Z \
  --restart unless-stopped \
  searxng/searxng:latest

After deployment, use http://localhost:8081 (or your public URL) as the Engine URL in llm-manager.

Docker Compose

For Docker Compose users, create docker-compose.yml:

services:
  searxng:
    image: searxng/searxng:latest
    ports:
      - "8081:8081"
    volumes:
      - ~/.searxng/settings.yaml:/etc/searxng/settings.yaml:Z
    restart: unless-stopped

Run with:

docker compose up -d

podman-compose

For podman-compose users:

services:
  searxng:
    image: searxng/searxng:latest
    ports:
      - "8081:8081"
    volumes:
      - ~/.searxng/settings.yaml:/etc/searxng/settings.yaml:Z
    restart: unless-stopped

Run with:

podman-compose up -d

Settings Panel Display

The LLM Settings panel shows the current web search status:

Web Search (Enabled: searxng)

Web Search (Disabled: searxng)

Troubleshooting

303 redirect — set server.base_url in settings.yaml to the public URL (e.g., http://localhost:8081 or https://search.example.com)
Search returns no results — verify the Engine URL is accessible and points to a running SearXNG instance
Timeout errors — web search has a 15-second timeout; slow SearXNG instances may need tuning
Authentication failures — if web_search_api_key is set, ensure the SearXNG instance accepts the Bearer token
Results not appearing in chat — check that trigger keywords are present in the message
HTTPS certificate errors — ensure the SearXNG instance has valid TLS certificates if using https://

Router Mode & Multi-Model Inference

Router Mode enables loading and managing multiple models simultaneously through a single llama-server instance. This is useful for A/B testing models, building model routing systems, or comparing different models in a shared environment.

What is Router Mode?

Router Mode uses llama.cpp’s router API to load multiple models into a single server process. Each model is addressed by its unique identifier, allowing clients to route requests to specific models.

Enabling Router Mode

Open Server Settings (F2 or navigate to Server Settings panel)
Set Mode to Router
Set Router Max Models to your desired limit (default: 4)
Load your first model normally

Loading Models

In Router Mode, models are loaded via the API endpoint /models/load:

curl -X POST http://localhost:8080/models/load \
  -H "Content-Type: application/json" \
  -d '{"model": "model_name"}'

Or through the TUI:

Select a model in the Models panel
Press l or Enter to load it

Each loaded model shows its status in the Models panel with the port and PID.

Managing Models

Listing Loaded Models

Get all loaded models and their status:

curl http://localhost:8080/models

Response format:

[
  {
    "id": "model_name",
    "object": "model",
    "owned_by": "user",
    "path": "/path/to/model.gguf"
  }
]

Unloading Models

Unload a specific model:

curl -X POST http://localhost:8080/models/unload \
  -H "Content-Type: application/json" \
  -d '{"model": "model_name"}'

Or in the TUI:

Select a loaded model
Press u to unload it

Deleting Models

Delete a model (moves to unused directory):

Select the model
Press Ctrl+D
Confirm deletion

VRAM Management

Understanding VRAM Usage

Each loaded model consumes VRAM proportional to:

Model size (quantization level)
GPU layers offloaded
Context length
Batch size

Monitoring VRAM

The Active Model panel shows:

VRAM: GPU memory used/total per model
Total VRAM: Sum of all model VRAM usage
Context usage: Progress bar showing ctx_used/ctx_max

VRAM Estimation

The app computes VRAM estimates based on:

Model file size (with MoE expert ratio applied to FFN portion for mixture-of-experts models)
GPU layers mode (Auto/Specific/All)
KV cache settings (Flash Attention, quantization, YaRN RoPE scale)
Activation overhead (8× multiplier)
Fixed overhead (3.8% of max VRAM)

The estimate is shown in the LLM Settings title (e.g., “VRAM ~= 8.2 GB”).

Best Practices

Leave headroom: Keep 10-20% VRAM free for KV cache and activations
Use lower quantization: Q4_K_M or Q5_K_M for better multi-model support
Reduce context length: Shorter contexts use less VRAM
Monitor Total VRAM: The Active Model panel shows combined usage

Configuration

Router Max Models

Set the maximum number of models that can be loaded simultaneously.

In config.yaml:

default:
  router_max_models: 4

In the TUI:

Open Server Settings
Navigate to Router Max Models
Adjust value (1-10)

Per-Model Settings

Each model can have its own settings:

Context length
GPU layers
Temperature
Sampling parameters

Settings are stored in ~/.config/llm-manager/models/<name>.yaml.

API Usage

Chat Completions

Route to a specific model:

curl http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "model_name",
    "messages": [{"role": "user", "content": "Hello"}]
  }'

Streaming

Enable SSE streaming:

curl http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "model_name",
    "messages": [{"role": "user", "content": "Hello"}],
    "stream": true
  }'

Use Cases

Model Comparison

Load multiple models to compare outputs:

Load Model A and Model B
Send the same prompt to both
Compare generation quality and speed

Specialized Models

Run different models for different tasks:

Coder model: Optimized for code generation
Math model: Optimized for mathematical reasoning
General model: For everyday conversations

Route requests based on the task type.

A/B Testing

Test model updates without downtime:

Load the old model
Load the new model
Route some traffic to each
Compare metrics

Limitations

VRAM Constraints

Each model consumes VRAM independently
Total VRAM cannot exceed GPU capacity
KV cache for each model is allocated separately

Model Conflicts

Models with different architectures may have compatibility issues
Some parameters (like context length) are model-specific
Loading/unloading affects all models in the server

Performance

Shared server means shared CPU/GPU resources
Heavy models may impact lighter model performance
Monitor system resources during multi-model operation

Troubleshooting

“VRAM exceeded” error

Reduce GPU layers for loaded models
Use lower quantization models
Unload unused models
Reduce context length

Model fails to load

Check model path is correct
Verify model file exists and is readable
Check available VRAM
Review server logs for specific errors

Slow performance

Reduce number of loaded models
Check for resource contention
Verify GPU is being used (not CPU fallback)
Monitor system resources

Router API not responding

Verify server is running on the expected port
Check network connectivity
Ensure router mode is enabled
Review server logs for errors

Benchmark Tuning

Benchmark Tuning finds the optimal settings for your model and hardware by automatically testing multiple parameter combinations and measuring performance.

When to Benchmark

Benchmark before deploying a model in production to:

Find the fastest settings for your specific GPU/CPU
Compare tradeoffs between throughput (TPS) and latency
Determine the best context length for your use case
Validate speculative decoding improvements
Compare different backends (CPU vs Vulkan vs ROCm vs CUDA)

Accessing Benchmark Mode

Set the server Mode to BenchTune in Server Settings, then press Enter to open the BenchTune Setup modal.

Benchmark Modes

Two modes are available, each with different tradeoffs:

RuntimeOnly (Recommended)

Single server, all parameters sent in the request body. No server restarts between tests.

Pros: Fast (seconds per test), low overhead, preserves server state
Cons: Some parameters may not be reconfigurable at runtime
Best for: Sampling parameters (temperature, top-k, top-p), repetition control, DRY settings

Full

Spawns a new server for each parameter combination.

Pros: Tests all parameters including server-level settings (threads, context, GPU layers)
Cons: Slow (minutes per test due to server startup), higher resource usage
Best for: Hardware-level parameters, backend selection, architecture tuning

Tunable Parameters

The following parameters can be tuned. Enable/disable each with Space:

Parameter	Range	Description	Server/Client
Temperature	0.4–1.0	Sampling randomness	Both
Top-p	0.8–1.0	Nucleus sampling threshold	Both
Top-k	10–40	Token sampling window	Both
Repeat Penalty	1.0–1.5	Repetition suppression	Both
Flash Attention	0/1	Enable/disable Flash Attention 2	Server
Threads	4–16	CPU threads for generation	Server
Batch Size	512–2048	Logical maximum batch size	Server
Expert Count	-1–4	MoE experts per token (MoE models)	Server
Context Length	Model default–max	Context window size	Server
Spec Type	draft-mtp, ngram-simple, etc.	Speculative decoding method	Server
Draft Tokens	0–8	Draft tokens per step (speculative)	Server

Server vs Client Parameters

Server parameters require a full server restart to change (threads, context, flash attention). Use Full mode.
Client parameters can be changed per-request (temperature, top-p, top-k). Use RuntimeOnly mode.

Benchmark Configuration

Prompt

The prompt used for each test iteration. Default:

Create Mona Lisa image in ascii art using text, number, symbol, everything possible. this should be the perfect painting.

Edit with Alt+P. Use a prompt representative of your actual workload for meaningful results.

n-predict (Max Tokens)

Number of tokens to generate per test. Default: 512.

Edit with Alt+N. Higher values give more stable TPS measurements but take longer.

Iterations

Number of test iterations per parameter combination. Default: 3.

Edit with Alt+I. More iterations reduce variance but increase total benchmark time.

Test Duration

Maximum time per test iteration. Default: 30 seconds.

Test Timeout

Maximum time for the entire benchmark run. Default: 60 seconds.

Running a Benchmark

Set Mode to BenchTune in Server Settings
Press Enter to open BenchTune Setup
Select parameters to test with Space
Adjust parameter ranges (min/max/step)
Choose mode: RuntimeOnly or Full (toggle with Alt+M)
Edit prompt (Alt+P), n-predict (Alt+N), iterations (Alt+I)
Press Enter to start

The benchmark runs automatically. Progress is shown in the Active Model panel with a progress bar and current parameter display.

Interpreting Results

Results include these metrics per test:

Metric	Description	What it means
Prompt TPS	Tokens processed per second during prompt evaluation	How fast the model reads your input
Generation TPS	Tokens generated per second	How fast the model produces output
Combined TPS	Total tokens (prompt + generation) per total time	Overall throughput
Latency/Token	Milliseconds per generated token	User-perceived responsiveness
First Token Time	Milliseconds until first token appears	Time to first response

Reading the Results Table

Results are displayed in a table sorted by combined TPS (descending). Use n (next) and p (previous) to navigate between results.

The winner section highlights the best configuration. Look for:

Highest generation TPS for chat applications
Lowest latency/token for interactive use
Lowest first-token-time for responsive UX

Impact Analysis

Each result includes an impact analysis showing how each parameter change affected performance:

Positive impact: Parameter increased throughput
Negative impact: Parameter decreased throughput
Neutral: No significant change

Exporting Results

Results can be exported in multiple formats. Press e in the results view to export:

Format	Use Case
Markdown table	Documentation, sharing via chat/email
JSON	Programmatic analysis, CI/CD pipelines
YAML	Configuration files, version control
HTML report	Visual analysis with Chart.js charts

The HTML report includes:

Summary cards with key metrics
Winner section with recommended configuration
Impact analysis charts
Per-parameter performance breakdowns
Chart.js interactive charts for visual comparison

Example Workflows

Quick Chat Optimization

Goal: Find best settings for a chat application.

Set Mode to BenchTune
Enable: Temperature, Top-p, Top-k, Repeat Penalty
Select RuntimeOnly mode
Set iterations to 5 for stable results
Run benchmark
Export as Markdown for documentation

Hardware Stress Test

Goal: Find maximum throughput on your hardware.

Set Mode to BenchTune
Enable: Threads, Batch Size, Context Length, Flash Attention
Select Full mode (tests server-level params)
Set iterations to 3
Run benchmark (may take 20-30 minutes)
Export as HTML for visual analysis

Speculative Decoding Comparison

Goal: Compare speculative decoding methods.

Set Mode to BenchTune
Enable: Spec Type, Draft Tokens
Select Full mode
Set n-predict to 256 for meaningful speculative gains
Run benchmark
Compare first-token-time across spec types

Tips and Best Practices

Use representative prompts: Benchmark with prompts similar to your actual workload
Control variables: Change one parameter at a time for clear attribution
Run multiple iterations: Use at least 3 iterations to reduce variance
Warm up: Let the model run for a few tokens before measuring (handled automatically)
Monitor system resources: Watch for CPU/GPU saturation during benchmarks
Compare baselines: Always benchmark against your current settings
Document results: Export results to track improvements over time

Troubleshooting

Benchmark hangs or times out

Increase Test Timeout in BenchTune Setup
Reduce n-predict to shorter generation tasks
Check server logs for errors

Inconsistent results

Increase Iterations for more stable averages
Ensure no other processes are competing for GPU/CPU
Close other applications using the GPU

Low TPS values

Verify GPU is being used (check VRAM usage)
Try enabling Flash Attention if supported
Increase batch size if VRAM allows
Check that threads match your physical core count

Benchmark fails on certain parameters

Some parameters cannot be changed at runtime (use Full mode)
Context length changes require server restart
Backend-specific parameters may not be tunable

Distributed Inference (RPC Workers)

RPC Workers enable distributed inference across multiple machines. Each worker runs a llama-rpc-server that exposes part of a model or a complete model for remote inference.

What is Distributed Inference?

Distributed inference splits model computation across multiple machines, allowing you to:

Combine GPU resources from multiple machines
Run models larger than a single GPU’s VRAM
Reduce latency by placing workers closer to clients
Scale inference horizontally

Architecture

                    ┌─────────────────┐
                    │   llm-manager   │
                    │   (TUI Client)  │
                    └────────┬────────┘
                             │ RPC
               ┌──────────────┼──────────────┐
               │              │              │
        ┌───────▼──────┐ ┌────▼─────┐ ┌──────▼──────┐
        │  Worker 1    │ │ Worker 2 │ │  Worker 3   │
        │  (GPU: A100) │ │(GPU: 4090)│ │ (GPU: A6000)│
        └──────────────┘ └──────────┘ └─────────────┘

Setting Up Workers

Prerequisites

llama-rpc-server binary on each worker machine
Network connectivity between client and workers
Consistent llama.cpp versions across all machines

Installing llama-rpc-server

Each worker machine needs the RPC server binary:

# Download from llama.cpp releases
wget https://github.com/ggml-org/llama.cpp/releases/download/b4100/llama-rpc-server-ubuntu-x64.tar.gz
tar -xzf llama-rpc-server-ubuntu-x64.tar.gz

Starting a Worker

Start the RPC server on each worker machine:

./llama-rpc-server \
  --model /path/to/model.gguf \
  --host 0.0.0.0 \
  --port 50052

Parameters:

--model: Path to the GGUF model file
--host: Bind address (use 0.0.0.0 for network access)
--port: RPC port (default: 50052)

Managing Workers in TUI

Opening the RPC Manager

Open Server Settings (F2)
Navigate to RPC Workers
Press Enter to open the RPC Workers manager

Adding a Worker

In the RPC Workers manager, press n to add a new worker
Enter worker details: [Name], [IP], [Port]
- Example: GPU-A100, 192.168.1.10, 50052
- Or: 192.168.1.10, 50052 (name auto-generated)
- Or: 192.168.1.10 (name and port use defaults)
Press Enter to save

Editing a Worker

Select the worker in the list
Press e to edit
Modify the details
Press Enter to save

Deleting a Worker

Select the worker
Press d to delete
Confirm deletion

Selecting Workers

Use Space to toggle worker selection
Selected workers are combined into the --rpc flag when starting the server
Only selected workers are used for inference

Worker Configuration

Worker Properties

Each worker has these properties:

Property	Description	Example
Name	Human-readable identifier	`GPU-A100-Rack1`
IP	Network address of the worker	`192.168.1.10`
Port	RPC server port (default: 50052)	`50052`
Selected	Whether to use this worker	`true`/`false`

Network Configuration

Firewall Rules

Ensure port 50052 (or your chosen port) is open:

# Ubuntu/Debian
sudo ufw allow 50052/tcp

# CentOS/RHEL
sudo firewall-cmd --add-port=50052/tcp --permanent
sudo firewall-cmd --reload

SSH Tunneling

For secure connections without opening ports:

ssh -L 50052:localhost:50052 user@remote-worker

Then use localhost:50052 as the worker address in llm-manager.

Using Distributed Inference

Loading a Model Across Workers

Configure and select your workers
Set Mode to Router (Work In Progress — not yet selectable in TUI) or Normal
Load your model
The model is distributed across selected workers

Monitoring Workers

The Active Model panel shows total VRAM usage across all workers. Per-worker metrics are not yet displayed in the TUI.

RPC Settings

Configure RPC behavior in LLM Settings:

Setting	Description
RPC	Comma-separated list of worker endpoints
Tensor Split	Fraction of model per GPU (for multi-GPU workers)

Example RPC string: localhost:50052,192.168.1.10:50052

Use Cases

Multi-GPU Workstation

Combine multiple GPUs in a single workstation:

Worker 1: localhost:50052 (GPU 0 - RTX 4090)
Worker 2: localhost:50053 (GPU 1 - RTX 4090)

Distributed Data Center

Spread inference across a data center:

Worker 1: 10.0.1.10:50052 (A100 80GB)
Worker 2: 10.0.1.11:50052 (A100 80GB)
Worker 3: 10.0.1.12:50052 (A100 80GB)

Edge Computing

Deploy workers close to end users:

Worker 1: us-east.worker.example.com:50052
Worker 2: eu-west.worker.example.com:50052
Worker 3: ap-southeast.worker.example.com:50052

Troubleshooting

Worker Connection Failed

Verify worker IP address is correct
Check firewall rules allow port 50052
Ensure llama-rpc-server is running on the worker
Test connectivity: nc -zv <worker-ip> 50052

Model Fails to Load

Verify all workers have sufficient VRAM
Check network latency between workers (should be <1ms for optimal performance)
Ensure llama.cpp versions match across all workers
Review worker logs for errors

Slow Inference

Check network bandwidth between client and workers
Verify workers are not CPU-bound
Monitor GPU utilization on each worker
Consider reducing model size or context length

Worker Disconnected

Check worker machine is still running
Verify network connectivity
Restart llama-rpc-server if needed
Re-select the worker in llm-manager

Best Practices

Hardware Selection

Use GPUs with similar capabilities for balanced performance
Prefer GPUs with high memory bandwidth (A100, H100, RTX 4090)
Ensure sufficient VRAM for your model size

Network Setup

Use wired connections (Ethernet) for workers
Minimize network latency between workers (<1ms ideal)
Use dedicated network interfaces for RPC traffic if possible

Monitoring

Monitor VRAM usage on each worker
Track network latency and throughput
Set up alerts for worker disconnections
Log inference metrics for capacity planning

Security

Use SSH tunneling for untrusted networks
Implement firewall rules to restrict access
Consider TLS for RPC communication (if supported by your llama.cpp version)
Use strong authentication for remote workers

Advanced Configuration

Custom Ports

Each worker can use a different port:

Worker 1: 192.168.1.10:50052
Worker 2: 192.168.1.11:50053
Worker 3: 192.168.1.12:50054

Dynamic Worker Management

Workers can be added/removed without restarting the server:

Open RPC Workers manager
Add new worker
Select the worker
Reload the model to incorporate the new worker

Worker Health Checks

The app automatically checks worker health:

Connection status shown in RPC Workers manager
Failed workers are marked and excluded from inference
Reconnection attempts are made automatically

LLM Settings

LLM Settings cover model parameters and inference configuration.

GGUF Filename Explanation

GGUF filenames encode the model’s architecture, quantization, and source. Press Ctrl+G from any panel to open a popup that parses and explains the filename.

Model Info Panel

How It Works

The parser splits the filename into segments and provides a description for each:

Model family — e.g., “Qwen3.6-35B-A3B”, “Llama-3.1-8B”
Unsloth Dynamic — Indicates the model was fine-tuned using Unsloth’s dynamic methodology
Quantization — e.g., “Q4_K_M”, “Q5_K_S”
Extension — “.gguf”

Quantization Legend

Quant	Description
Q4_0	4-bit quantization (legacy)
Q4_K_M	4-bit mixed quantization (recommended)
Q5_K_M	5-bit mixed quantization
Q5_K_S	5-bit small quantization
Q8_0	8-bit quantization
F16	Floating-point 16-bit (unquantized)

Model Families

The parser recognizes common model families and provides specific explanations:

Qwen — Alibaba’s Qwen models (dense and MoE)
Llama — Meta’s Llama models
Gemma — Google’s Gemma models
Mistral — Mistral AI models
Phi — Microsoft’s Phi models
And more — Custom explanations for other architectures

Auto-Detection of MoE Models

For MoE (Mixture-of-Experts) models, the parser extracts and displays the expert count and parameter breakdown. For example, Qwen3.6-35B-A3B indicates a 35B total parameter model with 3.5B active parameters per token.

Unsloth Dynamic

When a model has been fine-tuned using Unsloth’s dynamic methodology, the filename includes “Unsloth Dynamic”. This indicates the model was optimized with Unsloth’s dynamic quantization approach, which typically yields better quality at the same quantization level.

Cache

KV cache management for performance optimization.

Cache Prompt

Controls whether the KV cache is computed and stored for the prompt (input) tokens during generation.

Configuration

Setting	Default	Description
Cache Prompt	true	Whether to cache the prompt tokens in the KV cache

How It Works

When enabled (default), the prompt tokens are cached in the KV cache during generation. This means:

The prompt is only computed once, improving efficiency for repeated prompts
The KV cache includes both prompt and generation tokens
More KV cache memory is used

When disabled, the prompt tokens are not cached. This means:

The prompt is recomputed on every iteration
Less KV cache memory is used
Useful for very long prompts where caching would exceed available memory

When to Disable

Very long prompts — When the prompt is so long that caching it would consume all available KV cache memory
Limited VRAM — When you need to maximize memory for generation tokens
Streaming scenarios — When processing prompts incrementally

Config Key

cache_prompt — boolean, default true.

Cache Reuse

Controls how many tokens from the KV cache are reused when processing a new prompt that shares prefix context with the previous one.

Configuration

Setting	Default	Description
Cache Reuse	0	Number of tokens to reuse from the previous KV cache

How It Works

When processing a new prompt that starts with the same text as the previous prompt, the cache reuse feature avoids recomputing the KV cache for the shared prefix. The number specified here is the maximum number of tokens that will be reused.

For example, if the previous prompt was “Hello, how are you today?” and the new prompt is “Hello, how are you today? What’s the weather?”, setting cache reuse to 8 would reuse the KV cache for the first 8 tokens (“Hello, how are you today”) and only compute the new portion.

When to Use

High values (100+) — When processing many prompts with long shared prefixes (e.g., system prompts + user queries)
Low values (0-16) — When prompts rarely share prefixes, or when you want to minimize memory usage
Zero — Disables cache reuse entirely

Config Key

cache_reuse — integer, default 0.

Speculative Decoding

Speculative Decoding accelerates inference by using a smaller “draft” model to predict multiple tokens, which are then verified by the main model in parallel. This can significantly reduce generation time without sacrificing quality.

How Speculative Decoding Works

Standard Decoding:
  [Token1] → [Token2] → [Token3] → [Token4] → [Token5]
  Time: 5 steps

Speculative Decoding:
  Draft:  [T1] → [T2] → [T3] → [T4] → [T5]
  Verify:  ✓    ✓    ✗    ✓    (rejects T4, respeculates)
  Time: 2 steps

The draft model generates tokens quickly. The main model verifies them in parallel. Accepted tokens are kept; rejected tokens trigger respeculation.

Speculative Decoding Types

Draft MTP (Multi-Token Prediction)

Uses a model’s built-in draft tokens for speculative decoding. Requires a model with MTP architecture.

Best for: Models specifically designed with MTP (e.g., Qwen2.5-MoE)
Performance: Highest speedup when draft tokens are accurate
Auto-detection: llm-manager automatically detects MTP models and enables this type

draft-simple

Simple n-gram based speculative decoding.

Best for: General use, no special model required
Performance: Moderate speedup
Compatibility: Works with any GGUF model

draft-eagle3

EAGLE3 (Efficient Autoregressive Generation via Lookahead Decoding) speculative decoding.

Best for: High-quality generation with good speedup
Performance: Good balance of speed and quality
Requirements: Model must support EAGLE3 architecture

ngram-simple

N-gram based simple speculative decoding.

Best for: Fast setup, minimal configuration
Performance: Basic speedup
Compatibility: Works with any model

ngram-map-k

N-gram mapping with k-nearest neighbors.

Best for: Models with repetitive patterns
Performance: Variable, depends on text patterns
Complexity: Higher memory usage

ngram-map-k4v

N-gram mapping with k-nearest neighbors (4th variant).

Best for: Models with specific n-gram patterns
Performance: Better than ngram-map-k for certain models
Complexity: Higher memory usage

ngram-mod

Modified n-gram speculative decoding.

Best for: Experimental use
Performance: Varies by model

ngram-cache

N-gram cache-based speculative decoding.

Best for: Repeated prompts or templates
Performance: Excellent for templated generation

Enabling Speculative Decoding

In the TUI

Open LLM Settings (F3)
Navigate to Speculative Decoding section
Toggle MTP to enable
Select Spec Type from the dropdown
Set Spec Draft N Max (0-16, default: 0)

Speculative Decoding Settings

In Config

default:
  spec_type: "draft-mtp"
  draft_tokens: 4

Auto-Detection

llm-manager automatically detects MTP models:

Load a model with MTP architecture
The app reads draft tokens from GGUF metadata
MTP is automatically enabled with appropriate settings
Draft token count is displayed in the Model Info panel

Configuration

Spec Type

Select the speculative decoding method:

Spec Type	Model Requirement	Speedup	Quality
`draft-mtp`	MTP architecture	High	Excellent
`draft-simple`	Any	Moderate	Good
`draft-eagle3`	EAGLE3 architecture	High	Excellent
`ngram-simple`	Any	Low-Moderate	Good
`ngram-map-k`	Any	Moderate	Good
`ngram-map-k4v`	Any	Moderate	Good
`ngram-mod`	Any	Variable	Good
`ngram-cache`	Any	High (templated)	Excellent
Off	N/A	None	N/A

Draft Tokens (N Max)

Maximum number of draft tokens per step:

Value	Use Case	Tradeoff
0	Disabled	No speedup, no overhead
1-2	Conservative	Low speedup, minimal rejection
3-4	Recommended	Good balance of speed and accuracy
5-8	Aggressive	Higher speedup, more rejections
5-8	Maximum	Highest potential speedup, high rejection rate

Optimal value depends on your model and text patterns. Benchmark to find the best setting.

Performance Expectations

Typical Speedups

Scenario	Expected Speedup
MTP model with draft-mtp	1.5-2.5×
General model with draft-simple	1.2-1.5×
Templated text with ngram-cache	2.0-3.0×
Creative writing	1.1-1.3×
Code generation	1.3-1.8×

Factors Affecting Performance

Draft accuracy: Higher accuracy = more accepted tokens = better speedup
Model architecture: Some models benefit more than others
Text patterns: Repetitive patterns are easier to speculate
Context length: Longer contexts may reduce speculation accuracy
Draft token count: Too many drafts increase rejection rate

Benchmarking Speculative Decoding

Use Benchmark Tuning to find optimal speculative decoding settings:

Set Mode to BenchTune
Enable Spec Type and Draft Tokens
Run benchmark with different spec types
Compare generation TPS and latency
Export results to find the best configuration

Troubleshooting

No Speedup Observed

Check draft accuracy (too many rejections)
Reduce Draft N Max to lower rejection rate
Try a different spec type
Verify model supports speculative decoding

Quality Degradation

Reduce Draft N Max
Switch to a more accurate spec type (e.g., draft-mtp)
Increase temperature slightly to compensate
Check draft token count matches model capabilities

Model Not Detected as MTP

Verify model has MTP architecture (check GGUF metadata)
Ensure draft tokens are present in metadata
Check llama.cpp version supports MTP
Review server logs for MTP detection messages

High Rejection Rate

Reduce Draft N Max
Try a different spec type
Check if model is suitable for speculative decoding
Verify draft model matches main model architecture

Best Practices

Choosing a Spec Type

MTP models: Always use draft-mtp
General purpose: Start with draft-simple or draft-eagle3
Templated generation: Use ngram-cache
Code generation: Use draft-mtp or draft-simple
Creative writing: Use draft-simple

Setting Draft Tokens

Start with 4 as a baseline
Increase if acceptance rate is high (>70%)
Decrease if rejection rate is high (<50%)
Monitor first-token-time for interactive use

Monitoring Performance

Track these metrics:

Acceptance rate: Percentage of draft tokens accepted
Generation TPS: Tokens per second with speculation
First-token-time: Time until first token appears
Latency: Milliseconds per token

When to Disable

Disable speculative decoding when:

Generating very short responses (<32 tokens)
Using models without draft token support
Experiencing quality degradation
Running on CPU-only systems with limited resources

Advanced Usage

Combining with Other Optimizations

Speculative decoding works well with:

Flash Attention: Reduces memory usage, improves speed
KV Cache Quantization: Frees VRAM for larger contexts
Router Mode: Compare speculative vs non-speculative models (Work In Progress)

Dynamic Adjustment

Adjust speculative decoding settings based on workload:

Interactive chat: Lower draft tokens (2-4) for responsiveness
Batch processing: Higher draft tokens (6-8) for throughput
Creative generation: Moderate draft tokens (4-6) for quality

Custom Draft Models

For advanced users, custom draft models can be trained:

Collect generation data from your domain
Train a small draft model on the data
Use the draft model for speculation
Monitor and adjust based on acceptance rates

Chat Templates

Chat templates define how the model formats conversations for chat completion. llm-manager supports three modes:

Auto (Detect from GGUF)

When set to Auto, the app reads the model’s GGUF architecture metadata and automatically selects the correct llama.cpp built-in chat template. This is the recommended mode for most use cases — it works out of the box with any model.

Built-in Template Names

You can also select specific llama.cpp built-in templates by name. The available templates depend on the model and are auto-detected from the GGUF metadata. These are the same templates llama.cpp uses internally.

Browse Directory

Select Browse directory to pick a custom .jinja chat template file from your filesystem. The app searches for .jinja files recursively in:

<app directory>/locales/chat_templates/ (for serve mode)
~/.config/llm-manager/chat_templates/ (for TUI mode)

You can also configure a custom directory by setting the chat_templates_dir in your config.

None

Select None to disable any chat template. The model will receive raw inputs without any conversation formatting. Useful for non-chat tasks like completion or embedding.

Chat Template Kwargs

Chat template kwargs allow you to inject additional parameters into the chat template. These are passed as a JSON string to llama.cpp’s --chat-template-kwargs flag.

For example, some models support an enable_thinking parameter that controls whether the model outputs its reasoning:

{"enable_thinking": false}

Open the chat template kwargs editor by pressing Alt+C in the LLM Settings panel.

Jinja Template Files

Custom .jinja files use the Jinja2 templating syntax. They are loaded and applied at inference time. Example structure:

<|system|>
{{ system_prompt }}
<|end|>
<|user|>
{{ prompt }}
<|end|>
<|assistant|>

Place custom templates in the chat_templates directory (see Browse Directory above).

Configuration

Chat template settings are stored per-model in the per-model YAML config or in the LLM Settings panel:

Config Key	Type	Description
`jinja`	bool	Enable Jinja chat template (true by default)
`chat_template`	string/null	Custom chat template name or file path
`auto_chat_template`	bool	Auto-detect template from GGUF metadata
`chat_template_kwargs`	string/null	JSON string for chat template parameters

Max Concurrent Predictions

The Max Concurrent Predictions field controls how many inference requests can be processed simultaneously by the llama.cpp server.

Configuration

Setting	Default	Description
Max Concurrent Predictions	None (unlimited)	Maximum number of concurrent requests. `None` means no limit.
Parallel	1	Max concurrent predictions (sequences). Separate from `max_concurrent_predictions` which limits requests in flight.

How It Works

When set to a specific number, the server limits concurrent inference to that many requests. This is useful for:

Preventing VRAM exhaustion from too many simultaneous requests
Controlling resource usage in multi-user environments
Ensuring predictable latency under load

Set None to allow unlimited concurrent predictions — the server handles requests as they arrive.

Config Key

max_concurrent_predictions — integer or null. Also configurable via the parallel field in expert mode.

Web Search

Server-Side Flow

┌──────────┐     /v1/chat/completions      ┌──────────────────┐
│  Client  │ ──────────────────────────────►│ llm-manager API  │
│ (curl,   │ ◄──────────────────────────────│ proxy (port 49222)│
│  UI, etc)│     SSE streaming response     └────────┬─────────┘
└──────────┘                                        │
                                                    │ triggers SearXNG
                                                    ▼
                                           ┌──────────────────┐
                                           │   SearXNG        │
                                           │   instance       │
                                           └──────────────────┘

The web_search_engine_url config points to the SearXNG instance, not the client. Clients never need direct access to SearXNG — they only talk to llm-manager’s API proxy.

Trigger

Web search triggers when your message contains $web:

$web best model for coding 2026
$web compare qwen 3 and llama 4
$web recommend vision model

Configuration

Via Server Settings Panel

Open the Server Settings panel (press F2 or l when focused)
Navigate to the Web Search field using arrow keys
Press ↵ (Enter) to open the Web Search Picker dialog

The dialog (65 columns wide, 15 rows tall) shows:

Field	Type	Description
Enabled	Toggle	Shows “On” (green) or “Off” (gray) — press `↵` to toggle
Engine	Dropdown	Search engine: `searxng`
Engine URL	Text input	URL of your SearXNG instance (e.g., `https://search.example.com`)
API Key	Text input	Bearer token for authentication (optional, masked as `****` when set)

Navigation: ↑/↓ (or j/k) to move between fields, ↵ to toggle/edit, ⎋ (Esc) to close.

Via config.yaml

Add these fields to your ~/.config/llm-manager/config.yaml:

default:
  web_search_enabled: true
  web_search_engine: searxng
  web_search_engine_url: "https://search.example.com"
  web_search_api_key: null  # optional, omit or set to null if not needed

Per-Model Override

Web search settings can also be configured per-model in ~/.config/llm-manager/models/<model_name>.yaml:

web_search_enabled: true
web_search_engine: searxng
web_search_engine_url: "https://search.example.com"
web_search_api_key: null

Model-level settings override the global defaults.

How It Works

When a message matches a trigger keyword:

Query extraction — the full user message is used as the search query
SearXNG search — HTTP GET request to {engine_url}/search?q={query}&format=json
Result parsing — expects JSON with a results array; each result needs title, url, and content/snippet fields
Page fetching — Wikipedia results and up to 5 other URLs have their page content fetched in parallel
Context injection — results are prepended to the message as a [WEB CONTEXT]...[END WEB CONTEXT] block

Request Details

Endpoint: {engine_url}/search?q={url_encoded_query}&format=json
User-Agent: Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36
Authentication: Authorization: Bearer {api_key} header (only if api_key is configured)
Timeout: 15 seconds
Max results: 10

Injected Prompt Format

The web context is prepended to the user message like this:

[WEB CONTEXT]
INSTRUCTION: Cite sources using inline markdown links in your answer.

## Search Results
1. **Title** - URL
   snippet text

## Web Context
## [Title](URL)
...fetched page content...

[END WEB CONTEXT]

[Original user message]

Engine Support

Engine	Status	Notes
SearXNG	✅ Fully functional	Requires a configured `engine_url` pointing to a SearXNG instance

SearXNG Setup

SearXNG Settings

Minimal `settings.yml`

SearXNG requires a settings.yml configuration file. Create one before deploying:

use_default_settings: true

server:
  secret_key: "change-this-to-a-random-secret"  # generate with: python3 -c "import secrets; print(secrets.token_hex(32))"
  port: 8081
  bind_address: "0.0.0.0"
  # Base URL — required to avoid 303 redirects
  # Set to the public URL where SearXNG is accessible
  base_url: "http://localhost:8081"  # or "https://search.example.com"

search:
  default_lang: en
  # Enable JSON format for API access (required for llm-manager web search)
  formats:
    - html
    - json

Note: The server.port in settings.yml is for SearXNG’s WSGI metadata. The actual listening port is controlled by the GRANIAN_PORT environment variable (default 8080). You must set -e GRANIAN_PORT=8081 to match your desired port.

Podman (standalone)

Run SearXNG as a standalone Podman container:

# Create config directory and settings file
mkdir -p ~/.searxng
cat > ~/.searxng/settings.yml << 'EOF'
use_default_settings: true

server:
  secret_key: "change-this-to-a-random-secret"
  port: 8081
  bind_address: "0.0.0.0"
  base_url: "http://localhost:8081"  # uncomment if behind reverse proxy

search:
  default_lang: en
  formats:
    - html
    - json
EOF

# Run the container
podman run -d \
  --name searxng \
  -p 8081:8081 \
  -e GRANIAN_PORT=8081 \
  -v ~/.searxng/settings.yml:/etc/searxng/settings.yml:Z \
  --restart unless-stopped \
  searxng/searxng:latest

Important: Do not use -v ~/.searxng:/etc/searxng/lib/searx:Z — it replaces the entire Python package directory with an empty directory, causing the container to crash. Only mount the settings.yml file.

After deployment, use http://localhost:8081 (or your public URL) as the Engine URL in llm-manager.

Docker Compose

For Docker Compose users, create docker-compose.yml:

services:
  searxng:
    image: searxng/searxng:latest
    ports:
      - "8081:8081"
    environment:
      - GRANIAN_PORT=8081
    volumes:
      - ~/.searxng/settings.yml:/etc/searxng/settings.yml:Z
    restart: unless-stopped

Run with:

docker compose up -d

podman-compose

For podman-compose users:

services:
  searxng:
    image: searxng/searxng:latest
    ports:
      - "8081:8081"
    environment:
      - GRANIAN_PORT=8081
    volumes:
      - ~/.searxng/settings.yml:/etc/searxng/settings.yml:Z
    restart: unless-stopped

Run with:

podman-compose up -d

Settings Panel Display

The LLM Settings panel shows the current web search status:

Web Search (Enabled: searxng)

Web Search (Disabled: searxng)

Troubleshooting

303 redirect — set server.base_url in settings.yaml to the public URL (e.g., http://localhost:8081 or https://search.example.com)
Search returns no results — verify the Engine URL is accessible and points to a running SearXNG instance
Timeout errors — web search has a 15-second timeout; slow SearXNG instances may need tuning
Authentication failures — if web_search_api_key is set, ensure the SearXNG instance accepts the Bearer token
Results not appearing in chat — check that trigger keywords are present in the message
HTTPS certificate errors — ensure the SearXNG instance has valid TLS certificates if using https://

GNOME Shell Extension

The GNOME Shell Extension provides real-time LLM metrics directly in your GNOME top panel. It connects to a llama.cpp server’s WebSocket endpoint and displays key performance indicators without needing a browser.

GNOME Extension

Requirements

GNOME Shell 45, 46, 47, 48, 49, or 50
A running llama.cpp server with the /metrics endpoint (enabled via --ws-enable or --metrics-url)
glib-compile-schemas (part of glib-2.0 development packages)

Installation

Build and install the extension:

gnome-extensions pack llm-manager@aginies
gnome-extensions install llm-manager@aginies.zip --force

Enable the extension

gnome-extensions enable llm-manager@aginies

Or enable it via the GNOME Extensions application.

Switch to the extension

After enabling, you may need to reload GNOME Shell:

On X11: Press Alt+F2, type r, press Enter
On Wayland: Log out and log back in

Configuration

Settings

The extension provides a preferences dialog accessible from the GNOME Extensions application or by right-clicking the extension icon in the panel.

Setting	Default	Description
Metrics URL	`http://127.0.0.1:8080/metrics`	HTTP/HTTPS URL of the llama.cpp metrics endpoint. The WebSocket URL is derived from this (path changes to `/ws`, protocol changes to `ws://` or `wss://`).
Update Interval	3 seconds	Delay between WebSocket reconnection attempts when disconnected.
Panel Position	2 (Right)	Where the extension appears in the panel: `0`=left, `1`=center, `2`=right, `3`=far left, `4`=far right.
WebSocket Auth	(empty)	Secret key for WebSocket authentication (passed as subprotocol, not URL query parameter).

Testing the Connection

The preferences dialog includes a Test button that:

Runs curl -s -I -k --max-time 5 <metrics-url> against the configured URL
Checks for HTTP response headers and LLM server indicators (llamacpp: or # HELP)
Shows success (green) or failure (red) status

Metrics Selection

Toggle individual metrics on or off via checkboxes in the preferences dialog. All 12 metrics are selected by default. The selected metrics appear both in the top panel (as a compact string) and in the dropdown menu as clickable items.

Metrics Reference

The extension monitors 12 metrics from the llama.cpp WebSocket feed. Each metric can be individually toggled in the preferences dialog.

Metric Key	Label	Type	Description
`model_name`	Model	text	Current model filename (path stripped to basename)
`tps`	TPS	number	Tokens per second for generation (t/s)
`prompt_tps`	Prompt TPS	number	Tokens per second for prompt processing (t/s)
`gen_tps`	Gen TPS	number	Generation tokens per second
`ctx`	Ctx	ratio	Context window usage (tokens), displayed with K-suffix (e.g., “2K / 8K”)
`vram`	VRAM	ratio_gb	GPU memory usage in GB (e.g., “8.0 GB / 24.0 GB”)
`ram`	RAM	gb	System memory usage in GB
`cpu`	CPU	percent	CPU usage percentage
`decoded_tokens`	Decoded	number	Total decoded tokens generated
`prompt_tokens`	Prompt Eval	number	Prompt evaluation token count
`prompt_progress`	Prompt Progress	ratio_pct	Prompt processing progress (0–100%)

Context Token Formatting

Context usage is displayed with K-suffix formatting for large values. When the token count exceeds 1024, it is shown as kilotokens (e.g., “2K / 8K” instead of “2048 / 8192”). Smaller values display as raw numbers.

Color Coding

Metric values and progress bars use color coding to indicate load levels:

Color	Threshold	Usage
Green (`#9ece6a`)	< 50%	Normal operation
Yellow (`#e0af68`)	50% - 80%	Elevated usage
Red (`#f7768e`)	> 80%	High usage, approaching limits

Progress bars for VRAM and context also use these colors. The VRAM progress bar on the panel icon uses the same thresholds.

WebSocket Authentication

When a WebSocket Auth secret is configured, the extension passes it as a WebSocket subprotocol (Sec-WebSocket-Protocol header) during the handshake. The auth key is NOT appended as a URL query parameter.

Example:

Metrics URL: http://127.0.0.1:8080/metrics
WebSocket URL: ws://127.0.0.1:8080/ws
Subprotocol: mysecretkey

The auth key is configured in the preferences dialog under “Metrics Secret”.

Panel Display

The extension displays a compact string in the top panel showing each selected metric’s label and value, separated by spaces. When no metrics are selected or all selected metrics show no data, the panel icon remains visible with the label hidden.

The dropdown menu shows all metrics as interactive checkboxes. Clicking a metric toggles its selection state. Metrics with progress bar types (ctx, vram) display both a value and a colored progress bar in the menu.

Recent Changes

Prompt Metrics — Added prompt_tokens (Prompt Eval) and prompt_progress (Prompt Progress) metrics to the Performance group. Prompt Progress uses a new ratio_pct type for progress bars with 0–100% values.
Icon Size — Panel icon increased from 16px to 24px for better visibility
Debug Log Removed — Debug log panel removed from preferences dialog to simplify the settings interface

Architecture

LLM Manager is a Rust application built on ratatui and crossterm, using tokio for async operations. The codebase is organized into several modules:

src/
├── main.rs              # Entry point, CLI parsing, event loop, model discovery
├── lib.rs               # Library root
├── config.rs            # Config loading/saving, YAML-based, profiles, presets, RPC workers
├── models.rs            # Domain types (SearchResult, DownloadState, ModelSettings, ServerMetrics, etc.)
├── serve.rs             # Standalone serve mode CLI (--model, --profile, --api-port, --api-key, --ws-enable)
├── serve_api.rs         # Axum-based API proxy server for serve mode
├── config/
│   ├── store.rs         # Generic named-item store
│   ├── profiles.rs      # ProfileStore
│   ├── presets.rs       # PresetStore
│   └── model_config.rs  # ModelConfigStore
├── backend/
│   ├── mod.rs           # Module root, USER_AGENT constant
│   ├── benchmark.rs     # Benchmark tuning engine (RuntimeOnly and Full modes)
│   ├── benchmark_report.html  # HTML report template for benchmarks
│   ├── hardware.rs      # GPU detection (AMD/NVIDIA/Intel), CPU core counting
│   ├── hub.rs           # HuggingFace API: search, list files, download
│   ├── server.rs        # llama.cpp server spawning, command building, metrics parsing
│   ├── tls.rs           # TLS certificate generation (self-signed CA), load_tls_config, ensure_tls_certs
│   ├── web_context.rs   # Web context helpers
│   ├── web_search.rs    # Web search (SearXNG) integration
│   └── ws_server.rs     # WebSocket metrics dashboard server
├── tui/
│   ├── mod.rs           # Module root
│   ├── app.rs           # App struct, main entry
│   ├── colors.rs        # Color constants (YELLOW, GREEN, RED, WHITE, DARK_GRAY, CYAN, etc.)
│   ├── settings.rs      # SettingField definitions, filtered_fields for expert mode
│   ├── i18n.rs          # Translation system (t! macro, language switching, locale loading)
│   ├── gguf_naming.rs   # GGUF filename explanation parser
│   ├── app/
│   │   ├── types.rs     # GlobalMode, ModelsMode, ActivePanel enum definitions
│   │   ├── types/sub.rs # Sub-structs: ServerState, DownloadState, SettingsState, etc.
│   │   ├── state/       # State module (parsing patterns, state impls)
│   │   ├── async_ops.rs # Async operations (server spawning, metrics polling, downloads)
│   │   ├── sync_ops.rs  # Sync operations (model discovery, settings sync)
│   │   ├── panels.rs    # Panel layout calculations
│   │   ├── pickers.rs   # Picker helpers
│   │   ├── profiles.rs  # Profile management
│   │   ├── help.rs      # Help text definitions
│   │   ├── metadata.rs  # Metadata handling
│   │   └── pending_events.rs  # PendingEvent enum + scheduler
│   ├── event/
│   │   ├── mod.rs       # Event module root
│   │   ├── key.rs       # Keyboard event handling (global shortcuts, panel handlers)
│   │   ├── mouse.rs     # Mouse event handling
│   │   ├── helpers.rs   # Shared helpers: TextEditor, picker_nav_*
│   │   ├── readme.rs    # README fetching
│   │   ├── rpc_workers.rs  # RPC worker key handling
│   │   ├── panel/       # Per-panel key handlers
│   │   │   ├── models.rs      # Models panel
│   │   │   ├── downloads.rs   # Downloads panel
│   │   │   ├── log.rs         # Log panel
│   │   │   ├── settings.rs    # Settings panel
│   │   │   ├── profiles.rs    # Profiles panel
│   │   │   ├── system_prompts.rs
│   │   │   ├── tags.rs        # Tags modal
│   │   │   └── mod.rs
│   │   └── overlay/       # Overlay handlers (21 handlers)
│   │       ├── mod.rs          # OverlayRegistry, OverlayHandler trait
│   │       ├── about.rs
│   │       ├── api_endpoint_picker.rs
│   │       ├── backend_picker.rs
│   │       ├── bench_tune_setup.rs
│   │       ├── chat_template_file_picker.rs
│   │       ├── chat_template_picker.rs
│   │       ├── cmd_line.rs
│   │       ├── confirmation.rs
│   │       ├── dashboard_picker.rs
│   │       ├── dashboard_url.rs
│   │       ├── directory_picker.rs
│   │       ├── gguf_naming.rs
│   │       ├── host_picker.rs
│   │       ├── max_concurrent_picker.rs
│   │       ├── onboarding.rs
│   │       ├── profile_picker.rs
│   │       ├── prompt_picker.rs
│   │       ├── rpc_manager.rs
│   │       ├── search_input.rs
│   │       ├── spec_type_picker.rs
│   │       ├── web_search_picker.rs
│   │       └── yarn_rope_settings.rs
│   ├── panel/           # Panel rendering
│   │   ├── mod.rs
│   │   ├── about.rs        # About panel
│   │   ├── active.rs       # Active model metrics panel
│   │   ├── help.rs         # Help panel
│   │   ├── info.rs         # Info line rendering
│   │   ├── log.rs          # Log panel
│   │   ├── models.rs       # Models panel (search, list, files)
│   │   ├── profiles.rs     # Profiles panel
│   │   ├── readme.rs       # README panel
│   │   ├── rpc_workers.rs  # RPC Workers panel
│   │   ├── settings.rs     # LLM Settings panel
│   │   ├── system_prompt_presets.rs
│   │   └── tabbed.rs       # Tabbed settings rendering
│   ├── render/          # Rendering
│   │   ├── mod.rs
│   │   ├── render.rs    # Main render function (layout, panel visibility)
│   │   ├── overlays.rs  # Overlay rendering (20+ overlay renderers)
│   │   ├── status.rs    # Status bar rendering
│   │   ├── hints.rs     # Bottom hints rendering
│   │   └── onboarding.rs  # Onboarding wizard rendering
│   └── render.rs

App State Machine

The App struct in src/tui/app.rs holds all application state. The main state machine is controlled by models_mode:

#![allow(unused)]
fn main() {
pub enum ModelsMode {
    List { sort_by: ListSort },
    Search { query: String, results: Vec<SearchResult>, sort_by: SearchSort, show_readme: bool, page: usize, loading: bool, has_more: bool },
    Files { model_id: String, files: Vec<(String, u64, String)>, selected_idx: Option<usize>, previous_query: String, previous_results: Vec<SearchResult>, selected_result: Option<SearchResult> },
    BenchTune,
}
}

Each mode controls rendering in render.rs and key handling in event/key.rs. The ActivePanel enum controls focus:

#![allow(unused)]
fn main() {
pub enum ActivePanel {
    #[default] Models,
    Log,
    ServerSettings,
    LlmSettings,
    Profiles,
    SystemPromptPresets,
    SearchReadme,
    ActiveModel,
    ModelInfo,
    Downloads,
}
}

The GlobalMode enum handles overlays that appear above all panels (21 variants):

#![allow(unused)]
fn main() {
pub enum GlobalMode {
    Normal,
    CmdLine { cmd_line: String },
    HostPicker { entries: Vec<(String, String)>, selected: usize },
    BackendPicker { entries: Vec<(Backend, Option<String>)>, selected: usize },
    Confirmation { selected: bool, kind: ConfirmationKind, display_name: String, detail: Option<String> },
    RpcManager,
    About,
    MaxConcurrentPicker { value: String },
    SpecTypePicker { entries: Vec<String>, selected: usize },
    YarnRoPESettings { scale: String, freq_base: String, freq_scale: String, selected_field: i32, editing: bool, edit_buffer: String, edit_cursor_pos: usize },
    BenchTuneSetup { config: BenchTuneConfig, selected_idx: usize, editing_param: bool, editing_param_field: i32, param_edit_buffer: String, param_edit_cursor_pos: usize, bench_mode_selection: usize, editing_prompt: bool, editing_kwargs: bool },
    PromptPicker { entries: Vec<(String, String)>, selected: usize, editing: bool, edit_buffer: String, edit_cursor_pos: usize, confirm_delete: bool },
    ProfilePicker { entries: Vec<(String, String)>, selected: usize, profiles: Vec<Profile> },
    DashboardPicker { enabled: bool, port: String, auth_key: String, tls_enabled: bool, tls_cert: String, tls_key: String, selected_field: i32, editing: bool, edit_buffer: String, edit_cursor_pos: usize },
    ApiEndpointPicker { enabled: bool, port: String, api_key: String, tls_enabled: bool, tls_cert: String, tls_key: String, selected_field: i32, editing: bool, edit_buffer: String, edit_cursor_pos: usize },
    DashboardUrl { host: String, ws_port: String, api_port: u16, llm_port: u16, auth_key: String, ws_enabled: bool, tls_enabled: bool },
    SearchInput { buffer: String, cursor_pos: usize },
    GgufNaming { explanation: GgufExplanation, filename: String },
    Onboarding { step: usize },
    ChatTemplatePicker { entries: Vec<String>, selected: usize },
    ChatTemplateFilePicker { entries: Vec<(String, String)>, selected: usize },
    WebSearchPicker { enabled: bool, engine: String, engine_url: String, api_key: Option<String>, selected_field: i32, engine_picker_selected: usize, editing: bool, edit_buffer: String, edit_cursor_pos: usize, check_status: Option<WebSearchCheckStatus> },
}
}

Server State

The ServerState struct in src/tui/app/types/sub.rs tracks server runtime state:

#![allow(unused)]
fn main() {
pub struct ServerState {
    pub server_handle: Option<ServerHandle>,
    pub metrics_task_handle: Option<JoinHandle<()>>,
    pub sync_task_handle: Option<JoinHandle<()>>,
    pub spawn_task_handle: Option<SpawnTaskHandle>,
    pub bench_tune_task_handle: Option<BenchTuneTaskHandle>,
    pub server_log_rx: Option<mpsc::Receiver<String>>,
    pub metrics_rx: Option<mpsc::Receiver<ServerMetrics>>,
    pub sync_rx: Option<SyncRx>,
    pub spawn_log_tx: Option<mpsc::Sender<String>>,
    pub metrics_model_name: Arc<Mutex<Option<String>>>,
    pub loaded_model_names: Arc<Mutex<Vec<String>>>,
    pub api_proxy_handle: Option<JoinHandle<()>>,
    pub metrics_tx: Option<mpsc::broadcast::Sender<WsMetrics>>,
    pub running_ws_port: Option<u16>,
    pub running_ws_auth: Option<String>,
    pub running_server_tls: Option<bool>,
    pub running_api_port: Option<u16>,
    pub running_api_server_port: Option<u16>,
    pub running_api_model: Option<String>,
    pub running_server_tls_cfg: Option<RustlsConfig>,
    pub running_server_tls_cert_path: Option<String>,
    pub running_server_tls_key_path: Option<String>,
    pub cmd_display: Option<String>,
    pub spawned_settings: Option<ModelSettings>,
    pub spawned_model_name: Option<String>,
    pub spawned_model_state: Option<String>,
    pub spawned_context_length: u32,
    pub server_exit_rx: Option<mpsc::Receiver<()>>,
    pub server_exit_tx: Option<mpsc::Sender<()>>,
    pub api_shutdown_tx: Option<watch::Sender<bool>>,
    pub last_server_logs_tick: Option<Instant>,
    pub last_sync_tick: Option<Instant>,
}
}

Press Ctrl+U in any panel to open the Dashboard URL modal, which displays all server URLs and copies them to the clipboard on Enter.

The modal shows:

Host address
Server configuration (backend, threads, mode)
API Endpoint status with port
RPC Workers count
Dashboard status with port
API URL: http(s)://host:api_port
Metrics URL: http://host:llm_port/metrics
Dashboard URL: http(s)://host:ws_port/dashboard (auth passed as WebSocket subprotocol)
opencode baseURL: http(s)://host:api_port/v1
TLS status indicator (GREEN for On, GRAY for Off)

The modal is 72 columns wide and 20 rows tall, rendered as a centered overlay with yellow-bordered block.

TLS / HTTPS

TLS is managed in src/backend/tls.rs (232 lines):

load_tls_config(cert_path, key_path) — loads Rustls config from PEM files
generate_ca() — generates self-signed CA (cert + key)
generate_server_cert(ca_cert, ca_key) — signs server cert with CA
ensure_tls_certs() — auto-generates certs if missing, stores in ~/.config/llm-manager/tls/
validate_tls_path(path) — validates cert/key file paths

Auto-generated certificates are stored in ~/.config/llm-manager/tls/:

~/.config/llm-manager/tls/
├── ca.pem              # CA certificate
├── ca-key.pem          # CA private key
├── server.pem          # Server certificate
└── server-key.pem      # Server private key

Version tracking (TLS_VERSION = "1") triggers regeneration on bump. CA expiry warnings show if certificate expires within 6 months.

TLS is used by:

WebSocket dashboard server (ws_server.rs)
API proxy server (serve_api.rs)
Dashboard picker (GlobalMode::DashboardPicker)
API endpoint picker (GlobalMode::ApiEndpointPicker)

RPC Workers

Remote workers for distributed inference are stored in config as Vec<RpcWorker>. Each worker has:

name: Human-readable identifier
ip: Network address
port: RPC port (default: 50052)
selected: Whether to use this worker

The RpcManager global mode provides a dedicated window for managing workers:

n — add new worker
e — edit selected worker
d — delete selected worker
Space — toggle worker selection

Workers are combined into the --rpc flag when starting the server. Configuration is stored in ~/.config/llm-manager/config.yaml under rpc_workers.

Benchmark Tuning

The benchmark system (src/backend/benchmark.rs) supports two modes:

RuntimeOnly: Single server, params sent in request body (no server restarts). Best for sampling parameters.
Full: New server spawned for each parameter combination. Tests all parameters including server-level settings.

Key types:

BenchTuneConfig: Model path, iterations, prompt, params to test, duration, mode
BenchTuneParam: name, min, max, step, enabled
BenchTuneResult: params, metrics (prompt_tps, generation_tps, combined_tps, latency_per_token, first_token_time), outputs, per-iteration metrics
BenchTuneStatus: Running (with progress), Completed (with stats), PartiallyCompleted (with stats), Cancelled (with stats)

Results can be exported as Markdown table, JSON, YAML, or HTML report (with Chart.js charts).

WebSocket Dashboard

The WebSocket Dashboard (src/backend/ws_server.rs) provides real-time metrics visualization:

Built with axum and tokio
Creates broadcast::channel(64) for metrics distribution
Routes: /dashboard (serves embedded HTML), /ws (WebSocket for metrics), /health
Auth: WebSocket subprotocol (Sec-WebSocket-Protocol header)
TLS: supports both plain TCP and rustls TLS
Connection indicator: green pulsing dot (connected), red dot (disconnected, auto-reconnects every 2s)

The HTML dashboard is embedded in the binary via include_str! and receives the auth key via a <meta name="ws-auth" content="..."> tag injected into the <body>. The value is HTML-escaped for safe attribute placement. The dashboard JavaScript reads the meta tag with JSON.parse(metaEl.content) and falls back to ?auth= URL parameter.

Web Search

Web search (src/backend/web_search.rs) integrates with SearXNG for research queries:

Trigger: $web prefix in chat message
Server-side flow: intercepts /v1/chat/completions, checks for search keywords, performs SearXNG search, injects results into prompt
Configuration: web_search_enabled, web_search_engine, web_search_engine_url, web_search_api_key
Supports custom SearXNG instances with Docker/Podman deployment
Injected context format: [WEB CONTEXT]...[END WEB CONTEXT] block prepended to user message

Configuration

Config is YAML-based in ~/.config/llm-manager/:

Config struct fields:

#![allow(unused)]
fn main() {
pub rpc_workers: Vec<RpcWorker>          // RPC workers for distributed inference
pub search_limit: u32                    // HuggingFace search results per query (default 50)
pub active_panel: ActivePanel            // Last focused panel
pub left_pct: u16                        // Left panel width % (default 55)
pub language: String                     // UI language (en/fr/it/de, default "en")
pub onboarding_complete: bool            // Onboarding wizard done flag
}

DefaultParams fields:

#![allow(unused)]
fn main() {
pub ws_server_enabled: bool              // WebSocket dashboard enabled (default false)
pub ws_server_port: u16                  // WebSocket dashboard port (default 49223)
pub server_tls_enabled: bool             // TLS for server (default true)
pub server_tls_cert: Option<String>      // TLS certificate path
pub server_tls_key: Option<String>       // TLS key path
pub api_endpoint_enabled: bool           // API endpoint enabled (default false)
pub api_endpoint_port: u16               // API endpoint port (default 49222)
pub api_endpoint_key: Option<String>     // API bearer token
pub web_search_engine: String            // Web search engine (default "searxng")
pub web_search_engine_url: String        // Web search engine URL
pub web_search_enabled: bool             // Web search enabled (default false)
pub web_search_api_key: Option<String>   // Web search API key
}

ModelOverride new fields:

#![allow(unused)]
fn main() {
pub chat_template: Option<String>
pub chat_template_kwargs: Option<String>
pub auto_chat_template: bool
pub expert_count: i32
pub gpu_layers_mode: GpuLayersMode
pub tags: Option<Vec<String>>
}

Local Model Filter

The application supports real-time filtering of the local models list. Triggered by the f key when the Models panel is focused, it allows users to quickly narrow down large collections using case-insensitive substring matching.

Model Discovery

The discover_models() function in src/tui/app/sync_ops.rs recursively scans the models directory for .gguf files:

#![allow(unused)]
fn main() {
fn discover_models(dir: &Path) -> Vec<DiscoveredModel>
}

Each DiscoveredModel contains the file path, name, file_size, and display name (relative path from models directory). Discovery runs in a blocking task on startup.

Download System

Downloads run in a spawned tokio task with progress flowing through a broadcast channel:

User selects a file and presses Enter
pending_download is set with (model_id, filename, url, file_size)
Before starting, the app checks available disk space via hub::get_free_space_bytes() and warns if insufficient
A tokio task calls hub::download_file() with an Arc<AtomicBool> cancel token and Arc<AtomicU8> state
Progress updates flow through download_tx → download_rx
The main loop polls download_rx each iteration and updates the Download panel
Pressing ⌥C (Alt+C) cancels the download and removes the temporary file; p pauses/resumes it

The download loop checks the state atomically each iteration: 1 = downloading, 2 = paused (sleeps 100ms and retries), 3 = cancelled (removes temp file, returns error). Each DownloadState tracks bytes downloaded, speed, ETA, destination path, and status (Downloading/Paused/Complete/Cancelled/Error).

Server Spawning

When a model is loaded, spawn_server() in backend/server.rs:

Resolves the llama-server binary using resolve_backend_binary()
If the binary doesn’t exist, downloads and extracts it from GitHub releases
Spawns the process with the model path and all settings
Sets up a log channel (server_log_rx) for parsing output

The main loop polls server_log_rx and parses log messages for:

Loading phases (model, metadata, tensors) from log messages
Error detection (OOM, crash) from log messages

Metrics (TPS, VRAM, context) are now collected exclusively from the /metrics and /health API endpoints rather than log parsing.

Metrics & Logging

Metrics are collected from the /metrics and /health endpoints, which provide accurate real-time data. Loading completion is detected via the /health endpoint (polling for "status": "ok" and non-empty slots).

Each log entry is stored in log_entries: VecDeque<LogEntry> with a max of 500 entries. The log panel supports scrolling, expansion (Enter/Esc), and two modes: Following (auto-scroll to bottom) and Manual (free scroll). Press f to toggle modes.

Search

Search uses the HuggingFace API with &filter=gguf to only return GGUF models:

#![allow(unused)]
fn main() {
pub async fn search_models(query: &str, limit: u32, offset: u32) -> Result<(Vec<SearchResult>, usize, Vec<String>)> // third element: raw model IDs for post-filtering
}

A post-filter checks that the model_id contains the search query (case-insensitive), since the HF API does full-text search across descriptions/tags and can return unrelated models.

Multi-word search: Space-separated words are split and each word must match the model name (AND logic). Matching words are highlighted in cyan in the results list.

Default: 50 results per page (max 200)
Pagination: Ctrl+B goes back, Down at bottom loads more
Sort order cycles: Relevance → Downloads → Likes → Trending → Created
README fetching: -> downloads and renders the model’s README

VRAM Estimation

The estimate_vram_mib() function in src/models.rs estimates VRAM usage:

total = model_vram + kv_cache + activation + fixed_overhead + 550

Where:

model_vram — proportional to GPU layers loaded, with MoE expert ratio applied to FFN portion (~60%) for mixture-of-experts models
kv_cache — 2 * n_layer * n_ctx * n_embd_kv * sizeof(type) with GQA ratio, FlashAttention factor, and effective context (context_length × rope_scale)
activation — proportional to batch size and hidden size
fixed_overhead — 3.8% of max VRAM (or 500 MiB if unknown)

Loading Progress

Model loading phases are detected from llama.cpp log output and /health API polling:

Phase	Detection	Weight
ServerStarting	(implicit)	8%
LoadingModel	“LLAMA_MODEL_LOADER” / “LOADING MODEL”	7%
LoadingMeta	“LOADED META” / “META DATA”	7%
LoadingTensors	“LOAD_TENSORS:”	70%
ServerListening	“SERVER LISTENING”	8%
Complete	Detected via `/health` API polling	—

During tensor loading, the progress bar refines using layer counts parsed from “offloaded X/Y layers” log messages.

Error Handling

Errors are detected from log patterns:

OOM: “OUTOFDEVICEMEMORY” / “OUT OF MEMORY”
General error: “ERROR”, “FAILED TO LOAD”, “EXCEPTION”

Server exit is detected via a dedicated channel (not log parsing). On error, affected models are marked as Failed with the error message.

Confirmation Dialogs

Destructive actions trigger a GlobalMode::Confirmation overlay with ConfirmationKind variants: Exit, Reset, Delete, Unload, DeleteBackend. The user confirms with Enter or cancels with Esc.

Dialog height is calculated as lines.len() + 6 (content lines plus vertical padding), clamped to area.height - 4 to ensure it fits within the terminal. The dialog requires a minimum terminal height of 12 lines to render, preventing display on very small terminals where buttons would be cut off.

Internationalization (i18n)

All user-facing strings go through the i18n system defined in src/tui/i18n.rs. Translations are stored as JSON files in locales/<lang>.json (currently en.json, fr.json, it.json, de.json). The system loads all locale files at startup into a static LazyLock<HashMap> and switches language at runtime via Ctrl+L (cycles en → fr → it → en).

Key components:

TRANSLATIONS — static HashMap keyed by language code, each containing a map of key → string
CURRENT_LANG — thread-safe mutex holding the active language (persisted to config)
t!("key") — macro for simple string lookup with fallback (current lang → English → key itself)
t_fmt!("key", args...) — macro for strings with {} placeholders
field_help(field_id) — helper that constructs field.help.<id> keys for LLM Settings tooltips

Naming convention: dot-separated hierarchical keys matching UI context (e.g. dialog.exit.title, field.help.context, hints.nav). Technical/internal strings (error messages for logs, debug output) may remain in code. User-facing strings (panel titles, button labels, help text, tooltips, dialog messages, hints) MUST use t!(). When adding a new key, it must be added to ALL locale files simultaneously.

Language switching persists the chosen language to ~/.config/llm-manager/config.yaml under the language field. The locale directory is resolved at runtime by checking: (1) locales/ alongside the binary, (2) LLM_MANAGER_LOCALES env var, (3) project root locales/ directory.

Key Bindings

Global Shortcuts

Key	Action
`/` (search mode)	Opens `GlobalMode::SearchInput`
`Ctrl+U`	Opens `DashboardUrl` modal (copies all URLs to clipboard)
`Ctrl+X`	Toggle expert mode
`Ctrl+G`	Opens `GgufNaming` overlay (GGUF filename explanation)
`Ctrl+P`	Opens `ProfilePicker` overlay
`Ctrl+L`	Cycles language: en → fr → it → en
`Ctrl+O`	Opens `Onboarding` wizard (resets onboarding_complete)
`Ctrl+C`	Exit confirmation if models loaded, else `app.running = false`
`Shift+Tab`	Focus prev
`Tab`	Focus next
`Ctrl+H`	Toggle panel help
`F1`	Focus Models panel
`F2`	Focus ServerSettings panel
`F3`	Focus LlmSettings panel
`F6`	Focus Log panel
`Ctrl+F2`	Toggle ServerSettings panel visibility
`Ctrl+F3`	Toggle LlmSettings panel visibility
`Alt+F3`	Toggle LlmSettings panel visibility
`Ctrl+F4`	Toggle ModelInfo panel visibility
`Ctrl+F5`	Toggle ActiveModel panel visibility
`Ctrl+F6`	Toggle Log panel visibility
`Ctrl+F10`	Show all panels
`F10`	Hide all panels except Models

Index	Setting	Action
0	Host	Opens HostPicker
1	Backend	Opens BackendPicker
2	Threads	Cycles 1..max_threads
3	Threads Batch	Cycles 1..32
4	Mode	Cycles Normal → Router → Bench → BenchTune → Normal
5	API Endpoint	Opens ApiEndpointPicker
6	Dashboard	Opens DashboardPicker
7	RPC Workers	Opens RpcManager
8	Web Search	Opens WebSearchPicker
9	Language	Cycles language

Overlay Registry

The overlay system in src/tui/event/overlay/mod.rs uses a registry pattern with 21 handler types. Each handler implements the OverlayHandler trait with can_handle() and handle() methods. The registry dispatches key events to the appropriate handler based on the current GlobalMode.

Render Pipeline

The render pipeline in src/tui/render/render.rs orchestrates all panel layout:

Status bar (1 line) — mode indicator, server status, bench progress
Main area (fill) — split by left_pct (20-80%) for models vs settings
Active model panel (6 lines) — metrics display
Bottom area — log panel (expandable) and downloads

Log expansion doubles the log height at the expense of other panels. Panel visibility is controlled by bitflags (0b111111 = all panels visible).

Settings Panel

The tabbed settings panel (src/tui/panel/tabbed.rs) combines Server Settings and LLM Settings into a unified interface with tabs:

UNSAVED watermark in red dimmed text when settings are dirty
Help text auto-display after 1.5s focus
Settings rendered as key-value pairs with edit modes (toggle, cycle, text input, picker)
Expert mode (Ctrl+X) reveals additional fields

API Reference

The full Rust API reference is available at docs.rs/llm-manager.

Generate it locally with:

cargo doc --open

Public Types

Core Types

Type	Module	Description
`DiscoveredModel`	`models`	A discovered `.gguf` file with path, name, file_size, and display name
`ModelSettings`	`models`	All settings for loading a model via llama.cpp server (70+ fields)
`ModelState`	`models`	State of a model: `Available`, `Loading`, `Benchmarking`, `Loaded`, or `Failed`
`SearchResult`	`models`	A model found via HuggingFace search
`DownloadState`	`models`	Download progress tracking with cancellation support
`GgufMetadata`	`models`	Parsed GGUF metadata (layers, hidden size, context, etc.)
`ServerMetrics`	`models`	Metrics from the llama.cpp server (TPS, VRAM, CPU, context, latency, prompt progress)
`WsMetrics`	`models`	WebSocket-friendly metrics snapshot (serializable, includes settings, command display, timestamp)
`LogEntry`	`config`	A single log entry with timestamp, level, and message
`GPUBuffer`	`models`	GPU device buffer reported during model loading (device, buffer_size_mib)
`LoadProgress`	`models`	Progress during model loading (layers_total, layers_loaded, tensors_total, tensors_loaded, buffers)

Enums

Type	Module	Description
`Backend`	`models`	Acceleration backend: `Cpu`, `Vulkan`, `Rocm`, `RocmLemonade`, `Cuda`, `CpuArm64`, `CpuWindows`, `VulkanWindows`, `CudaWindows12_4`, `CudaWindows13_1`, `HipWindows`, `CpuMacosArm64`, `CpuMacosX64`
`ServerMode`	`models`	Server operating mode: `Normal` (single model), `Router` (multiple), `Bench` (GPU benchmarking), or `BenchTune` (parameter auto-tuning)
`GpuLayersMode`	`models`	GPU offloading: `Auto`, `Specific(n)`, or `All`
`SearchSort`	`models`	Search result sort order: `Relevance`, `Downloads`, `Likes`, `Trending`, `CreatedAt`
`ListSort`	`models`	Local model list sort order: `Name`, `Size`, `Modified`
`CacheType`	`models`	Main KV cache data type: `F16`, `BF16`, `Fq8_0`, `Fq4_1`
`CacheQuantType`	`models`	KV cache data type for quantization (F32, F16, BF16, Q8_0, Q5_0, Q5_1, Q4_0, Q4_1, Iq4Nl)
`CacheTypeK` / `CacheTypeV`	`models`	Type aliases for `CacheQuantType` (used for keys and values)
`SplitMode`	`models`	Multi-GPU split mode: `None`, `Layer`, `Row`, `Tensor`
`NumMode`	`models`	NUMA optimization: `None`, `Distribute`, `Isolate`, `Numactl`
`RopeScaling`	`models`	RoPE frequency scaling: `None`, `Linear`, `Yarn`
`Mirostat`	`models`	Mirostat version: `Off`, `V1`, `Mirostat2`
`ActivePanel`	`app`	Focused panel: `Models`, `Log`, `ServerSettings`, `LlmSettings`, `Profiles`, `SystemPromptPresets`, `SearchReadme`, `ActiveModel`, `ModelInfo`, `Downloads`
`ConfirmationKind`	`app`	Confirmation dialog type: `Exit`, `Reset`, `Delete`, `Unload`, `DeleteBackend`
`LoadingPhase`	`app`	Phase of model loading: `ServerStarting`, `LoadingModel`, `LoadingMeta`, `LoadingTensors`, `ServerListening`, `Complete`
`LoadProgress`	`models`	Load progress with `layers_total`, `layers_loaded`, `tensors_loaded`
`Samplers`	`models`	Semicolon-separated sampler order string
`BenchTuneMode`	`benchmark`	Benchmark mode: `RuntimeOnly` or `Full` (default: `Full`)
`BenchTuneStatus`	`benchmark`	Status: `Running`, `Completed`, `PartiallyCompleted`, `Cancelled`, or `Error`
`WebSearchCheckStatus`	`app`	Web search status: `Checking`, `Ok`, `Error(String)`

Main Modules

`backend::hub`

HuggingFace API integration.

#![allow(unused)]
fn main() {
/// Search models on HuggingFace.
pub async fn search_models(query: &str, limit: u32, offset: u32) -> Result<(Vec<SearchResult>, usize, Vec<String>)> // third element: raw model IDs for post-filtering

/// List all GGUF files for a model.
pub async fn list_gguf_files(model_id: &str) -> Result<Vec<(String, u64, String)>>

/// Fetch the README for a model from HuggingFace.
pub async fn fetch_readme(model_id: &str) -> Result<String>

/// Download a file with progress tracking.
pub async fn download_file(
    model_id: &str,
    filename: &str,
    url: &str,
    dest: &Path,
    progress: &mut DownloadState,
    download_state: Arc<AtomicU8>,
    tx: broadcast::Sender<DownloadState>,
) -> Result<()>

/// Get available free disk space in bytes for a given path.
pub fn get_free_space_bytes(path: &Path) -> u64

/// Resolve the llama-server binary path for a given backend.
/// Downloads the binary from GitHub releases if not already cached.
pub async fn resolve_backend_binary(
    backend: Backend,
    tag: Option<&str>,
    log_tx: Option<mpsc::Sender<String>>,
    progress_tx: Option<tokio::sync::broadcast::Sender<crate::models::DownloadState>>,
) -> Result<PathBuf>
}

`backend::server`

llama.cpp server process management.

#![allow(unused)]
fn main() {
/// Manages a single llama.cpp server process.
pub struct ServerHandle {
    pub port: u16,
    pub host: String,
    pub pid: u32,
    pub kill_tx: mpsc::Sender<()>,
}

/// Build the full llama-server command line from settings.
pub fn build_server_cmd(
    binary: &Path,
    model: Option<&DiscoveredModel>,
    settings: &ModelSettings,
    config: &Config,
    server_mode: ServerMode,
    router_max_models: u32,
) -> (Command, String)

/// Request to spawn a llama.cpp server process.
pub struct SpawnServerRequest<'a> {
    pub config: &'a Config,
    pub model: Option<&'a DiscoveredModel>,
    pub settings: &'a ModelSettings,
    pub log_tx: mpsc::Sender<String>,
    pub progress_tx: Option<tokio::sync::broadcast::Sender<DownloadState>>,
    pub server_mode: ServerMode,
    pub router_max_models: u32,
    pub exit_tx: mpsc::Sender<()>,
}

/// Spawn a llama.cpp server process.
pub async fn spawn_server(request: SpawnServerRequest) -> Result<(ServerHandle, String), String>

/// Check if the server is healthy and responsive.
pub async fn check_health(host: &str, port: u16) -> bool

/// Kill a running server.
pub async fn kill_server(handle: ServerHandle) -> Result<(), String>

/// Poll metrics from the server.
pub async fn get_metrics(
    host: &str,
    port: u16,
    model_name: Option<&str>,
    pid: Option<u32>,
) -> Result<ServerMetrics, String>

/// Load a model via the llama-server Router API.
pub async fn load_model(host: &str, port: u16, model_id: &str, model_path: Option<&str>) -> Result<(), String>

/// List all models and their status from the llama-server Router API.
pub async fn list_models(host: &str, port: u16) -> Result<Vec<(String, String, Option<String>)>, String>

/// Unload a model via the llama-server Router API.
pub async fn unload_model(host: &str, port: u16, model_id: &str, model_path: Option<&str>) -> Result<(), String>
}

`config`

Configuration loading and saving.

#![allow(unused)]
fn main() {
/// Global configuration.
pub struct Config {
    pub models_dirs: Vec<PathBuf>,
    pub llama_server: PathBuf,
    pub default: DefaultParams,
    pub model_overrides: ModelConfigStore,
    pub profiles: ProfileStore,
    pub system_prompt_presets: PresetStore,
    pub rpc_workers: Vec<RpcWorker>,
    pub search_limit: u32,
    pub active_panel: ActivePanel,
    pub left_pct: u16,
    pub language: String,        // UI language (en, fr, it, de)
    pub onboarding_complete: bool,
}

/// Default parameters for new models.
pub struct DefaultParams {
    pub context_length: u32,
    pub threads: u32,
    pub threads_batch: u32,
    pub batch_size: u32,
    pub ubatch_size: u32,
    pub parallel: u32,
    pub max_concurrent_predictions: Option<u32>,
    pub temperature: f32,
    pub top_k: i32,
    pub top_p: f32,
    pub min_p: f32,
    pub typical_p: f32,
    pub seed: i32,
    pub repeat_penalty: f32,
    pub repeat_last_n: i32,
    pub presence_penalty: f32,
    pub frequency_penalty: f32,
    pub dry_multiplier: f32,
    pub dry_base: f32,
    pub dry_allowed_length: i32,
    pub dry_penalty_last_n: i32,
    pub rope_scaling: RopeScaling,
    pub rope_scale: f32,
    pub rope_freq_base: f32,
    pub rope_freq_scale: f32,
    pub rope_yarn_enabled: bool,
    pub host: String,
    pub port: u16,
    pub timeout: u32,
    pub cache_prompt: bool,
    pub cache_reuse: u32,
    pub webui: bool,
    pub ws_server_enabled: bool,
    pub ws_server_port: u16,
    pub server_tls_enabled: bool,
    pub server_tls_cert: Option<String>,
    pub server_tls_key: Option<String>,
    pub router_max_models: u32,
    pub server_mode: ServerMode,
    pub max_tokens: Option<u32>,
    pub cache_type: CacheType,
    pub backend: Backend,
    pub platform: Option<String>,
    pub llama_cpp_version_cpu: Option<String>,
    pub llama_cpp_version_vulkan: Option<String>,
    pub llama_cpp_version_rocm: Option<String>,
    pub llama_cpp_version_rocm_lemonade: Option<String>,
    pub llama_cpp_version_cuda: Option<String>,
    pub api_endpoint_enabled: bool,
    pub api_endpoint_port: u16,
    pub web_search_engine: String,
    pub web_search_engine_url: String,
    pub web_search_enabled: bool,
    pub web_search_api_key: Option<String>,
    pub api_endpoint_key: Option<String>,
    pub spec_type: String,
    pub draft_tokens: u32,
    pub tags: Vec<String>,
}

/// A remote RPC worker for distributed inference.
pub struct RpcWorker {
    pub selected: bool,
    pub name: String,
    pub ip: String,
    pub port: u16,
}

/// A named profile of settings.
pub struct Profile {
    pub name: String,
    pub description: String,
    pub settings: ModelOverride,
}

impl Profile {
    pub fn apply(&self, base: ModelSettings) -> ModelSettings
}

/// A named system prompt preset.
pub struct SystemPromptPreset {
    pub name: String,
    pub description: String,
    pub content: String,
}

/// Per-model settings override (optional fields).
pub struct ModelOverride {
    // Loading
    pub context_length: Option<u32>,
    pub batch_size: Option<u32>,
    pub ubatch_size: Option<u32>,
    pub cache_type_k: Option<CacheTypeK>,
    pub cache_type_v: Option<CacheTypeV>,
    pub keep: Option<i32>,
    pub swa_full: Option<bool>,
    pub mlock: Option<bool>,
    pub mmap: Option<bool>,
    pub numa: Option<NumMode>,
    pub uniform_cache: Option<bool>,
    pub system_prompt: Option<String>,
    pub system_prompt_preset_name: Option<String>,
    pub max_concurrent_predictions: Option<u32>,
    pub threads: Option<u32>,
    pub threads_batch: Option<u32>,
    pub parallel: Option<u32>,
    // GPU
    pub gpu_layers: Option<i32>,
    pub split_mode: Option<SplitMode>,
    pub tensor_split: Option<String>,
    pub main_gpu: Option<i32>,
    pub fit: Option<bool>,
    pub lora: Option<PathBuf>,
    pub lora_scaled: Option<(PathBuf, f32)>,
    pub rpc: Option<String>,
    pub embedding: Option<bool>,
    pub kv_cache_offload: Option<bool>,
    pub flash_attn: Option<bool>,
    pub jinja: Option<bool>,
    pub auto_chat_template: Option<bool>,
    pub chat_template: Option<String>,
    pub chat_template_kwargs: Option<String>,
    pub expert_count: Option<i32>,
    pub gpu_layers_mode: Option<GpuLayersMode>,
    // Sampling
    pub seed: Option<i32>,
    pub temperature: Option<f32>,
    pub top_k: Option<i32>,
    pub top_p: Option<f32>,
    pub min_p: Option<f32>,
    pub typical_p: Option<f32>,
    pub mirostat: Option<Mirostat>,
    pub mirostat_lr: Option<f32>,
    pub mirostat_ent: Option<f32>,
    pub ignore_eos: Option<bool>,
    pub samplers: Option<Samplers>,
    // Repetition
    pub repeat_penalty: Option<f32>,
    pub repeat_last_n: Option<i32>,
    pub presence_penalty: Option<f32>,
    pub frequency_penalty: Option<f32>,
    pub dry_multiplier: Option<f32>,
    pub dry_base: Option<f32>,
    pub dry_allowed_length: Option<i32>,
    pub dry_penalty_last_n: Option<i32>,
    // RoPE
    pub rope_scaling: Option<RopeScaling>,
    pub rope_scale: Option<f32>,
    pub rope_freq_base: Option<f32>,
    pub rope_freq_scale: Option<f32>,
    pub rope_yarn_enabled: Option<bool>,
    // Server
    pub cache_prompt: Option<bool>,
    pub cache_reuse: Option<u32>,
    pub webui: Option<bool>,
    // Other
    pub max_tokens: Option<u32>,
    pub cache_type: Option<CacheType>,
    pub llama_cpp_version_cpu: Option<String>,
    pub llama_cpp_version_vulkan: Option<String>,
    pub llama_cpp_version_rocm: Option<String>,
    pub llama_cpp_version_rocm_lemonade: Option<String>,
    pub llama_cpp_version_cuda: Option<String>,
    pub spec_type: Option<String>,
    pub draft_tokens: Option<u32>,
    pub tags: Option<Vec<String>>,
}

/// Built-in profiles with sensible defaults.
pub fn builtin_profiles() -> Vec<Profile>

/// Built-in system prompt presets.
pub fn builtin_system_prompt_presets() -> Vec<SystemPromptPreset>
}

`backend::ws_server`

WebSocket dashboard server.

#![allow(unused)]
fn main() {
pub struct WsAppState {
    pub metrics_rx: Arc<broadcast::Receiver<WsMetrics>>,
    pub auth_key: Option<String>,
}

pub async fn start_ws_server(
    port: u16,
    metrics_rx: Arc<broadcast::Receiver<WsMetrics>>,
    auth_key: Option<String>,
    tls_config: Option<axum_server::tls_rustls::RustlsConfig>,
    host: String,
) -> Result<JoinHandle<()>>
}

`backend::benchmark`

Benchmark tuning system.

#![allow(unused)]
fn main() {
/// Configuration for a benchmark run.
pub struct BenchTuneConfig {
    pub model_path: PathBuf,
    pub num_iterations: u32,
    pub prompt: String,
    pub params_to_test: Vec<BenchTuneParam>,
    pub test_duration: Duration,
    pub bench_mode: BenchTuneMode,
    pub n_predict: u32,
    pub chat_template_kwargs: Option<String>,
    pub test_timeout: Duration,
}

/// A tunable parameter for benchmarking.
pub struct BenchTuneParam {
    pub name: String,
    pub min: f64,
    pub max: f64,
    pub step: f64,
    pub enabled: bool,
    pub variants: Vec<String>,
}

/// Actual parameter values for a benchmark run.
pub struct BenchTuneParamValue {
    pub temperature: Option<f64>,
    pub top_p: Option<f64>,
    pub top_k: Option<i64>,
    pub repeat_penalty: Option<f64>,
    pub context_length: Option<u32>,
    pub batch_size: Option<u32>,
    pub flash_attn: Option<bool>,
    pub threads: Option<u32>,
    pub expert_count: Option<i32>,
    pub spec_type: Option<String>,
    pub draft_tokens: Option<u32>,
}

/// Results from a benchmark run.
pub struct BenchTuneResult {
    pub params: BenchTuneParamValue,
    pub metrics: BenchTuneMetrics,
    pub outputs: Vec<String>,
    pub per_iteration_metrics: Vec<BenchTuneMetrics>,
    pub base_settings: Option<ModelSettings>,
    pub server_command: Option<String>,
}

/// Metrics from a benchmark run.
pub struct BenchTuneMetrics {
    pub prompt_tps: f64,
    pub generation_tps: f64,
    pub combined_tps: f64,
    pub latency_per_token: f64,
    pub first_token_time: f64,
}
}

`backend::tls`

TLS certificate management.

#![allow(unused)]
fn main() {
/// Load TLS config from PEM certificate and key files.
pub fn load_tls_config(cert_path: &Path, key_path: &Path) -> Result<RustlsConfig>

/// Generate a self-signed CA (certificate + key).
pub fn generate_ca() -> Result<(String, String)>

/// Sign a server certificate with the CA.
pub fn generate_server_cert(ca_cert: &str, ca_key: &str) -> Result<(String, String)>

/// Ensure TLS certs exist, auto-generating if missing.
pub fn ensure_tls_certs() -> Result<(PathBuf, PathBuf)>

/// Validate a TLS certificate/key path.
pub fn validate_tls_path(path: &Path) -> Result<()>

/// Try to load TLS config from paths, returning None if paths are empty.
pub fn try_load_tls(cert_path: &str, key_path: &str) -> Result<Option<RustlsConfig>>
}

`backend::web_search`

Web search integration with SearXNG.

#![allow(unused)]
fn main() {
/// Search using the configured web search engine.
pub async fn search_web(query: &str, engine_url: &str, api_key: Option<&str>) -> Result<WebSearchResults>

/// Parse SearXNG JSON response into search results.
pub fn parse_searxng_response(json: &str) -> Result<WebSearchResults>
}

`models`

Domain types and utilities.

#![allow(unused)]
fn main() {
/// Estimate VRAM usage (in MiB) for a model with the given settings.
pub fn estimate_vram_mib(
    model_mib: u64,
    settings: &ModelSettings,
    total_layers: u32,
    hidden_size_opt: Option<u32>,
    n_head_opt: Option<u32>,
    n_kv_head_opt: Option<u32>,
    gpu_mem_total_mib: u64,
) -> u64

/// Format host for display ("" or "127.0.0.1" -> "localhost").
pub fn format_host(host: &str) -> String
}

ServerMetrics

Metrics collected from the llama.cpp server:

#![allow(unused)]
fn main() {
pub struct ServerMetrics {
    pub loaded: bool,
    pub tps: f64,
    pub prompt_tps: f64,
    pub cpu_usage: f64,
    pub gpu_mem_used: u64,
    pub gpu_mem_total: u64,
    pub ram_used: u64,
    pub ctx_used: u32,
    pub ctx_max: u32,
    pub total_vram_used: u64,           // Sum across all loaded models
    pub decoded_tokens: u64,            // Tokens from print_timing logs
    pub gen_tps: f64,                   // Generation TPS from log parsing
    pub latency_per_token_ms: f64,      // Estimated latency per token
    pub prompt_latency_ms: f64,         // Prompt processing latency
    pub prompt_tokens: u64,             // Tokens in prompt being evaluated
    pub prompt_progress: f64,           // Progress of prompt evaluation (0.0-1.0)
    pub prompt_elapsed_ms: f64,         // Elapsed prompt evaluation time
    pub prompt_tps_eval: f64,           // Prompt evaluation throughput
}
}

Configuration

Configuration is stored in ~/.config/llm-manager/config.yaml and loaded via Config::load(). The config file structure:

models_dirs:
  - ~/.local/share/llm-manager/models
llama_server: llama-server
default:
  context_length: 131072
  threads: <physical cores>
  # ... more default parameters
  ws_server_enabled: false
  ws_server_port: 49223
  server_tls_enabled: true
  api_endpoint_enabled: false
  api_endpoint_port: 49222
  web_search_enabled: false
  web_search_engine: searxng
  web_search_engine_url: ""
  spec_type: "draft-mtp"
  draft_tokens: 0
  tags: []
model_overrides:
  # Per-model configs stored as individual YAML files in ~/.config/llm-manager/models/
  model.gguf:
    temperature: 0.7
    gpu_layers: 32
profiles:
  - name: Qwen
    description: Optimized for Qwen models
    settings:
      temperature: 0.6
      top_k: 20
rpc_workers:
  - name: Remote-GPU-1
    ip: 192.168.1.50
    port: 50052
    selected: true
system_prompt_presets:
  - name: General
    description: General-purpose assistant
    content: "You are a helpful assistant."
language: en
onboarding_complete: true
search_limit: 50
active_panel: Models
left_pct: 55

Built-in profiles are merged on load, so adding new ones in code automatically appears in the UI.

Systemd Deployment

Run llm-manager as a background service using systemd. This requires a .env file for configuration and a .service unit file.

Environment File

Create a .env file with your settings. All values are passed as environment variables to the serve command:

# Path to the model file (.gguf)
LLM_MODEL=/path/to/your/model.gguf

# Path to a per-model override YAML file (optional)
LLM_MODEL_CONFIG=/path/to/model-config.yaml

# Path to main config.yaml
LLM_CONFIG=~/.config/llm-manager/config.yaml

# Settings profile (optional, e.g. qwen, llama, mistral)
# LLM_PROFILE=qwen

# API proxy port (optional, default: 49222)
LLM_API_PORT=49222

# Path to a custom llama-server binary (optional, defaults to auto-resolved)
# LLM_BACKEND_BINARY=/opt/rocm/bin/llama-server

# Host to bind the llama-server to (optional, defaults to config.yaml)
# LLM_HOST=0.0.0.0

# Log file path (optional, defaults to stdout)
# LLM_LOG_FILE=/var/log/llm-manager/model.log

# Enable WebSocket dashboard server
DASHBOARD_ENABLED=true

# WebSocket dashboard port (optional, default: 49223)
# LLM_WS_PORT=49223

# API key for Bearer token authentication (optional)
# LLM_API_KEY=secret

# TLS for WebSocket dashboard and API server (optional)
# LLM_TLS_ENABLE=true
# LLM_TLS_CERT=/path/to/cert.pem
# LLM_TLS_KEY=/path/to/key.pem

Environment Variables

Variable	Required	Description
`LLM_MODEL`	Yes	Path to the GGUF model file
`LLM_MODEL_CONFIG`	No	Path to per-model override YAML (auto-detected from model path if not set)
`LLM_CONFIG`	No	Path to `config.yaml` (falls back to `~/.config/llm-manager/config.yaml`)
`LLM_PROFILE`	No	Settings profile name (e.g., `qwen`, `llama`)
`LLM_API_PORT`	No	API proxy port (default: 49222)
`LLM_BACKEND_BINARY`	No	Custom llama-server binary path
`LLM_HOST`	No	Bind address (default: from config)
`LLM_LOG_FILE`	No	Log file path (default: stdout)
`DASHBOARD_ENABLED`	No	Enable WebSocket dashboard (`true`/`false`)
`LLM_WS_PORT`	No	Dashboard port (default: 49223)
`LLM_API_KEY`	No	Bearer token for API/Dashboard auth
`LLM_TLS_ENABLE`	No	Enable TLS (`true`/`false`)
`LLM_TLS_CERT`	No	TLS certificate path
`LLM_TLS_KEY`	No	TLS private key path

Variables not set or commented out are omitted from the command line. Variables with values use the ${VAR:+--flag "${VAR}"} bash syntax — only added if the variable is set and non-empty.

Service File

Install the service unit file in ~/.config/systemd/user/ (user scope) or /etc/systemd/system/ (system scope):

[Unit]
Description=LLM Server
After=network-online.target
Wants=network-online.target

[Service]
Type=simple
User=aginies
Group=aginies
EnvironmentFile=/home/aginies/llmtui/llm-manager.env
ExecStart=/bin/bash -l -c '\
exec /home/aginies/llmtui/target/release/llm-manager serve \
  --model "${LLM_MODEL}" \
  ${LLM_PROFILE:+--profile "${LLM_PROFILE}"} \
  ${LLM_API_PORT:+--api-port "${LLM_API_PORT}"} \
  ${LLM_CONFIG:+--config "${LLM_CONFIG}"} \
  ${LLM_MODEL_CONFIG:+--model-config "${LLM_MODEL_CONFIG}"} \
  ${LLM_BACKEND_BINARY:+--backend-binary "${LLM_BACKEND_BINARY}"} \
  ${LLM_HOST:+--host "${LLM_HOST}"} \
  ${LLM_LOG_FILE:+--log-file "${LLM_LOG_FILE}"} \
  ${DASHBOARD_ENABLED:+--ws-enable} \
  ${LLM_WS_PORT:+--ws-port "${LLM_WS_PORT}"} \
  ${LLM_API_KEY:+--api-key "${LLM_API_KEY}"} \
  ${LLM_TLS_ENABLE:+--tls-enable} \
  ${LLM_TLS_CERT:+--tls-cert "${LLM_TLS_CERT}"} \
  ${LLM_TLS_KEY:+--tls-key "${LLM_TLS_KEY}"}'
Restart=on-failure
RestartSec=5

[Install]
WantedBy=multi-user.target

User vs System Scope

User scope (recommended for single-user setups):

cp llm-manager.service ~/.config/systemd/user/
systemctl --user daemon-reload
systemctl --user enable llm-manager
systemctl --user start llm-manager

System scope (for multi-user or root-managed services):

sudo cp llm-manager.service /etc/systemd/system/
sudo systemctl daemon-reload
sudo systemctl enable llm-manager
sudo systemctl start llm-manager

Update User= and Group= in the service file to match your user, and update EnvironmentFile= and ExecStart paths as needed.

Management Commands

# Start the service
systemctl --user start llm-manager

# Stop the service
systemctl --user stop llm-manager

# Restart the service
systemctl --user restart llm-manager

# Enable auto-start on boot
systemctl --user enable llm-manager

# Check status
systemctl --user status llm-manager

# View logs
journalctl --user -u llm-manager -f

# Reload config after .env changes
systemctl --user reload-or-restart llm-manager

Per-Model Config Auto-Detection

When LLM_MODEL_CONFIG is not set, llm-manager auto-detects the per-model config file from the model path. The key is derived from the model’s display name:

Strip the first models directory prefix from the model path
Replace path separators (/) with double underscores (__)
Append .yaml to the key

For example, model path /home/aginies/.local/share/llm-manager/models/unsloth/Qwen3.6-35B-A3B-MTP-GGUF/Qwen3.6-35B-A3B-UD-Q4_K_XL.gguf with models dir /home/aginies/.local/share/llm-manager/models produces key unsloth__Qwen3.6-35B-A3B-MTP-GGUF__Qwen3.6-35B-A3B-UD-Q4_K_XL.gguf, and the config file is looked up at ~/.config/llm-manager/models/unsloth__Qwen3.6-35B-A3B-MTP-GGUF__Qwen3.6-35B-A3B-UD-Q4_K_XL.gguf.yaml.

Override Priority

Settings are applied in this order (later overrides earlier):

Rust hardcoded defaults
config.yaml default section
config.yaml per-model overrides (models section)
config.yaml profile
Per-model config YAML file (LLM_MODEL_CONFIG or auto-detected)

Example: Full Production Configuration

# llm-manager.env
LLM_MODEL=/home/aginies/.local/share/llm-manager/models/unsloth/Qwen3.6-35B-A3B-MTP-GGUF/Qwen3.6-35B-A3B-UD-Q4_K_XL.gguf
LLM_MODEL_CONFIG=/home/aginies/.config/llm-manager/models/unsloth__Qwen3.6-35B-A3B-MTP-GGUF__Qwen3.6-35B-A3B-UD-Q4_K_XL.gguf.yaml
LLM_CONFIG=/home/aginies/.config/llm-manager/config.yaml
LLM_API_PORT=49222
DASHBOARD_ENABLED=true
LLM_WS_PORT=49223
LLM_API_KEY=secret
LLM_TLS_ENABLE=true
LLM_TLS_CERT=/home/aginies/.config/llm-manager/tls/ryzen9-linux.crt
LLM_TLS_KEY=/home/aginies/.config/llm-manager/tls/ryzen9-linux.key
LLM_LOG_FILE=/var/log/llm-manager/model.log

This enables the API proxy, WebSocket dashboard with authentication, TLS encryption, and log file output — all suitable for a production server.

Keyboard shortcuts

LLM Manager