Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Introduction

LLM Manager is a terminal UI (TUI) for managing local LLM models. It lets you search HuggingFace, download GGUF models, load them via llama.cpp’s llama-server, and chat with them — all from your terminal.

Features

  • Model search on HuggingFace (filters to GGUF models, paginated with infinite scroll)
  • Download GGUF model files with progress tracking and cancellation (with disk space check)
  • Load/unload models via llama.cpp server with progress visualization
  • Local Model Filter — quickly find models in your list with f
  • RPC Workers Manager — dedicated window to manage distributed inference nodes
  • Chat with loaded models in the terminal
  • Configure loading and inference parameters per model
  • GGUF file browser — list and select specific GGUF files for a model
  • Log panel — expand/collapse with Enter/Esc, follow mode with f
  • About Box — application info and GPLv3 license link (A)
  • CmdLine overlay — view the full llama-server command line (Ctrl+K), export to script (e)
  • API proxy — expose an OpenAI-compatible API with CORS and SSE streaming support
  • API key authentication — Bearer token authentication for the API proxy
  • Profiles — save and apply named presets of settings
  • System Prompt Presets — named system prompts for different use cases
  • Router Mode — load multiple models simultaneously
  • Benchmark Tuning — auto-tune model parameters for optimal performance
  • Panel Resize — drag the border between left and right panels, or use Shift+←/→
  • README rendering — full markdown renderer for HuggingFace model documentation
  • HuggingFace URL links — navigate to model pages from Model Info
  • Multi-backend — CPU, Vulkan, ROCm, ROCm Lemonade, and CUDA support with per-backend version picker (13 platform-specific variants)
  • Speculative decoding — MTP and other speculative decoding types via SpecTypePicker
  • Per-model tags — Edit and manage tags for each model
  • TLS support — Secure WebSocket dashboard with self-signed certificate generation
  • Dashboard URL modal — Copy dashboard URL to clipboard with Ctrl+U
  • YaRN RoPE — Extend context beyond training length with YaRN RoPE parameter tuning

Prerequisites

  • Rust toolchain (edition 2024)
  • A HuggingFace account (for downloading gated models)
  • An NVIDIA GPU (Vulkan/CUDA) or AMD GPU (ROCm/ROCm Lemonade) for GPU inference, or a CPU for CPU-only inference

Screenshot

LLM Manager

Quick Start

git clone https://github.com/aginies/llmtui.git
cd llmtui
cargo build --release
cargo run

Getting Started

Installation

From source

git clone https://github.com/aginies/llmtui.git
cd llmtui
cargo build --release

Platform Support

llm-manager runs on Linux, macOS, and Windows. GPU backends available per platform:

PlatformCPUVulkanROCmROCm LemonadeCUDA
Linux x64YesYesYesYesYes
Linux ARM64Yes
Windows x64YesYesYes (HIP)Yes (12.4 / 13.1)
macOS ARM64Yes
macOS x64Yes

ROCm Lemonade (AMD optimized) is Linux-only and auto-detects your GPU architecture (e.g. gfx1100).

Using the build script

A convenience script is included for common operations:

./build.sh build      # Build (debug)
./build.sh run        # Build and run (TUI mode)
./build.sh serve      # Serve a model
./build.sh servedoc   # Serve docs with watch mode
./build.sh release    # Release build
./build.sh clean      # Remove build artifacts
./build.sh format     # Format code
./build.sh clippy     # Run clippy
./build.sh doc        # Build documentation
./build.sh help       # Show help

First Run

On first launch, llm-manager creates a default configuration in ~/.config/llm-manager/config.yaml and sets up the models directory at ~/.local/share/llm-manager/models/.

cargo run

The application will:

  1. Load (or create) the config file
  2. Discover any .gguf files in the models directory
  3. Start the TUI

The TUI is divided into several panels:

  • Models panel (left) — list of local GGUF models
  • Settings panel (right) — server and LLM settings
  • Log panel (bottom) — live output from llama.cpp
  • Download panel — appears when downloading files

Use Tab to cycle between panels, and Ctrl+H for panel-specific help.

Searching for Models

To search HuggingFace for models:

  1. Press / to enter search mode
  2. Type your query and press Enter
  3. Results appear sorted by relevance by default
  4. Press Ctrl+S to cycle sort order (Relevance / Downloads / Likes / Trending / Created)
  5. Press Ctrl+B to go back one page, or scroll down at the bottom for more results
  6. Press Ctrl+Shift+R to fetch the model’s README (auto-fetched when navigating results)

Multi-word search: Type space-separated words (e.g. qwen opus) to search with AND logic — all words must match the model name.

Downloading Models

To download a model from HuggingFace:

  1. Press / to enter search mode
  2. Type your query and press Enter
  3. Press l on a result to browse available GGUF files
  4. Select a file and press Enter to download
  5. Press ⌥C (Alt+C) to cancel, or p to pause/resume the download at any time

The download progress is shown in the Download panel with speed (MiB/s), ETA, and status indicators. Before downloading, the app checks available disk space and warns if insufficient. Cancelled downloads automatically remove the temporary file. Once complete, the model appears in the Models panel (in your models directory).

Loading Models

Once a model is downloaded (or has one locally in your models directory):

  1. Select the model in the Models panel
  2. Press l (or Enter) to load it

The loading process shows a progress bar with phases:

  • Server starting
  • Loading model weights
  • Loading metadata
  • Loading tensors (with GPU layer count)
  • Server listening
  • Ready (detected via /health API polling)

Log Panel

The Log panel shows live output from the llama.cpp server. Press Enter to expand to fullscreen, Esc to collapse. Press f to toggle between Following (auto-scroll) and Manual (scroll history) modes.

Other Features

  • Profiles (p) — Quick-switch between saved settings presets
  • Profile Picker (Ctrl+P) — Open a modal to select from built-in or user profiles
  • System Prompt Presets — Named system prompts for different use cases (Coder, Thinker, Mathematician)
  • RPC Workers — Manage distributed inference nodes from Server Settings
  • Benchmark Tuning — Auto-tune model parameters for optimal performance (set Mode to BenchTune)
  • Router Mode — Load multiple models simultaneously
  • Panel Resize — Drag the border between left and right panels, or use Shift+←/→ (20%-80%)
  • Mouse support — Click panels to focus, scroll in logs, README, and settings

Using Serve Mode

You can also start a model directly from the command line:

./build.sh serve --model /path/to/model.gguf

Or with a settings profile:

./build.sh serve --model model.gguf --profile qwen

With a custom backend binary:

./build.sh serve --model model.gguf --backend-binary /opt/rocm/bin/llama-server

Bound to a specific network interface:

./build.sh serve --model model.gguf --host 0.0.0.0

Logs redirected to a file:

./build.sh serve --model model.gguf --log-file /var/log/llm-manager/model.log

API Proxy

Start with an OpenAI-compatible API proxy:

./build.sh serve --model model.gguf --api-port 49222

With authentication:

./build.sh serve --model model.gguf --api-port 49222 --api-key secret

The API proxy forwards requests to the llama-server instance and supports all llama.cpp endpoints including chat completions, embeddings, and more. It supports SSE (Server-Sent Events) streaming for chat completions and other streaming endpoints, and CORS is enabled for all origins.

Usage

Serve Mode

Run a model directly with llama-server and expose an OpenAI-compatible API:

# Serve a model with API proxy on port 49222
./build.sh serve --model /path/to/model.gguf --api-port 49222

# Serve with a settings profile
./build.sh serve --model model.gguf --profile qwen

# Serve with API key authentication (Bearer token)
./build.sh serve --model model.gguf --api-port 49222 --api-key secret

# Serve with API proxy and WebSocket dashboard
./build.sh serve --model model.gguf --api-port 49222 --ws-enable

# Serve with custom dashboard port and auth
./build.sh serve --model model.gguf --api-port 49222 --ws-enable --ws-port 8081 --ws-auth mykey

# Serve with a custom backend binary path
./build.sh serve --model model.gguf --backend-binary /path/to/custom/llama-server

# Serve bound to a specific network interface
./build.sh serve --model model.gguf --host 0.0.0.0

# Redirect logs to a file (useful for systemd)
./build.sh serve --model model.gguf --log-file /var/log/llm-manager/model.log

# Combine options
# Serve with API proxy and WebSocket dashboard on a specific host
./build.sh serve --model model.gguf --api-port 49222 --ws-enable --host 192.168.1.100

# Redirect logs to a file (useful for systemd)
./build.sh serve --model model.gguf --log-file /var/log/llm-manager/model.log

# Combine options
./build.sh serve --model model.gguf --api-port 49222 --ws-enable --host 0.0.0.0 --backend-binary /opt/rocm/bin/llama-server --log-file /var/log/llm-manager/model.log

The serve command automatically resolves the llama-server binary from the backend-specific directory (~/.local/share/llm-manager/bin/llama-server-{cpu,vulkan,rocm}-{version}/) and sets LD_LIBRARY_PATH for shared libraries. If the binary is not found, it downloads it from the llama.cpp GitHub releases. Use --backend-binary to specify a custom binary path, --host to override the network bind address for both the API proxy and WebSocket servers (default is from config), and --log-file to redirect logs to a file instead of stdout.

Model Management

Listing Models

The Models panel shows all .gguf files found in your models directories (recursively). The display name is the relative path from the models directory.

  • f — Filter local models by name (case-insensitive substring match)
  • Esc — Clear active filter and return to full list

Loading and Unloading

  • l or Enter — Load selected model
  • u — Unload model from server
  • Ctrl+D — Delete model (with confirmation)

When a model is loaded, its state changes to Loaded showing the port and PID. You can load multiple models when using Router mode.

Deleting Models

Pressing Ctrl+D prompts for confirmation before moving the model file and its YAML config to ~/.config/llm-manager/unused/. Both can be restored later.

Search mode lets you browse and download GGUF models from HuggingFace:

KeyAction
/Open search input modal — type query and press Enter to search
EnterSelect GGUF files for the highlighted model
EscExit search
Ctrl+SCycle sort order
Ctrl+BGo back one page
Down (at bottom)Load more results
Ctrl+Shift+RFetch and view README for the selected model

Type space-separated words (e.g. qwen opus) to search with AND logic — all words must match the model name. Matching words are highlighted in cyan in the results list.

GGUF File Browser

When viewing GGUF files for a model:

KeyAction
j / kNavigate files
EnterDownload selected file
EscGo back to search results
⌥CCancel download and remove temp file

Download Panel

When one or more files are downloading, the Download panel appears at the bottom of the screen, showing progress, speed (MiB/s), ETA, and status for each download. Before downloading, the app checks available disk space and warns if insufficient. Cancelled downloads automatically remove the temporary file.

KeyAction
j / kNavigate downloads
pPause / Resume selected download
⌥CCancel selected download and remove temp file

Status indicators: Downloading (yellow), Paused (white), Complete (green), Cancelled (red), Error (red).

Loading Models

When you load a model, the application:

  1. Resolves the llama-server binary for the selected backend (CPU/Vulkan/ROCm)
  2. Spawns the server with the current settings
  3. Loads the model via the server’s /models/load API
  4. Polls the server’s /metrics and /health endpoints for status
  5. Displays a progress bar showing loading phases

Loading Phases

The progress bar tracks:

  • Server starting (8%) — llama.cpp binary is launched
  • Loading model (7%) — weights file is being read
  • Loading metadata (7%) — GGUF metadata is parsed
  • Loading tensors (70%) — tensors are loaded and offloaded to GPU
  • Server listening (8%) — HTTP server is ready
  • Complete — model is ready for inference

During tensor loading, the progress bar shows offloaded layers (e.g., 16/32) parsed from llama.cpp’s log output.

Settings

Server Settings

SettingDefaultDescription
Host127.0.0.1Bind address for the llama.cpp server. Use 0.0.0.0 to accept connections from other machines.
Backendauto-detectedAcceleration backend: auto-detected based on GPU (Cuda for NVIDIA, Rocm for AMD, Vulkan for Intel). Options: cpu (CPU-only), vulkan (NVIDIA/AMD/Intel GPU), rocm (AMD GPU), rocm-lemonade (AMD optimized), cuda (NVIDIA CUDA 12.8). Shows the currently selected version.
Threads(physical cores)CPU threads for generation. Set to your physical core count for best performance.
Threads Batch8CPU threads for batch processing (prompt evaluation).
ModeNormalServer mode: Normal (single model), Router (multiple models), Bench (run llama-bench), or BenchTune (parameter auto-tuning).
API EndpointfalseEnable the API proxy server (see Serve Mode).
DashboardfalseWebSocket dashboard server (port 49223). Press Enter to configure (enabled, port, auth key, TLS).
RPC WorkersNoneOpen a dedicated window to manage distributed inference nodes (IP:Port).

Note: The Server Settings panel is hidden when a server is already running. Press F2 to toggle Server Settings only when no server is active.

LLM Settings

The LLM Settings panel has 32 standard fields, 16 expert fields (revealed with Ctrl+X), and 19 ultra fields (hidden even in expert mode), for a total of 67 fields. Arrow keys adjust values; +/- for coarse changes, Left/Right for fine. Toggle fields respond to e or Ctrl+E.

Loading

FieldDefaultDescription
PromptGeneralSystem prompt preset that defines the model’s initial behavior. Presets include General, Coder, Thinker, Mathematician, and any user-defined prompts.
Context32096Context window size in tokens. Must be a power of two. Larger values consume more VRAM and RAM. Models often have a maximum context length (e.g., 32K, 128K).
Keep in memoryfalseLocks model weights in RAM (-mlock) to prevent the OS from swapping them out. Useful when repeatedly loading/unloading models. Increases RAM usage.

GPU Offload

FieldDefaultDescription
GPU LayersAutoNumber of model layers offloaded to GPU memory. Auto lets llama.cpp decide based on available VRAM. Specific sets an exact number. All offloads every layer (-ngl 999).
Flash AttentiontrueEnables Flash Attention 2 for faster inference with lower memory usage. Requires GPU support. Can improve throughput by 20-40%.
KV Cache OffloadtrueOffloads the KV cache to RAM when GPU memory is full. Trade-off: more VRAM available for model weights at the cost of slower cache access.
Cache Type KF16Data type for the key cache. Options: F32 (most accurate, most memory), F16 (default), BF16 (better than F16 for some models), Q8_0, Q5_0, Q5_1, Q4_0, Q4_1, Iq4_NL.
Cache Type VF16Data type for the value cache. Same options as Cache Type K. Using lower precision reduces VRAM but may affect quality.
Active Experts1For Mixture-of-Experts (MoE) models, the number of experts activated per token. Higher values improve quality but increase compute.

Evaluation

FieldDefaultDescription
Eval Batch512Logical maximum batch size for evaluation. Larger batches improve throughput but increase memory usage. Set to the model’s native context length for single-sequence inference.
Unified KVtrueShares KV cache across sequences, reducing memory usage when running multiple prompts. Can cause cache eviction conflicts.

Sampling

FieldDefaultDescription
Seed-1Random seed for reproducible outputs. -1 means random each time. Set to a fixed value for debugging or reproducibility.
Temperature0.8Controls randomness in sampling. Higher values (1.0-2.0) produce more creative/divergent outputs. Lower values (0.0-0.5) produce more deterministic/crisp outputs.
Top-k40Limits sampling to the k most likely next tokens. 0 disables. Smaller values make outputs more focused. Typical: 20-50.
Top-p0.95Nucleus sampling: limits to tokens whose cumulative probability reaches p. 1.0 disables. Lower values (0.8-0.95) reduce randomness.
Min P0.0Minimum probability threshold for sampling. Tokens with probability below this fraction of the highest-probability token are excluded. Useful for controlling extreme outputs.
Max Tokens0Maximum tokens to generate per response. 0 means no limit (until EOS token).

Repetition Control

FieldDefaultDescription
Repetition Penalty1.1Penalizes tokens that have already appeared. Values > 1.0 reduce repetition. Typical: 1.1-1.2.
Rep. Last N64Number of recent tokens to consider for repetition penalty. -1 uses the full context.

Yarn RoPE

FieldDefaultDescription
Yarn RoPEfalseEnables YaRN (Yet another RoPE extensioN) for extending context beyond the model’s training length.
Yarn ParamsOpens a modal to configure three floating-point values: rope_scale (default 1.0, multiplies context), rope_freq_base (default 0.0, overrides the model’s base frequency), rope_freq_scale (default 1.0, scales the frequency). Only digits, ., -, e, and E are accepted.

Tags

FieldDefaultDescription
TagsNonePer-model tags stored in the YAML config. Press Enter to open the tag editor modal. Press t in the LLM Settings panel to open the tag editor.

Backend

FieldDefaultDescription
LLama.cpp VersionLatestShows the currently selected backend version. Press Enter to open the backend version picker.

Expert Mode

Press Ctrl+X to toggle expert mode, which reveals 16 additional parameters:

Loading (expert): NUMA (None/Distribute/Isolate/Numactl)

GPU (expert): Cache Type K (toggle), Cache Type V (toggle), Main GPU, Fit, Active Experts (toggle)

Sampling (expert): Mirostat (Off/1/2), Mirostat LR, Mirostat Ent, Ignore EOS (toggle)

Repetition (expert): Presence Penalty (toggle, -2.0 to 2.0), Frequency Penalty (toggle, -2.0 to 2.0)

Speculative (expert): MTP (toggle), Spec Type, Spec Draft N Max

These fields follow the same navigation and editing rules as standard fields. Arrow keys adjust values, Enter enters direct edit mode, and dirty fields are highlighted in yellow.

Ultra Fields

19 ultra fields are hidden even in expert mode. They include: Typical P, Mirostat, Mirostat LR, Mirostat Ent, Ignore EOS, Samplers, DRY Multiplier, DRY Base, DRY Allowed Length, DRY Penalty Last N, Threads Batch, UBatch Size, Keep, Split Mode, Tensor Split, Main GPU, Fit, Embedding, RPC. These require direct config file editing or profile application.

Cache Type K/V options: F32, F16, BF16, Q8_0, Q5_0, Q5_1, Q4_0, Q4_1, Iq4Nl

Changing Values

Use Left/Right to adjust numeric fields by 1, or Up/Down for larger steps. Toggle fields respond to e or Ctrl+E. Dirty (changed) fields have the name in red and a trailing *. The status bar shows *unsaved* when settings are dirty.

Saving Settings

  • Ctrl+S — Save settings for the selected model
  • Ctrl+R — Reset to defaults
  • e / Ctrl+E — Toggle enabled/disabled (for Keep in memory, Flash Attention, KV Cache Offload, Cache Type K/V, Fit, Unified KV, Max Tokens, Presence/Frequency Penalty, Max Concurrent Pred, MTP, Ignore EOS, Yarn RoPE, Active Experts)
  • Ctrl+X — Toggle expert mode (reveals 16 additional parameters)

Dirty (changed) fields are highlighted with red names and a trailing *.

Keyboard Shortcuts

KeyAction
j / kNavigate up/down
EnterLoad model / Select GGUF files (in search) / Expand log
fFilter local models list / Toggle Follow mode (in Log panel)
EscBack / Exit search / Collapse log / Clear local filter
TabSwitch panels (next)
Shift+TabSwitch panels (previous)
/Open search input modal
lLoad model
uUnload model
tOpen tags editor (in LLM Settings)
AAbout box (license and version info)
Ctrl+DDelete model (with confirmation)
Ctrl+HPanel-specific help
Ctrl+KCmdLine overlay
Ctrl+Alt+KKill llama-server
Ctrl+LFocus Log panel
Ctrl+SSave settings / Cycle search sort (in search)
Ctrl+RReset settings (in LLM Settings)
Ctrl+EToggle field enabled/disabled (in LLM Settings: Cache Type K/V, Max Tokens, Presence/Frequency Penalty, Max Concurrent Pred, Flash Attention, Unified KV, Keep in memory, Fit, MTP, Ignore EOS, Yarn RoPE, Active Experts)
Ctrl+XToggle expert mode (in LLM Settings)
Ctrl+POpen Profile Picker modal (in LLM Settings)
Ctrl+UOpen Dashboard URL modal (copy URL to clipboard)
Ctrl+BBack one page in search
Ctrl+Shift+KKill llama-server (alternative)
Ctrl+Shift+RFetch README for selected model (in search)
g / GJump to top/bottom of log
PageUp / PageDownFast scroll (logs, README, benchmarks)
F1F6Toggle panels (Models, Server, Info, Settings, Active, Log)
F9 / F10 / Ctrl+F10Show all panels
Ctrl+F7Focus Models panel
Ctrl+F8Focus Server Settings panel
Ctrl+F9Focus LLM Settings panel
Shift+← / Shift+→Resize horizontal panel split (20%-80%)
pPause/resume download / Previous benchmark result / Apply profile
nNew preset (System Prompt Presets) / Next benchmark result / Add new worker (RPC)
SpaceToggle selection (RPC workers, benchmark parameters)
Alt+MToggle benchmark mode (RuntimeOnly / Full)
Alt+PEdit benchmark prompt
Alt+NEdit n_predict (max tokens)
Alt+IEdit iterations
Alt+CEdit chat template kwargs / Cancel confirmation
yConfirm destructive action
hCancel confirmation dialog

Log Panel

The Log panel displays live output from the llama.cpp server with level-based coloring.

Log Modes

ModeBehavior
Following (default)Auto-scrolls to the bottom as new entries arrive. Press g to exit.
ManualAllows manual scrolling through log history. Press G to return to bottom.

Press f in the Log panel to toggle between modes. The current mode is shown in the panel title. Expand the log to fullscreen with Enter; collapse with Esc.

RPC Workers

RPC Workers enable distributed inference across multiple machines. Each worker has a name, IP address, and port (default: 50052).

Open the RPC Workers manager from the Server Settings panel. Within the manager:

KeyAction
nAdd new worker
eEdit selected worker
dDelete selected worker
SpaceToggle worker selection
EscClose manager

WebSocket Dashboard

The WebSocket Dashboard provides a real-time visualization of model metrics in any web browser. Access it at http://localhost:49223 (default port).

Configuration

Open the Server Settings panel, navigate to Dashboard, and press Enter to configure:

FieldDescription
EnabledToggle the dashboard on/off
PortServer port (default: 49223)
Auth KeyOptional authentication key
TLS EnabledEnable TLS for secure dashboard access
TLS CertPath to TLS certificate file
TLS KeyPath to TLS private key file

When an auth key is set, clients must include it as a URL parameter: http://localhost:49223?auth=<key>. With TLS enabled, the URL uses https://.

Dashboard Display

The dashboard shows real-time metrics (TPS, prompt TPS, latency, context, VRAM, RAM, CPU) and current inference settings (backend, threads, temperature, sampling parameters, etc.) alongside the full server command line.

Benchmark Tuning

Benchmark Tuning auto-tunes model parameters for optimal performance. Access it by setting the Server Mode to BenchTune.

Two modes are available:

  • RuntimeOnly — Single server, params sent in request body (no server restarts)
  • Full — New server spawned for each parameter combination

Tunable parameters: temperature (0.4–1.0), top_p (0.8–1.0), top_k (40–50), repeat_penalty (1.0–1.2), flash_attn (0/1), threads (4–16), batch_size (512–2048), expert_count (1–4), context_length, spec_type (speculative decoding type), draft_tokens.

Results can be exported as Markdown table, JSON, YAML, or HTML report with summary cards, winner section, impact analysis, and Chart.js charts. Navigate between results with p (previous) and n (next).

System Prompt Presets

Named system prompts for different use cases. Built-in presets: General, Coder, Thinker, Mathematician. User presets are stored as YAML files in ~/.config/llm-manager/presets/<name>.yaml.

Open the System Prompt Presets panel and manage presets:

KeyAction
nCreate new preset
eEdit selected preset
Apply preset
dDelete selected preset (moved to unused_presets/)
⌃SSave preset during edit
EscClose / Cancel edit

GPU Layers Cycling

In the LLM Settings panel, the GPU Layers field cycles through three modes with arrow keys:

ModeBehavior
AutoLets llama.cpp auto-detect based on available VRAM (default)
Specific numberOffloads exactly that many layers to GPU
AllOffloads all layers (equivalent to -ngl 999)

Arrow keys cycle: Auto12 → … → NAllAuto. Pressing Enter from a specific number opens an edit buffer for direct input. The -ngl flag is only added for Specific and All modes.

Tags

Per-model tags can be edited in the LLM Settings panel. The Tags field opens an edit modal where you can add, remove, or modify tags associated with the model. Tags are stored in the per-model YAML config.

MTP (Multi-Token Prediction)

MTP is an experimental feature that uses a draft model to predict multiple tokens in parallel, improving inference speed. When a model with MTP architecture is selected, the app automatically detects it and enables the --draft-mtp flag. The number of draft tokens is read from the GGUF metadata and displayed in the Model Info panel.

GGUF Metadata

The Model Info panel shows parsed GGUF metadata including: architecture, layers, hidden size, context length, attention heads, KV heads, domain, capabilities, quantization, parameters (e.g., “7B”, “405B”), tokenizer type, vocabulary size, and max context for VRAM. Metadata is parsed once and cached (debounced by file mtime).

Active Model Metrics

The Active Model panel shows real-time metrics:

MetricDescription
TPSTokens per second (generation speed)
Prompt TPSPrompt processing speed
Gen TPSGeneration tokens per second (separate from prompt TPS)
Context usageProgress bar showing ctx_used/ctx_max
CPU%CPU usage percentage
RAMRAM usage
VRAMGPU memory used/total
Total VRAMTotal GPU memory used (including non-model allocations)
LatencyMilliseconds per token (generation and prompt)
TokensTotal decoded tokens generated

The panel also shows benchmarking state with progress bar and current parameter display when running BenchTune.

Backend Selection

Multiple backends are supported via the llama.cpp server:

BackendSourceDescription
CPUggml-org/llama.cppCPU-only inference (standard)
Vulkanggml-org/llama.cppGPU via Vulkan (Universal: AMD/NVIDIA/Intel)
ROCmggml-org/llama.cppGPU via ROCm (AMD Native)
ROCm Lemonadelemonade-sdk/llamacpp-rocmGPU via ROCm (AMD Optimized, auto-detects GFX architecture)
CUDAai-dock/llama.cpp-cudaGPU via CUDA (NVIDIA Native, CUDA 12.8)
CPU ARM64ggml-org/llama.cppCPU-only for ARM64 Linux
CPU Windowsggml-org/llama.cppCPU-only for Windows
Vulkan Windowsggml-org/llama.cppVulkan for Windows
CUDA Windows 12.4ai-dock/llama.cpp-cudaCUDA 12.4 for Windows
CUDA Windows 13.1ai-dock/llama.cpp-cudaCUDA 13.1 for Windows
HIP Windowsggml-org/llama.cppHIP (ROCm) for Windows
CPU macOS ARM64ggml-org/llama.cppCPU-only for macOS Apple Silicon
CPU macOS x64ggml-org/llama.cppCPU-only for macOS Intel

Each backend has its own independently configurable llama.cpp version. Switching versions is instant — no re-download.

Server Modes

ModeDescription
NormalSingle model via CLI (default)
RouterMultiple models via API, loads via /load endpoint
BenchGPU benchmarking mode (runs llama-bench)
BenchTuneParameter auto-tuning mode

VRAM Estimate

The app computes a detailed VRAM estimate based on model size, GPU layers, KV cache, activation overhead, and fixed overhead. The formula accounts for GQA ratio, FlashAttention (0.5× KV cache reduction), unified KV cache, KV cache quantization bytes, activation overhead (8× multiplier), and fixed overhead (3.8% of max VRAM or 500 MiB fallback). The estimate is shown in the LLM Settings title (e.g., “VRAM ~= 8.2 GB”).

Confirmation Dialogs

The app uses confirmation dialogs for destructive actions:

  • Exit — warns about loaded models
  • Delete — confirms irreversible deletion
  • Reset — confirms resetting all LLM settings
  • Unload — confirms unloading a model via API
  • DeleteBackend — confirms deleting a backend binary version from disk

Mouse Support

Mouse interactions are supported: clicking on panels to focus them, and scrolling in the log panel, README panel, settings, profiles, and presets panels.

Panel Resize

The horizontal split between left panels (Models + Info) and right panels (Settings/README) can be resized:

MethodDescription
Drag borderClick and drag the vertical border between left and right panels
Scroll on borderScroll mouse wheel while hovering over the border (1% steps)
KeyboardShift+← / Shift+→ to adjust by 1% (range: 20%-80%)

The current split percentage is shown in the status bar (e.g., │ 55%). While actively resizing, the indicator shows │ 55% ← resize →.

CmdLine Overlay

Press Ctrl+K to view the full command line that would be executed to start the llama.cpp server. This shows the binary path, model path, and all parameters.

Press e in the overlay to export the command to /tmp/test_llamaserver.sh.

Server Status

The status bar shows the current server status at the top:

  • Running: ● 9090 Normal (green dot with port and mode)
  • Stopped: ○ Server (gray)

Press Ctrl+Alt+K to kill the running llama-server. When stopped, all loaded models are reset to Available state.

Profiles

Profiles are named presets of LLM settings. Built-in profiles include Qwen, Gemma, Llama, Mistral, and Phi. User profiles are stored as YAML files in ~/.config/llm-manager/profiles/<name>.yaml.

  • p — Apply a profile to current settings
  • Ctrl+S — Save current settings as a new profile (in the Profiles panel)
  • Ctrl+D — Delete a user-defined profile (moved to unused_profiles/)

Configuration

Directory Layout

llm-manager uses XDG directories for config and data:

~/.config/llm-manager/          # Config directory
├── config.yaml                 # Global settings
├── models/                     # Per-model YAML configs
│   └── qwen2.5-7b.yaml
├── profiles/                   # Per-profile YAML configs
│   └── my-profile.yaml
├── presets/                    # Per-preset YAML configs
│   └── custom-preset.yaml
├── unused/                     # Deleted model configs
├── unused_profiles/            # Deleted profiles
└── unused_presets/             # Deleted presets

~/.local/share/llm-manager/     # Data directory
├── models/                     # GGUF model files
│   └── qwen2.5-7b.Q4_K_M.gguf
└── bin/                        # llama-server binaries
    └── llama-server-cpu-...

Per-model configs are named <model_name>.yaml where model_name is the GGUF filename without the .gguf extension. Deleted configs are moved to unused/ subdirectories (recoverable).

Config File

The main config file is ~/.config/llm-manager/config.yaml. It is created automatically on first run with sensible defaults.

models_dirs:
  - ~/.local/share/llm-manager/models
llama_server: llama-server
default:
  context_length: 32096
  threads: <physical cores>
  threads_batch: 8
  batch_size: 512
  temperature: 0.8
  # ... more settings

You can specify a custom config path with --config:

cargo run -- --config /path/to/config.yaml

Default Parameters

ParameterDefaultDescription
context_length32096Context window size in tokens
threads(physical cores)CPU threads for generation
threads_batch8CPU threads for batch processing
batch_size512Logical maximum batch size
ubatch_size512Physical maximum batch size
keep0Keep N tokens from initial prompt
mlockfalseLock model weights in RAM
mmaptrueMemory-map the model
kv_cache_offloadtrueOffload KV cache to RAM
flash_attntrueEnable Flash Attention
temperature0.8Sampling temperature
top_k40Top-k sampling
top_p0.95Top-p sampling
repeat_penalty1.1Repetition penalty
backendauto-detectedDefault backend (auto-detected: Cuda for NVIDIA, Rocm for AMD, Vulkan for Intel; falls back to cpu). Options: cpu, vulkan, rocm, rocm-lemonade, cuda

Profiles

Profiles are named presets of settings. The built-in profiles are:

ProfileDescriptionKey Settings
QwenOptimized for Qwen modelstemp: 0.6, top-k: 20, context: 32768
GemmaOptimized for Gemma modelstemp: 0.8, min-p: 0.1, typical-p: 0.9
LlamaOptimized for Llama modelstemp: 0.7, repeat-penalty: 1.1
MistralOptimized for Mistral modelstemp: 0.7, top-k: 50
PhiOptimized for Phi modelstemp: 0.7, top-k: 50

User-defined profiles are stored as individual YAML files in ~/.config/llm-manager/profiles/<name>.yaml. Built-in profiles are auto-merged on load.

System Prompt Presets

System prompt presets define the initial system prompt. Built-in presets:

PresetDescription
General“You are a helpful assistant.”
CoderExpert software developer
ThinkerAnalytical and thoughtful
MathematicianExpert in mathematics

User-defined presets are stored as individual YAML files in ~/.config/llm-manager/presets/<name>.yaml. Built-in presets are auto-merged on load.

Backend Binaries

llama-server binaries are stored in ~/.local/share/llm-manager/bin/ with versioned directories:

~/.local/share/llm-manager/bin/
├── llama-server-cpu-{version}/llama-server
├── llama-server-vulkan-{version}/llama-server
├── llama-server-rocm-{version}/llama-server
├── llama-server-rocm-lemonade-{version}/llama-server
└── llama-server-cuda-{version}/llama-server

Binaries are downloaded from specialized repositories on first use:

Switching versions is instant — no re-download.

Per-backend Version Config

llama_cpp_version_cpu: null
llama_cpp_version_vulkan: null
llama_cpp_version_rocm: null
llama_cpp_version_rocm_lemonade: null
llama_cpp_version_cuda: null

Platform-specific backend variants (e.g. CpuArm64, CpuWindows, CpuMacosArm64) are handled through the Backend enum and platform field, not through separate version config keys. Each backend has its own independently configurable version.

Setting to null uses the latest release. Specific versions can be set via the version picker in LLM Settings. These selections are automatically persisted to your configuration and remembered across restarts.

Asset Names

Assets are selected based on the detected platform. Linux examples:

  • CPU (x64): llama-{tag}-bin-ubuntu-x64.tar.gz
  • CPU (ARM64): llama-{tag}-bin-ubuntu-arm64.tar.gz
  • Vulkan: llama-{tag}-bin-ubuntu-vulkan-x64.tar.gz
  • ROCm: llama-{tag}-bin-ubuntu-rocm-7.2-x64.tar.gz
  • ROCm Lemonade: llama-{tag}-ubuntu-rocm-{gfx}-x64.zip (auto-detects GPU architecture)
  • CUDA: llama.cpp-{tag}-cuda-12.8-amd64.tar.gz

Windows assets use *.zip (e.g. llama-{tag}-bin-win-cpu-x64.zip). macOS assets use *-macos-arm64.tar.gz or *-macos-x64.tar.gz.

Serve Mode

You can start a model directly from the command line without the TUI:

./build.sh serve --model /path/to/model.gguf

Options

OptionDescription
--modelPath to the GGUF model file
--profileApply a settings profile (e.g., qwen, llama)
--configPath to config file
--api-portStart API proxy on given port
--api-keyAPI key for Bearer token authentication
--ws-enableEnable WebSocket dashboard server
--ws-portPort for WebSocket dashboard server
--ws-authAuth key for WebSocket dashboard access
--hostBind address for the server (e.g., 0.0.0.0)
--backend-binaryPath to a custom llama-server binary
--log-fileLog file path (default: stdout)
--tls-enableEnable TLS for WebSocket dashboard
--tls-certPath to TLS certificate file
--tls-keyPath to TLS private key file
--threadsCPU threads
--contextContext length
--gpu-layersNumber of GPU layers

API Proxy

The API proxy forwards requests to the llama.cpp server and provides OpenAI-compatible and Anthropic-compatible endpoints. It supports SSE (Server-Sent Events) streaming for chat completions and other streaming endpoints, and CORS is enabled for all origins with GET/POST/PUT/DELETE/OPTIONS methods. When --api-key is set, all requests require Authorization: Bearer <key>.

API Endpoints

The API proxy explicitly handles the following endpoints, while all other paths are automatically proxied to the llama-server instance:

EndpointMethodDescription
/healthGETHealth check
/metricsGETPrometheus metrics
/v1/chat/completionsPOSTChat completions (OpenAI)
/v1/completionsPOSTCompletions (OpenAI)
/v1/embeddingsPOSTEmbeddings
/v1/modelsGETList models
/api/statusGETServer status (pid, uptime, loaded models)

The following endpoints are automatically proxied to llama-server (not explicitly handled):

EndpointMethodDescription
/v1/responsesPOSTResponses (Anthropic)
/v1/messagesPOSTMessages (Anthropic)
/v1/messages/count_tokensPOSTCount tokens (Anthropic)
/completionPOSTLegacy completion
/infillPOSTCode completion (FIM)
/rerankingPOSTRe-ranking
/tokenizePOSTTokenize text
/detokenizePOSTDetokenize tokens
/apply-templatePOSTApply chat template
/v1/healthGETHealth check (alias)
/propsGET/POSTGet/set server properties
/slotsGETSlot monitoring
/lora-adaptersGET/POSTList/load LoRA adapters
/models/loadPOSTLoad a model (router mode)
/models/unloadPOSTUnload a model (router mode)

Model Overrides

Settings can be saved per-model. Overrides are stored as individual YAML files in ~/.config/llm-manager/models/<name>.yaml (where name is the GGUF filename without .gguf). When a model is loaded, its override settings are merged into the defaults. Deleted configs are moved to ~/.config/llm-manager/unused/ for recovery.

RPC Workers

You can manage a list of remote llama-rpc-server nodes for distributed inference. These are stored in the rpc_workers list in the config:

rpc_workers:
  - selected: true
    name: "Worker 1"
    ip: "192.168.1.10"
    port: 50052

Workers can be managed via the RPC Workers window in the Server Settings panel. Selected workers are combined into the --rpc flag when starting the server.

WebSocket Dashboard

The WebSocket Dashboard provides a real-time visualization of model metrics and settings via a web browser.

Accessing the Dashboard

The dashboard runs as a built-in HTTP server on port 49223 by default. Open it in your browser:

http://localhost:49223

Enabling in Serve Mode

The dashboard can be enabled in serve mode using the --ws-enable flag:

./build.sh serve --model model.gguf --api-port 49222 --ws-enable

Customize the dashboard port and authentication:

./build.sh serve --model model.gguf --api-port 49222 --ws-enable --ws-port 8081 --ws-auth mykey

Customize the host and use a specific backend binary:

./build.sh serve --model model.gguf --api-port 49222 --ws-enable --host 0.0.0.0 --backend-binary /opt/rocm/bin/llama-server

The --host option controls the bind address for both the API proxy server and the WebSocket dashboard server, ensuring they use the same network interface. The default is 127.0.0.1 (from config).

Enabling in TUI Mode

The dashboard can also be enabled from the TUI:

  1. Open the Server Settings panel (F2)
  2. Navigate to Dashboard and press Enter
  3. Configure:
    • Enabled — toggle on/off
    • Port — server port (default: 49223)
    • Auth Key — optional authentication (see below)
  4. Press Enter to save, Esc to close

Dashboard Overview

The dashboard displays real-time metrics in a card-based layout:

Dashboard

Metrics Cards

MetricDescription
StatusCurrent model state (loaded / unloaded / loading)
Generation SpeedTokens per second (TPS) for text generation
Prompt SpeedTokens per second for prompt processing
LatencyMilliseconds per token
TokensTokens generated with progress bar (decoded_tokens / max_tokens, or ‘∞’ if not configured)
VRAMGPU memory used/total with color-coded progress bar (green <50%, yellow 50-80%, red >80%)
RAMSystem memory usage
CPUCPU usage percentage

Settings Panel

Below the metrics, the dashboard shows a grid of current inference settings:

SettingDescription
Backend & Versionllama.cpp backend and version
Threads / Threads BatchCPU thread configuration
Context / Batch Size / Ubatch SizeModel execution parameters
Temperature / Top-k / Top-p / Min P / Typical PSampling parameters
SeedRandom seed for reproducibility
Repeat Penalty / Repeat Last NRepetition control
Presence Penalty / Frequency PenaltyAdvanced repetition control
Flash Attention / KV Cache OffloadPerformance optimizations
Cache Type K / Cache Type VKV cache quantization
Unified KV / Mlock / MmapMemory management
Expert Count / GPU LayersModel-specific settings
SamplersSampler order string
Spec Type / Draft TokensSpeculative decoding configuration
Yarn RoPE / Yarn ParamsContext extension parameters
TagsPer-model tags

Server Command

The full llama-server command line is displayed at the bottom of the dashboard, showing the exact invocation with all parameters. This is useful for debugging and inspecting the exact configuration being used.

Configuration

To enable and configure the dashboard:

  1. Open the Server Settings panel (F2)
  2. Navigate to Dashboard and press Enter
  3. Configure:
    • Enabled — toggle on/off
    • Port — server port (default: 49223)
    • Auth Key — optional authentication (see below)
  4. Press Enter to save, Esc to close

Authentication

When an auth key is configured, clients must include it as a query parameter:

http://localhost:49223?auth=mysecretkey

Connection Status

The dashboard shows a connection indicator at the top of the page:

  • Green pulsing dot — Connected via WebSocket
  • Red dot — Disconnected (auto-reconnects every 2 seconds)

Architecture

The dashboard server is built with axum and tokio. It:

  1. Creates a broadcast::channel(64) for metrics distribution
  2. Spawns the server on the configured port
  3. Each metrics update is sent to the broadcast channel
  4. WebSocket clients subscribe and receive real-time updates
  5. The HTML dashboard (embedded in the binary) connects via WebSocket and renders the metrics

The server is started/stopped automatically when you toggle the Dashboard setting in Server Settings.

Architecture

LLM Manager is a Rust application built on ratatui and crossterm, using tokio for async operations. The codebase is organized into several modules:

src/
├── main.rs          # Entry point, event loop, model discovery, metrics polling
├── config.rs        # Config loading/saving, YAML-based, profiles, presets
├── models.rs        # Domain types (SearchResult, DownloadState, ModelSettings, etc.)
├── serve.rs         # Standalone serve mode CLI (--model, --profile, --api-port, --api-key)
├── serve_api.rs     # Axum-based API proxy server for serve mode
├── backend/
│   ├── hub.rs       # HuggingFace API: search, list files, download
│   ├── server.rs    # llama.cpp server spawning (resolve_backend_binary, spawn_server)
│   ├── benchmark.rs # Benchmark tuning system (RuntimeOnly and Full modes)
│   ├── hardware.rs  # GPU detection (AMD/NVIDIA/Intel), platform detection
│   ├── tls.rs       # TLS certificate generation for secure connections
│   └── ws_server.rs # WebSocket metrics server
├── tui/
│   ├── mod.rs       # Module declaration, format_size/format_number helpers
│   ├── app/         # App state (types.rs, async_ops.rs, sync_ops.rs, state.rs, metadata.rs,
│   │                  # profiles.rs, pickers.rs, panels.rs, help.rs)
│   ├── event/       # Keyboard/mouse event handling (benches.rs, helpers.rs, key.rs, mouse.rs,
│   │                  # panel/, readme.rs)
│   ├── render.rs    # Top-level render dispatcher (hints.rs, overlays.rs, status.rs)
│   └── panel/       # Individual panel render functions
│       ├── mod.rs
│       ├── models.rs      # Left panel: model list / search / download
│       ├── info.rs        # GGUF metadata rendering
│       ├── tabbed.rs      # Right panel: Model Info / Settings tabs
│       ├── settings.rs
│       ├── log.rs
│       ├── help.rs
│       ├── active.rs      # Active model metrics panel
│       ├── about.rs       # About box
│       ├── readme.rs      # README rendering
│       ├── rpc_workers.rs # RPC workers manager
│       ├── system_prompt_presets.rs # System prompt presets
│       ├── profiles.rs    # Profiles manager
│       ├── downloads.rs   # Download progress panel (rendered inline, not a separate module)

## App State Machine

The `App` struct in `src/tui/app.rs` holds all application state. The main state machine is controlled by `models_mode`:

```rust
pub enum ModelsMode {
    List,       // Local model list
    Search { query, results, sort_by, show_readme, loading, has_more, page },
    Files { model_id, files, selected_idx, previous_query, previous_results, selected_result },
    BenchTune,  // Benchmark tuning mode showing results table
}

Each mode controls rendering in render.rs and key handling in event.rs. The GlobalMode enum handles overlays that appear above all panels:

#![allow(unused)]
fn main() {
pub enum GlobalMode {
    Normal,
    CmdLine { cmd_line: String },
    HostPicker { entries: Vec<(String, String)>, selected: usize },
    BackendPicker { entries: Vec<(Backend, Option<String>)>, selected: usize },
    Confirmation { selected: bool, kind: ConfirmationKind },
    RpcManager,
    About,
    MaxConcurrentPicker { value: String },
    SpecTypePicker { entries: Vec<String>, selected: usize },
    YarnRoPESettings { scale: String, freq_base: String, freq_scale: String, selected_field: i32, editing: bool, edit_buffer: String, edit_cursor_pos: usize },
    BenchTuneSetup { config, selected_idx, bench_mode_selection, editing_prompt, editing_kwargs },
    PromptPicker { entries, selected, editing, edit_buffer, edit_cursor_pos, confirm_delete },
    ProfilePicker { entries, selected, profiles },
    DashboardPicker { enabled, port, auth_key, tls_enabled, tls_cert, tls_key, selected_field, editing, edit_buffer, edit_cursor_pos },
    DashboardUrl { host, port, auth_key, ws_enabled },
    SearchInput { buffer: String, cursor_pos: usize },
}
}

Local Model Filter

The application supports real-time filtering of the local models list. Triggered by the f key when the Models panel is focused, it allows users to quickly narrow down large collections using case-insensitive substring matching.

Model Discovery

The discover_models() function in main.rs recursively scans the models directory for .gguf files:

#![allow(unused)]
fn main() {
fn discover_models(dir: &Path) -> Vec<DiscoveredModel>
}

Each DiscoveredModel contains the file path, name, size, and display name (relative path from models directory). Discovery runs in a blocking task on startup.

Download System

Downloads run in a spawned tokio task with progress flowing through a broadcast channel:

  1. User selects a file and presses Enter
  2. pending_download is set with (model_id, filename, url, file_size)
  3. Before starting, the app checks available disk space via hub::get_free_space_bytes() and warns if insufficient
  4. A tokio task calls hub::download_file() with an Arc<AtomicBool> cancel token and Arc<AtomicU8> state
  5. Progress updates flow through download_txdownload_rx
  6. The main loop polls download_rx each iteration and updates the Download panel
  7. Pressing ⌥C (Alt+C) cancels the download and removes the temporary file; p pauses/resumes it

The download loop checks the state atomically each iteration: 1 = downloading, 2 = paused (sleeps 100ms and retries), 3 = cancelled (removes temp file, returns error). Each DownloadState tracks bytes downloaded, speed, ETA, destination path, and status (Downloading/Paused/Complete/Cancelled/Error).

Server Spawning

When a model is loaded, spawn_server() in backend/server.rs:

  1. Resolves the llama-server binary using resolve_backend_binary()
  2. If the binary doesn’t exist, downloads and extracts it from GitHub releases
  3. Spawns the process with the model path and all settings
  4. Sets up a log channel (server_log_rx) for parsing output

The main loop polls server_log_rx and parses log messages for:

  • Loading phases (model, metadata, tensors) from log messages
  • Error detection (OOM, crash) from log messages

Metrics (TPS, VRAM, context) are now collected exclusively from the /metrics and /health API endpoints rather than log parsing.

Metrics & Logging

Metrics are collected from the /metrics and /health endpoints, which provide accurate real-time data. Loading completion is detected via the /health endpoint (polling for "status": "ok" and non-empty slots).

Each log entry is stored in log_entries: VecDeque<LogEntry> with a max of 500 entries. The log panel supports scrolling, expansion (Enter/Esc), and two modes: Following (auto-scroll to bottom) and Manual (free scroll). Press f to toggle modes.

Search

Search uses the HuggingFace API with &filter=gguf to only return GGUF models:

#![allow(unused)]
fn main() {
pub async fn search_models(query: &str, limit: u32, offset: u32) -> Result<(Vec<SearchResult>, bool)>
}

A post-filter checks that the model_id contains the search query (case-insensitive), since the HF API does full-text search across descriptions/tags and can return unrelated models.

Multi-word search: Space-separated words are split and each word must match the model name (AND logic). Matching words are highlighted in cyan in the results list.

  • Default: 70 results per page (max 200)
  • Pagination: Ctrl+B goes back, Down at bottom loads more
  • Sort order cycles: Relevance → Downloads → Likes → Trending → Created
  • README fetching: Ctrl+Shift+R downloads and renders the model’s README

VRAM Estimation

The estimate_vram_mib() function in src/models.rs estimates VRAM usage:

total = model_vram + kv_cache + activation + fixed_overhead + 550

Where:

  • model_vram — proportional to GPU layers loaded
  • kv_cache2 * n_layer * n_ctx * n_embd_kv * sizeof(type) with GQA ratio and FlashAttention factor
  • activation — proportional to batch size and hidden size
  • fixed_overhead — 3.8% of max VRAM (or 500 MiB if unknown)

Loading Progress

Model loading phases are detected from llama.cpp log output:

PhaseLog patternWeight
ServerStarting(implicit)8%
LoadingModel“LLAMA_MODEL_LOADER” / “LOADING MODEL”7%
LoadingMeta“LOADED META” / “META DATA”7%
LoadingTensors“LOAD_TENSORS:”70%
ServerListening“SERVER LISTENING”8%
CompleteDetected via /health API polling

During tensor loading, the progress bar refines using layer counts parsed from “offloaded X/Y layers” log messages.

RPC Workers

Remote workers for distributed inference are stored in the config as Vec<RpcWorker>. Each worker has a name, IP address, and port (default: 50052). The RpcManager global mode provides a dedicated window for managing workers: add (n), edit (e), delete (d), toggle selection (Space).

Benchmark Tuning

The benchmark system (src/backend/benchmark.rs) supports two modes:

  • RuntimeOnly: Single server, params sent in request body (no server restarts)
  • Full: New server spawned for each parameter combination

Key types:

  • BenchTuneConfig: Model path, iterations, prompt, params to test, duration, mode
  • BenchTuneParam: name, min, max, step, enabled
  • BenchTuneResult: params, metrics (prompt_tps, generation_tps, combined_tps, latency_per_token, first_token_time), outputs, per-iteration metrics
  • BenchTuneStatus: Running (with progress), Completed (with stats), PartiallyCompleted (with stats), Cancelled (with stats)

Error Handling

Errors are detected from log patterns:

  • OOM: “OUTOFDEVICEMEMORY” / “OUT OF MEMORY”
  • General error: “ERROR”, “FAILED TO LOAD”, “EXCEPTION”

Server exit is detected via a dedicated channel (not log parsing). On error, affected models are marked as Failed with the error message.

Confirmation Dialogs

Destructive actions trigger a GlobalMode::Confirmation overlay with ConfirmationKind variants: Exit, Reset, Delete, Unload, DeleteBackend. The user confirms with Enter or cancels with Esc.

API Reference

The full Rust API reference is available at docs.rs/llm-manager.

Generate it locally with:

cargo doc --open

Public Types

Core Types

TypeModuleDescription
DiscoveredModelmodelsA discovered .gguf file with path, name, size, and display name
ModelSettingsmodelsAll settings for loading a model via llama.cpp server (70+ fields)
ModelStatemodelsState of a model: Available, Loading, Loaded, or Failed
SearchResultmodelsA model found via HuggingFace search
DownloadStatemodelsDownload progress tracking with cancellation support
GgufMetadatamodelsParsed GGUF metadata (layers, hidden size, context, etc.)
ServerMetricsmodelsMetrics from the llama.cpp server (TPS, VRAM, CPU, context)
WsMetricsmodelsWebSocket-friendly metrics snapshot (serializable, includes settings and command display)
LogEntryconfigA single log entry with timestamp, level, and message

Enums

TypeModuleDescription
BackendmodelsAcceleration backend: Cpu, Vulkan, Rocm, RocmLemonade, Cuda, CpuArm64, CpuWindows, VulkanWindows, CudaWindows12_4, CudaWindows13_1, HipWindows, CpuMacosArm64, CpuMacosX64
ServerModemodelsServer operating mode: Normal (single model), Router (multiple), Bench (GPU benchmarking), or BenchTune (parameter auto-tuning)
GpuLayersModemodelsGPU offloading: Auto, Specific(n), or All
SearchSortmodelsSearch result sort order: Relevance, Downloads, Likes, Trending, Created
CacheTypemodelsMain KV cache data type: F16, BF16, Fq8_0, Fq4_1
CacheQuantTypemodelsKV cache data type for quantization (F32, F16, BF16, Q8_0, Q5_0, Q5_1, Q4_0, Q4_1, Iq4Nl)
CacheTypeK / CacheTypeVmodelsType aliases for CacheQuantType (used for keys and values)
SplitModemodelsMulti-GPU split mode: None, Layer, Row, Tensor
NumModemodelsNUMA optimization: None, Distribute, Isolate, Numactl
RopeScalingmodelsRoPE frequency scaling: None, Linear, Yarn
MirostatmodelsMirostat version: Off, Mirostat, Mirostat2
LoadingPhaseappPhase of model loading (used internally by the TUI)
LoadProgressmodelsLoad progress with layers_total, layers_loaded, tensors_loaded
SamplersmodelsSemicolon-separated sampler order string
BenchTuneModebenchmarkBenchmark mode: RuntimeOnly or Full
BenchTuneStatusbenchmarkStatus: Running, Completed, PartiallyCompleted, Cancelled, or Error

Main Modules

backend::hub

HuggingFace API integration.

#![allow(unused)]
fn main() {
/// Search models on HuggingFace.
pub async fn search_models(query: &str, limit: u32, offset: u32) -> Result<(Vec<SearchResult>, usize)>

/// List all GGUF files for a model.
pub async fn list_gguf_files(model_id: &str) -> Result<Vec<(String, u64, String)>>

/// Fetch the README for a model from HuggingFace.
pub async fn fetch_readme(model_id: &str) -> Result<String>

/// Download a file with progress tracking.
pub async fn download_file(
    model_id: &str,
    filename: &str,
    url: &str,
    dest: &Path,
    progress: &mut DownloadState,
    download_state: Arc<AtomicU8>,
    tx: broadcast::Sender<DownloadState>,
) -> Result<()>

/// Get available free disk space in bytes for a given path.
pub fn get_free_space_bytes(path: &Path) -> u64

/// Resolve the llama-server binary path for a given backend.
/// Downloads the binary from GitHub releases if not already cached.
pub async fn resolve_backend_binary(
    backend: Backend,
    tag: Option<&str>,
    log_tx: Option<mpsc::Sender<String>>,
    progress_tx: Option<tokio::sync::broadcast::Sender<crate::models::DownloadState>>,
) -> Result<PathBuf>
}

backend::server

llama.cpp server process management.

#![allow(unused)]
fn main() {
/// Manages a single llama.cpp server process.
pub struct ServerHandle {
    pub port: u16,
    pub host: String,
    pub pid: u32,
    pub kill_tx: mpsc::Sender<()>,
}

/// Build the full llama-server command line from settings.
pub fn build_server_cmd(
    binary: &Path,
    model: Option<&DiscoveredModel>,
    settings: &ModelSettings,
    config: &Config,
    server_mode: ServerMode,
    router_max_models: u32,
) -> (Command, String)

/// Request to spawn a llama.cpp server process.
pub struct SpawnServerRequest<'a> {
    pub config: &'a Config,
    pub model: Option<&'a DiscoveredModel>,
    pub settings: &'a ModelSettings,
    pub log_tx: mpsc::Sender<String>,
    pub progress_tx: Option<tokio::sync::broadcast::Sender<DownloadState>>,
    pub server_mode: ServerMode,
    pub router_max_models: u32,
    pub exit_tx: mpsc::Sender<()>,
}

/// Spawn a llama.cpp server process.
pub async fn spawn_server(request: SpawnServerRequest) -> Result<(ServerHandle, String), String>

/// Check if the server is healthy and responsive.
pub async fn check_health(host: &str, port: u16) -> bool

/// Kill a running server.
pub async fn kill_server(handle: ServerHandle) -> Result<(), String>

/// Poll metrics from the server.
pub async fn get_metrics(
    host: &str,
    port: u16,
    model_name: Option<&str>,
    pid: Option<u32>,
) -> Result<ServerMetrics, String>

/// Load a model via the llama-server Router API.
pub async fn load_model(host: &str, port: u16, model_id: &str, model_path: Option<&str>) -> Result<(), String>

/// List all models and their status from the llama-server Router API.
pub async fn list_models(host: &str, port: u16) -> Result<Vec<(String, String, Option<String>)>, String>

/// Unload a model via the llama-server Router API.
pub async fn unload_model(host: &str, port: u16, model_id: &str, model_path: Option<&str>) -> Result<(), String>
}

config

Configuration loading and saving.

#![allow(unused)]
fn main() {
/// Global configuration.
pub struct Config {
    pub models_dirs: Vec<PathBuf>,
    pub llama_server: PathBuf,
    pub default: DefaultParams,
    pub model_overrides: ModelConfigStore,
    pub profiles: ProfileStore,
    pub system_prompt_presets: PresetStore,
    pub rpc_workers: Vec<RpcWorker>,
    pub ws_server: WsServer,
    pub search_limit: u32,
}

/// A remote RPC worker for distributed inference.
pub struct RpcWorker {
    pub selected: bool,
    pub name: String,
    pub ip: String,
    pub port: u16,
}

/// A named profile of settings.
pub struct Profile {
    pub name: String,
    pub description: String,
    pub settings: ModelOverride,
}

/// A named system prompt preset.
pub struct SystemPromptPreset {
    pub name: String,
    pub description: String,
    pub content: String,
}

/// Per-model settings override (optional fields).
pub struct ModelOverride {
    pub context_length: Option<u32>,
    pub threads: Option<u32>,
    pub temperature: Option<f32>,
    // ... 50+ optional fields
}

/// Built-in profiles with sensible defaults.
pub fn builtin_profiles() -> Vec<Profile>

/// Built-in system prompt presets.
pub fn builtin_system_prompt_presets() -> Vec<SystemPromptPreset>
}

backend::ws_server

WebSocket dashboard server.

#![allow(unused)]
fn main() {
pub struct WsAppState {
    pub metrics_rx: Arc<broadcast::Receiver<WsMetrics>>,
    pub auth_key: Option<String>,
}

pub async fn start_ws_server(
    port: u16,
    metrics_rx: Arc<broadcast::Receiver<WsMetrics>>,
    auth_key: Option<String>,
    tls_config: Option<axum_server::tls_rustls::RustlsConfig>,
    host: String,
) -> Result<JoinHandle<()>>

pub fn stop_ws_server(handle: JoinHandle<()>)
}

backend::benchmark

Benchmark tuning system.

#![allow(unused)]
fn main() {
/// Configuration for a benchmark run.
pub struct BenchTuneConfig {
    pub model_path: PathBuf,
    pub num_iterations: u32,
    pub prompt: String,
    pub params: Vec<BenchTuneParam>,
    pub duration: Duration,
    pub mode: BenchTuneMode,
    pub n_predict: usize,
    pub chat_template_kwargs: Option<String>,
}

/// A tunable parameter for benchmarking.
pub struct BenchTuneParam {
    pub name: String,
    pub min: f64,
    pub max: f64,
    pub step: f64,
    pub enabled: bool,
}

/// Actual parameter values for a benchmark run.
pub struct BenchTuneParamValue {
    pub temperature: Option<f64>,
    pub top_p: Option<f64>,
    pub top_k: Option<i64>,
    pub repeat_penalty: Option<f64>,
    pub context_length: Option<u32>,
    pub batch_size: Option<u32>,
    pub flash_attn: Option<bool>,
    pub threads: Option<u32>,
    pub expert_count: Option<i32>,
    pub spec_type: Option<String>,
    pub draft_tokens: Option<u32>,
}

/// Results from a benchmark run.
pub struct BenchTuneResult {
    pub params: BenchTuneParamValue,
    pub metrics: BenchTuneMetrics,
    pub outputs: Vec<String>,
    pub per_iteration_metrics: Vec<BenchTuneMetrics>,
    pub base_settings: Option<ModelSettings>,
}

/// Metrics from a benchmark run.
pub struct BenchTuneMetrics {
    pub prompt_tps: f64,
    pub generation_tps: f64,
    pub combined_tps: f64,
    pub latency_per_token: f64,
    pub first_token_time: f64,
}
}

models

Domain types and utilities.

#![allow(unused)]
fn main() {
/// Estimate VRAM usage (in MiB) for a model with the given settings.
pub fn estimate_vram_mib(
    model_mib: u64,
    settings: &ModelSettings,
    total_layers: u32,
    hidden_size_opt: Option<u32>,
    n_head_opt: Option<u32>,
    n_kv_head_opt: Option<u32>,
    gpu_mem_total_mib: u64,
) -> u64

/// Format bytes as MB or GB.
pub fn format_mib(mib: u64) -> String
}

Configuration

Configuration is stored in ~/.config/llm-manager/config.yaml and loaded via Config::load(). The config file structure:

models_dirs:
  - ~/.local/share/llm-manager/models
llama_server: llama-server
default:
  context_length: 32096
  threads: <physical cores>
  # ... more default parameters
  llama_cpp_version_cpu: null
  llama_cpp_version_vulkan: null
  llama_cpp_version_rocm: null
  llama_cpp_version_rocm_lemonade: null
  llama_cpp_version_cuda: null
model_overrides:
  # Per-model configs stored as individual YAML files in ~/.config/llm-manager/models/
  model.gguf:
    temperature: 0.7
    gpu_layers: 32
profiles:
  - name: Qwen
    description: Optimized for Qwen models
    settings:
      temperature: 0.6
      top_k: 20
rpc_workers:
  - name: Remote-GPU-1
    ip: 192.168.1.50
    port: 50052
    selected: true
system_prompt_presets:
  - name: General
    description: General-purpose assistant
    content: "You are a helpful assistant."

Built-in profiles are merged on load, so adding new ones in code automatically appears in the UI.