Introduction
LLM Manager is a terminal UI (TUI) for managing local LLM models. It lets you search HuggingFace, download GGUF models, load them via llama.cpp’s llama-server, and chat with them — all from your terminal.
Features
- Model search on HuggingFace (filters to GGUF models, paginated with infinite scroll)
- Download GGUF model files with progress tracking and cancellation (with disk space check)
- Load/unload models via llama.cpp server with progress visualization
- Local Model Filter — quickly find models in your list with
f - RPC Workers Manager — dedicated window to manage distributed inference nodes
- Chat with loaded models in the terminal
- Configure loading and inference parameters per model
- GGUF file browser — list and select specific GGUF files for a model
- Log panel — expand/collapse with Enter/Esc, follow mode with
f - About Box — application info and GPLv3 license link (
A) - CmdLine overlay — view the full llama-server command line (
Ctrl+K), export to script (e) - API proxy — expose an OpenAI-compatible API with CORS and SSE streaming support
- API key authentication — Bearer token authentication for the API proxy
- Profiles — save and apply named presets of settings
- System Prompt Presets — named system prompts for different use cases
- Router Mode — load multiple models simultaneously
- Benchmark Tuning — auto-tune model parameters for optimal performance
- Panel Resize — drag the border between left and right panels, or use
Shift+←/→ - README rendering — full markdown renderer for HuggingFace model documentation
- HuggingFace URL links — navigate to model pages from Model Info
- Multi-backend — CPU, Vulkan, ROCm, ROCm Lemonade, and CUDA support with per-backend version picker (13 platform-specific variants)
- Speculative decoding — MTP and other speculative decoding types via SpecTypePicker
- Per-model tags — Edit and manage tags for each model
- TLS support — Secure WebSocket dashboard with self-signed certificate generation
- Dashboard URL modal — Copy dashboard URL to clipboard with
Ctrl+U - YaRN RoPE — Extend context beyond training length with YaRN RoPE parameter tuning
Prerequisites
- Rust toolchain (edition 2024)
- A HuggingFace account (for downloading gated models)
- An NVIDIA GPU (Vulkan/CUDA) or AMD GPU (ROCm/ROCm Lemonade) for GPU inference, or a CPU for CPU-only inference
Screenshot

Quick Start
git clone https://github.com/aginies/llmtui.git
cd llmtui
cargo build --release
cargo run
Getting Started
Installation
From source
git clone https://github.com/aginies/llmtui.git
cd llmtui
cargo build --release
Platform Support
llm-manager runs on Linux, macOS, and Windows. GPU backends available per platform:
| Platform | CPU | Vulkan | ROCm | ROCm Lemonade | CUDA |
|---|---|---|---|---|---|
| Linux x64 | Yes | Yes | Yes | Yes | Yes |
| Linux ARM64 | Yes | — | — | — | — |
| Windows x64 | Yes | Yes | Yes (HIP) | — | Yes (12.4 / 13.1) |
| macOS ARM64 | Yes | — | — | — | — |
| macOS x64 | Yes | — | — | — | — |
ROCm Lemonade (AMD optimized) is Linux-only and auto-detects your GPU architecture (e.g. gfx1100).
Using the build script
A convenience script is included for common operations:
./build.sh build # Build (debug)
./build.sh run # Build and run (TUI mode)
./build.sh serve # Serve a model
./build.sh servedoc # Serve docs with watch mode
./build.sh release # Release build
./build.sh clean # Remove build artifacts
./build.sh format # Format code
./build.sh clippy # Run clippy
./build.sh doc # Build documentation
./build.sh help # Show help
First Run
On first launch, llm-manager creates a default configuration in ~/.config/llm-manager/config.yaml and sets up the models directory at ~/.local/share/llm-manager/models/.
cargo run
The application will:
- Load (or create) the config file
- Discover any
.gguffiles in the models directory - Start the TUI
Navigating the Interface
The TUI is divided into several panels:
- Models panel (left) — list of local GGUF models
- Settings panel (right) — server and LLM settings
- Log panel (bottom) — live output from llama.cpp
- Download panel — appears when downloading files
Use Tab to cycle between panels, and Ctrl+H for panel-specific help.
Searching for Models
To search HuggingFace for models:
- Press
/to enter search mode - Type your query and press
Enter - Results appear sorted by relevance by default
- Press
Ctrl+Sto cycle sort order (Relevance / Downloads / Likes / Trending / Created) - Press
Ctrl+Bto go back one page, or scroll down at the bottom for more results - Press
Ctrl+Shift+Rto fetch the model’s README (auto-fetched when navigating results)
Multi-word search: Type space-separated words (e.g. qwen opus) to search with AND logic — all words must match the model name.
Downloading Models
To download a model from HuggingFace:
- Press
/to enter search mode - Type your query and press
Enter - Press
lon a result to browse available GGUF files - Select a file and press
Enterto download - Press
⌥C(Alt+C) to cancel, orpto pause/resume the download at any time
The download progress is shown in the Download panel with speed (MiB/s), ETA, and status indicators. Before downloading, the app checks available disk space and warns if insufficient. Cancelled downloads automatically remove the temporary file. Once complete, the model appears in the Models panel (in your models directory).
Loading Models
Once a model is downloaded (or has one locally in your models directory):
- Select the model in the Models panel
- Press
l(orEnter) to load it
The loading process shows a progress bar with phases:
- Server starting
- Loading model weights
- Loading metadata
- Loading tensors (with GPU layer count)
- Server listening
- Ready (detected via
/healthAPI polling)
Log Panel
The Log panel shows live output from the llama.cpp server. Press Enter to expand to fullscreen, Esc to collapse. Press f to toggle between Following (auto-scroll) and Manual (scroll history) modes.
Other Features
- Profiles (
p) — Quick-switch between saved settings presets - Profile Picker (
Ctrl+P) — Open a modal to select from built-in or user profiles - System Prompt Presets — Named system prompts for different use cases (Coder, Thinker, Mathematician)
- RPC Workers — Manage distributed inference nodes from Server Settings
- Benchmark Tuning — Auto-tune model parameters for optimal performance (set Mode to BenchTune)
- Router Mode — Load multiple models simultaneously
- Panel Resize — Drag the border between left and right panels, or use
Shift+←/→(20%-80%) - Mouse support — Click panels to focus, scroll in logs, README, and settings
Using Serve Mode
You can also start a model directly from the command line:
./build.sh serve --model /path/to/model.gguf
Or with a settings profile:
./build.sh serve --model model.gguf --profile qwen
With a custom backend binary:
./build.sh serve --model model.gguf --backend-binary /opt/rocm/bin/llama-server
Bound to a specific network interface:
./build.sh serve --model model.gguf --host 0.0.0.0
Logs redirected to a file:
./build.sh serve --model model.gguf --log-file /var/log/llm-manager/model.log
API Proxy
Start with an OpenAI-compatible API proxy:
./build.sh serve --model model.gguf --api-port 49222
With authentication:
./build.sh serve --model model.gguf --api-port 49222 --api-key secret
The API proxy forwards requests to the llama-server instance and supports all llama.cpp endpoints including chat completions, embeddings, and more. It supports SSE (Server-Sent Events) streaming for chat completions and other streaming endpoints, and CORS is enabled for all origins.
Usage
Serve Mode
Run a model directly with llama-server and expose an OpenAI-compatible API:
# Serve a model with API proxy on port 49222
./build.sh serve --model /path/to/model.gguf --api-port 49222
# Serve with a settings profile
./build.sh serve --model model.gguf --profile qwen
# Serve with API key authentication (Bearer token)
./build.sh serve --model model.gguf --api-port 49222 --api-key secret
# Serve with API proxy and WebSocket dashboard
./build.sh serve --model model.gguf --api-port 49222 --ws-enable
# Serve with custom dashboard port and auth
./build.sh serve --model model.gguf --api-port 49222 --ws-enable --ws-port 8081 --ws-auth mykey
# Serve with a custom backend binary path
./build.sh serve --model model.gguf --backend-binary /path/to/custom/llama-server
# Serve bound to a specific network interface
./build.sh serve --model model.gguf --host 0.0.0.0
# Redirect logs to a file (useful for systemd)
./build.sh serve --model model.gguf --log-file /var/log/llm-manager/model.log
# Combine options
# Serve with API proxy and WebSocket dashboard on a specific host
./build.sh serve --model model.gguf --api-port 49222 --ws-enable --host 192.168.1.100
# Redirect logs to a file (useful for systemd)
./build.sh serve --model model.gguf --log-file /var/log/llm-manager/model.log
# Combine options
./build.sh serve --model model.gguf --api-port 49222 --ws-enable --host 0.0.0.0 --backend-binary /opt/rocm/bin/llama-server --log-file /var/log/llm-manager/model.log
The serve command automatically resolves the llama-server binary from the backend-specific directory (~/.local/share/llm-manager/bin/llama-server-{cpu,vulkan,rocm}-{version}/) and sets LD_LIBRARY_PATH for shared libraries. If the binary is not found, it downloads it from the llama.cpp GitHub releases. Use --backend-binary to specify a custom binary path, --host to override the network bind address for both the API proxy and WebSocket servers (default is from config), and --log-file to redirect logs to a file instead of stdout.
Model Management
Listing Models
The Models panel shows all .gguf files found in your models directories (recursively). The display name is the relative path from the models directory.
f— Filter local models by name (case-insensitive substring match)Esc— Clear active filter and return to full list
Loading and Unloading
lorEnter— Load selected modelu— Unload model from serverCtrl+D— Delete model (with confirmation)
When a model is loaded, its state changes to Loaded showing the port and PID. You can load multiple models when using Router mode.
Deleting Models
Pressing Ctrl+D prompts for confirmation before moving the model file and its YAML config to ~/.config/llm-manager/unused/. Both can be restored later.
Search
Search mode lets you browse and download GGUF models from HuggingFace:
| Key | Action |
|---|---|
/ | Open search input modal — type query and press Enter to search |
Enter | Select GGUF files for the highlighted model |
Esc | Exit search |
Ctrl+S | Cycle sort order |
Ctrl+B | Go back one page |
Down (at bottom) | Load more results |
Ctrl+Shift+R | Fetch and view README for the selected model |
Multi-word Search
Type space-separated words (e.g. qwen opus) to search with AND logic — all words must match the model name. Matching words are highlighted in cyan in the results list.
GGUF File Browser
When viewing GGUF files for a model:
| Key | Action |
|---|---|
j / k | Navigate files |
Enter | Download selected file |
Esc | Go back to search results |
⌥C | Cancel download and remove temp file |
Download Panel
When one or more files are downloading, the Download panel appears at the bottom of the screen, showing progress, speed (MiB/s), ETA, and status for each download. Before downloading, the app checks available disk space and warns if insufficient. Cancelled downloads automatically remove the temporary file.
| Key | Action |
|---|---|
j / k | Navigate downloads |
p | Pause / Resume selected download |
⌥C | Cancel selected download and remove temp file |
Status indicators: Downloading (yellow), Paused (white), Complete (green), Cancelled (red), Error (red).
Loading Models
When you load a model, the application:
- Resolves the llama-server binary for the selected backend (CPU/Vulkan/ROCm)
- Spawns the server with the current settings
- Loads the model via the server’s
/models/loadAPI - Polls the server’s
/metricsand/healthendpoints for status - Displays a progress bar showing loading phases
Loading Phases
The progress bar tracks:
- Server starting (8%) — llama.cpp binary is launched
- Loading model (7%) — weights file is being read
- Loading metadata (7%) — GGUF metadata is parsed
- Loading tensors (70%) — tensors are loaded and offloaded to GPU
- Server listening (8%) — HTTP server is ready
- Complete — model is ready for inference
During tensor loading, the progress bar shows offloaded layers (e.g., 16/32) parsed from llama.cpp’s log output.
Settings
Server Settings
| Setting | Default | Description |
|---|---|---|
| Host | 127.0.0.1 | Bind address for the llama.cpp server. Use 0.0.0.0 to accept connections from other machines. |
| Backend | auto-detected | Acceleration backend: auto-detected based on GPU (Cuda for NVIDIA, Rocm for AMD, Vulkan for Intel). Options: cpu (CPU-only), vulkan (NVIDIA/AMD/Intel GPU), rocm (AMD GPU), rocm-lemonade (AMD optimized), cuda (NVIDIA CUDA 12.8). Shows the currently selected version. |
| Threads | (physical cores) | CPU threads for generation. Set to your physical core count for best performance. |
| Threads Batch | 8 | CPU threads for batch processing (prompt evaluation). |
| Mode | Normal | Server mode: Normal (single model), Router (multiple models), Bench (run llama-bench), or BenchTune (parameter auto-tuning). |
| API Endpoint | false | Enable the API proxy server (see Serve Mode). |
| Dashboard | false | WebSocket dashboard server (port 49223). Press Enter to configure (enabled, port, auth key, TLS). |
| RPC Workers | None | Open a dedicated window to manage distributed inference nodes (IP:Port). |
Note: The Server Settings panel is hidden when a server is already running. Press
F2to toggle Server Settings only when no server is active.
LLM Settings
The LLM Settings panel has 32 standard fields, 16 expert fields (revealed with Ctrl+X), and 19 ultra fields (hidden even in expert mode), for a total of 67 fields. Arrow keys adjust values; +/- for coarse changes, Left/Right for fine. Toggle fields respond to e or Ctrl+E.
Loading
| Field | Default | Description |
|---|---|---|
| Prompt | General | System prompt preset that defines the model’s initial behavior. Presets include General, Coder, Thinker, Mathematician, and any user-defined prompts. |
| Context | 32096 | Context window size in tokens. Must be a power of two. Larger values consume more VRAM and RAM. Models often have a maximum context length (e.g., 32K, 128K). |
| Keep in memory | false | Locks model weights in RAM (-mlock) to prevent the OS from swapping them out. Useful when repeatedly loading/unloading models. Increases RAM usage. |
GPU Offload
| Field | Default | Description |
|---|---|---|
| GPU Layers | Auto | Number of model layers offloaded to GPU memory. Auto lets llama.cpp decide based on available VRAM. Specific sets an exact number. All offloads every layer (-ngl 999). |
| Flash Attention | true | Enables Flash Attention 2 for faster inference with lower memory usage. Requires GPU support. Can improve throughput by 20-40%. |
| KV Cache Offload | true | Offloads the KV cache to RAM when GPU memory is full. Trade-off: more VRAM available for model weights at the cost of slower cache access. |
| Cache Type K | F16 | Data type for the key cache. Options: F32 (most accurate, most memory), F16 (default), BF16 (better than F16 for some models), Q8_0, Q5_0, Q5_1, Q4_0, Q4_1, Iq4_NL. |
| Cache Type V | F16 | Data type for the value cache. Same options as Cache Type K. Using lower precision reduces VRAM but may affect quality. |
| Active Experts | 1 | For Mixture-of-Experts (MoE) models, the number of experts activated per token. Higher values improve quality but increase compute. |
Evaluation
| Field | Default | Description |
|---|---|---|
| Eval Batch | 512 | Logical maximum batch size for evaluation. Larger batches improve throughput but increase memory usage. Set to the model’s native context length for single-sequence inference. |
| Unified KV | true | Shares KV cache across sequences, reducing memory usage when running multiple prompts. Can cause cache eviction conflicts. |
Sampling
| Field | Default | Description |
|---|---|---|
| Seed | -1 | Random seed for reproducible outputs. -1 means random each time. Set to a fixed value for debugging or reproducibility. |
| Temperature | 0.8 | Controls randomness in sampling. Higher values (1.0-2.0) produce more creative/divergent outputs. Lower values (0.0-0.5) produce more deterministic/crisp outputs. |
| Top-k | 40 | Limits sampling to the k most likely next tokens. 0 disables. Smaller values make outputs more focused. Typical: 20-50. |
| Top-p | 0.95 | Nucleus sampling: limits to tokens whose cumulative probability reaches p. 1.0 disables. Lower values (0.8-0.95) reduce randomness. |
| Min P | 0.0 | Minimum probability threshold for sampling. Tokens with probability below this fraction of the highest-probability token are excluded. Useful for controlling extreme outputs. |
| Max Tokens | 0 | Maximum tokens to generate per response. 0 means no limit (until EOS token). |
Repetition Control
| Field | Default | Description |
|---|---|---|
| Repetition Penalty | 1.1 | Penalizes tokens that have already appeared. Values > 1.0 reduce repetition. Typical: 1.1-1.2. |
| Rep. Last N | 64 | Number of recent tokens to consider for repetition penalty. -1 uses the full context. |
Yarn RoPE
| Field | Default | Description |
|---|---|---|
| Yarn RoPE | false | Enables YaRN (Yet another RoPE extensioN) for extending context beyond the model’s training length. |
| Yarn Params | — | Opens a modal to configure three floating-point values: rope_scale (default 1.0, multiplies context), rope_freq_base (default 0.0, overrides the model’s base frequency), rope_freq_scale (default 1.0, scales the frequency). Only digits, ., -, e, and E are accepted. |
Tags
| Field | Default | Description |
|---|---|---|
| Tags | None | Per-model tags stored in the YAML config. Press Enter to open the tag editor modal. Press t in the LLM Settings panel to open the tag editor. |
Backend
| Field | Default | Description |
|---|---|---|
| LLama.cpp Version | Latest | Shows the currently selected backend version. Press Enter to open the backend version picker. |
Expert Mode
Press Ctrl+X to toggle expert mode, which reveals 16 additional parameters:
Loading (expert): NUMA (None/Distribute/Isolate/Numactl)
GPU (expert): Cache Type K (toggle), Cache Type V (toggle), Main GPU, Fit, Active Experts (toggle)
Sampling (expert): Mirostat (Off/1/2), Mirostat LR, Mirostat Ent, Ignore EOS (toggle)
Repetition (expert): Presence Penalty (toggle, -2.0 to 2.0), Frequency Penalty (toggle, -2.0 to 2.0)
Speculative (expert): MTP (toggle), Spec Type, Spec Draft N Max
These fields follow the same navigation and editing rules as standard fields. Arrow keys adjust values, Enter enters direct edit mode, and dirty fields are highlighted in yellow.
Ultra Fields
19 ultra fields are hidden even in expert mode. They include: Typical P, Mirostat, Mirostat LR, Mirostat Ent, Ignore EOS, Samplers, DRY Multiplier, DRY Base, DRY Allowed Length, DRY Penalty Last N, Threads Batch, UBatch Size, Keep, Split Mode, Tensor Split, Main GPU, Fit, Embedding, RPC. These require direct config file editing or profile application.
Cache Type K/V options: F32, F16, BF16, Q8_0, Q5_0, Q5_1, Q4_0, Q4_1, Iq4Nl
Changing Values
Use Left/Right to adjust numeric fields by 1, or Up/Down for larger steps. Toggle fields respond to e or Ctrl+E. Dirty (changed) fields have the name in red and a trailing *. The status bar shows *unsaved* when settings are dirty.
Saving Settings
Ctrl+S— Save settings for the selected modelCtrl+R— Reset to defaultse/Ctrl+E— Toggle enabled/disabled (for Keep in memory, Flash Attention, KV Cache Offload, Cache Type K/V, Fit, Unified KV, Max Tokens, Presence/Frequency Penalty, Max Concurrent Pred, MTP, Ignore EOS, Yarn RoPE, Active Experts)Ctrl+X— Toggle expert mode (reveals 16 additional parameters)
Dirty (changed) fields are highlighted with red names and a trailing *.
Keyboard Shortcuts
| Key | Action |
|---|---|
j / k | Navigate up/down |
Enter | Load model / Select GGUF files (in search) / Expand log |
f | Filter local models list / Toggle Follow mode (in Log panel) |
Esc | Back / Exit search / Collapse log / Clear local filter |
Tab | Switch panels (next) |
Shift+Tab | Switch panels (previous) |
/ | Open search input modal |
l | Load model |
u | Unload model |
t | Open tags editor (in LLM Settings) |
A | About box (license and version info) |
Ctrl+D | Delete model (with confirmation) |
Ctrl+H | Panel-specific help |
Ctrl+K | CmdLine overlay |
Ctrl+Alt+K | Kill llama-server |
Ctrl+L | Focus Log panel |
Ctrl+S | Save settings / Cycle search sort (in search) |
Ctrl+R | Reset settings (in LLM Settings) |
Ctrl+E | Toggle field enabled/disabled (in LLM Settings: Cache Type K/V, Max Tokens, Presence/Frequency Penalty, Max Concurrent Pred, Flash Attention, Unified KV, Keep in memory, Fit, MTP, Ignore EOS, Yarn RoPE, Active Experts) |
Ctrl+X | Toggle expert mode (in LLM Settings) |
Ctrl+P | Open Profile Picker modal (in LLM Settings) |
Ctrl+U | Open Dashboard URL modal (copy URL to clipboard) |
Ctrl+B | Back one page in search |
Ctrl+Shift+K | Kill llama-server (alternative) |
Ctrl+Shift+R | Fetch README for selected model (in search) |
g / G | Jump to top/bottom of log |
PageUp / PageDown | Fast scroll (logs, README, benchmarks) |
F1–F6 | Toggle panels (Models, Server, Info, Settings, Active, Log) |
F9 / F10 / Ctrl+F10 | Show all panels |
Ctrl+F7 | Focus Models panel |
Ctrl+F8 | Focus Server Settings panel |
Ctrl+F9 | Focus LLM Settings panel |
Shift+← / Shift+→ | Resize horizontal panel split (20%-80%) |
p | Pause/resume download / Previous benchmark result / Apply profile |
n | New preset (System Prompt Presets) / Next benchmark result / Add new worker (RPC) |
Space | Toggle selection (RPC workers, benchmark parameters) |
Alt+M | Toggle benchmark mode (RuntimeOnly / Full) |
Alt+P | Edit benchmark prompt |
Alt+N | Edit n_predict (max tokens) |
Alt+I | Edit iterations |
Alt+C | Edit chat template kwargs / Cancel confirmation |
y | Confirm destructive action |
h | Cancel confirmation dialog |
Log Panel
The Log panel displays live output from the llama.cpp server with level-based coloring.
Log Modes
| Mode | Behavior |
|---|---|
| Following (default) | Auto-scrolls to the bottom as new entries arrive. Press g to exit. |
| Manual | Allows manual scrolling through log history. Press G to return to bottom. |
Press f in the Log panel to toggle between modes. The current mode is shown in the panel title. Expand the log to fullscreen with Enter; collapse with Esc.
RPC Workers
RPC Workers enable distributed inference across multiple machines. Each worker has a name, IP address, and port (default: 50052).
Open the RPC Workers manager from the Server Settings panel. Within the manager:
| Key | Action |
|---|---|
n | Add new worker |
e | Edit selected worker |
d | Delete selected worker |
Space | Toggle worker selection |
Esc | Close manager |
WebSocket Dashboard
The WebSocket Dashboard provides a real-time visualization of model metrics in any web browser. Access it at http://localhost:49223 (default port).
Configuration
Open the Server Settings panel, navigate to Dashboard, and press Enter to configure:
| Field | Description |
|---|---|
| Enabled | Toggle the dashboard on/off |
| Port | Server port (default: 49223) |
| Auth Key | Optional authentication key |
| TLS Enabled | Enable TLS for secure dashboard access |
| TLS Cert | Path to TLS certificate file |
| TLS Key | Path to TLS private key file |
When an auth key is set, clients must include it as a URL parameter: http://localhost:49223?auth=<key>. With TLS enabled, the URL uses https://.
Dashboard Display
The dashboard shows real-time metrics (TPS, prompt TPS, latency, context, VRAM, RAM, CPU) and current inference settings (backend, threads, temperature, sampling parameters, etc.) alongside the full server command line.
Benchmark Tuning
Benchmark Tuning auto-tunes model parameters for optimal performance. Access it by setting the Server Mode to BenchTune.
Two modes are available:
- RuntimeOnly — Single server, params sent in request body (no server restarts)
- Full — New server spawned for each parameter combination
Tunable parameters: temperature (0.4–1.0), top_p (0.8–1.0), top_k (40–50), repeat_penalty (1.0–1.2), flash_attn (0/1), threads (4–16), batch_size (512–2048), expert_count (1–4), context_length, spec_type (speculative decoding type), draft_tokens.
Results can be exported as Markdown table, JSON, YAML, or HTML report with summary cards, winner section, impact analysis, and Chart.js charts. Navigate between results with p (previous) and n (next).
System Prompt Presets
Named system prompts for different use cases. Built-in presets: General, Coder, Thinker, Mathematician. User presets are stored as YAML files in ~/.config/llm-manager/presets/<name>.yaml.
Open the System Prompt Presets panel and manage presets:
| Key | Action |
|---|---|
n | Create new preset |
e | Edit selected preset |
↵ | Apply preset |
d | Delete selected preset (moved to unused_presets/) |
⌃S | Save preset during edit |
Esc | Close / Cancel edit |
GPU Layers Cycling
In the LLM Settings panel, the GPU Layers field cycles through three modes with arrow keys:
| Mode | Behavior |
|---|---|
| Auto | Lets llama.cpp auto-detect based on available VRAM (default) |
| Specific number | Offloads exactly that many layers to GPU |
| All | Offloads all layers (equivalent to -ngl 999) |
Arrow keys cycle: Auto → 1 → 2 → … → N → All → Auto. Pressing Enter from a specific number opens an edit buffer for direct input. The -ngl flag is only added for Specific and All modes.
Tags
Per-model tags can be edited in the LLM Settings panel. The Tags field opens an edit modal where you can add, remove, or modify tags associated with the model. Tags are stored in the per-model YAML config.
MTP (Multi-Token Prediction)
MTP is an experimental feature that uses a draft model to predict multiple tokens in parallel, improving inference speed. When a model with MTP architecture is selected, the app automatically detects it and enables the --draft-mtp flag. The number of draft tokens is read from the GGUF metadata and displayed in the Model Info panel.
GGUF Metadata
The Model Info panel shows parsed GGUF metadata including: architecture, layers, hidden size, context length, attention heads, KV heads, domain, capabilities, quantization, parameters (e.g., “7B”, “405B”), tokenizer type, vocabulary size, and max context for VRAM. Metadata is parsed once and cached (debounced by file mtime).
Active Model Metrics
The Active Model panel shows real-time metrics:
| Metric | Description |
|---|---|
| TPS | Tokens per second (generation speed) |
| Prompt TPS | Prompt processing speed |
| Gen TPS | Generation tokens per second (separate from prompt TPS) |
| Context usage | Progress bar showing ctx_used/ctx_max |
| CPU% | CPU usage percentage |
| RAM | RAM usage |
| VRAM | GPU memory used/total |
| Total VRAM | Total GPU memory used (including non-model allocations) |
| Latency | Milliseconds per token (generation and prompt) |
| Tokens | Total decoded tokens generated |
The panel also shows benchmarking state with progress bar and current parameter display when running BenchTune.
Backend Selection
Multiple backends are supported via the llama.cpp server:
| Backend | Source | Description |
|---|---|---|
| CPU | ggml-org/llama.cpp | CPU-only inference (standard) |
| Vulkan | ggml-org/llama.cpp | GPU via Vulkan (Universal: AMD/NVIDIA/Intel) |
| ROCm | ggml-org/llama.cpp | GPU via ROCm (AMD Native) |
| ROCm Lemonade | lemonade-sdk/llamacpp-rocm | GPU via ROCm (AMD Optimized, auto-detects GFX architecture) |
| CUDA | ai-dock/llama.cpp-cuda | GPU via CUDA (NVIDIA Native, CUDA 12.8) |
| CPU ARM64 | ggml-org/llama.cpp | CPU-only for ARM64 Linux |
| CPU Windows | ggml-org/llama.cpp | CPU-only for Windows |
| Vulkan Windows | ggml-org/llama.cpp | Vulkan for Windows |
| CUDA Windows 12.4 | ai-dock/llama.cpp-cuda | CUDA 12.4 for Windows |
| CUDA Windows 13.1 | ai-dock/llama.cpp-cuda | CUDA 13.1 for Windows |
| HIP Windows | ggml-org/llama.cpp | HIP (ROCm) for Windows |
| CPU macOS ARM64 | ggml-org/llama.cpp | CPU-only for macOS Apple Silicon |
| CPU macOS x64 | ggml-org/llama.cpp | CPU-only for macOS Intel |
Each backend has its own independently configurable llama.cpp version. Switching versions is instant — no re-download.
Server Modes
| Mode | Description |
|---|---|
| Normal | Single model via CLI (default) |
| Router | Multiple models via API, loads via /load endpoint |
| Bench | GPU benchmarking mode (runs llama-bench) |
| BenchTune | Parameter auto-tuning mode |
VRAM Estimate
The app computes a detailed VRAM estimate based on model size, GPU layers, KV cache, activation overhead, and fixed overhead. The formula accounts for GQA ratio, FlashAttention (0.5× KV cache reduction), unified KV cache, KV cache quantization bytes, activation overhead (8× multiplier), and fixed overhead (3.8% of max VRAM or 500 MiB fallback). The estimate is shown in the LLM Settings title (e.g., “VRAM ~= 8.2 GB”).
Confirmation Dialogs
The app uses confirmation dialogs for destructive actions:
- Exit — warns about loaded models
- Delete — confirms irreversible deletion
- Reset — confirms resetting all LLM settings
- Unload — confirms unloading a model via API
- DeleteBackend — confirms deleting a backend binary version from disk
Mouse Support
Mouse interactions are supported: clicking on panels to focus them, and scrolling in the log panel, README panel, settings, profiles, and presets panels.
Panel Resize
The horizontal split between left panels (Models + Info) and right panels (Settings/README) can be resized:
| Method | Description |
|---|---|
| Drag border | Click and drag the vertical border between left and right panels |
| Scroll on border | Scroll mouse wheel while hovering over the border (1% steps) |
| Keyboard | Shift+← / Shift+→ to adjust by 1% (range: 20%-80%) |
The current split percentage is shown in the status bar (e.g., │ 55%). While actively resizing, the indicator shows │ 55% ← resize →.
CmdLine Overlay
Press Ctrl+K to view the full command line that would be executed to start the llama.cpp server. This shows the binary path, model path, and all parameters.
Press e in the overlay to export the command to /tmp/test_llamaserver.sh.
Server Status
The status bar shows the current server status at the top:
- Running:
● 9090 Normal(green dot with port and mode) - Stopped:
○ Server(gray)
Press Ctrl+Alt+K to kill the running llama-server. When stopped, all loaded models are reset to Available state.
Profiles
Profiles are named presets of LLM settings. Built-in profiles include Qwen, Gemma, Llama, Mistral, and Phi. User profiles are stored as YAML files in ~/.config/llm-manager/profiles/<name>.yaml.
p— Apply a profile to current settingsCtrl+S— Save current settings as a new profile (in the Profiles panel)Ctrl+D— Delete a user-defined profile (moved tounused_profiles/)
Configuration
Directory Layout
llm-manager uses XDG directories for config and data:
~/.config/llm-manager/ # Config directory
├── config.yaml # Global settings
├── models/ # Per-model YAML configs
│ └── qwen2.5-7b.yaml
├── profiles/ # Per-profile YAML configs
│ └── my-profile.yaml
├── presets/ # Per-preset YAML configs
│ └── custom-preset.yaml
├── unused/ # Deleted model configs
├── unused_profiles/ # Deleted profiles
└── unused_presets/ # Deleted presets
~/.local/share/llm-manager/ # Data directory
├── models/ # GGUF model files
│ └── qwen2.5-7b.Q4_K_M.gguf
└── bin/ # llama-server binaries
└── llama-server-cpu-...
Per-model configs are named <model_name>.yaml where model_name is the GGUF filename without the .gguf extension. Deleted configs are moved to unused/ subdirectories (recoverable).
Config File
The main config file is ~/.config/llm-manager/config.yaml. It is created automatically on first run with sensible defaults.
models_dirs:
- ~/.local/share/llm-manager/models
llama_server: llama-server
default:
context_length: 32096
threads: <physical cores>
threads_batch: 8
batch_size: 512
temperature: 0.8
# ... more settings
You can specify a custom config path with --config:
cargo run -- --config /path/to/config.yaml
Default Parameters
| Parameter | Default | Description |
|---|---|---|
context_length | 32096 | Context window size in tokens |
threads | (physical cores) | CPU threads for generation |
threads_batch | 8 | CPU threads for batch processing |
batch_size | 512 | Logical maximum batch size |
ubatch_size | 512 | Physical maximum batch size |
keep | 0 | Keep N tokens from initial prompt |
mlock | false | Lock model weights in RAM |
mmap | true | Memory-map the model |
kv_cache_offload | true | Offload KV cache to RAM |
flash_attn | true | Enable Flash Attention |
temperature | 0.8 | Sampling temperature |
top_k | 40 | Top-k sampling |
top_p | 0.95 | Top-p sampling |
repeat_penalty | 1.1 | Repetition penalty |
backend | auto-detected | Default backend (auto-detected: Cuda for NVIDIA, Rocm for AMD, Vulkan for Intel; falls back to cpu). Options: cpu, vulkan, rocm, rocm-lemonade, cuda |
Profiles
Profiles are named presets of settings. The built-in profiles are:
| Profile | Description | Key Settings |
|---|---|---|
| Qwen | Optimized for Qwen models | temp: 0.6, top-k: 20, context: 32768 |
| Gemma | Optimized for Gemma models | temp: 0.8, min-p: 0.1, typical-p: 0.9 |
| Llama | Optimized for Llama models | temp: 0.7, repeat-penalty: 1.1 |
| Mistral | Optimized for Mistral models | temp: 0.7, top-k: 50 |
| Phi | Optimized for Phi models | temp: 0.7, top-k: 50 |
User-defined profiles are stored as individual YAML files in ~/.config/llm-manager/profiles/<name>.yaml. Built-in profiles are auto-merged on load.
System Prompt Presets
System prompt presets define the initial system prompt. Built-in presets:
| Preset | Description |
|---|---|
| General | “You are a helpful assistant.” |
| Coder | Expert software developer |
| Thinker | Analytical and thoughtful |
| Mathematician | Expert in mathematics |
User-defined presets are stored as individual YAML files in ~/.config/llm-manager/presets/<name>.yaml. Built-in presets are auto-merged on load.
Backend Binaries
llama-server binaries are stored in ~/.local/share/llm-manager/bin/ with versioned directories:
~/.local/share/llm-manager/bin/
├── llama-server-cpu-{version}/llama-server
├── llama-server-vulkan-{version}/llama-server
├── llama-server-rocm-{version}/llama-server
├── llama-server-rocm-lemonade-{version}/llama-server
└── llama-server-cuda-{version}/llama-server
Binaries are downloaded from specialized repositories on first use:
- CPU, Vulkan, ROCm (Native): Fetched from ggml-org/llama.cpp
- ROCm Lemonade: Fetched from lemonade-sdk/llamacpp-rocm (ZIP, auto-detects GFX architecture like
gfx1100) - CUDA (NVIDIA): Fetched from ai-dock/llama.cpp-cuda (CUDA 12.8 builds)
Switching versions is instant — no re-download.
Per-backend Version Config
llama_cpp_version_cpu: null
llama_cpp_version_vulkan: null
llama_cpp_version_rocm: null
llama_cpp_version_rocm_lemonade: null
llama_cpp_version_cuda: null
Platform-specific backend variants (e.g. CpuArm64, CpuWindows, CpuMacosArm64) are handled through the Backend enum and platform field, not through separate version config keys. Each backend has its own independently configurable version.
Setting to null uses the latest release. Specific versions can be set via the version picker in LLM Settings. These selections are automatically persisted to your configuration and remembered across restarts.
Asset Names
Assets are selected based on the detected platform. Linux examples:
- CPU (x64):
llama-{tag}-bin-ubuntu-x64.tar.gz - CPU (ARM64):
llama-{tag}-bin-ubuntu-arm64.tar.gz - Vulkan:
llama-{tag}-bin-ubuntu-vulkan-x64.tar.gz - ROCm:
llama-{tag}-bin-ubuntu-rocm-7.2-x64.tar.gz - ROCm Lemonade:
llama-{tag}-ubuntu-rocm-{gfx}-x64.zip(auto-detects GPU architecture) - CUDA:
llama.cpp-{tag}-cuda-12.8-amd64.tar.gz
Windows assets use *.zip (e.g. llama-{tag}-bin-win-cpu-x64.zip). macOS assets use *-macos-arm64.tar.gz or *-macos-x64.tar.gz.
Serve Mode
You can start a model directly from the command line without the TUI:
./build.sh serve --model /path/to/model.gguf
Options
| Option | Description |
|---|---|
--model | Path to the GGUF model file |
--profile | Apply a settings profile (e.g., qwen, llama) |
--config | Path to config file |
--api-port | Start API proxy on given port |
--api-key | API key for Bearer token authentication |
--ws-enable | Enable WebSocket dashboard server |
--ws-port | Port for WebSocket dashboard server |
--ws-auth | Auth key for WebSocket dashboard access |
--host | Bind address for the server (e.g., 0.0.0.0) |
--backend-binary | Path to a custom llama-server binary |
--log-file | Log file path (default: stdout) |
--tls-enable | Enable TLS for WebSocket dashboard |
--tls-cert | Path to TLS certificate file |
--tls-key | Path to TLS private key file |
--threads | CPU threads |
--context | Context length |
--gpu-layers | Number of GPU layers |
API Proxy
The API proxy forwards requests to the llama.cpp server and provides OpenAI-compatible and Anthropic-compatible endpoints. It supports SSE (Server-Sent Events) streaming for chat completions and other streaming endpoints, and CORS is enabled for all origins with GET/POST/PUT/DELETE/OPTIONS methods. When --api-key is set, all requests require Authorization: Bearer <key>.
API Endpoints
The API proxy explicitly handles the following endpoints, while all other paths are automatically proxied to the llama-server instance:
| Endpoint | Method | Description |
|---|---|---|
/health | GET | Health check |
/metrics | GET | Prometheus metrics |
/v1/chat/completions | POST | Chat completions (OpenAI) |
/v1/completions | POST | Completions (OpenAI) |
/v1/embeddings | POST | Embeddings |
/v1/models | GET | List models |
/api/status | GET | Server status (pid, uptime, loaded models) |
The following endpoints are automatically proxied to llama-server (not explicitly handled):
| Endpoint | Method | Description |
|---|---|---|
/v1/responses | POST | Responses (Anthropic) |
/v1/messages | POST | Messages (Anthropic) |
/v1/messages/count_tokens | POST | Count tokens (Anthropic) |
/completion | POST | Legacy completion |
/infill | POST | Code completion (FIM) |
/reranking | POST | Re-ranking |
/tokenize | POST | Tokenize text |
/detokenize | POST | Detokenize tokens |
/apply-template | POST | Apply chat template |
/v1/health | GET | Health check (alias) |
/props | GET/POST | Get/set server properties |
/slots | GET | Slot monitoring |
/lora-adapters | GET/POST | List/load LoRA adapters |
/models/load | POST | Load a model (router mode) |
/models/unload | POST | Unload a model (router mode) |
Model Overrides
Settings can be saved per-model. Overrides are stored as individual YAML files in ~/.config/llm-manager/models/<name>.yaml (where name is the GGUF filename without .gguf). When a model is loaded, its override settings are merged into the defaults. Deleted configs are moved to ~/.config/llm-manager/unused/ for recovery.
RPC Workers
You can manage a list of remote llama-rpc-server nodes for distributed inference. These are stored in the rpc_workers list in the config:
rpc_workers:
- selected: true
name: "Worker 1"
ip: "192.168.1.10"
port: 50052
Workers can be managed via the RPC Workers window in the Server Settings panel. Selected workers are combined into the --rpc flag when starting the server.
WebSocket Dashboard
The WebSocket Dashboard provides a real-time visualization of model metrics and settings via a web browser.
Accessing the Dashboard
The dashboard runs as a built-in HTTP server on port 49223 by default. Open it in your browser:
http://localhost:49223
Enabling in Serve Mode
The dashboard can be enabled in serve mode using the --ws-enable flag:
./build.sh serve --model model.gguf --api-port 49222 --ws-enable
Customize the dashboard port and authentication:
./build.sh serve --model model.gguf --api-port 49222 --ws-enable --ws-port 8081 --ws-auth mykey
Customize the host and use a specific backend binary:
./build.sh serve --model model.gguf --api-port 49222 --ws-enable --host 0.0.0.0 --backend-binary /opt/rocm/bin/llama-server
The --host option controls the bind address for both the API proxy server and the WebSocket dashboard server, ensuring they use the same network interface. The default is 127.0.0.1 (from config).
Enabling in TUI Mode
The dashboard can also be enabled from the TUI:
- Open the Server Settings panel (F2)
- Navigate to Dashboard and press
Enter - Configure:
- Enabled — toggle on/off
- Port — server port (default: 49223)
- Auth Key — optional authentication (see below)
- Press
Enterto save,Escto close
Dashboard Overview
The dashboard displays real-time metrics in a card-based layout:

Metrics Cards
| Metric | Description |
|---|---|
| Status | Current model state (loaded / unloaded / loading) |
| Generation Speed | Tokens per second (TPS) for text generation |
| Prompt Speed | Tokens per second for prompt processing |
| Latency | Milliseconds per token |
| Tokens | Tokens generated with progress bar (decoded_tokens / max_tokens, or ‘∞’ if not configured) |
| VRAM | GPU memory used/total with color-coded progress bar (green <50%, yellow 50-80%, red >80%) |
| RAM | System memory usage |
| CPU | CPU usage percentage |
Settings Panel
Below the metrics, the dashboard shows a grid of current inference settings:
| Setting | Description |
|---|---|
| Backend & Version | llama.cpp backend and version |
| Threads / Threads Batch | CPU thread configuration |
| Context / Batch Size / Ubatch Size | Model execution parameters |
| Temperature / Top-k / Top-p / Min P / Typical P | Sampling parameters |
| Seed | Random seed for reproducibility |
| Repeat Penalty / Repeat Last N | Repetition control |
| Presence Penalty / Frequency Penalty | Advanced repetition control |
| Flash Attention / KV Cache Offload | Performance optimizations |
| Cache Type K / Cache Type V | KV cache quantization |
| Unified KV / Mlock / Mmap | Memory management |
| Expert Count / GPU Layers | Model-specific settings |
| Samplers | Sampler order string |
| Spec Type / Draft Tokens | Speculative decoding configuration |
| Yarn RoPE / Yarn Params | Context extension parameters |
| Tags | Per-model tags |
Server Command
The full llama-server command line is displayed at the bottom of the dashboard, showing the exact invocation with all parameters. This is useful for debugging and inspecting the exact configuration being used.
Configuration
To enable and configure the dashboard:
- Open the Server Settings panel (F2)
- Navigate to Dashboard and press
Enter - Configure:
- Enabled — toggle on/off
- Port — server port (default: 49223)
- Auth Key — optional authentication (see below)
- Press
Enterto save,Escto close
Authentication
When an auth key is configured, clients must include it as a query parameter:
http://localhost:49223?auth=mysecretkey
Connection Status
The dashboard shows a connection indicator at the top of the page:
- Green pulsing dot — Connected via WebSocket
- Red dot — Disconnected (auto-reconnects every 2 seconds)
Architecture
The dashboard server is built with axum and tokio. It:
- Creates a
broadcast::channel(64)for metrics distribution - Spawns the server on the configured port
- Each metrics update is sent to the broadcast channel
- WebSocket clients subscribe and receive real-time updates
- The HTML dashboard (embedded in the binary) connects via WebSocket and renders the metrics
The server is started/stopped automatically when you toggle the Dashboard setting in Server Settings.
Architecture
LLM Manager is a Rust application built on ratatui and crossterm, using tokio for async operations. The codebase is organized into several modules:
src/
├── main.rs # Entry point, event loop, model discovery, metrics polling
├── config.rs # Config loading/saving, YAML-based, profiles, presets
├── models.rs # Domain types (SearchResult, DownloadState, ModelSettings, etc.)
├── serve.rs # Standalone serve mode CLI (--model, --profile, --api-port, --api-key)
├── serve_api.rs # Axum-based API proxy server for serve mode
├── backend/
│ ├── hub.rs # HuggingFace API: search, list files, download
│ ├── server.rs # llama.cpp server spawning (resolve_backend_binary, spawn_server)
│ ├── benchmark.rs # Benchmark tuning system (RuntimeOnly and Full modes)
│ ├── hardware.rs # GPU detection (AMD/NVIDIA/Intel), platform detection
│ ├── tls.rs # TLS certificate generation for secure connections
│ └── ws_server.rs # WebSocket metrics server
├── tui/
│ ├── mod.rs # Module declaration, format_size/format_number helpers
│ ├── app/ # App state (types.rs, async_ops.rs, sync_ops.rs, state.rs, metadata.rs,
│ │ # profiles.rs, pickers.rs, panels.rs, help.rs)
│ ├── event/ # Keyboard/mouse event handling (benches.rs, helpers.rs, key.rs, mouse.rs,
│ │ # panel/, readme.rs)
│ ├── render.rs # Top-level render dispatcher (hints.rs, overlays.rs, status.rs)
│ └── panel/ # Individual panel render functions
│ ├── mod.rs
│ ├── models.rs # Left panel: model list / search / download
│ ├── info.rs # GGUF metadata rendering
│ ├── tabbed.rs # Right panel: Model Info / Settings tabs
│ ├── settings.rs
│ ├── log.rs
│ ├── help.rs
│ ├── active.rs # Active model metrics panel
│ ├── about.rs # About box
│ ├── readme.rs # README rendering
│ ├── rpc_workers.rs # RPC workers manager
│ ├── system_prompt_presets.rs # System prompt presets
│ ├── profiles.rs # Profiles manager
│ ├── downloads.rs # Download progress panel (rendered inline, not a separate module)
## App State Machine
The `App` struct in `src/tui/app.rs` holds all application state. The main state machine is controlled by `models_mode`:
```rust
pub enum ModelsMode {
List, // Local model list
Search { query, results, sort_by, show_readme, loading, has_more, page },
Files { model_id, files, selected_idx, previous_query, previous_results, selected_result },
BenchTune, // Benchmark tuning mode showing results table
}
Each mode controls rendering in render.rs and key handling in event.rs. The GlobalMode enum handles overlays that appear above all panels:
#![allow(unused)]
fn main() {
pub enum GlobalMode {
Normal,
CmdLine { cmd_line: String },
HostPicker { entries: Vec<(String, String)>, selected: usize },
BackendPicker { entries: Vec<(Backend, Option<String>)>, selected: usize },
Confirmation { selected: bool, kind: ConfirmationKind },
RpcManager,
About,
MaxConcurrentPicker { value: String },
SpecTypePicker { entries: Vec<String>, selected: usize },
YarnRoPESettings { scale: String, freq_base: String, freq_scale: String, selected_field: i32, editing: bool, edit_buffer: String, edit_cursor_pos: usize },
BenchTuneSetup { config, selected_idx, bench_mode_selection, editing_prompt, editing_kwargs },
PromptPicker { entries, selected, editing, edit_buffer, edit_cursor_pos, confirm_delete },
ProfilePicker { entries, selected, profiles },
DashboardPicker { enabled, port, auth_key, tls_enabled, tls_cert, tls_key, selected_field, editing, edit_buffer, edit_cursor_pos },
DashboardUrl { host, port, auth_key, ws_enabled },
SearchInput { buffer: String, cursor_pos: usize },
}
}
Local Model Filter
The application supports real-time filtering of the local models list. Triggered by the f key when the Models panel is focused, it allows users to quickly narrow down large collections using case-insensitive substring matching.
Model Discovery
The discover_models() function in main.rs recursively scans the models directory for .gguf files:
#![allow(unused)]
fn main() {
fn discover_models(dir: &Path) -> Vec<DiscoveredModel>
}
Each DiscoveredModel contains the file path, name, size, and display name (relative path from models directory). Discovery runs in a blocking task on startup.
Download System
Downloads run in a spawned tokio task with progress flowing through a broadcast channel:
- User selects a file and presses
Enter pending_downloadis set with(model_id, filename, url, file_size)- Before starting, the app checks available disk space via
hub::get_free_space_bytes()and warns if insufficient - A tokio task calls
hub::download_file()with anArc<AtomicBool>cancel token andArc<AtomicU8>state - Progress updates flow through
download_tx→download_rx - The main loop polls
download_rxeach iteration and updates the Download panel - Pressing
⌥C(Alt+C) cancels the download and removes the temporary file;ppauses/resumes it
The download loop checks the state atomically each iteration: 1 = downloading, 2 = paused (sleeps 100ms and retries), 3 = cancelled (removes temp file, returns error). Each DownloadState tracks bytes downloaded, speed, ETA, destination path, and status (Downloading/Paused/Complete/Cancelled/Error).
Server Spawning
When a model is loaded, spawn_server() in backend/server.rs:
- Resolves the llama-server binary using
resolve_backend_binary() - If the binary doesn’t exist, downloads and extracts it from GitHub releases
- Spawns the process with the model path and all settings
- Sets up a log channel (
server_log_rx) for parsing output
The main loop polls server_log_rx and parses log messages for:
- Loading phases (model, metadata, tensors) from log messages
- Error detection (OOM, crash) from log messages
Metrics (TPS, VRAM, context) are now collected exclusively from the /metrics and /health API endpoints rather than log parsing.
Metrics & Logging
Metrics are collected from the /metrics and /health endpoints, which provide accurate real-time data. Loading completion is detected via the /health endpoint (polling for "status": "ok" and non-empty slots).
Each log entry is stored in log_entries: VecDeque<LogEntry> with a max of 500 entries. The log panel supports scrolling, expansion (Enter/Esc), and two modes: Following (auto-scroll to bottom) and Manual (free scroll). Press f to toggle modes.
Search
Search uses the HuggingFace API with &filter=gguf to only return GGUF models:
#![allow(unused)]
fn main() {
pub async fn search_models(query: &str, limit: u32, offset: u32) -> Result<(Vec<SearchResult>, bool)>
}
A post-filter checks that the model_id contains the search query (case-insensitive), since the HF API does full-text search across descriptions/tags and can return unrelated models.
Multi-word search: Space-separated words are split and each word must match the model name (AND logic). Matching words are highlighted in cyan in the results list.
- Default: 70 results per page (max 200)
- Pagination:
Ctrl+Bgoes back,Downat bottom loads more - Sort order cycles: Relevance → Downloads → Likes → Trending → Created
- README fetching:
Ctrl+Shift+Rdownloads and renders the model’s README
VRAM Estimation
The estimate_vram_mib() function in src/models.rs estimates VRAM usage:
total = model_vram + kv_cache + activation + fixed_overhead + 550
Where:
- model_vram — proportional to GPU layers loaded
- kv_cache —
2 * n_layer * n_ctx * n_embd_kv * sizeof(type)with GQA ratio and FlashAttention factor - activation — proportional to batch size and hidden size
- fixed_overhead — 3.8% of max VRAM (or 500 MiB if unknown)
Loading Progress
Model loading phases are detected from llama.cpp log output:
| Phase | Log pattern | Weight |
|---|---|---|
| ServerStarting | (implicit) | 8% |
| LoadingModel | “LLAMA_MODEL_LOADER” / “LOADING MODEL” | 7% |
| LoadingMeta | “LOADED META” / “META DATA” | 7% |
| LoadingTensors | “LOAD_TENSORS:” | 70% |
| ServerListening | “SERVER LISTENING” | 8% |
| Complete | Detected via /health API polling | — |
During tensor loading, the progress bar refines using layer counts parsed from “offloaded X/Y layers” log messages.
RPC Workers
Remote workers for distributed inference are stored in the config as Vec<RpcWorker>. Each worker has a name, IP address, and port (default: 50052). The RpcManager global mode provides a dedicated window for managing workers: add (n), edit (e), delete (d), toggle selection (Space).
Benchmark Tuning
The benchmark system (src/backend/benchmark.rs) supports two modes:
- RuntimeOnly: Single server, params sent in request body (no server restarts)
- Full: New server spawned for each parameter combination
Key types:
BenchTuneConfig: Model path, iterations, prompt, params to test, duration, modeBenchTuneParam: name, min, max, step, enabledBenchTuneResult: params, metrics (prompt_tps, generation_tps, combined_tps, latency_per_token, first_token_time), outputs, per-iteration metricsBenchTuneStatus: Running (with progress), Completed (with stats), PartiallyCompleted (with stats), Cancelled (with stats)
Error Handling
Errors are detected from log patterns:
- OOM: “OUTOFDEVICEMEMORY” / “OUT OF MEMORY”
- General error: “ERROR”, “FAILED TO LOAD”, “EXCEPTION”
Server exit is detected via a dedicated channel (not log parsing). On error, affected models are marked as Failed with the error message.
Confirmation Dialogs
Destructive actions trigger a GlobalMode::Confirmation overlay with ConfirmationKind variants: Exit, Reset, Delete, Unload, DeleteBackend. The user confirms with Enter or cancels with Esc.
API Reference
The full Rust API reference is available at docs.rs/llm-manager.
Generate it locally with:
cargo doc --open
Public Types
Core Types
| Type | Module | Description |
|---|---|---|
DiscoveredModel | models | A discovered .gguf file with path, name, size, and display name |
ModelSettings | models | All settings for loading a model via llama.cpp server (70+ fields) |
ModelState | models | State of a model: Available, Loading, Loaded, or Failed |
SearchResult | models | A model found via HuggingFace search |
DownloadState | models | Download progress tracking with cancellation support |
GgufMetadata | models | Parsed GGUF metadata (layers, hidden size, context, etc.) |
ServerMetrics | models | Metrics from the llama.cpp server (TPS, VRAM, CPU, context) |
WsMetrics | models | WebSocket-friendly metrics snapshot (serializable, includes settings and command display) |
LogEntry | config | A single log entry with timestamp, level, and message |
Enums
| Type | Module | Description |
|---|---|---|
Backend | models | Acceleration backend: Cpu, Vulkan, Rocm, RocmLemonade, Cuda, CpuArm64, CpuWindows, VulkanWindows, CudaWindows12_4, CudaWindows13_1, HipWindows, CpuMacosArm64, CpuMacosX64 |
ServerMode | models | Server operating mode: Normal (single model), Router (multiple), Bench (GPU benchmarking), or BenchTune (parameter auto-tuning) |
GpuLayersMode | models | GPU offloading: Auto, Specific(n), or All |
SearchSort | models | Search result sort order: Relevance, Downloads, Likes, Trending, Created |
CacheType | models | Main KV cache data type: F16, BF16, Fq8_0, Fq4_1 |
CacheQuantType | models | KV cache data type for quantization (F32, F16, BF16, Q8_0, Q5_0, Q5_1, Q4_0, Q4_1, Iq4Nl) |
CacheTypeK / CacheTypeV | models | Type aliases for CacheQuantType (used for keys and values) |
SplitMode | models | Multi-GPU split mode: None, Layer, Row, Tensor |
NumMode | models | NUMA optimization: None, Distribute, Isolate, Numactl |
RopeScaling | models | RoPE frequency scaling: None, Linear, Yarn |
Mirostat | models | Mirostat version: Off, Mirostat, Mirostat2 |
LoadingPhase | app | Phase of model loading (used internally by the TUI) |
LoadProgress | models | Load progress with layers_total, layers_loaded, tensors_loaded |
Samplers | models | Semicolon-separated sampler order string |
BenchTuneMode | benchmark | Benchmark mode: RuntimeOnly or Full |
BenchTuneStatus | benchmark | Status: Running, Completed, PartiallyCompleted, Cancelled, or Error |
Main Modules
backend::hub
HuggingFace API integration.
#![allow(unused)]
fn main() {
/// Search models on HuggingFace.
pub async fn search_models(query: &str, limit: u32, offset: u32) -> Result<(Vec<SearchResult>, usize)>
/// List all GGUF files for a model.
pub async fn list_gguf_files(model_id: &str) -> Result<Vec<(String, u64, String)>>
/// Fetch the README for a model from HuggingFace.
pub async fn fetch_readme(model_id: &str) -> Result<String>
/// Download a file with progress tracking.
pub async fn download_file(
model_id: &str,
filename: &str,
url: &str,
dest: &Path,
progress: &mut DownloadState,
download_state: Arc<AtomicU8>,
tx: broadcast::Sender<DownloadState>,
) -> Result<()>
/// Get available free disk space in bytes for a given path.
pub fn get_free_space_bytes(path: &Path) -> u64
/// Resolve the llama-server binary path for a given backend.
/// Downloads the binary from GitHub releases if not already cached.
pub async fn resolve_backend_binary(
backend: Backend,
tag: Option<&str>,
log_tx: Option<mpsc::Sender<String>>,
progress_tx: Option<tokio::sync::broadcast::Sender<crate::models::DownloadState>>,
) -> Result<PathBuf>
}
backend::server
llama.cpp server process management.
#![allow(unused)]
fn main() {
/// Manages a single llama.cpp server process.
pub struct ServerHandle {
pub port: u16,
pub host: String,
pub pid: u32,
pub kill_tx: mpsc::Sender<()>,
}
/// Build the full llama-server command line from settings.
pub fn build_server_cmd(
binary: &Path,
model: Option<&DiscoveredModel>,
settings: &ModelSettings,
config: &Config,
server_mode: ServerMode,
router_max_models: u32,
) -> (Command, String)
/// Request to spawn a llama.cpp server process.
pub struct SpawnServerRequest<'a> {
pub config: &'a Config,
pub model: Option<&'a DiscoveredModel>,
pub settings: &'a ModelSettings,
pub log_tx: mpsc::Sender<String>,
pub progress_tx: Option<tokio::sync::broadcast::Sender<DownloadState>>,
pub server_mode: ServerMode,
pub router_max_models: u32,
pub exit_tx: mpsc::Sender<()>,
}
/// Spawn a llama.cpp server process.
pub async fn spawn_server(request: SpawnServerRequest) -> Result<(ServerHandle, String), String>
/// Check if the server is healthy and responsive.
pub async fn check_health(host: &str, port: u16) -> bool
/// Kill a running server.
pub async fn kill_server(handle: ServerHandle) -> Result<(), String>
/// Poll metrics from the server.
pub async fn get_metrics(
host: &str,
port: u16,
model_name: Option<&str>,
pid: Option<u32>,
) -> Result<ServerMetrics, String>
/// Load a model via the llama-server Router API.
pub async fn load_model(host: &str, port: u16, model_id: &str, model_path: Option<&str>) -> Result<(), String>
/// List all models and their status from the llama-server Router API.
pub async fn list_models(host: &str, port: u16) -> Result<Vec<(String, String, Option<String>)>, String>
/// Unload a model via the llama-server Router API.
pub async fn unload_model(host: &str, port: u16, model_id: &str, model_path: Option<&str>) -> Result<(), String>
}
config
Configuration loading and saving.
#![allow(unused)]
fn main() {
/// Global configuration.
pub struct Config {
pub models_dirs: Vec<PathBuf>,
pub llama_server: PathBuf,
pub default: DefaultParams,
pub model_overrides: ModelConfigStore,
pub profiles: ProfileStore,
pub system_prompt_presets: PresetStore,
pub rpc_workers: Vec<RpcWorker>,
pub ws_server: WsServer,
pub search_limit: u32,
}
/// A remote RPC worker for distributed inference.
pub struct RpcWorker {
pub selected: bool,
pub name: String,
pub ip: String,
pub port: u16,
}
/// A named profile of settings.
pub struct Profile {
pub name: String,
pub description: String,
pub settings: ModelOverride,
}
/// A named system prompt preset.
pub struct SystemPromptPreset {
pub name: String,
pub description: String,
pub content: String,
}
/// Per-model settings override (optional fields).
pub struct ModelOverride {
pub context_length: Option<u32>,
pub threads: Option<u32>,
pub temperature: Option<f32>,
// ... 50+ optional fields
}
/// Built-in profiles with sensible defaults.
pub fn builtin_profiles() -> Vec<Profile>
/// Built-in system prompt presets.
pub fn builtin_system_prompt_presets() -> Vec<SystemPromptPreset>
}
backend::ws_server
WebSocket dashboard server.
#![allow(unused)]
fn main() {
pub struct WsAppState {
pub metrics_rx: Arc<broadcast::Receiver<WsMetrics>>,
pub auth_key: Option<String>,
}
pub async fn start_ws_server(
port: u16,
metrics_rx: Arc<broadcast::Receiver<WsMetrics>>,
auth_key: Option<String>,
tls_config: Option<axum_server::tls_rustls::RustlsConfig>,
host: String,
) -> Result<JoinHandle<()>>
pub fn stop_ws_server(handle: JoinHandle<()>)
}
backend::benchmark
Benchmark tuning system.
#![allow(unused)]
fn main() {
/// Configuration for a benchmark run.
pub struct BenchTuneConfig {
pub model_path: PathBuf,
pub num_iterations: u32,
pub prompt: String,
pub params: Vec<BenchTuneParam>,
pub duration: Duration,
pub mode: BenchTuneMode,
pub n_predict: usize,
pub chat_template_kwargs: Option<String>,
}
/// A tunable parameter for benchmarking.
pub struct BenchTuneParam {
pub name: String,
pub min: f64,
pub max: f64,
pub step: f64,
pub enabled: bool,
}
/// Actual parameter values for a benchmark run.
pub struct BenchTuneParamValue {
pub temperature: Option<f64>,
pub top_p: Option<f64>,
pub top_k: Option<i64>,
pub repeat_penalty: Option<f64>,
pub context_length: Option<u32>,
pub batch_size: Option<u32>,
pub flash_attn: Option<bool>,
pub threads: Option<u32>,
pub expert_count: Option<i32>,
pub spec_type: Option<String>,
pub draft_tokens: Option<u32>,
}
/// Results from a benchmark run.
pub struct BenchTuneResult {
pub params: BenchTuneParamValue,
pub metrics: BenchTuneMetrics,
pub outputs: Vec<String>,
pub per_iteration_metrics: Vec<BenchTuneMetrics>,
pub base_settings: Option<ModelSettings>,
}
/// Metrics from a benchmark run.
pub struct BenchTuneMetrics {
pub prompt_tps: f64,
pub generation_tps: f64,
pub combined_tps: f64,
pub latency_per_token: f64,
pub first_token_time: f64,
}
}
models
Domain types and utilities.
#![allow(unused)]
fn main() {
/// Estimate VRAM usage (in MiB) for a model with the given settings.
pub fn estimate_vram_mib(
model_mib: u64,
settings: &ModelSettings,
total_layers: u32,
hidden_size_opt: Option<u32>,
n_head_opt: Option<u32>,
n_kv_head_opt: Option<u32>,
gpu_mem_total_mib: u64,
) -> u64
/// Format bytes as MB or GB.
pub fn format_mib(mib: u64) -> String
}
Configuration
Configuration is stored in ~/.config/llm-manager/config.yaml and loaded via Config::load(). The config file structure:
models_dirs:
- ~/.local/share/llm-manager/models
llama_server: llama-server
default:
context_length: 32096
threads: <physical cores>
# ... more default parameters
llama_cpp_version_cpu: null
llama_cpp_version_vulkan: null
llama_cpp_version_rocm: null
llama_cpp_version_rocm_lemonade: null
llama_cpp_version_cuda: null
model_overrides:
# Per-model configs stored as individual YAML files in ~/.config/llm-manager/models/
model.gguf:
temperature: 0.7
gpu_layers: 32
profiles:
- name: Qwen
description: Optimized for Qwen models
settings:
temperature: 0.6
top_k: 20
rpc_workers:
- name: Remote-GPU-1
ip: 192.168.1.50
port: 50052
selected: true
system_prompt_presets:
- name: General
description: General-purpose assistant
content: "You are a helpful assistant."
Built-in profiles are merged on load, so adding new ones in code automatically appears in the UI.