Architecture
LLM Manager is a Rust application built on ratatui and crossterm, using tokio for async operations. The codebase is organized into several modules:
src/
├── main.rs # Entry point, event loop, model discovery, metrics polling
├── config.rs # Config loading/saving, YAML-based, profiles, presets
├── models.rs # Domain types (SearchResult, DownloadState, ModelSettings, etc.)
├── serve.rs # Standalone serve mode CLI (--model, --profile, --api-port, --api-key)
├── serve_api.rs # Axum-based API proxy server for serve mode
├── backend/
│ ├── hub.rs # HuggingFace API: search, list files, download
│ ├── server.rs # llama.cpp server spawning (resolve_backend_binary, spawn_server)
│ ├── benchmark.rs # Benchmark tuning system (RuntimeOnly and Full modes)
│ ├── hardware.rs # GPU detection (AMD/NVIDIA/Intel), platform detection
│ ├── tls.rs # TLS certificate generation for secure connections
│ └── ws_server.rs # WebSocket metrics server
├── tui/
│ ├── mod.rs # Module declaration, format_size/format_number helpers
│ ├── app/ # App state (types.rs, async_ops.rs, sync_ops.rs, state.rs, metadata.rs,
│ │ # profiles.rs, pickers.rs, panels.rs, help.rs)
│ ├── event/ # Keyboard/mouse event handling (benches.rs, helpers.rs, key.rs, mouse.rs,
│ │ # panel/, readme.rs)
│ ├── render.rs # Top-level render dispatcher (hints.rs, overlays.rs, status.rs)
│ └── panel/ # Individual panel render functions
│ ├── mod.rs
│ ├── models.rs # Left panel: model list / search / download
│ ├── info.rs # GGUF metadata rendering
│ ├── tabbed.rs # Right panel: Model Info / Settings tabs
│ ├── settings.rs
│ ├── log.rs
│ ├── help.rs
│ ├── active.rs # Active model metrics panel
│ ├── about.rs # About box
│ ├── readme.rs # README rendering
│ ├── rpc_workers.rs # RPC workers manager
│ ├── system_prompt_presets.rs # System prompt presets
│ ├── profiles.rs # Profiles manager
│ ├── downloads.rs # Download progress panel (rendered inline, not a separate module)
## App State Machine
The `App` struct in `src/tui/app.rs` holds all application state. The main state machine is controlled by `models_mode`:
```rust
pub enum ModelsMode {
List, // Local model list
Search { query, results, sort_by, show_readme, loading, has_more, page },
Files { model_id, files, selected_idx, previous_query, previous_results, selected_result },
BenchTune, // Benchmark tuning mode showing results table
}
Each mode controls rendering in render.rs and key handling in event.rs. The GlobalMode enum handles overlays that appear above all panels:
#![allow(unused)]
fn main() {
pub enum GlobalMode {
Normal,
CmdLine { cmd_line: String },
HostPicker { entries: Vec<(String, String)>, selected: usize },
BackendPicker { entries: Vec<(Backend, Option<String>)>, selected: usize },
Confirmation { selected: bool, kind: ConfirmationKind },
RpcManager,
About,
MaxConcurrentPicker { value: String },
SpecTypePicker { entries: Vec<String>, selected: usize },
YarnRoPESettings { scale: String, freq_base: String, freq_scale: String, selected_field: i32, editing: bool, edit_buffer: String, edit_cursor_pos: usize },
BenchTuneSetup { config, selected_idx, bench_mode_selection, editing_prompt, editing_kwargs },
PromptPicker { entries, selected, editing, edit_buffer, edit_cursor_pos, confirm_delete },
ProfilePicker { entries, selected, profiles },
DashboardPicker { enabled, port, auth_key, tls_enabled, tls_cert, tls_key, selected_field, editing, edit_buffer, edit_cursor_pos },
DashboardUrl { host, port, auth_key, ws_enabled },
SearchInput { buffer: String, cursor_pos: usize },
}
}
Local Model Filter
The application supports real-time filtering of the local models list. Triggered by the f key when the Models panel is focused, it allows users to quickly narrow down large collections using case-insensitive substring matching.
Model Discovery
The discover_models() function in main.rs recursively scans the models directory for .gguf files:
#![allow(unused)]
fn main() {
fn discover_models(dir: &Path) -> Vec<DiscoveredModel>
}
Each DiscoveredModel contains the file path, name, size, and display name (relative path from models directory). Discovery runs in a blocking task on startup.
Download System
Downloads run in a spawned tokio task with progress flowing through a broadcast channel:
- User selects a file and presses
Enter pending_downloadis set with(model_id, filename, url, file_size)- Before starting, the app checks available disk space via
hub::get_free_space_bytes()and warns if insufficient - A tokio task calls
hub::download_file()with anArc<AtomicBool>cancel token andArc<AtomicU8>state - Progress updates flow through
download_tx→download_rx - The main loop polls
download_rxeach iteration and updates the Download panel - Pressing
⌥C(Alt+C) cancels the download and removes the temporary file;ppauses/resumes it
The download loop checks the state atomically each iteration: 1 = downloading, 2 = paused (sleeps 100ms and retries), 3 = cancelled (removes temp file, returns error). Each DownloadState tracks bytes downloaded, speed, ETA, destination path, and status (Downloading/Paused/Complete/Cancelled/Error).
Server Spawning
When a model is loaded, spawn_server() in backend/server.rs:
- Resolves the llama-server binary using
resolve_backend_binary() - If the binary doesn’t exist, downloads and extracts it from GitHub releases
- Spawns the process with the model path and all settings
- Sets up a log channel (
server_log_rx) for parsing output
The main loop polls server_log_rx and parses log messages for:
- Loading phases (model, metadata, tensors) from log messages
- Error detection (OOM, crash) from log messages
Metrics (TPS, VRAM, context) are now collected exclusively from the /metrics and /health API endpoints rather than log parsing.
Metrics & Logging
Metrics are collected from the /metrics and /health endpoints, which provide accurate real-time data. Loading completion is detected via the /health endpoint (polling for "status": "ok" and non-empty slots).
Each log entry is stored in log_entries: VecDeque<LogEntry> with a max of 500 entries. The log panel supports scrolling, expansion (Enter/Esc), and two modes: Following (auto-scroll to bottom) and Manual (free scroll). Press f to toggle modes.
Search
Search uses the HuggingFace API with &filter=gguf to only return GGUF models:
#![allow(unused)]
fn main() {
pub async fn search_models(query: &str, limit: u32, offset: u32) -> Result<(Vec<SearchResult>, bool)>
}
A post-filter checks that the model_id contains the search query (case-insensitive), since the HF API does full-text search across descriptions/tags and can return unrelated models.
Multi-word search: Space-separated words are split and each word must match the model name (AND logic). Matching words are highlighted in cyan in the results list.
- Default: 70 results per page (max 200)
- Pagination:
Ctrl+Bgoes back,Downat bottom loads more - Sort order cycles: Relevance → Downloads → Likes → Trending → Created
- README fetching:
Ctrl+Shift+Rdownloads and renders the model’s README
VRAM Estimation
The estimate_vram_mib() function in src/models.rs estimates VRAM usage:
total = model_vram + kv_cache + activation + fixed_overhead + 550
Where:
- model_vram — proportional to GPU layers loaded
- kv_cache —
2 * n_layer * n_ctx * n_embd_kv * sizeof(type)with GQA ratio and FlashAttention factor - activation — proportional to batch size and hidden size
- fixed_overhead — 3.8% of max VRAM (or 500 MiB if unknown)
Loading Progress
Model loading phases are detected from llama.cpp log output:
| Phase | Log pattern | Weight |
|---|---|---|
| ServerStarting | (implicit) | 8% |
| LoadingModel | “LLAMA_MODEL_LOADER” / “LOADING MODEL” | 7% |
| LoadingMeta | “LOADED META” / “META DATA” | 7% |
| LoadingTensors | “LOAD_TENSORS:” | 70% |
| ServerListening | “SERVER LISTENING” | 8% |
| Complete | Detected via /health API polling | — |
During tensor loading, the progress bar refines using layer counts parsed from “offloaded X/Y layers” log messages.
RPC Workers
Remote workers for distributed inference are stored in the config as Vec<RpcWorker>. Each worker has a name, IP address, and port (default: 50052). The RpcManager global mode provides a dedicated window for managing workers: add (n), edit (e), delete (d), toggle selection (Space).
Benchmark Tuning
The benchmark system (src/backend/benchmark.rs) supports two modes:
- RuntimeOnly: Single server, params sent in request body (no server restarts)
- Full: New server spawned for each parameter combination
Key types:
BenchTuneConfig: Model path, iterations, prompt, params to test, duration, modeBenchTuneParam: name, min, max, step, enabledBenchTuneResult: params, metrics (prompt_tps, generation_tps, combined_tps, latency_per_token, first_token_time), outputs, per-iteration metricsBenchTuneStatus: Running (with progress), Completed (with stats), PartiallyCompleted (with stats), Cancelled (with stats)
Error Handling
Errors are detected from log patterns:
- OOM: “OUTOFDEVICEMEMORY” / “OUT OF MEMORY”
- General error: “ERROR”, “FAILED TO LOAD”, “EXCEPTION”
Server exit is detected via a dedicated channel (not log parsing). On error, affected models are marked as Failed with the error message.
Confirmation Dialogs
Destructive actions trigger a GlobalMode::Confirmation overlay with ConfirmationKind variants: Exit, Reset, Delete, Unload, DeleteBackend. The user confirms with Enter or cancels with Esc.