Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Configuration

Directory Layout

llm-manager uses XDG directories for config and data:

~/.config/llm-manager/          # Config directory
├── config.yaml                 # Global settings
├── models/                     # Per-model YAML configs
│   └── qwen2.5-7b.yaml
├── profiles/                   # Per-profile YAML configs
│   └── my-profile.yaml
├── presets/                    # Per-preset YAML configs
│   └── custom-preset.yaml
├── unused/                     # Deleted model configs
├── unused_profiles/            # Deleted profiles
└── unused_presets/             # Deleted presets

~/.local/share/llm-manager/     # Data directory
├── models/                     # GGUF model files
│   └── qwen2.5-7b.Q4_K_M.gguf
└── bin/                        # llama-server binaries
    └── llama-server-cpu-...

Per-model configs are named <model_name>.yaml where model_name is the GGUF filename without the .gguf extension. Deleted configs are moved to unused/ subdirectories (recoverable).

Config File

The main config file is ~/.config/llm-manager/config.yaml. It is created automatically on first run with sensible defaults.

models_dirs:
  - ~/.local/share/llm-manager/models
llama_server: llama-server
default:
  context_length: 32096
  threads: <physical cores>
  threads_batch: 8
  batch_size: 512
  temperature: 0.8
  # ... more settings

You can specify a custom config path with --config:

cargo run -- --config /path/to/config.yaml

Default Parameters

ParameterDefaultDescription
context_length32096Context window size in tokens
threads(physical cores)CPU threads for generation
threads_batch8CPU threads for batch processing
batch_size512Logical maximum batch size
ubatch_size512Physical maximum batch size
keep0Keep N tokens from initial prompt
mlockfalseLock model weights in RAM
mmaptrueMemory-map the model
kv_cache_offloadtrueOffload KV cache to RAM
flash_attntrueEnable Flash Attention
temperature0.8Sampling temperature
top_k40Top-k sampling
top_p0.95Top-p sampling
repeat_penalty1.1Repetition penalty
backendauto-detectedDefault backend (auto-detected: Cuda for NVIDIA, Rocm for AMD, Vulkan for Intel; falls back to cpu). Options: cpu, vulkan, rocm, rocm-lemonade, cuda

Profiles

Profiles are named presets of settings. The built-in profiles are:

ProfileDescriptionKey Settings
QwenOptimized for Qwen modelstemp: 0.6, top-k: 20, context: 32768
GemmaOptimized for Gemma modelstemp: 0.8, min-p: 0.1, typical-p: 0.9
LlamaOptimized for Llama modelstemp: 0.7, repeat-penalty: 1.1
MistralOptimized for Mistral modelstemp: 0.7, top-k: 50
PhiOptimized for Phi modelstemp: 0.7, top-k: 50

User-defined profiles are stored as individual YAML files in ~/.config/llm-manager/profiles/<name>.yaml. Built-in profiles are auto-merged on load.

System Prompt Presets

System prompt presets define the initial system prompt. Built-in presets:

PresetDescription
General“You are a helpful assistant.”
CoderExpert software developer
ThinkerAnalytical and thoughtful
MathematicianExpert in mathematics

User-defined presets are stored as individual YAML files in ~/.config/llm-manager/presets/<name>.yaml. Built-in presets are auto-merged on load.

Backend Binaries

llama-server binaries are stored in ~/.local/share/llm-manager/bin/ with versioned directories:

~/.local/share/llm-manager/bin/
├── llama-server-cpu-{version}/llama-server
├── llama-server-vulkan-{version}/llama-server
├── llama-server-rocm-{version}/llama-server
├── llama-server-rocm-lemonade-{version}/llama-server
└── llama-server-cuda-{version}/llama-server

Binaries are downloaded from specialized repositories on first use:

Switching versions is instant — no re-download.

Per-backend Version Config

llama_cpp_version_cpu: null
llama_cpp_version_vulkan: null
llama_cpp_version_rocm: null
llama_cpp_version_rocm_lemonade: null
llama_cpp_version_cuda: null

Platform-specific backend variants (e.g. CpuArm64, CpuWindows, CpuMacosArm64) are handled through the Backend enum and platform field, not through separate version config keys. Each backend has its own independently configurable version.

Setting to null uses the latest release. Specific versions can be set via the version picker in LLM Settings. These selections are automatically persisted to your configuration and remembered across restarts.

Asset Names

Assets are selected based on the detected platform. Linux examples:

  • CPU (x64): llama-{tag}-bin-ubuntu-x64.tar.gz
  • CPU (ARM64): llama-{tag}-bin-ubuntu-arm64.tar.gz
  • Vulkan: llama-{tag}-bin-ubuntu-vulkan-x64.tar.gz
  • ROCm: llama-{tag}-bin-ubuntu-rocm-7.2-x64.tar.gz
  • ROCm Lemonade: llama-{tag}-ubuntu-rocm-{gfx}-x64.zip (auto-detects GPU architecture)
  • CUDA: llama.cpp-{tag}-cuda-12.8-amd64.tar.gz

Windows assets use *.zip (e.g. llama-{tag}-bin-win-cpu-x64.zip). macOS assets use *-macos-arm64.tar.gz or *-macos-x64.tar.gz.

Serve Mode

You can start a model directly from the command line without the TUI:

./build.sh serve --model /path/to/model.gguf

Options

OptionDescription
--modelPath to the GGUF model file
--profileApply a settings profile (e.g., qwen, llama)
--configPath to config file
--api-portStart API proxy on given port
--api-keyAPI key for Bearer token authentication
--ws-enableEnable WebSocket dashboard server
--ws-portPort for WebSocket dashboard server
--ws-authAuth key for WebSocket dashboard access
--hostBind address for the server (e.g., 0.0.0.0)
--backend-binaryPath to a custom llama-server binary
--log-fileLog file path (default: stdout)
--tls-enableEnable TLS for WebSocket dashboard
--tls-certPath to TLS certificate file
--tls-keyPath to TLS private key file
--threadsCPU threads
--contextContext length
--gpu-layersNumber of GPU layers

API Proxy

The API proxy forwards requests to the llama.cpp server and provides OpenAI-compatible and Anthropic-compatible endpoints. It supports SSE (Server-Sent Events) streaming for chat completions and other streaming endpoints, and CORS is enabled for all origins with GET/POST/PUT/DELETE/OPTIONS methods. When --api-key is set, all requests require Authorization: Bearer <key>.

API Endpoints

The API proxy explicitly handles the following endpoints, while all other paths are automatically proxied to the llama-server instance:

EndpointMethodDescription
/healthGETHealth check
/metricsGETPrometheus metrics
/v1/chat/completionsPOSTChat completions (OpenAI)
/v1/completionsPOSTCompletions (OpenAI)
/v1/embeddingsPOSTEmbeddings
/v1/modelsGETList models
/api/statusGETServer status (pid, uptime, loaded models)

The following endpoints are automatically proxied to llama-server (not explicitly handled):

EndpointMethodDescription
/v1/responsesPOSTResponses (Anthropic)
/v1/messagesPOSTMessages (Anthropic)
/v1/messages/count_tokensPOSTCount tokens (Anthropic)
/completionPOSTLegacy completion
/infillPOSTCode completion (FIM)
/rerankingPOSTRe-ranking
/tokenizePOSTTokenize text
/detokenizePOSTDetokenize tokens
/apply-templatePOSTApply chat template
/v1/healthGETHealth check (alias)
/propsGET/POSTGet/set server properties
/slotsGETSlot monitoring
/lora-adaptersGET/POSTList/load LoRA adapters
/models/loadPOSTLoad a model (router mode)
/models/unloadPOSTUnload a model (router mode)

Model Overrides

Settings can be saved per-model. Overrides are stored as individual YAML files in ~/.config/llm-manager/models/<name>.yaml (where name is the GGUF filename without .gguf). When a model is loaded, its override settings are merged into the defaults. Deleted configs are moved to ~/.config/llm-manager/unused/ for recovery.

RPC Workers

You can manage a list of remote llama-rpc-server nodes for distributed inference. These are stored in the rpc_workers list in the config:

rpc_workers:
  - selected: true
    name: "Worker 1"
    ip: "192.168.1.10"
    port: 50052

Workers can be managed via the RPC Workers window in the Server Settings panel. Selected workers are combined into the --rpc flag when starting the server.