CLI Reference#
The CLI entrypoint is vox-serve (installed via pip), which maps to
python -m vox_serve.launch.
Usage#
vox-serve --model <model-name> --port 8000
Arguments#
--modelModel name or local path to load for inference. Default:
canopylabs/orpheus-3b-0.1-ft.--scheduler-typeScheduler backend implementation. One of:
base,online,offline.--async-schedulingEnable async scheduling mode, which overlaps request handling and scheduler work.
--hostBind address for the HTTP server. Default:
0.0.0.0.--portTCP port for the HTTP server. Default:
8000.--max-batch-sizeMaximum batch size used by the scheduler for inference.
--max-num-pagesMaximum number of KV cache pages for the scheduler backend.
--page-sizeSize of each KV cache page (tokens per page).
--top-pTop-p (nucleus) sampling threshold. When set, tokens are sampled from the smallest set whose cumulative probability exceeds this value.
--top-kTop-k sampling threshold. When set, tokens are sampled from the k most likely candidates.
--min-pMin-p sampling threshold. Filters out tokens with probability below this value.
--temperatureSampling temperature to scale logits. Lower is more deterministic; higher is more random.
--max-tokensMaximum number of tokens to generate per request.
--repetition-penaltyPenalize repeated tokens to reduce loops in generated output.
--repetition-windowWindow size for repetition penalty.
--cfg-scaleClassifier-free guidance scale, where higher values strengthen conditioning.
--greedyUse greedy decoding (disables top-k/top-p/min-p/temperature sampling).
--enable-cuda-graphEnable CUDA graph optimization for the decode phase.
--disable-cuda-graphDisable CUDA graph optimization for the decode phase.
--enable-disaggregationEnable disaggregation mode (requires at least 2 GPUs).
--dp-sizeEnable data parallel mode with N replicas (N >= 1). Cannot be combined with
--enable-disaggregationand requires N <= available GPUs.--enable-nvtxEnable NVTX profiling for performance analysis.
--enable-torch-compileEnable
torch.compileoptimization for model inference.--log-levelSet log verbosity. One of:
DEBUG,INFO,WARNING,ERROR,CRITICAL.--socket-suffixAppend a suffix to IPC socket paths to avoid conflicts when running multiple instances.