transformers.js inference server
use open-weight models on mac, linux & win via an OpenAI-compatible api, built in TypeScript
setup
install it globally
or use npx
run the server
the server runs at http://127.0.0.1:8000 and exposes an OpenAI-compatible API, so any OpenAI client works out of the box.
wandler --llm LiquidAI/LFM2.5-1.2B-Instruct-ONNXhere is every flag wandler accepts:
org/repo[:precision]wandler · baseline: transformersjsauto · options: auto, webgpu, cpu, wasm8000127.0.0.1WANDLER_API_KEY*1120000info · options: debug, info, warn, error~/.cache/huggingface (standard HuggingFace cache, also respects HF_HOME)auto (640MB GPU attention budget) · use auto:<mb> to tune or 0/off to disableauto uses transformers.js generate() · use on to opt into the experimental looptrue251208precision suffixes: q1, q2, q4 (default), q8, fp16, fp32.
discover models
list every model in the wandler registry with type, size, precision and capabilities
filter by type with --type llm, --type embedding, or --type stt.
benchmarks
WebGPU · q4 quantization · 10 runs per scenario · tested on m3 pro 18gb
| Model | Params | Weights | Context | tok/s | TTFT | Load | Capabilities |
|---|---|---|---|---|---|---|---|
| 350M | ~200 MB | 128K | 248 | 16ms | 0.5s | text | |
| 1.2B | ~700 MB | 128K | 118 | 34ms | 1.7s | text, tools | |
| 0.8B | ~500 MB | 256K | 37 | 276ms | 1.8s | text, tools | |
| 4B | ~2.5 GB | 128K | 20 | 636ms | 13.4s | text, tools, vision | |
| 2B | ~1.2 GB | 128K | 12 | 890ms | 7.0s | text, tools, vision |
these are the ones we tested. any transformers.js-compatible model on Hugging Face works.
find more on Hugging Faceuse it in your app
drop-in replacement for any OpenAI-compatible SDK
import OpenAI from "openai";
const client = new OpenAI({
baseURL: "http://localhost:8000/v1",
// replace with your --api-key value if you started wandler with one
apiKey: "changeme",
});
const res = await client.chat.completions.create({
model: "LiquidAI/LFM2.5-1.2B-Instruct-ONNX",
messages: [{ role: "user", content: "What is the capital of Germany" }],
stream: true,
});
for await (const chunk of res) {
process.stdout.write(chunk.choices[0]?.delta?.content ?? "");
}use it with your agent
point your agent to wandler. works with any agent that supports custom OpenAI endpoints
assuming your wandler server is running LiquidAI/LFM2.5-1.2B-Instruct-ONNX, configure Hermes via the CLI like this
replace model.default with the model slug you actually loaded in wandler.
hermes config set model.default LiquidAI/LFM2.5-1.2B-Instruct-ONNX hermes config set model.provider custom hermes config set model.base_url http://localhost:8000/v1 # if you started wandler with --api-key your-local-key # hermes config set model.api_key your-local-key
or put the same settings into ~/.hermes/config.yaml
again, replace model.default if your server is running a different model.
model:
default: "LiquidAI/LFM2.5-1.2B-Instruct-ONNX"
provider: "custom"
base_url: "http://localhost:8000/v1"
# if you started wandler with --api-key your-local-key
# api_key: "your-local-key"API reference
/v1/responsesCreate a model response with streaming, tool calling, and multi-turn input
inputstring | arrayRequiredInput text or array of message/function_call/function_call_output items
instructionsstringSystem-level instructions (replaces system message)
temperaturefloatSampling temperature, 0-2. Default 0.7
top_pfloatNucleus sampling threshold. Default 0.95
max_output_tokensintMaximum tokens to generate
streambooleanEnable named SSE event streaming. Default false
toolsarrayFunction tools: {type, name, description, parameters}
Flat format — name and parameters are top-level, not nested under a function key.
tool_choicestring | object"auto", "none", "required", or {type: "function", name: "..."}
textobject{"format": {"type": "json_object"}} for JSON mode
top_kintTop-k sampling
min_pfloatMinimum probability threshold
repetition_penaltyfloatRepetition penalty, > 1.0 to penalize
/v1/chat/completionsChat completion with streaming and tool calling
messagesarrayRequiredInput messages with role and content
temperaturefloatSampling temperature, 0-2. Default 0.7
top_pfloatNucleus sampling threshold. Default 0.95
max_tokensintMaximum tokens to generate
streambooleanEnable SSE streaming. Default false
stopstring | string[]Stop sequences
Only the final token of each stop string triggers stopping. Multi-token sequences are not matched exactly.
toolsarrayFunction calling tool definitions
When set, streaming is emulated. The full response is generated first, then re-chunked as SSE.
response_formatobject{"type": "json_object"} for JSON mode
top_kintTop-k sampling
min_pfloatMinimum probability threshold
repetition_penaltyfloatRepetition penalty, > 1.0 to penalize
stream_optionsobject{"include_usage": true} for usage stats
/v1/completionsText completion (legacy) with echo and suffix
promptstringRequiredInput text prompt
temperaturefloatSampling temperature, 0-2. Default 0.7
max_tokensintMaximum tokens to generate
streambooleanEnable SSE streaming. Default false
stopstring | string[]Stop sequences
Only the final token of each stop string triggers stopping. Multi-token sequences are not matched exactly.
echobooleanEcho the prompt in the response
suffixstringText to append after completion
/v1/embeddingsText embeddings for RAG and semantic search
inputstring | string[]RequiredText to embed
encoding_formatstring"float" or "base64". Default "float"
/v1/modelsList and inspect loaded models
/v1/audio/transcriptionsSpeech-to-text
filebinaryRequiredAudio file to transcribe
languagestringLanguage code (e.g. en, de)
/tokenizeConvert between text and token IDs
textstringRequiredText to tokenize