transformers.js inference server
OpenAI-compatible API · Mac, Linux & Windows
setup
wandler is an OpenAI-compatible inference server powered by transformers.js
install it globally and run it directly:
or use npx to skip the install:
run the server
pick a setup, run the command, and point any OpenAI client at the server
wandler --llm LiquidAI/LFM2.5-1.2B-Instruct-ONNXthe server listens on http://127.0.0.1:8000 and speaks the OpenAI API, so any OpenAI client works out of the box.
here is every flag wandler accepts:
org/repo[:precision]auto · options: auto, webgpu, cpu, wasm8000127.0.0.1WANDLER_API_KEY*20481120000info · options: debug, info, warn, error.cache/ inside the @huggingface/transformers package (i.e. node_modules/@huggingface/transformers/.cache/)precision suffixes: q4 (default), q8, fp16, fp32.
discover models
list every model in the wandler registry with type, size, precision and capabilities
filter by type with --type llm, --type embedding, or --type stt.
benchmarks
WebGPU · q4 quantization · 10 runs per scenario
| Model | Params | Weights | Context | tok/s | TTFT | Load | Capabilities |
|---|---|---|---|---|---|---|---|
| 350M | ~200 MB | 128K | 248 | 16ms | 0.5s | text | |
| 1.2B | ~700 MB | 128K | 118 | 34ms | 1.7s | text, tools | |
| 0.8B | ~500 MB | 256K | 37 | 276ms | 1.8s | text, tools | |
| 4B | ~2.5 GB | 128K | 20 | 636ms | 13.4s | text, tools, vision | |
| 2B | ~1.2 GB | 128K | 12 | 890ms | 7.0s | text, tools, vision |
these are the ones we tested. any transformers.js-compatible model on Hugging Face works.
find more on Hugging Faceuse it in your app
drop-in replacement for any OpenAI-compatible SDK
import OpenAI from "openai";
const client = new OpenAI({
baseURL: "http://localhost:8000/v1",
apiKey: "-",
});
const res = await client.chat.completions.create({
model: "LiquidAI/LFM2.5-1.2B-Instruct-ONNX",
messages: [{ role: "user", content: "Hello!" }],
stream: true,
});
for await (const chunk of res) {
process.stdout.write(chunk.choices[0]?.delta?.content ?? "");
}use it with your agent
point your agent to wandler. works with any agent that supports custom OpenAI endpoints
set the base URL in ~/.hermes/config.yaml
model:
default: "LiquidAI/LFM2.5-1.2B-Instruct-ONNX"
provider: "custom"
base_url: "http://localhost:8000/v1"
api_key: "-"or configure it via the CLI
API reference
/v1/chat/completionsChat completion with streaming and tool calling
messagesarrayRequiredInput messages with role and content
temperaturefloatSampling temperature, 0-2. Default 0.7
top_pfloatNucleus sampling threshold. Default 0.95
max_tokensintMaximum tokens to generate
streambooleanEnable SSE streaming. Default false
stopstring | string[]Stop sequences
Only the final token of each stop string triggers stopping. Multi-token sequences are not matched exactly.
toolsarrayFunction calling tool definitions
When set, streaming is emulated. The full response is generated first, then re-chunked as SSE.
response_formatobject{"type": "json_object"} for JSON mode
top_kintTop-k sampling
min_pfloatMinimum probability threshold
repetition_penaltyfloatRepetition penalty, > 1.0 to penalize
stream_optionsobject{"include_usage": true} for usage stats
/v1/completionsText completion (legacy) with echo and suffix
promptstringRequiredInput text prompt
temperaturefloatSampling temperature, 0-2. Default 0.7
max_tokensintMaximum tokens to generate
streambooleanEnable SSE streaming. Default false
stopstring | string[]Stop sequences
Only the final token of each stop string triggers stopping. Multi-token sequences are not matched exactly.
echobooleanEcho the prompt in the response
suffixstringText to append after completion
/v1/embeddingsText embeddings for RAG and semantic search
inputstring | string[]RequiredText to embed
encoding_formatstring"float" or "base64". Default "float"
/v1/modelsList and inspect loaded models
/v1/audio/transcriptionsSpeech-to-text via Whisper
filebinaryRequiredAudio file to transcribe
languagestringLanguage code (e.g. en, de)
/tokenizeConvert between text and token IDs
textstringRequiredText to tokenize