wandler by Runpod Labs

transformers.js inference server

use open-weight models on mac, linux & win via an OpenAI-compatible api, built in TypeScript

run the server

let your agent run the server

setup

install it globally

or use npx

run the server

the server runs at http://127.0.0.1:8000 and exposes an OpenAI-compatible API, so any OpenAI client works out of the box.

wandler --llm LiquidAI/LFM2.5-1.2B-Instruct-ONNX

here is every flag wandler accepts:

--llm <id>

LLM model.

format: org/repo[:precision]

--backend <name>

LLM backend.

default: wandler · baseline: transformersjs

--embedding <id>

Embedding model.

--stt <id>

Speech-to-text model.

--device <type>

Inference device.

default: auto · options: auto, webgpu, cpu, wasm

--port <n>

Server port.

default: 8000

--host <addr>

Bind address.

default: 127.0.0.1

--api-key <key>

Bearer auth token.

reads env WANDLER_API_KEY

--hf-token <token>

HuggingFace token for gated models.

--cors-origin <origin>

Allowed CORS origin.

default: *

--max-tokens <n>

Max tokens per request.

default: the loaded model's max context length

--max-concurrent <n>

Concurrent requests.

default: 1

--timeout <ms>

Request timeout in milliseconds.

default: 120000

--log-level <level>

Log verbosity.

default: info · options: debug, info, warn, error

--quiet

Suppress non-error startup and profile logs.

--cache-dir <path>

Model cache directory.

default: ~/.cache/huggingface (standard HuggingFace cache, also respects HF_HOME)

--prefill-chunk-size <n>

Chunk size for long-prompt prefill.

default: auto (640MB GPU attention budget) · use auto:<mb> to tune or 0/off to disable

--decode-loop <mode>

Wandler-owned decode loop.

default: auto uses transformers.js generate() · use on to opt into the experimental loop

--prefix-cache <mode>

Enable prefix KV cache.

default: true

--prefix-cache-entries <n>

Prefix KV cache entries.

default: 2

--prefix-cache-min-tokens <n>

Minimum prefix tokens to cache.

default: 512

--warmup-tokens <n>

Approximate prompt tokens to run once before serving.

default: 0

--warmup-max-new-tokens <n>

Max new tokens for startup warmup.

default: 8

precision suffixes: q1, q2, q4 (default), q8, fp16, fp32.

discover models

list every model in the wandler registry with type, size, precision and capabilities

filter by type with --type llm, --type embedding, or --type stt.

benchmarks

WebGPU · q4 quantization · 10 runs per scenario · tested on m3 pro 18gb

Model	Params	Weights	Context	tok/s	TTFT	Load	Capabilities
LiquidAI/LFM2.5-350M-ONNX	350M	~200 MB	128K	248	16ms	0.5s	text
LiquidAI/LFM2.5-1.2B-Instruct-ONNX	1.2B	~700 MB	128K	118	34ms	1.7s	text, tools
onnx-community/Qwen3.5-0.8B-Text-ONNX	0.8B	~500 MB	256K	37	276ms	1.8s	text, tools
onnx-community/gemma-4-E4B-it-ONNX	4B	~2.5 GB	128K	20	636ms	13.4s	text, tools, vision
onnx-community/gemma-4-E2B-it-ONNX	2B	~1.2 GB	128K	12	890ms	7.0s	text, tools, vision

these are the ones we tested. any transformers.js-compatible model on Hugging Face works.

find more on Hugging Face

use it in your app

drop-in replacement for any OpenAI-compatible SDK

import OpenAI from "openai";

const client = new OpenAI({
  baseURL: "http://localhost:8000/v1",
  // replace with your --api-key value if you started wandler with one
  apiKey: "changeme",
});

const res = await client.chat.completions.create({
  model: "LiquidAI/LFM2.5-1.2B-Instruct-ONNX",
  messages: [{ role: "user", content: "What is the capital of Germany" }],
  stream: true,
});

for await (const chunk of res) {
  process.stdout.write(chunk.choices[0]?.delta?.content ?? "");
}

use it with your agent

point your agent to wandler. works with any agent that supports custom OpenAI endpoints

assuming your wandler server is running LiquidAI/LFM2.5-1.2B-Instruct-ONNX, configure Hermes via the CLI like this

replace model.default with the model slug you actually loaded in wandler.

hermes config set model.default LiquidAI/LFM2.5-1.2B-Instruct-ONNX
hermes config set model.provider custom
hermes config set model.base_url http://localhost:8000/v1
# if you started wandler with --api-key your-local-key
# hermes config set model.api_key your-local-key

or put the same settings into ~/.hermes/config.yaml

again, replace model.default if your server is running a different model.

model:
  default: "LiquidAI/LFM2.5-1.2B-Instruct-ONNX"
  provider: "custom"
  base_url: "http://localhost:8000/v1"
  # if you started wandler with --api-key your-local-key
  # api_key: "your-local-key"

API reference

POST/v1/responsesCreate a model response with streaming, tool calling, and multi-turn input

Create a model response with streaming, tool calling, and multi-turn input

Body

inputstring | arrayRequired

Input text or array of message/function_call/function_call_output items

instructionsstring

System-level instructions (replaces system message)

temperaturefloat

Sampling temperature, 0-2. Default 0.7

top_pfloat

Nucleus sampling threshold. Default 0.95

max_output_tokensint

Maximum tokens to generate

streamboolean

Enable named SSE event streaming. Default false

toolsarray

Function tools: {type, name, description, parameters}

Flat format — name and parameters are top-level, not nested under a function key.

tool_choicestring | object

"auto", "none", "required", or {type: "function", name: "..."}

textobject

{"format": {"type": "json_object"}} for JSON mode

top_kint

Top-k sampling

min_pfloat

Minimum probability threshold

repetition_penaltyfloat

Repetition penalty, > 1.0 to penalize

POST/v1/chat/completionsChat completion with streaming and tool calling

Chat completion with streaming and tool calling

Body

messagesarrayRequired

Input messages with role and content

temperaturefloat

Sampling temperature, 0-2. Default 0.7

top_pfloat

Nucleus sampling threshold. Default 0.95

max_tokensint

Maximum tokens to generate

streamboolean

Enable SSE streaming. Default false

stopstring | string[]

Stop sequences

Only the final token of each stop string triggers stopping. Multi-token sequences are not matched exactly.

toolsarray

Function calling tool definitions

When set, streaming is emulated. The full response is generated first, then re-chunked as SSE.

response_formatobject

{"type": "json_object"} for JSON mode

top_kint

Top-k sampling

min_pfloat

Minimum probability threshold

repetition_penaltyfloat

Repetition penalty, > 1.0 to penalize

stream_optionsobject

{"include_usage": true} for usage stats

POST/v1/completionsText completion (legacy) with echo and suffix

Text completion (legacy) with echo and suffix

Body

promptstringRequired

Input text prompt

temperaturefloat

Sampling temperature, 0-2. Default 0.7

max_tokensint

Maximum tokens to generate

streamboolean

Enable SSE streaming. Default false

stopstring | string[]

Stop sequences

Only the final token of each stop string triggers stopping. Multi-token sequences are not matched exactly.

echoboolean

Echo the prompt in the response

suffixstring

Text to append after completion

POST/v1/embeddingsText embeddings for RAG and semantic search

Text embeddings for RAG and semantic search

Body

inputstring | string[]Required

Text to embed

encoding_formatstring

"float" or "base64". Default "float"

GET/v1/modelsList and inspect loaded models

List and inspect loaded models

POST/v1/audio/transcriptionsSpeech-to-text

Speech-to-text

Body

filebinaryRequired

Audio file to transcribe

languagestring

Language code (e.g. en, de)

POST/tokenizeConvert between text and token IDs

Convert between text and token IDs

Body

textstringRequired

Text to tokenize

transformers.js inference server

use open-weight models on mac, linux & win via an OpenAI-compatible api, built in TypeScript

run the server

let your agent run the server

setup

install it globally

or use npx

run the server

the server runs at http://127.0.0.1:8000 and exposes an OpenAI-compatible API, so any OpenAI client works out of the box.

wandler --llm LiquidAI/LFM2.5-1.2B-Instruct-ONNX

here is every flag wandler accepts:

--llm <id>

LLM model.

format: org/repo[:precision]

--backend <name>

LLM backend.

default: wandler · baseline: transformersjs

--embedding <id>

Embedding model.

--stt <id>

Speech-to-text model.

--device <type>

Inference device.

default: auto · options: auto, webgpu, cpu, wasm

--port <n>

Server port.

default: 8000

--host <addr>

Bind address.

default: 127.0.0.1

--api-key <key>

Bearer auth token.

reads env WANDLER_API_KEY

--hf-token <token>

HuggingFace token for gated models.

--cors-origin <origin>

Allowed CORS origin.

default: *

--max-tokens <n>

Max tokens per request.

default: the loaded model's max context length

--max-concurrent <n>

Concurrent requests.

default: 1

--timeout <ms>

Request timeout in milliseconds.

default: 120000

--log-level <level>

Log verbosity.

default: info · options: debug, info, warn, error

--quiet

Suppress non-error startup and profile logs.

--cache-dir <path>

Model cache directory.

default: ~/.cache/huggingface (standard HuggingFace cache, also respects HF_HOME)

--prefill-chunk-size <n>

Chunk size for long-prompt prefill.

default: auto (640MB GPU attention budget) · use auto:<mb> to tune or 0/off to disable

--decode-loop <mode>

Wandler-owned decode loop.

default: auto uses transformers.js generate() · use on to opt into the experimental loop

--prefix-cache <mode>

Enable prefix KV cache.

default: true

--prefix-cache-entries <n>

Prefix KV cache entries.

default: 2

--prefix-cache-min-tokens <n>

Minimum prefix tokens to cache.

default: 512

--warmup-tokens <n>

Approximate prompt tokens to run once before serving.

default: 0

--warmup-max-new-tokens <n>

Max new tokens for startup warmup.

default: 8

precision suffixes: q1, q2, q4 (default), q8, fp16, fp32.

discover models

list every model in the wandler registry with type, size, precision and capabilities

filter by type with --type llm, --type embedding, or --type stt.

benchmarks

WebGPU · q4 quantization · 10 runs per scenario · tested on m3 pro 18gb

Model	Params	Weights	Context	tok/s	TTFT	Load	Capabilities
LiquidAI/LFM2.5-350M-ONNX	350M	~200 MB	128K	248	16ms	0.5s	text
LiquidAI/LFM2.5-1.2B-Instruct-ONNX	1.2B	~700 MB	128K	118	34ms	1.7s	text, tools
onnx-community/Qwen3.5-0.8B-Text-ONNX	0.8B	~500 MB	256K	37	276ms	1.8s	text, tools
onnx-community/gemma-4-E4B-it-ONNX	4B	~2.5 GB	128K	20	636ms	13.4s	text, tools, vision
onnx-community/gemma-4-E2B-it-ONNX	2B	~1.2 GB	128K	12	890ms	7.0s	text, tools, vision

these are the ones we tested. any transformers.js-compatible model on Hugging Face works.

find more on Hugging Face

use it in your app

drop-in replacement for any OpenAI-compatible SDK

import OpenAI from "openai";

const client = new OpenAI({
  baseURL: "http://localhost:8000/v1",
  // replace with your --api-key value if you started wandler with one
  apiKey: "changeme",
});

const res = await client.chat.completions.create({
  model: "LiquidAI/LFM2.5-1.2B-Instruct-ONNX",
  messages: [{ role: "user", content: "What is the capital of Germany" }],
  stream: true,
});

for await (const chunk of res) {
  process.stdout.write(chunk.choices[0]?.delta?.content ?? "");
}

use it with your agent

point your agent to wandler. works with any agent that supports custom OpenAI endpoints

assuming your wandler server is running LiquidAI/LFM2.5-1.2B-Instruct-ONNX, configure Hermes via the CLI like this

replace model.default with the model slug you actually loaded in wandler.

hermes config set model.default LiquidAI/LFM2.5-1.2B-Instruct-ONNX
hermes config set model.provider custom
hermes config set model.base_url http://localhost:8000/v1
# if you started wandler with --api-key your-local-key
# hermes config set model.api_key your-local-key

or put the same settings into ~/.hermes/config.yaml

again, replace model.default if your server is running a different model.

model:
  default: "LiquidAI/LFM2.5-1.2B-Instruct-ONNX"
  provider: "custom"
  base_url: "http://localhost:8000/v1"
  # if you started wandler with --api-key your-local-key
  # api_key: "your-local-key"

API reference

POST/v1/responsesCreate a model response with streaming, tool calling, and multi-turn input

Create a model response with streaming, tool calling, and multi-turn input

Body

inputstring | arrayRequired

Input text or array of message/function_call/function_call_output items

instructionsstring

System-level instructions (replaces system message)

temperaturefloat

Sampling temperature, 0-2. Default 0.7

top_pfloat

Nucleus sampling threshold. Default 0.95

max_output_tokensint

Maximum tokens to generate

streamboolean

Enable named SSE event streaming. Default false

toolsarray

Function tools: {type, name, description, parameters}

Flat format — name and parameters are top-level, not nested under a function key.

tool_choicestring | object

"auto", "none", "required", or {type: "function", name: "..."}

textobject

{"format": {"type": "json_object"}} for JSON mode

top_kint

Top-k sampling

min_pfloat

Minimum probability threshold

repetition_penaltyfloat

Repetition penalty, > 1.0 to penalize

POST/v1/chat/completionsChat completion with streaming and tool calling

Chat completion with streaming and tool calling

Body

messagesarrayRequired

Input messages with role and content

temperaturefloat

Sampling temperature, 0-2. Default 0.7

top_pfloat

Nucleus sampling threshold. Default 0.95

max_tokensint

Maximum tokens to generate

streamboolean

Enable SSE streaming. Default false

stopstring | string[]

Stop sequences

Only the final token of each stop string triggers stopping. Multi-token sequences are not matched exactly.

toolsarray

Function calling tool definitions

When set, streaming is emulated. The full response is generated first, then re-chunked as SSE.

response_formatobject

{"type": "json_object"} for JSON mode

top_kint

Top-k sampling

min_pfloat

Minimum probability threshold

repetition_penaltyfloat

Repetition penalty, > 1.0 to penalize

stream_optionsobject

{"include_usage": true} for usage stats

POST/v1/completionsText completion (legacy) with echo and suffix

Text completion (legacy) with echo and suffix

Body

promptstringRequired

Input text prompt

temperaturefloat

Sampling temperature, 0-2. Default 0.7

max_tokensint

Maximum tokens to generate

streamboolean

Enable SSE streaming. Default false

stopstring | string[]

Stop sequences

Only the final token of each stop string triggers stopping. Multi-token sequences are not matched exactly.

echoboolean

Echo the prompt in the response

suffixstring

Text to append after completion

POST/v1/embeddingsText embeddings for RAG and semantic search

Text embeddings for RAG and semantic search

Body

inputstring | string[]Required

Text to embed

encoding_formatstring

"float" or "base64". Default "float"

GET/v1/modelsList and inspect loaded models

List and inspect loaded models

POST/v1/audio/transcriptionsSpeech-to-text

Speech-to-text

Body

filebinaryRequired

Audio file to transcribe

languagestring

Language code (e.g. en, de)

POST/tokenizeConvert between text and token IDs

Convert between text and token IDs

Body

textstringRequired

Text to tokenize