wandlerGitHubgithub
    wandler

    transformers.js inference server

    use open-weight models on mac, linux & win via an OpenAI-compatible api, built in TypeScript

    run the server
    let your agent run the server

    setup

    install it globally

    or use npx

    run the server

    the server runs at http://127.0.0.1:8000 and exposes an OpenAI-compatible API, so any OpenAI client works out of the box.

    wandler --llm LiquidAI/LFM2.5-1.2B-Instruct-ONNX

    here is every flag wandler accepts:

    --llm <id>
    LLM model.
    format: org/repo[:precision]
    --backend <name>
    LLM backend.
    default: wandler · baseline: transformersjs
    --embedding <id>
    Embedding model.
    --stt <id>
    Speech-to-text model.
    --device <type>
    Inference device.
    default: auto · options: auto, webgpu, cpu, wasm
    --port <n>
    Server port.
    default: 8000
    --host <addr>
    Bind address.
    default: 127.0.0.1
    --api-key <key>
    Bearer auth token.
    reads env WANDLER_API_KEY
    --hf-token <token>
    HuggingFace token for gated models.
    --cors-origin <origin>
    Allowed CORS origin.
    default: *
    --max-tokens <n>
    Max tokens per request.
    default: the loaded model's max context length
    --max-concurrent <n>
    Concurrent requests.
    default: 1
    --timeout <ms>
    Request timeout in milliseconds.
    default: 120000
    --log-level <level>
    Log verbosity.
    default: info · options: debug, info, warn, error
    --quiet
    Suppress non-error startup and profile logs.
    --cache-dir <path>
    Model cache directory.
    default: ~/.cache/huggingface (standard HuggingFace cache, also respects HF_HOME)
    --prefill-chunk-size <n>
    Chunk size for long-prompt prefill.
    default: auto (640MB GPU attention budget) · use auto:<mb> to tune or 0/off to disable
    --decode-loop <mode>
    Wandler-owned decode loop.
    default: auto uses transformers.js generate() · use on to opt into the experimental loop
    --prefix-cache <mode>
    Enable prefix KV cache.
    default: true
    --prefix-cache-entries <n>
    Prefix KV cache entries.
    default: 2
    --prefix-cache-min-tokens <n>
    Minimum prefix tokens to cache.
    default: 512
    --warmup-tokens <n>
    Approximate prompt tokens to run once before serving.
    default: 0
    --warmup-max-new-tokens <n>
    Max new tokens for startup warmup.
    default: 8

    precision suffixes: q1, q2, q4 (default), q8, fp16, fp32.

    discover models

    list every model in the wandler registry with type, size, precision and capabilities

    filter by type with --type llm, --type embedding, or --type stt.

    benchmarks

    WebGPU · q4 quantization · 10 runs per scenario · tested on m3 pro 18gb

    ModelParamsWeightsContexttok/sTTFTLoadCapabilities
    LiquidAI/LFM2.5-350M-ONNX
    350M~200 MB128K24816ms0.5stext
    LiquidAI/LFM2.5-1.2B-Instruct-ONNX
    1.2B~700 MB128K11834ms1.7stext, tools
    onnx-community/Qwen3.5-0.8B-Text-ONNX
    0.8B~500 MB256K37276ms1.8stext, tools
    onnx-community/gemma-4-E4B-it-ONNX
    4B~2.5 GB128K20636ms13.4stext, tools, vision
    onnx-community/gemma-4-E2B-it-ONNX
    2B~1.2 GB128K12890ms7.0stext, tools, vision

    these are the ones we tested. any transformers.js-compatible model on Hugging Face works.

    find more on Hugging Face

    use it in your app

    drop-in replacement for any OpenAI-compatible SDK

    import OpenAI from "openai";
    
    const client = new OpenAI({
      baseURL: "http://localhost:8000/v1",
      // replace with your --api-key value if you started wandler with one
      apiKey: "changeme",
    });
    
    const res = await client.chat.completions.create({
      model: "LiquidAI/LFM2.5-1.2B-Instruct-ONNX",
      messages: [{ role: "user", content: "What is the capital of Germany" }],
      stream: true,
    });
    
    for await (const chunk of res) {
      process.stdout.write(chunk.choices[0]?.delta?.content ?? "");
    }

    use it with your agent

    point your agent to wandler. works with any agent that supports custom OpenAI endpoints

    assuming your wandler server is running LiquidAI/LFM2.5-1.2B-Instruct-ONNX, configure Hermes via the CLI like this

    replace model.default with the model slug you actually loaded in wandler.

    hermes config set model.default LiquidAI/LFM2.5-1.2B-Instruct-ONNX
    hermes config set model.provider custom
    hermes config set model.base_url http://localhost:8000/v1
    # if you started wandler with --api-key your-local-key
    # hermes config set model.api_key your-local-key

    or put the same settings into ~/.hermes/config.yaml

    again, replace model.default if your server is running a different model.

    model:
      default: "LiquidAI/LFM2.5-1.2B-Instruct-ONNX"
      provider: "custom"
      base_url: "http://localhost:8000/v1"
      # if you started wandler with --api-key your-local-key
      # api_key: "your-local-key"

    API reference

    POST/v1/responsesCreate a model response with streaming, tool calling, and multi-turn input

    Create a model response with streaming, tool calling, and multi-turn input

    Body
    inputstring | arrayRequired

    Input text or array of message/function_call/function_call_output items

    instructionsstring

    System-level instructions (replaces system message)

    temperaturefloat

    Sampling temperature, 0-2. Default 0.7

    top_pfloat

    Nucleus sampling threshold. Default 0.95

    max_output_tokensint

    Maximum tokens to generate

    streamboolean

    Enable named SSE event streaming. Default false

    toolsarray

    Function tools: {type, name, description, parameters}

    Flat format — name and parameters are top-level, not nested under a function key.

    tool_choicestring | object

    "auto", "none", "required", or {type: "function", name: "..."}

    textobject

    {"format": {"type": "json_object"}} for JSON mode

    top_kint

    Top-k sampling

    min_pfloat

    Minimum probability threshold

    repetition_penaltyfloat

    Repetition penalty, > 1.0 to penalize

    POST/v1/chat/completionsChat completion with streaming and tool calling

    Chat completion with streaming and tool calling

    Body
    messagesarrayRequired

    Input messages with role and content

    temperaturefloat

    Sampling temperature, 0-2. Default 0.7

    top_pfloat

    Nucleus sampling threshold. Default 0.95

    max_tokensint

    Maximum tokens to generate

    streamboolean

    Enable SSE streaming. Default false

    stopstring | string[]

    Stop sequences

    Only the final token of each stop string triggers stopping. Multi-token sequences are not matched exactly.

    toolsarray

    Function calling tool definitions

    When set, streaming is emulated. The full response is generated first, then re-chunked as SSE.

    response_formatobject

    {"type": "json_object"} for JSON mode

    top_kint

    Top-k sampling

    min_pfloat

    Minimum probability threshold

    repetition_penaltyfloat

    Repetition penalty, > 1.0 to penalize

    stream_optionsobject

    {"include_usage": true} for usage stats

    POST/v1/completionsText completion (legacy) with echo and suffix

    Text completion (legacy) with echo and suffix

    Body
    promptstringRequired

    Input text prompt

    temperaturefloat

    Sampling temperature, 0-2. Default 0.7

    max_tokensint

    Maximum tokens to generate

    streamboolean

    Enable SSE streaming. Default false

    stopstring | string[]

    Stop sequences

    Only the final token of each stop string triggers stopping. Multi-token sequences are not matched exactly.

    echoboolean

    Echo the prompt in the response

    suffixstring

    Text to append after completion

    POST/v1/embeddingsText embeddings for RAG and semantic search

    Text embeddings for RAG and semantic search

    Body
    inputstring | string[]Required

    Text to embed

    encoding_formatstring

    "float" or "base64". Default "float"

    GET/v1/modelsList and inspect loaded models

    List and inspect loaded models

    POST/v1/audio/transcriptionsSpeech-to-text

    Speech-to-text

    Body
    filebinaryRequired

    Audio file to transcribe

    languagestring

    Language code (e.g. en, de)

    POST/tokenizeConvert between text and token IDs

    Convert between text and token IDs

    Body
    textstringRequired

    Text to tokenize

    provided byRunPodLabs