Model Selection

Choosing a model affects speed, RAM use, and quality. Prefer the smallest model that still meets your task; larger models cost more memory and latency for often marginal gains.

TL;DR

Solid default (~2 GB chat model):

huggingface:bartowski/Qwen_Qwen3-4B-GGUF/Qwen_Qwen3-4B-Q4_K_M.gguf

Smaller and faster:

huggingface:bartowski/Qwen_Qwen3-0.6B-GGUF/Qwen_Qwen3-0.6B-Q4_K_M.gguf

Pass these wherever Quaynor expects a model path (Chat, Model, Encoder, etc.). First use triggers a download; later loads read from cache and work offline.

Getting a model

Quaynor can fetch GGUF weights from Hugging Face. Pass a huggingface: URI instead of a local path:

huggingface:owner/repo/filename.gguf

Files are downloaded once and cached — after that, no network is needed for that model. hf: is accepted as shorthand.

You can pass any https:// URL if the server serves a GGUF directly, or keep using ordinary filesystem paths if you manage files yourself.

Common starting points for GGUF mirrors: Bartowski, Unsloth, and Qwen. Most public .gguf files on Hugging Face work; odd exports sometimes fail — Quaynor surfaces a clear error when something is incompatible.

Understanding file names

A typical chat filename looks like Qwen_Qwen3-0.6B-Q4_K_M.gguf:

Segment	Meaning
`Qwen`	Publisher or family
`Qwen3`	Release line
`0.6B`	Size in billions of parameters
`Q4`	Quantization level (bits per weight, approximately)
`K_M`	Quant variant (e.g. `S` faster / rougher, `L` slower / sharper, `M` balanced)

For chat, use instruction-tuned GGUF builds that include a chat template in metadata. That matches typical Hugging Face chat releases. Quaynor errors early if the template cannot be applied.

For embeddings or reranking / cross-encoder tasks, choose models labeled for those jobs; reranking models sometimes appear under “reranker” or “cross-encoder” names.

Quantization

Quantization stores weights in fewer bits so models use less RAM and often run faster, with small quality trade-offs. Q4 and Q5 tiers are typical sweet spots for chat.

Rough rule: prefer more parameters at lower bit width over fewer parameters at higher precision unless you have benchmarks saying otherwise — but task-specific testing beats rules of thumb.

Estimating memory

Approximate VRAM/RAM demand scales with parameter count × effective bytes per parameter from quantization. Illustrative anchors:

Example	Rough RAM
2 B params @ Q8	~2 GB
2 B @ Q4	~1 GB
14 B @ Q4	~7 GB
14 B @ Q2	~3.5 GB

Treat these as order-of-magnitude guides; backends and KV cache add overhead.

Comparing models online

Independent leaderboards help narrow candidates before you download:

LLM-Stats.com — filters for open / smaller models alongside common benchmarks.
OpenEvals (Hugging Face) — many task-specific boards; mixes open weights with closed APIs.

You need weights published as GGUF files; models that exist only behind a vendor API cannot be run locally here.

If two models trade off oddly for your workload, GitHub Discussions is the best place to compare notes with other Quaynor users; use Issues when something looks broken.