Swift supports multimodal prompts through Prompt, as long as the model and projection model were trained to work together.
Choosing a model
For image or audio input you usually need:
- a multimodal GGUF model
- a matching projection model, often named with
mmproj
Load both by passing projectionModelPath:
import Quaynor
let model = try await Model.load(
modelPath: "/path/to/multimodal-model.gguf",
projectionModelPath: "/path/to/mmproj.gguf"
)
let chat = try Chat(
model: model,
systemPrompt: "You can understand text, images, and audio."
)
You can also do this directly through Chat.fromPath.
Building a multimodal prompt
Use Prompt parts for text, image, and audio:
let prompt = Prompt(parts: [
Prompt.text("Describe what you see and hear."),
Prompt.image("/path/to/dog.png"),
Prompt.audio("/path/to/sound.mp3")
])
let answer = try await chat.ask(prompt).completed()
print(answer)
Tips
- The model and projection model must match. Mixing arbitrary GGUF and
mmprojfiles will usually fail or behave badly. - Prompt order matters. Try alternating text and assets if results are weak.
- Some multimodal models consume a lot of context per image or audio segment, so increase
contextSizewhen needed.
Example with a larger context window:
let chat = try await Chat.fromPath(
modelPath: "/path/to/multimodal-model.gguf",
projectionModelPath: "/path/to/mmproj.gguf",
contextSize: 8192
)