- Allows specifying whether thinking mode should be on or not
- Templates get passed a new option so, e.g., qwen3's template can put
`/think` or `/no_think` in the system prompt depending on the value of
the setting
- Add parsing for thinking blocks in both streaming/non-streaming mode
- Update the CLI to make use of these changes
TODO:
- [ ] Don't parse thinking blocks when the user doesn't explicitly set
the option, to maintain backwards compatibility
- [ ] Warning on CLI when using a non-thinking/older version of a model
(with an old template)
- [ ] Wire up capabilities fully
- [x] Unify parsing for streaming/non-streaming
- [ ] Update templates
- [ ] Update python/js libraries
- [ ] How to handle differences in models wrt defaults and whether or
not the thinking ability can even be controlled. If not specified
by the user, should there be a default or should the template be
able to check if it was explicitly set?
New lines can be an important part of a user's prompt and trimming
it can alter the results. We previously only trimmed prompts with
images but refactoring brought this behavior to all prompts, where
it became more noticable.
The /generate endpoint adds less whitespace and therefore doesn't
need to trim it out - this brings the same behavior to /chat.
Thanks to @gabe-l-hart for spotting the issue!
Fixes#7795
-Update mllama to take the cross attention state as embeddings in
a batch, more similar to how Llava handles it. This improves
integration with the input cache.
-Pass locations in a prompt for embeddings using tags similar to Llava.
-Abstract interface to vision models so the main runner accesses Clip
and Mllama similarly
Co-authored-by: Michael Yang <mxyng@pm.me>