Qwen/QwQ-32B is a great model, but it tends to think for a long time. OpenAI models have a reasoning_effort parameter, Anthropic models have a thinking.budget_tokens option, however, with open-source llms there isn't a built-in system to decide for how long and for how many tokens the llm thinks. Until now! This proxy adds a max_thinking_chars parameter that allows configuring the max number of characters that the LLM is allowed to think for.
Qwen/QwQ-32B is a great model, but it tends to think for a long time. OpenAI models have a reasoning_effort parameter, Anthropic models have a thinking.budget_tokens option, however, with open-source llms there isn't a built-in system to decide for how long and for how many tokens the llm thinks. Until now! This proxy adds a max_thinking_chars parameter that allows configuring the max number of characters that the LLM is allowed to think for.