DeepSeek-R1 Beaten by QwQ in 32B?

Challenging the Giants: A Compact Contender

Alibaba’s Qwen team has entered the competitive landscape of large language models (LLMs) with QwQ, a model that aims to deliver impressive performance within a relatively compact 32-billion parameter package. While significantly smaller than models like DeepSeek R1, which reportedly boasts 671 billion parameters, QwQ is positioned as a “reasoning” model. Alibaba claims that QwQ can surpass R1 in specific benchmarks, particularly in areas like mathematics, coding, and function-calling. This bold assertion necessitates a thorough examination of QwQ’s architecture and real-world capabilities.

Reinforcement Learning: The Key to QwQ’s Prowess

Like DeepSeek R1, the Qwen team utilized reinforcement learning (RL) to enhance QwQ’s chain-of-thought reasoning abilities. This technique refines the model’s capacity to analyze and decompose complex problems into sequential steps. The conventional RL approach involves rewarding the model for correct answers, thereby reinforcing accurate responses and learning patterns.

However, the Qwen team adopted a more sophisticated strategy with QwQ. They incorporated an accuracy verifier and a code execution server. This critical addition ensures that rewards are only granted for mathematically sound solutions and functional, executable code. By implementing this rigorous verification process, the team aims to cultivate a model that exhibits a higher degree of precision and reliability, minimizing the generation of incorrect or nonsensical outputs.

Performance Claims: A Reality Check

The Qwen team’s efforts have, according to their claims, yielded a model that performs significantly above its weight class. They assert that QwQ achieves performance levels comparable to, and in some cases even exceeding, much larger models. These claims are primarily based on benchmark results, which, while providing a standardized measure of performance, can sometimes be misleading.

The world of AI benchmarks is complex and often nuanced. It’s crucial to move beyond the reported figures and examine how these claims translate into practical, real-world scenarios. Benchmark performance doesn’t always directly correlate with usability or effectiveness in diverse applications.

Hands-On Testing: Putting QwQ Through Its Paces

To rigorously assess QwQ’s capabilities, a series of test prompts were designed, spanning a diverse range of domains. These included general knowledge inquiries, spatial reasoning puzzles, problem-solving scenarios, mathematical computations, and other challenges known to pose difficulties for even the most advanced LLMs.

Due to the substantial memory requirements of the full QwQ model, testing was executed in two distinct configurations. First, the complete model was evaluated using the QwQ demo available on Hugging Face. This allowed for an assessment of its full potential, unconstrained by hardware limitations. Second, a 4-bit quantized version was tested on a 24GB GPU (specifically, an Nvidia 3090 or an AMD Radeon RX 7900XTX). This configuration aimed to gauge the impact of quantization on the model’s accuracy, making it more accessible to users with less powerful hardware. Quantization reduces the precision of the model’s weights, thereby decreasing its memory footprint, but potentially at the cost of some performance degradation.

General Knowledge: Holding Its Own

In response to the majority of general knowledge questions, QwQ demonstrated performance comparable to DeepSeek’s 671-billion-parameter R1 and other reasoning models, such as OpenAI’s o3-mini. The model typically took a few seconds to formulate its thoughts before providing an answer to the query. This behavior is characteristic of reasoning models, which prioritize careful deliberation and step-by-step analysis over immediate responses. The delay is a consequence of the model’s internal processing, where it breaks down the question and considers various aspects before generating a response.

Excelling in Complexity: Logic, Coding, and Mathematics

Where QwQ truly begins to distinguish itself is in tackling more intricate challenges involving logic, coding, or mathematics. These domains require more than just pattern recognition; they demand a degree of reasoning and problem-solving ability. Let’s delve into these areas, highlighting QwQ’s strengths and addressing some areas where it falls short.

Spatial Reasoning: Navigating the Maze

A relatively new spatial-reasoning test, developed by Homebrew Research as part of their AlphaMaze project, was employed to evaluate QwQ’s spatial reasoning capabilities. This test presents a series of maze-like puzzles that require the model to navigate a virtual environment and reach a designated goal.

Both the locally hosted QwQ instance (the 4-bit quantized version) and the full-sized model on Hugging Face consistently solved these puzzles successfully. However, each run did require a few minutes to complete. This indicates that while QwQ can handle spatial reasoning effectively, it’s not necessarily the fastest at it. The time taken suggests that the model is performing a significant amount of internal computation to determine the optimal path through the maze.

In contrast, DeepSeek’s R1 and its 32B distill exhibited different behaviors. Both models successfully solved the first maze. However, R1 struggled with the second, while the 32B distill achieved a 90% success rate on the second maze. This variability is not entirely unexpected, given that R1 and the distill utilize distinct base models and potentially different training strategies.

While QwQ demonstrated superior performance compared to DeepSeek in this specific spatial reasoning test, some unusual behavior was observed with the 4-bit quantized model. Initially, it required nearly twice as many “thought” tokens to complete the test. This initially suggested potential losses due to quantization, implying that the reduced precision was hindering the model’s ability to efficiently solve the maze. However, further investigation revealed that the quantized model was, in its initial state, exhibiting suboptimal performance due to incorrect hyperparameter settings. Adjusting the hyperparameters and rerunning the tests resolved this issue, demonstrating the importance of proper configuration and highlighting the sensitivity of quantized models to these settings.

One-Shot Coding: A Potential Strength

QwQ has attracted considerable attention for its potential in “one-shot” code generation – the ability to produce usable code on the first attempt, without requiring iterative refinement or debugging. This particular area appears to be a significant strength for the model, showcasing its capacity to translate natural language instructions into functional code.

The model was tasked with recreating several relatively simple games in Python using the pygame library. The games chosen were Pong, Breakout, Asteroids, and Flappy Bird. These games, while conceptually straightforward, require a reasonable understanding of game logic, physics, and rendering.

QwQ handled Pong and Breakout with relative ease. After a few minutes of processing, the model generated working versions of both games. The generated code was largely functional and produced playable, albeit basic, recreations of the classic arcade games.

However, when tasked with recreating Asteroids, QwQ encountered difficulties. Although the generated code ran, the graphics and game mechanics were frequently distorted and buggy. The asteroids did not move correctly, collisions were not detected accurately, and the overall gameplay was significantly flawed. In contrast, R1, on its first attempt, faithfully recreated the classic arcade shooter, producing a much more accurate and playable version of the game.

It’s important to consider the training data for these models. They have been exposed to a vast amount of openly available source code, likely including numerous reproductions of classic games like Pong, Breakout, and Asteroids. This raises the question of whether the models are simply recalling learned information and patterns from their training data, rather than independently deriving game mechanics from scratch. This underscores the fundamental nature of these massive neural networks, where apparent intelligence often stems from extensive pattern recognition and the ability to interpolate between seen examples.

Even with these limitations, QwQ’s performance in recreating classic arcade games is impressive, especially considering its relatively small parameter count. It may not match R1 in every coding test, but it demonstrates a remarkable level of capability for a 32B parameter model. The phrase “there’s no replacement for displacement,” often used in the automotive world to describe the relationship between engine size and power, might be relevant here. This could explain why Alibaba is developing a “Max” version of QwQ, though it’s unlikely to be runnable on consumer hardware anytime soon, given the computational demands of significantly larger models.

Compared to DeepSeek’s similarly sized R1 Qwen 2.5 32B distill, Alibaba’s decision to integrate a code execution server into its reinforcement learning pipeline may have conferred an advantage in programming-related challenges. The code execution server allows QwQ to directly test and validate the code it generates, providing immediate feedback and reinforcing correct coding practices.

Mathematics: Capability with a Caveat

Historically, LLMs have struggled with mathematics, a consequence of their primarily language-focused training. While newer models, including QwQ, have shown improvements in mathematical reasoning, they still face challenges, though not necessarily for the reasons one might expect.

QwQ successfully solved all the mathematics problems previously posed to R1. This indicates that QwQ can handle basic arithmetic and even some algebra, demonstrating a level of mathematical competence. However, the issue lies in its efficiency, or rather, its inefficiency. Engaging an LLM for mathematical calculations seems counterintuitive when calculators and direct computation remain readily available and significantly faster.

For instance, solving a simple equation like 7*43 required QwQ to generate over 1,000 tokens, taking approximately 23 seconds on an RTX 3090 Ti. This is a task that could be completed on a pocket calculator, or even mentally, in a fraction of the time. The LLM’s approach involves breaking down the problem into steps and generating text representing the calculation process, which is inherently slower than direct computation.

The inefficiency becomes even more pronounced with larger calculations. Solving 3394*35979, a multiplication problem beyond the capabilities of most non-reasoning models, took the local instance of QwQ three minutes and over 5,000 tokens to compute. This highlights the significant overhead involved in using an LLM for tasks that are fundamentally better suited to specialized tools.

Before the hyperparameter fix (discussed earlier in the context of spatial reasoning), the same equation required a staggering nine minutes and nearly 12,000 tokens. This further emphasizes the sensitivity of QwQ, particularly the quantized version, to proper configuration.

The key takeaway here is that while a model might be capable of brute-forcing its way to the correct answer, it doesn’t necessarily mean it’s the optimal tool for the job. A more practical approach would be to provide QwQ with access to a Python calculator or a similar computational tool. This leverages the model’s strengths in understanding the problem and formulating the appropriate calculation, while offloading the computationally intensive task to a more suitable tool.

When tasked with solving the same 3394*35979 equation using tooling (allowing the model to utilize a calculator), QwQ’s response time plummeted to eight seconds, as the calculator handled the heavy lifting. This demonstrates the significant efficiency gains that can be achieved by combining LLMs with specialized tools.

The Prevalence of “Wait”: A Glimpse into the Thought Process

Examining QwQ’s “thoughts” (the intermediate text generated during processing) reveals a frequent occurrence of the word “wait,” particularly during complex tasks or word problems. This reflects the model’s internal process of checking its work against alternative outcomes and considering different possibilities. It’s akin to a human pausing to double-check their reasoning before committing to an answer.

While this behavior is common in reasoning models, it can be particularly frustrating when QwQ generates an incorrect answer, even after demonstrating an understanding of the correct answer during its “thought” process. This suggests that the model can sometimes get lost in its own internal deliberations or make incorrect inferences despite having the necessary information.

This issue was encountered frequently during testing. One of the most illustrative examples was AutoGen AI’s adaptation of the classic wolf, goat, and cabbage problem. This puzzle presents a transportation optimization challenge with a twist: A farmer needs to transport a wolf, a goat, and a cabbage across a river, but the boat can only carry the farmer and one other item at a time. The wolf cannot be left alone with the goat, and the goat cannot be left alone with the cabbage. The modified version of the problem states that the farmer has three secure compartments.

The solution is embedded within the prompt: with three secure compartments, the farmer can transport all items in a single trip. However, due to the puzzle’s resemblance to the classic version, models often overlook the compartments and attempt to solve it using the traditional constraints.

In testing, QwQ consistently failed to solve this puzzle correctly. Examining its thought process revealed that it did not overlook the three compartments. In fact, it acknowledged them but dismissed them as being too simple:

“Wait, if the farmer can take all three in one trip, then he can just do that and be done. But that would make the problem trivial, which is unlikely. So perhaps the compartments are separate but the boat can only carry two items plus the farmer?”

Regardless of whether the test was run on the full model in the cloud or locally (with the 4-bit quantized version), QwQ struggled to solve this consistently. This highlights a potential limitation in its reasoning capabilities, where it may overthink or misinterpret the problem’s constraints, leading it to reject the correct solution in favor of a more complex, but ultimately incorrect, one.

Hyperparameter Sensitivity: A Delicate Balance

Compared to other models, QwQ exhibited a heightened sensitivity to its configuration, particularly the hyperparameters that control the sampling process during text generation. Initially, Alibaba recommended specific sampling parameters:

  • Temperature: 0.6
  • TopP: 0.95
  • TopK: between 20 and 40

Subsequently, these recommendations were updated to include:

  • MinP: 0
  • Presence Penalty: between 0 and 2

Due to an apparent bug in Llama.cpp’s handling of samplingparameters (Llama.cpp is a popular library used for running inference on models, especially quantized ones), it was also necessary to disable the repeat penalty by setting it to 1.

As previously mentioned, addressing these configuration issues resulted in a significant improvement, more than halving the number of “thinking” tokens required to arrive at an answer. However, this bug appears to be specific to GGUF-quantized versions of the model when running on the Llama.cpp inference engine, which is used by popular applications like Ollama and LM Studio.

For users planning to utilize Llama.cpp, consulting Unsloth’s guide to correcting the sampling order is highly recommended. This guide provides detailed instructions on how to adjust the hyperparameters to ensure optimal performance with QwQ and other models.

Getting Started with QwQ: A Practical Guide

For those interested in experimenting with QwQ, setting it up in Ollama is relatively straightforward. However, it’s important to note that it does require a GPU with a substantial amount of vRAM (video RAM). The model was successfully run on a 24GB 3090 Ti with a context window large enough for practical use. The context window refers to the amount of text the model can process at once; a larger context window allows the model to handle longer conversations or documents.

While technically feasible to run the model on a CPU and system memory, this is likely to result in extremely slow response times unless using a high-end workstation or server with a large amount of RAM.

Prerequisites:

  1. A machine capable of running medium-sized LLMs at 4-bit quantization. A compatible GPU with at least 24GB of vRAM is recommended. A list of supported cards can be found here.
  2. For Apple Silicon Macs, a minimum of 32GB of memory is recommended.

This guide assumes basic familiarity with a Linux-world command-line interface and Ollama.

Installing Ollama

Ollama is a popular model runner that simplifies the process of downloading and serving LLMs on consumer hardware. For Windows or macOS users, download and install it like any other application from ollama.com.

For Linux users, Ollama provides a convenient one-liner for installation: