DeepSeek-R1 Beaten by QwQ in 32B?

How much can reinforcement learning, bolstered by some extra verification, elevate the capabilities of large language models (LLMs)? Alibaba’s Qwen team is on a quest to find out with their latest creation, QwQ.

QwQ, a ‘reasoning’ model, boasts a relatively compact 32 billion parameters. Yet, Alibaba claims it surpasses DeepSeek R1, with its massive 671 billion parameters, in specific benchmarks related to math, coding, and function-calling.

The Qwen team, similar to the approach taken with R1, employed reinforcement learning to refine QwQ’s chain-of-thought reasoning. This method enhances problem analysis and breakdown capabilities. Reinforcement learning traditionally strengthens stepwise reasoning by rewarding models for correct answers, thus fostering more accurate responses. However, QwQ takes a step further by incorporating an accuracy verifier and a code execution server. This ensures that rewards are exclusively granted for accurate mathematical solutions and functional code.

The Qwen team asserts that this approach results in a model that outperforms its size, achieving performance comparable to, and sometimes even exceeding, much larger models.

However, AI benchmarks can be deceptive. So, let’s examine how these claims translate to real-world scenarios and then we will guide you on how to get QwQ up and running independently.

Performance Evaluation

We subjected QwQ to a series of test prompts, encompassing general knowledge, spatial reasoning, problem-solving, mathematics, and other queries known to challenge even the most advanced LLMs.

Due to the full model’s substantial memory requirements, we conducted our tests in two configurations to cater to users with varying RAM capacities. Initially, we assessed the full model using the QwQ demo on Hugging Face. Subsequently, we tested a 4-bit quantized version on a 24 GB GPU (Nvidia 3090 or AMD Radeon RX 7900XTX) to gauge the impact of quantization on accuracy.

For most general knowledge questions, QwQ exhibited performance similar to DeepSeek’s 671 billion parameter R1 and other reasoning models like OpenAI’s o3-mini, pausing briefly to formulate its thoughts before providing the answer.

The model’s strengths, perhaps unsurprisingly, become evident when tackling more intricate logic, coding, or mathematical challenges. Let’s delve into these areas before addressing some of its limitations.

Spatial Reasoning Prowess

We began with a relatively novel spatial-reasoning test devised by Homebrew Research as part of their AlphaMaze project.

The test presents the model with a maze in text format, as shown below. The model’s task is to navigate from the origin “O” to the target “T.”