I tested ChatGPT o3-mini vs DeepSeek R1 vs Qwen 2.5 with 7 prompts — here’s the winner

Feb 5, 2025
1
1
10
I'm sorry to be the guy to say this, but... You actually tested Deepseek V3, not R1. It's clearly evident by the fact that your screenshots of Deepseek (with the blue whale icon to the left of the start of the answer) lack the "Thinking" section. You need to switch "Deepthink (R1)" in the UI to enable it, otherwise you'll be chatting with V3. And while V3 is a good model comparable to 4o, it's obviously much worse than any reasoning model, R1 included. Could you please re-test with proper R1?
 
  • Like
Reactions: urbs12345671
Feb 6, 2025
4
0
10
I'm sorry to be the guy to say this, but... You actually tested Deepseek V3, not R1. It's clearly evident by the fact that your screenshots of Deepseek (with the blue whale icon to the left of the start of the answer) lack the "Thinking" section. You need to switch "Deepthink (R1)" in the UI to enable it, otherwise you'll be chatting with V3. And while V3 is a good model comparable to 4o, it's obviously much worse than any reasoning model, R1 included. Could you please re-test with proper R1?
I appreciate the engagement, but I did, in fact, test DeepSeek R1. The only reason you don’t see the "Thinking" section in the screenshot is that I trimmed it to focus on the final output—otherwise, the entire screenshot would have been just the model thinking.

I understand that some may prefer to see the full reasoning process, but the model's actual responses were generated using R1, not V3.
 
Feb 6, 2025
4
0
10
She will reach the same conclusion no matter what. My suggestion to her is to stop publishing this type of biased article and to stop pretending to be an expert. If someone wants to choose an AI model, let them conduct their own evaluation.
I appreciate the engagement, but I did, in fact, test DeepSeek R1. The only reason you don’t see the "Thinking" section in the screenshot is that I trimmed it to focus on the final output—otherwise, the entire screenshot would have been just the model thinking.

As for the accusations of bias—evaluating AI models is about testing, analyzing, and sharing insights. Readers are free to draw their own conclusions, and constructive discussions are always welcome. Thanks! -AC
 
Feb 6, 2025
4
0
10
<<Removed by moderator>>

Your attention to interface-level granularity is commendable, though it risks obfuscating the core discussion: the functional efficacy of AI interactions in real-world applications. UI elements, including logo placement and response formatting, exist within dynamic rendering environments influenced by variables such as platform updates, device-specific resolutions, and transient session states—factors that render absolute assertions about static positioning inherently unstable.

As for benchmarking credibility, while leaderboards offer a useful (if limited) snapshot of model capability under predefined constraints, they lack the adaptive fidelity necessary for assessing real-world inference dynamics, where user intent, contextual ambiguity, and prompt engineering nuances play non-trivial roles. Dismissing empirical evaluations in favor of rigid, leaderboard-driven heuristics presupposes an overly mechanistic view of AI evaluation—one that fails to account for the stochastic nature of LLM reasoning.

Of course, I’m happy to engage further, provided the discourse aspires to a level of analytical rigor commensurate with the complexity of the subject matter—lol. Cheers! -AC
 
Last edited by a moderator: