I tested ChatGPT o3-mini vs DeepSeek R1 vs Qwen 2.5 with 7 prompts — here’s the winner

admin · Wednesday at 2:21 AM

Free tier chatbots face off with seven prompts that test their reasoning and problem-solving capabilities with a clear winner.

I tested ChatGPT o3-mini vs DeepSeek R1 vs Qwen 2.5 with 7 prompts — here’s the winner : Read more

kuntor · Wednesday at 2:57 PM

I'm sorry to be the guy to say this, but... You actually tested Deepseek V3, not R1. It's clearly evident by the fact that your screenshots of Deepseek (with the blue whale icon to the left of the start of the answer) lack the "Thinking" section. You need to switch "Deepthink (R1)" in the UI to enable it, otherwise you'll be chatting with V3. And while V3 is a good model comparable to 4o, it's obviously much worse than any reasoning model, R1 included. Could you please re-test with proper R1?

AmandaCaswell · Thursday at 4:24 PM

kuntor said:
I'm sorry to be the guy to say this, but... You actually tested Deepseek V3, not R1. It's clearly evident by the fact that your screenshots of Deepseek (with the blue whale icon to the left of the start of the answer) lack the "Thinking" section. You need to switch "Deepthink (R1)" in the UI to enable it, otherwise you'll be chatting with V3. And while V3 is a good model comparable to 4o, it's obviously much worse than any reasoning model, R1 included. Could you please re-test with proper R1?

I appreciate the engagement, but I did, in fact, test DeepSeek R1. The only reason you don’t see the "Thinking" section in the screenshot is that I trimmed it to focus on the final output—otherwise, the entire screenshot would have been just the model thinking.

I understand that some may prefer to see the full reasoning process, but the model's actual responses were generated using R1, not V3.

AmandaCaswell · Thursday at 4:25 PM

urbs12345671 said:
She will reach the same conclusion no matter what. My suggestion to her is to stop publishing this type of biased article and to stop pretending to be an expert. If someone wants to choose an AI model, let them conduct their own evaluation.

I appreciate the engagement, but I did, in fact, test DeepSeek R1. The only reason you don’t see the "Thinking" section in the screenshot is that I trimmed it to focus on the final output—otherwise, the entire screenshot would have been just the model thinking.

As for the accusations of bias—evaluating AI models is about testing, analyzing, and sharing insights. Readers are free to draw their own conclusions, and constructive discussions are always welcome. Thanks! -AC

consultFLETCHER · Friday at 2:30 PM

I would like to see a similar comparison between ChatGPT o3-mini vs Qwen 2.5 vs Claude (if possible, include vs Gemini).

AmandaCaswell · Friday at 4:20 PM

consultFLETCHER said:
I would like to see a similar comparison between ChatGPT o3-mini vs Qwen 2.5 vs Claude (if possible, include vs Gemini).

Great idea! I think Perplexity deserves a shot here, too. Thanks for reading!

AmandaCaswell · Friday at 4:35 PM

capinspector5000 said:
<<Removed by moderator>>

Your attention to interface-level granularity is commendable, though it risks obfuscating the core discussion: the functional efficacy of AI interactions in real-world applications. UI elements, including logo placement and response formatting, exist within dynamic rendering environments influenced by variables such as platform updates, device-specific resolutions, and transient session states—factors that render absolute assertions about static positioning inherently unstable.

As for benchmarking credibility, while leaderboards offer a useful (if limited) snapshot of model capability under predefined constraints, they lack the adaptive fidelity necessary for assessing real-world inference dynamics, where user intent, contextual ambiguity, and prompt engineering nuances play non-trivial roles. Dismissing empirical evaluations in favor of rigid, leaderboard-driven heuristics presupposes an overly mechanistic view of AI evaluation—one that fails to account for the stochastic nature of LLM reasoning.

Of course, I’m happy to engage further, provided the discourse aspires to a level of analytical rigor commensurate with the complexity of the subject matter—lol. Cheers! -AC

Search

I tested ChatGPT o3-mini vs DeepSeek R1 vs Qwen 2.5 with 7 prompts — here’s the winner

admin

Titan

kuntor

AmandaCaswell

AmandaCaswell

consultFLETCHER

AmandaCaswell

AmandaCaswell

Similar threads

TRENDING THREADS

Moderators online

Share this page