Ryan, it's a crucial topic so I applaud you for making an initial effort and am sorry to have been negative about it and bring up a bunch of legal stuff. That's life these days:
the peanut gallery will always find something to pick on.
IMHO, AI evaluation is an art, and according to Pirsig, it's impossible to know quality, so
your post here is infinitely more than nothing and it is quite
comprehensive and diverse because you tested many different areas of evaluation.
Thank you for thinking and writing about a head-to-head comparison of AIs!
P.S.
https://en.wikipedia.org/wiki/Theory_of_multiple_intelligences
might be relevant to your valuable line of inquiry here
To your idea, scientific analysis and benchmarking AI could become a valuable article series on Tom's Guide, especially for AI coding across various popular languages (
https://survey.stackoverflow.co/2023/#most-popular-technologies-language-prof), some ideas would be to compare the completeness (placeholders are a major frustration for devs working with AI and
https://chat.openai.com/g/g-3mjvrrXZ6-bug-fix-gpt is one I made specifically with the objective to define placeholders and set a goal to reduce them [without any lies about tipping or kittens, mind you] ), also correctness, and performance of the code they write, and the number of messages sent to get a working code example is a measure of the "massive energy waste" problem.
So, yeah, great idea! Think about it like reviewing a CPU or a GPU ... the more we can quantify quality, the easier it is to make informed decisions, and the more helpful the articles will become.
When the GPT store opens up, there will be (already is, really) a "Cambrian Explosion" of GPTs, and that's a good source of material because people will need help choosing from a billion options. Plus, Bard will almost surely wind up with a similar feature (hopefully soon, for reasons noted above)!
Happy Holidays,
Bion