- The Rundown AI
- Posts
- AI benchmarking under fire
AI benchmarking under fire
PLUS: Create and deploy websites using ChatGPT o3 and Canvas
Good morning, AI enthusiasts. Leaderboard prestige can make or break an AI model’s launch, but a new study claims the scoreboard might be stacked in favor of tech giants.
With allegations of private testing and biased experiments potentially distorting the results of a popular benchmarking platform, the AI evaluation game just got a lot hazier.
Reminder: Our next workshop is today at 4 PM EST — attend and learn how to use Google’s NotebookLM to improve your research, studying, and teaching! RSVP here.
In today’s AI rundown:
Study questions leading AI benchmark
Microsoft’s new small reasoning models
Create websites using ChatGPT o3 and Canvas
Amazon’s new Nova Premier teacher model
4 new AI tools & 4 job opportunities
LATEST DEVELOPMENTS
AI BENCHMARKING

Image source: Cohere Labs
The Rundown: A new study from researchers at Cohere Labs, MIT, Stanford, and other institutions claims that LMArena, the leading crowdsourced AI benchmark, gives unfair advantages to major tech companies, potentially distorting its widely-followed rankings.
The details:
The study claims providers like Meta, Google, and OpenAI privately test multiple model variants on the Arena to publish the best performers.
It also found that models from top labs were favored over small/open models in sampling, with Google and OpenAI receiving over 60% of all interactions.
Experiments showed that access to Arena data boosts performance on Arena-specific tasks, suggesting model overfitting rather than actual capability gains.
The researchers also noted that 205 models have been silently removed on the platform, with open-source models deprecated at a higher rate.
Why it matters: LMArena has disputed the study, claiming the leaderboard reflects genuine user preferences. However, these claims can damage the platform's credibility, which shapes how models are perceived. Combined with the Llama 4 Maverick benchmark debacle, this study highlights that AI evaluation isn't always as it seems.
TOGETHER WITH INNOVATING WITH AI
The Rundown: Innovating With AI’s “The AI Consultancy Project” gives you the frameworks, playbooks, and client‑ready templates needed to turn “interesting AI ideas” into a revenue‑generating business – helping you ride the wave of an AI consulting boom expected to grow by 8x this decade.
In this program, you will:
Gain the tools and frameworks to find clients and deliver top-notch services
Follow a 6-month plan to build a 6-figure AI consulting business
Join a 700‑strong cohort where some members landed their first AI client in 72 hours
MICROSOFT

Image source: Microsoft
The Rundown: Microsoft just launched three new reasoning-focused, open-weights models in its Phi family — which outperform larger rivals at complex reasoning tasks while being small enough to run on phones and laptops.
The details:
The flagship Phi-4-reasoning has just 14B parameters but outperforms OpenAI's o1-mini and matches DeepSeek's 671B model on key benchmarks.
A smaller 3.8B parameter version called Phi-4-mini-reasoning can run on mobile devices while matching larger 7B models on math benchmarks.
Designed for efficiency, the models aim to bring strong reasoning capabilities to constrained environments (like edge devices and Copilot+ PCs).
All three models are open-source with permissive licenses, allowing unrestricted commercial use and modification by developers.
Why it matters: Microsoft continues to raise the bar for its small but powerful Phi, with the latest launch bringing extremely capable reasoning to models sized to fit on phones and laptops. It’s still early days in truly bringing system-integrated AI to devices, but Microsoft’s Copilot+ PCs could be the biggest beneficiary of this reasoning boost.
AI TRAINING

The Rundown: In this tutorial, you will learn how to create fully-functional web applications with database capabilities using ChatGPT o3 and Canvas, and then deploy them for free — no coding skills required.
Step-by-step:
Head over to ChatGPT, select the “o3” model, and activate the ‘Canvas’ option.
Prepare a detailed prompt describing your desired HTML web application, including purpose, features, design preferences, and functionality requirements.
Test your application using the "Preview" button and request any necessary modifications.
Save the code as an HTML file and deploy using Cloudflare by navigating to Workers & Pages, selecting "Create using direct upload," and uploading the file.
Pro tip: Applications with local storage will maintain user data between sessions even when deployed, making them perfect for small applications.
PRESENTED BY CONVEYOR
The Rundown: Most vendors building with AI talk a big game. Sue, Conveyor’s AI Agent for Customer Trust, actually does the work — deploying across F1000 enterprises to fully run customer security reviews, skip busywork, and keep deals moving with no headaches or delays.
Sue, the AI Agent, can:
Manage customer security requests seamlessly across teams and tools
Handle any questionnaire format
Automatically complete, reject, or escalate questionnaires
Personalize your Trust Center with data from conversation intelligence tools
Learn more and integrate Sue into your infosec and sales workflow today.
AMAZON

Image source: Amazon
The Rundown: Amazon just launched Nova Premier, the company’s most advanced AI model yet — designed to both handle complex tasks and also act as a “teacher” to fine-tune smaller models to match its capabilities.
The details:
The multimodal model can process text, images, and videos with a 1M-token context window, allowing it to analyze about 750,000 words at once.
Internal testing shows Premier lagging behind top competitors like Gemini 2.5 Pro on math, science, and coding benchmarks.
Nova Premier excels at orchestrating multi-agent workflows, showing strength in financial analysis and investment research applications in testing.
Using Amazon's Bedrock Model Distillation, Premier can transfer capabilities to smaller models like Nova Pro and Micro and boost performance by up to 20%.
Why it matters: With Nova Premier, Amazon positions its top offering not as a direct rival for cutting-edge reasoning tasks but rather as a powerful teacher that can uplift the entire model family — suggesting a focus on optimizing performance and prioritizing efficient, task-specific deployments over a single powerhouse model.
QUICK HITS
🎥 Gen-4 References - Generate consistent characters and scenes in videos
🎨 Gemini App - New update with native AI image editing capabilities
🧠 MiMo-7B - Xiaomi’s small but powerful open-source reasoning model
📸 F-Lite - Freepik’s open-weights image generation model
Anthropic released Integrations, allowing Claude to connect with remote MCPs to integrate additional tools — alongside new research capabilities like web support.
NVIDIA criticized Anthropic’s AI chip export policy recommendations, arguing that U.S. firms should focus on innovation instead of limiting competitiveness with policy.
Google expanded its AI Mode in Search to all Labs users in the U.S., also introducing new visual shopping and local planning features.
Suno introduced v4.5 of its AI music generation platform, adding new genres, better prompting and adherence, the ability to create songs up to 8 minutes long, and more.
Microsoft is reportedly adding xAI’s Grok model to its Azure development platform, coming amid rumored tensions between CEO Satya Nadella and OpenAI’s Sam Altman.
Google launched Little Language Lessons, three new AI-powered experiments that use Gemini’s multilingual capabilities for personalized learning experiences.
COMMUNITY
Join our next workshop today at 4 PM EST with Dr. Alvaro Cintas, The Rundown’s AI professor. By the end of the workshop, you’ll confidently be able to use Google NotebookLM to improve your research, studying, and teaching.
RSVP here. Not a member? Join The Rundown University on a 14-day free trial.
We’ll always keep this newsletter 100% free. To support our work, consider sharing The Rundown with your friends, and we’ll send you more free goodies.
That's it for today!Before you go we’d love to know what you thought of today's newsletter to help us improve The Rundown experience for you. |
See you soon,
Rowan, Joey, Zach, Alvaro, and Jason—The Rundown’s editorial team
Reply