Full Report
Grok 4 is a huge leap from Grok 3, but how good is it compared to other models in the market, such as Gemini 2.5 Pro? We now have answers, thanks to new independent benchmarks. [...]
Analysis Summary
# Main Topic
Performance comparison of the Grok 4 API model against market competitors (specifically mentioning Gemini 2.5 Pro) using independent, crowd-sourced benchmarking data from LMArena.ai.
## Key Points
- Grok 4 API (grok-4-0709) represents a significant performance upgrade from Grok 3, ranking #3 overall in LMArena.ai's Text Arena based on over 4,000 community votes.
- Grok 4 achieved Top-3 performance rankings across multiple core general capabilities benchmarks:
- **Math: #1**
- **Coding: #2**
- **Creative Writing: #2**
- **Instruction Following: #2**
- **Hard Prompts: #3**
- Competitors Gemini 2.5 Pro and Claude still lead specifically in coding tasks for the current version of Grok 4.
- An anticipated, stronger model variant, Grok 4 Heavy, is mentioned but was not available for benchmarking on the API platform.
- xAI is expected to release Grok 4 Code, optimized for coding, along with a corresponding CLI tool, in August, which may shift the coding benchmark results.
## Threat Actors
- Not applicable. This report focuses on Large Language Model (LLM) performance benchmarking, not threat actor activity.
## TTPs
- Not applicable. The context involves LLM performance metrics (Math, Coding, etc.), not adversarial techniques.
## Affected Systems
- Not applicable. The report analyzes model capabilities, not specific vulnerable systems or victims.
## Mitigations
- Not applicable. No security vulnerabilities or threats are discussed; therefore, no direct mitigations are provided.
## Conclusion
Grok 4 demonstrates strong general reasoning capabilities, achieving top rankings in Math and near-top rankings in several other critical task domains. While it currently trails competitors like Gemini 2.5 Pro in coding, future specialized releases (Grok 4 Code) are anticipated to challenge this dynamic. This information is relevant for users assessing LLM deployments for research, development, or analytical tasks.