Multimodal performance

Rina7RS · Post by **Rina7RS** » Wed Feb 05, 2025 10:41 am

Text Representation
ability Benchmarks describe Gemini Ultra GPT-4
General MMLU Representation of problems in 57 subjects (including STEM, humanities, etc.) 90.0% CoT@32* 86.4% 5 attempts* (reported)
reasoning Big-Bench Hard Diverse and challenging tasks requiring multi-step reasoning 83.6% 3-shot 83.1% 3-shot (API)
DROP Reading comprehension (F1 score) 82.4 Variable-shot 80.9 3-shot (report)
HellaSwag Common-sense reasoning for everyday tasks 87.8% 10-shot* 95.3% 10-shot* (reported)
math GSM8K Basic arithmetic operations (including venezuela mobile database elementary school math problems) 94.4% maj@32 92.0% 5-shot (report)
MATH Challenging math problems (including algebra, geometry, pre-calculus, etc.) 53.2% 4-shot 52.9% 4-shot (API)
coding HumanEval Python Code Generation 74.4% 0-shot (IT)* 67.0% 0-shot* (reported)
Natural2Code Python code generation. New HumanEval-like dataset not leaked online 74.9% 0-shot 73.9% 0-shot (API)
Gemini is a naturally multimodal model that can transform any type of input into any type of output. For example, Gemini can generate code based on different inputs.

ability Benchmarks describe Gemini GPT-4V
image MMMU Multidisciplinary university-level reasoning problems 59.4% (0-shot) 56.8% (0-shot)
Gemini Ultra (pixel only)* OCR+PA
VQAv2 Natural Image Understanding 77.8% (0-shot) 77.2% (0-shot)
Gemini Ultra (pixel only)* OCR+PA
TextVQA OCR on natural images 82.3% (0-shot) 78.0% (0-shot)
Gemini Ultra (pixel only)* OCR+PA
DocVQA Document understanding 90.9% (0-shot) 88.4% (0-shot)
Gemini Ultra (pixel only)* OCR+PA
Infographic VQA Infographic understanding 80.3% (0-shot) 75.1% (0-shot)
Gemini Ultra (pixel only)* OCR+PA.