Most popular benchmark repositories and open source projects

MultiMedEval corentin-ryr Python

A Python tool to evaluate the performance of VLM on the medical domain.

88 8 88

physics-benchmarking-neurips2021 cogtoolslab Jupyter Notebook

Repo for "Physion: Evaluating Physical Prediction from Vision in Humans and Machines", presented at NeurIPS 2021 (Datasets & Benchmarks track)

88 5 88

BinaryAudit QuesmaOrg Shell

An open-source benchmark for evaluating AI agents' ability to find backdoors hidden in compiled binaries.

88 4 88

Machine-Unlearning-Comparator gnueaj TypeScript

A Visual Analytics System for Comparative Evaluation of Machine Unlearning Methods (TVCG 2026)

88 8 88

mujoco-benchmark ChenDRAG

Provide full reinforcement learning benchmark on mujoco environments, including ddpg, sac, td3, pg, a2c, ppo, library

88 6 88

NetworkBenchmarkDotNet JohannesDeml C#

Low-level dotnet network benchmark for UDP socket performance (.NET and Unity compatible)

87 12 87

Mercury Elfsong Jupyter Notebook

Code Efficiency Benchmark

87 10 87

LifelongAgentBench caixd-220529 Python

Code repo for "LifelongAgentBench: Evaluating LLM Agents as Lifelong Learners"

87 6 87

vllm-safety-benchmark UCSC-VLAA Python

[ECCV 2024] Official PyTorch Implementation of "How Many Unicorns Are in This Image? A Safety Evaluation Benchmark for Vision LLMs"

87 5 87

CMExam williamliujl Python

A Chinese National Medical Licensing Examination dataset and large languge model benchmarks

86 13 86

MultiCorrupt ika-rwth-aachen Jupyter Notebook

[IV2024] MultiCorrupt: A benchmark for robust multi-modal 3D object detection, evaluating LiDAR-Camera fusion models in autonomous driving. Includes d...

86 7 86

GL_vs_VK RippeR37 C++

Comparison of OpenGL and Vulkan API in terms of performance.

86 10 86

wikidata-simplequestions askplatypus Jupyter Notebook

Mapping of the SimpleQuestions dataset to Wikidata

86 18 86

step_game lechmazur

Multi-Agent Step Race Benchmark: Assessing LLM Collaboration and Deception Under Pressure. A multi-player “step-race” that challenges LLMs to engage i...

86 2 86

Elysium Hon-Wong Python

[ECCV 2024] Elysium: Exploring Object-level Perception in Videos via MLLM

86 5 86

vlsa-aegis THU-RCSCT Python

A vision-language-safety action architecture, named AEGIS, which contains a plug-and-play safety constraint layer formulated via control barrier funct...

86 8 86

SVBench sotayang Python

[ICLR'2025 Spotlight] Official repository for "SVBench: A Benchmark with Temporal Multi-Turn Dialogues for Streaming Video Understanding"

86 2 86

CreativeBench ZethWang Python

Official code for "CreativeBench: Benchmarking and Enhancing Machine Creativity via Self-Evolving Challenges"

86 5 86

turkish-lm-tuner boun-tabi-LMG Python

Turkish LM Tuner

86 5 86

BenchLMM AIFEG Python

[ECCV 2024] BenchLMM: Benchmarking Cross-style Visual Capability of Large Multimodal Models

86 7 86

ncnn-android-benchmark nihui C++

ncnn android benchmark app

86 22 86

rb klingtnet Rust

A thread-safe fixed-size circular buffer written in safe Rust.

86 8 86

locust4j myzhan Java

Locust4j is a load generator for locust, written in Java.

86 33 86

PEACE microsoft Python

PEACE: Empowering Geologic Map Holistic Understanding with MLLMs [Official, CVPR 2025]

85 12 85

benchinit mvdan Go

Benchmark the init cost of Go packages

85 3 85

DramaBench IIIIQIIII HTML

A six-dimensional evaluation framework for drama script continuation with interactive leaderboard and case studies

85 5 85

GMAI-MMBench uni-medical

GMAI-MMBench: A Comprehensive Multimodal Evaluation Benchmark Towards General Medical AI.

84 3 84

tum-traffic-dataset-dev-kit tum-traffic-dataset Python

TUM Traffic Dataset Development Kit

84 10 84

Benchmark WorldDownTown Swift

The Benchmark⏲ module provides methods to measure and report the time used to execute Swift code.

84 5 84

Impossible-Videos showlab Python

ICML 2025 - Impossible Videos

83 8 83

Counting-Stars nick7nlp Jupyter Notebook

Counting-Stars (★)

83 2 83

tasty-bench Bodigrim Haskell

Featherlight benchmark framework, drop-in replacement for criterion and gauge.

83 14 83

IMLCGui FarisR99 C#

Intel Memory Latency Checker GUI

83 7 83

AI-Generated-Content-Detection rustyneuron01 Python

Multi-modal AI-generated content detection: image, video, and audio. Benchmarks, training code (DINOv2, DINOv3, ReStraV, BreathNet), and evaluation pi...

83 47 83

pytest-django-queries NyanKiyoshi Python

Generate performance reports from your django database performance tests.

83 2 83

MDBenchmark bio-phys Python

Quickly generate, start and analyze benchmarks for molecular dynamics simulations.

83 17 83

logs-benchmark SigNoz Shell

Logs performance benchmark repo: Comparing Elastic, Loki and SigNoz

82 3 82

router-benchmark delvedor JavaScript

Benchmark of the most commonly used http routers

82 17 82

webassembly-benchmarks jedisct1

Libsodium WebAssembly benchmarks results.

82 3 82

WildScenes csiro-robotics Python

[IJRR2024] The official repository for the WildScenes: A Benchmark for 2D and 3D Semantic Segmentation in Natural Environments

82 7 82

ruby-performance-tools JuanitoFatas

List of Ruby Tools for doing Performance.

81 5 81

evoeval evo-eval Python

EvoEval: Evolving Coding Benchmarks via LLM

81 13 81

php-orm-benchmark sergeyklay PHP

The benchmark to compare performance of PHP ORM solutions.

81 6 81

godnsbench ameshkov Go

Simple DNS bench util that supports encrypted protocols.

81 2 81

sightglass bytecodealliance C

A benchmarking suite and tooling for Wasmtime and Cranelift

81 36 81

LearnedSort anikristo C++

Learned Sort: a model-enhanced sorting algorithm

81 12 81

auto-pen-bench lucagioacchini Python

This repo contains the codes of the penetration test benchmark for Generative Agents presented in the paper "AutoPenBench: Benchmarking Generative Age...

81 20 81

ollama-benchmark cloudmercato Python

Handy tool to measure the performance and efficiency of LLMs workloads.

81 8 81

perforator zyedidia Go

Record "perf" performance metrics for individual functions/regions of an ELF binary.

81 5 81

ASR_benchmark Franck-Dernoncourt Python

Program to benchmark various speech recognition APIs

81 18 81

benchmark

Repositories (1763)