Most popular benchmark repositories and open source projects

Mind2Web-2 OSU-NLP-Group Python

[NeurIPS'25 D&B] Mind2Web-2 Benchmark: Evaluating Agentic Search with Agent-as-a-Judge

112 7 1

holisticai holistic-ai Jupyter Notebook

This is an open-source tool to assess and improve the trustworthiness of AI systems.

112 31 3

AIDABench MichaelYang-lyx Python

Code for paper AIDABench: AI Data Analytics Benchmark.

111 3 1

text2image-benchmark boomb0om Jupyter Notebook

Benchmark for generative image models

111 6 1

OmniBenchmark ZhangYuanhan-AI Python

[ECCV2022] New benchmark for evaluating pre-trained model; New supervised contrastive learning framework.

110 4 4

benchyou xelabs Go

benchyou is a benchmark tool for MySQL, real-time monitoring TPS and vmstat/iostat

110 36 2

kubernetes-iperf3 Pharb Shell

Simple wrapper around iperf3 to measure network bandwidth from all nodes of a Kubernetes cluster

110 38 2

mini-nbody harrism C

A simple gravitational N-body simulation in less than 100 lines of C code, with CUDA optimizations.

109 30 3

VisualNews-Repository FuxiaoLiu Jupyter Notebook

[EMNLP'21] Visual News: Benchmark and Challenges in News Image Captioning

109 9 14

tsnkit ChuanyuXue Python

A scheduling and benchmark toolkit for Time-Sensitive Networking in Python

109 29 4

pytorch-benchmark LukasHedegaard Python

Easily benchmark PyTorch model FLOPs, latency, throughput, allocated gpu memory and energy consumption

109 11 2

build-tools-performance rstackjs JavaScript

Benchmarks for bundlers and build tools, including Rspack, Rsbuild, webpack, Vite, Rolldown, esbuild, Parcel, Farm and Utoo.

108 8 9

kaggle-dogs-vs-cats-caffe mrgloom Python

Kaggle dogs vs cats solution in Caffe

107 58 11

coder_eval UiPath Python

Evaluate & benchmark AI coding agents and Claude Code skills — sandboxed, reproducible YAML eval suites for Claude Code, Codex & Gemini, with A/B expe...

107 2 0

cwe-bench-java iris-sast Python

A manually vetted dataset for security vulnerability detection in Java projects

107 15 4

EvalNE Dru-Mara Python

Source code for EvalNE, a Python library for evaluating Network Embedding methods.

107 25 4

peaks-consolidation hkpeaks Go

The Peaks Consolidation is equipped with state-of-the-art algorithms and data structures that support high-performance databending exercises. It speci...

107 9 3

MCC5-THU-Gearbox-Benchmark-Datasets liuzy0708 MATLAB

A benchmark fault diagnosis dataset comprises vibration data collected from a gearbox under variable working conditions with intentionally induced fau...

107 8 2

apebench tum-pbs Python

[Neurips 2024] A benchmark suite for autoregressive neural emulation of PDEs. (≥46 PDEs in 1D, 2D, 3D; Differentiable Physics; Unrolled Training; Roll...

107 2 4

video_object_detection_paper junliang230

update some video object detection papers (视频目标检测论文和代码整理)

106 7 2

hash-smith bluuewhale Java

Fast & memory efficient hash tables for Java

106 9 0

vault-benchmark hashicorp Go

A tool for benchmarking usage of Vault.

106 27 19

pplbench facebookresearch Python

Evaluation Framework for Probabilistic Programming Languages

105 24 21

mlip-arena atomind-ai Jupyter Notebook

🌟 [NeurIPS '25 Spotlight] Fair and transparent benchmark of machine learning interatomic potentials (MLIPs), beyond basic error metrics https://openr...

105 9 0

benchmark-websocket oatpp C++

Websocket Client and Server for benchmarks with Millions of concurrent connections.

104 15 4

cec2017-py tilleyd Python

Python module for CEC 2017 single objective optimization test function suite.

104 25 2

solidity-benchmarks alephao Solidity

Benchmarks of popular contract implementations in solidity

104 11 6

ddio-bench aliireza Makefile

Reexamining Direct Cache Access to Optimize I/O Intensive Applications for Multi-hundred-gigabit Networks

104 22 2

tastylib chuyangliu C++

C++ implementations of data structures, algorithms, and system designs.

104 29 104

deepmark IngestAI PHP

Deepmark AI enables a unique testing environment for language models (LLM) assessment on task-specific metrics and on your own data so your GenAI-powe...

104 2 5

YAIB rvandewater Python

🧪Yet Another ICU Benchmark: a holistic framework for the standardization of clinical prediction model experiments. Provide custom datasets, cohorts,...

104 34 3

PAD EricLee0224 Python

[NeurlPS 2023] A Dataset and Benchmark for Pose-agnostic Anomaly Detection.

104 7 1

zk-Harness zkCollective Python

Benchmarking framework for general purpose zero-knowledge proofs languages and libraries

103 23 9

trajectopy gereon-t Python

Trajectopy - Trajectory Evaluation in Python

103 5 2

RealPDEBench AI4Science-WestlakeU Python

[🔥ICLR26 Oral] RealPDEBench: A Benchmark for Complex Physical Systems with Paired Real-World and Simulated Data

103 14 1

annbench matsui528 Python

A lightweight benchmark for approximate nearest neighbor search

103 15 5

datacenter-speed-tests jakejarvis Shell

⚡ Test speed and pings to all DigitalOcean, Linode, AWS, GCP, and Vultr regions

102 12 1

playwright-test hugomrdias JavaScript

Run unit tests with several test runners or benchmark inside real browsers with playwright and other Javascript runtimes.

102 14 1

ansibench nfinit C

A selection of ANSI C benchmarks and programs useful as benchmarks

102 11 1

PPM ZHKKKe

A High-Quality Photograpy Portrait Matting Benchmark

102 11 4

AIRTBench-Code dreadnode Jupyter Notebook

Code Repository for: AIRTBench: Measuring Autonomous AI Red Teaming Capabilities in Language Models

102 15 1

XRAutomatedTests Unity-Technologies C#

XRAutomatedTests is where you can find functional, graphics, performance, and other types of automated tests for your XR Unity development.

102 18 16

MMRole YanqiDai Python

[ICLR 2025] A Comprehensive Framework for Developing and Evaluating Multimodal Role-Playing Agents

101 4 2

dm_nevis google-deepmind Python

NEVIS'22: Benchmarking the next generation of never-ending learners

101 5 10

PhysGym principia-ai Python

A benchmark suite for evaluating LLM-based interactive scientific reasoning.

101 14 9

mqtt-mock daoshenzzg Go

mqtt压测工具。支持subscribe、publish压测方式，支持模拟客户端连接数。

101 32 7

GLUE-X YangLinyi Python

We leverage 14 datasets as OOD test data and conduct evaluations on 8 NLU tasks over 21 popularly used models. Our findings confirm that the OOD accur...

100 3 2

Wild-Places csiro-robotics Python

🏞️ [IEEE ICRA2023] The official repository for paper "Wild-Places: A Large-Scale Dataset for Lidar Place Recognition in Unstructured Natural Environm...

100 3 13

Robust-Gymnasium SafeRL-Lab Python

[ICLR 2025] Robust Gymnasium: A Unified Modular Benchmark for Robust Reinforcement Learning.

100 11 2

RWKU jinzhuoran Python

RWKU: Benchmarking Real-World Knowledge Unlearning for Large Language Models. NeurIPS 2024

100 10 1

benchmark

Repositories (1827)