Most popular benchmark repositories and open source projects

ChemLLMBench ChemFoundationModels Jupyter Notebook

Official Code for What can Large Language Models do in chemistry? A comprehensive benchmark on eight tasks (In NeurIPS 2023)

172 7 172

Benchmark-Overlays TroyMetrics

RTSS / RivaTuner Overlay

172 4 172

ecs andygeiss Go

Build your own Game-Engine based on the Entity Component System concept in Golang.

172 12 172

Shot2Story bytedance Python

A new multi-shot video understanding benchmark Shot2Story with comprehensive video summaries and detailed shot-level captions.

172 10 172

VPR-datasets-downloader gmberton Python

Automatic download VPR datasets in a standard format

172 21 172

dlbench hclhkbu Python

Benchmarking State-of-the-Art Deep Learning Software Tools

171 47 171

memory-maze jurgisp Python

Evaluating long-term memory of reinforcement learning algorithms

171 20 171

face-occlusion-generation kennyvoo Python

[CVPRW 2022] Delving into High-Quality Synthetic Face Occlusion Segmentation Datasets

170 17 170

pytorch-retraining ahirner Jupyter Notebook

Transfer Learning Shootout for PyTorch's model zoo (torchvision)

169 40 169

p2plab Netflix Go

performance benchmark infrastructure for IPLD DAGs

169 30 169

BizFinBench HiThink-Research Python

A Business-Driven Real-World Financial Benchmark for Evaluating LLMs

166 10 166

HPOBench automl Python

Collection of hyperparameter optimization benchmark problems

166 38 166

Hardening-Audit-Tool-AuditTAP fbprogmbh PowerShell

FBPro Audit Test Automation Package allows you to create compliance reports for your systems. The resulting HTML-reports provide a transparent overvie...

166 43 166

LLM-Agent-Benchmark-List zhangxjohn

A banchmark list for evaluation of large language models.

165 11 165

BLINK_Benchmark zeyofu Python

This repo contains evaluation code for the paper "BLINK: Multimodal Large Language Models Can See but Not Perceive". https://arxiv.org/abs/2404.12...

165 8 165

benchmarking-fft project-gemmi C++

choosing FFT library...

165 11 165

PointCloud-C ldkong1205 Python

Benchmarking and Analyzing Point Cloud Perception Robustness under Corruptions

165 22 165

LLM-RGB babelcloud TypeScript

LLM Reasoning and Generation Benchmark. Evaluate LLMs in complex scenarios systematically.

165 16 165

SGI-Bench InternScience Python

Probing Scientific General Intelligence of LLMs with Scientist-Aligned Workflows

164 4 164

OSWorld-G xlang-ai TypeScript

[NeurIPS 2025 Spotlight] Scaling Computer-Use Grounding via UI Decomposition and Synthesis

163 7 163

backup-bench deajan Shell

Quick and dirty backup tool benchmark with reproducible results

163 13 163

ossf-cve-benchmark ossf-cve-benchmark TypeScript

The OpenSSF CVE Benchmark consists of code and metadata for over 200 real life CVEs, as well as tooling to analyze the vulnerable codebases using a va...

162 45 162

SWE-CI SKYLENAGE-AI Python

SWE-CI: Evaluating Agent Capabilities in Maintaining Codebases via Continuous Integration

162 18 162

TurtleBench mazzzystar Jupyter Notebook

TurtleBench: Evaluating Top Language Models via Real-World Yes/No Puzzles.

162 16 162

ollama-benchmark LarHope Python

Ollama based Benchmark with detail I/O token per second. Python with Deepseek R1 example.

161 13 161

benchi ConduitIO Go

Benchmark any tool from the CLI

161 7 161

MMToM-QA chuanyangjin Python

[🏆Outstanding Paper Award at ACL 2024] MMToM-QA: Multimodal Theory of Mind Question Answering

161 19 161

active_genie Roriz Ruby

The Lodash for GenAI: Real Value + Consistent + Model-Agnostic

160 2 160

compiler-benchmark nordlow Python

Benchmarks compilation speeds of different combinations of languages and compilers.

159 17 159

bsuccinct-rs beling Rust

Rust libraries and programs focused on succinct data structures

159 15 159

python-benchmark-harness JoeyHendricks Python

A micro/macro benchmark framework for the Python programming language that helps with optimizing your software.

158 15 158

benchmark-driver benchmark-driver Ruby

Fully-featured benchmark driver for Ruby

157 28 157

segment houbb Java

The jieba-analysis tool for java.（基于结巴分词词库实现的更加灵活优雅易用，高性能的 java 分词实现。支持词性标注。）

156 29 156

codspeed CodSpeedHQ Rust

CodSpeed is the all-in-one performance testing toolkit. Optimize code performance and catch regressions early.

155 19 155

clinical-trial-outcome-prediction futianfan Python

benchmark dataset and Deep learning method (Hierarchical Interaction Network, HINT) for clinical trial approval probability prediction, published in C...

155 44 155

MedXpertQA TsinghuaC3I Python

[ICML 2025] MedXpertQA: Benchmarking Expert-Level Medical Reasoning and Understanding

154 13 154

7guis 7guis JavaScript

7GUIs is a GUI programming usability benchmark.

154 21 154

gatling-dubbo youzan Scala

A gatling plugin for running load tests on Apache Dubbo(https://github.com/apache/incubator-dubbo) and other java ecosystem.

153 43 153

wake-word-benchmark Picovoice Python

wake word engine benchmark framework

153 30 153

tcpulse yuuki Go

A TCP/UDP load generator that provides fine-grained, flow-level control in Go.

153 6 153

SICK Yuri-NagaSaki Shell

Server Info & Check Kit

153 15 153

sltbench ivafanas C++

C++ benchmark tool. Practical, stable and fast performance testing framework.

152 10 152

OpenSceneFlow KTH-RPL Python

A codebase for point cloud scene flow estimation research. Latest works: TeFlow(CVPR'26), DeltaFlow(NeurIPS'25), HiMo(T-RO'25), VoteFlow(CVPR'25), Flo...

152 13 152

deepchange PengBoXiangShang

ICCV 2023, project page of the paper "DeepChange: A Long-term Person Re-identification Benchmark"

152 5 152

coir CoIR-team Python

(ACL 2025 Main) A Comprehensive Benchmark for Code Information Retrieval.

151 15 151

math-parser-benchmark-project ArashPartow C++

C++ Mathematical Expression Parser Benchmark

151 27 151

TeaStore DescartesResearch Java

A micro-service reference test application for model extraction, cloud management, energy efficiency, power prediction, single- and multi-tier auto-sc...

151 176 151

plf_nanotimer mattreecebentley C++

A simple C++ 03/11/etc timer class for ~microsecond-precision cross-platform benchmarking. The implementation is as limited and as simple as possible...

151 14 151

NAS-Benchmark antoyang Python

[ICLR 2020] NAS evaluation is frustratingly hard

150 24 150

pddl-instances potassco Common Lisp

🌍 PDDL instances covering the International Planning Competitions

150 61 150

benchmark

Repositories (1763)