Most popular benchmark repositories and open source projects

confabulations lechmazur HTML

Hallucinations (Confabulations) Document-Based Benchmark for RAG. Includes human-verified questions and answers.

245 9 245

fastero wasi-master Python

Python timeit CLI for the 21st century! colored output, multi-line input with syntax highlighting and autocompletion and much more!

245 7 245

MCPBench modelscope Python

The evaluation benchmark on MCP servers

243 15 243

turnkeyml onnx Python

No-code CLI designed for accelerating ONNX workflows

237 33 237

fixed_point johnmcfarlane C++

C++ Binary Fixed-Point Arithmetic

236 37 236

HAKE DirtyHarryLYL Python

HAKE: Human Activity Knowledge Engine (CVPR'18/19/20, NeurIPS'20, TPAMI'21)

236 14 236

llm-bulls-and-cows-benchmark stalkermustang HTML

A mini-framework for evaluating LLM performance on the Bulls and Cows number guessing game, supporting multiple LLM providers.

235 1 235

LLFormer TaoWangzj Python

The code release of paper "Ultra-High-Definition Low-Light Image Enhancement: A Benchmark and Transformer-Based Method", AAAI 2023

235 14 235

DyCodeEval SeekingDream Python

Official repository of the ICML2025 paper “Dynamic Benchmarking of Reasoning Capabilities in Code Large Language Models Under Data Contamination”

235 20 235

vkmark vkmark C++

Vulkan benchmark

234 40 234

local-feature-evaluation ahojnnes MATLAB

Comparative Evaluation of Hand-Crafted and Learned Local Features

232 48 232

MolScore MorganCThomas Python

An automated scoring function to facilitate and standardize the evaluation of goal-directed generative models for de novo molecular design

232 35 232

nablaDFT AIRI-Institute Python

nablaDFT: Large-Scale Conformational Energy and Hamiltonian Prediction benchmark and dataset

232 25 232

agent-studio ltzheng Python

[ICLR 2025] A trinity of environments, tools, and benchmarks for general virtual agents

231 30 231

httpit gonetx Go

A rapid http(s) benchmark tool written in Go

231 9 231

kubestone kubestone Go

Performance benchmarks for Kubernetes

230 51 230

hpatches-benchmark hpatches MATLAB

Python & Matlab code for local feature descriptor evaluation with the HPatches dataset.

229 64 229

kbench longhorn Go

Benchmark your Kubernetes storage.

228 40 228

evm-bench ziyadedher Solidity

🚀🪑 evm-bench is a suite of Ethereum Virtual Machine stress tests and benchmarks.

226 29 226

mlx-benchmark TristanBilot Python

Benchmark of Apple MLX operations on all Apple Silicon chips (GPU, CPU) + MPS and CUDA.

225 31 225

PoseBench BioinfoMachineLearning Jupyter Notebook

Comprehensive benchmarking of protein-ligand structure prediction methods. (Nature Machine Intelligence)

225 17 225

GBM-perf szilard HTML

Performance of various open source GBM implementations

224 30 224

llm-price-compass arc53 TypeScript

This project collects GPU benchmarks from various cloud providers and compares them to fixed per token costs. Use our tool for efficient LLM GPU selec...

224 9 224

awesome-mojo ego Python

Awesome Mojo🔥

224 10 224

torch-scan frgfm Python

Seamless analysis of your PyTorch models (RAM usage, FLOPs, MACs, receptive field, etc.)

223 22 223

nyt-connections lechmazur Python

Benchmark that evaluates LLMs using 759 NYT Connections puzzles extended with extra trick words

223 8 223

DBNet driving-behavior Python

DBNet: A Large-Scale Dataset for Driving Behavior Learning, CVPR 2018

221 49 221

SanityHarness lemon07r Go

A lightweight harness designed to be simple to use, efficient, universially compatible with any coding agent to evaluate them over a broad set of codi...

219 3 219

ConvolutionalNeuralOperator camlab-ethz Python

This repository is the official implementation of the paper Convolutional Neural Operators for robust and accurate learning of PDEs

217 30 217

rlt wfxr Rust

A universal load testing framework for Rust, with real-time tui support.

217 13 217

srs-benchmark open-spaced-repetition Jupyter Notebook

A benchmark for spaced repetition schedulers/algorithms

216 26 216

ModelNet40-C jiachens Python

Repo for "Benchmarking Robustness of 3D Point Cloud Recognition against Common Corruptions" https://arxiv.org/abs/2201.12296

215 20 215

tpch-kit gregrahn C

TPC-H benchmark kit with some modifications/additions

212 80 212

Ditto OFA-Sys Jupyter Notebook

A self-ailgnment method for role-play. Benchmark for role-play. Resources for "Large Language Models are Superpositions of All Characters: Attaining A...

211 18 211