Topic

benchmark

Repositories (1763)

confabulations
confabulations lechmazur HTML

Hallucinations (Confabulations) Document-Based Benchmark for RAG. Includes human-verified questions and answers.

245
fastero
fastero wasi-master Python

Python timeit CLI for the 21st century! colored output, multi-line input with syntax highlighting and autocompletion and much more!

245
MCPBench
MCPBench modelscope Python

The evaluation benchmark on MCP servers

243
turnkeyml
turnkeyml onnx Python

No-code CLI designed for accelerating ONNX workflows

237
fixed_point
fixed_point johnmcfarlane C++

C++ Binary Fixed-Point Arithmetic

236
HAKE
HAKE DirtyHarryLYL Python

HAKE: Human Activity Knowledge Engine (CVPR'18/19/20, NeurIPS'20, TPAMI'21)

236
llm-bulls-and-cows-benchmark
llm-bulls-and-cows-benchmark stalkermustang HTML

A mini-framework for evaluating LLM performance on the Bulls and Cows number guessing game, supporting multiple LLM providers.

235
LLFormer
LLFormer TaoWangzj Python

The code release of paper "Ultra-High-Definition Low-Light Image Enhancement: A Benchmark and Transformer-Based Method", AAAI 2023

235
DyCodeEval
DyCodeEval SeekingDream Python

Official repository of the ICML2025 paper “Dynamic Benchmarking of Reasoning Capabilities in Code Large Language Models Under Data Contamination”

235
vkmark
vkmark vkmark C++

Vulkan benchmark

234
local-feature-evaluation
local-feature-evaluation ahojnnes MATLAB

Comparative Evaluation of Hand-Crafted and Learned Local Features

232
MolScore
MolScore MorganCThomas Python

An automated scoring function to facilitate and standardize the evaluation of goal-directed generative models for de novo molecular design

232
nablaDFT
nablaDFT AIRI-Institute Python

nablaDFT: Large-Scale Conformational Energy and Hamiltonian Prediction benchmark and dataset

232
agent-studio
agent-studio ltzheng Python

[ICLR 2025] A trinity of environments, tools, and benchmarks for general virtual agents

231
httpit
httpit gonetx Go

A rapid http(s) benchmark tool written in Go

231
kubestone
kubestone kubestone Go

Performance benchmarks for Kubernetes

230
hpatches-benchmark
hpatches-benchmark hpatches MATLAB

Python & Matlab code for local feature descriptor evaluation with the HPatches dataset.

229
kbench
kbench longhorn Go

Benchmark your Kubernetes storage.

228
evm-bench
evm-bench ziyadedher Solidity

🚀🪑 evm-bench is a suite of Ethereum Virtual Machine stress tests and benchmarks.

226
mlx-benchmark
mlx-benchmark TristanBilot Python

Benchmark of Apple MLX operations on all Apple Silicon chips (GPU, CPU) + MPS and CUDA.

225
PoseBench
PoseBench BioinfoMachineLearning Jupyter Notebook

Comprehensive benchmarking of protein-ligand structure prediction methods. (Nature Machine Intelligence)

225
GBM-perf
GBM-perf szilard HTML

Performance of various open source GBM implementations

224
llm-price-compass
llm-price-compass arc53 TypeScript

This project collects GPU benchmarks from various cloud providers and compares them to fixed per token costs. Use our tool for efficient LLM GPU selec...

224
awesome-mojo
awesome-mojo ego Python

Awesome Mojo🔥

224
torch-scan
torch-scan frgfm Python

Seamless analysis of your PyTorch models (RAM usage, FLOPs, MACs, receptive field, etc.)

223
nyt-connections
nyt-connections lechmazur Python

Benchmark that evaluates LLMs using 759 NYT Connections puzzles extended with extra trick words

223
DBNet
DBNet driving-behavior Python

DBNet: A Large-Scale Dataset for Driving Behavior Learning, CVPR 2018

221
SanityHarness
SanityHarness lemon07r Go

A lightweight harness designed to be simple to use, efficient, universially compatible with any coding agent to evaluate them over a broad set of codi...

219
ConvolutionalNeuralOperator
ConvolutionalNeuralOperator camlab-ethz Python

This repository is the official implementation of the paper Convolutional Neural Operators for robust and accurate learning of PDEs

217
rlt
rlt wfxr Rust

A universal load testing framework for Rust, with real-time tui support.

217
srs-benchmark
srs-benchmark open-spaced-repetition Jupyter Notebook

A benchmark for spaced repetition schedulers/algorithms

216
ModelNet40-C
ModelNet40-C jiachens Python

Repo for "Benchmarking Robustness of 3D Point Cloud Recognition against Common Corruptions" https://arxiv.org/abs/2201.12296

215
tpch-kit
tpch-kit gregrahn C

TPC-H benchmark kit with some modifications/additions

212
Ditto
Ditto OFA-Sys Jupyter Notebook

A self-ailgnment method for role-play. Benchmark for role-play. Resources for "Large Language Models are Superpositions of All Characters: Attaining A...

211
ChronoMagic-Bench
ChronoMagic-Bench PKU-YuanGroup Python

[NeurIPS 2024 D&B Spotlight🔥] ChronoMagic-Bench: A Benchmark for Metamorphic Evaluation of Text-to-Time-lapse Video Generation

211
TSB-UAD
TSB-UAD thedatumorg Jupyter Notebook

An End-to-End Benchmark Suite for Univariate Time-Series Anomaly Detection

211
al_sid
al_sid selous123 Python

[Pytorch] The repo contains the code for "FORGE: Forming Semantic Identifiers for Generative Retrieval in Industrial Datasets"

209
OpenS2V-Nexus
OpenS2V-Nexus PKU-YuanGroup Jupyter Notebook

[NeurIPS 2025 D&B🔥] OpenS2V-Nexus: A Detailed Benchmark and Million-Scale Dataset for Subject-to-Video Generation

208
prometheus-benchmark
prometheus-benchmark VictoriaMetrics Go

Benchmark for Prometheus-compatible systems

208
cve-bench
cve-bench uiuc-kang-lab Python

CVE-Bench: A Benchmark for AI Agents’ Ability to Exploit Real-World Web Application Vulnerabilities

207
Meta-SelfLearning
Meta-SelfLearning bupt-ai-cz Python

Meta Self-learning for Multi-Source Domain Adaptation: A Benchmark

206
zBench
zBench hendriknielaender Zig

📊 zig benchmark

206
miracl
miracl project-miracl

A large-scale multilingual dataset for Information Retrieval. Thorough human-annotations across 18 diverse languages.

205
sangrenel
sangrenel jamiealquiza Go

Apache Kafka load testing "...basically a cloth bag filled with small jagged pieces of scrap iron"

204
smartbugs-wild
smartbugs-wild smartbugs Python

This repository contains 47,398 smart contracts extracted from the Ethereum network.

203
embedded-ai.bench
embedded-ai.bench AI-performance Python

benchmark for embededded-ai deep learning inference engines, such as NCNN / TNN / MNN / TensorFlow Lite etc.

201
mac-ml-speed-test
mac-ml-speed-test mrdbourke Jupyter Notebook

A few quick scripts focused on testing TensorFlow/PyTorch/Llama 2 on macOS.

201
RHEL9-CIS
RHEL9-CIS ansible-lockdown YAML

Automated CIS Benchmark Compliance Remediation for RHEL 9 with Ansible

201
SiliconeCalculator
SiliconeCalculator erfansn Kotlin

🎨 Simple but attractive graphic a calculator built with Jetpack Compose

200
gpu-rodinia
gpu-rodinia yuhc C

Rodinia benchmark

200