Topic

benchmark

Repositories (1763)

ChemLLMBench
ChemLLMBench ChemFoundationModels Jupyter Notebook

Official Code for What can Large Language Models do in chemistry? A comprehensive benchmark on eight tasks (In NeurIPS 2023)

172
Benchmark-Overlays
Benchmark-Overlays TroyMetrics

RTSS / RivaTuner Overlay

172
ecs
ecs andygeiss Go

Build your own Game-Engine based on the Entity Component System concept in Golang.

172
Shot2Story
Shot2Story bytedance Python

A new multi-shot video understanding benchmark Shot2Story with comprehensive video summaries and detailed shot-level captions.

172
VPR-datasets-downloader
VPR-datasets-downloader gmberton Python

Automatic download VPR datasets in a standard format

172
dlbench
dlbench hclhkbu Python

Benchmarking State-of-the-Art Deep Learning Software Tools

171
memory-maze
memory-maze jurgisp Python

Evaluating long-term memory of reinforcement learning algorithms

171
face-occlusion-generation
face-occlusion-generation kennyvoo Python

[CVPRW 2022] Delving into High-Quality Synthetic Face Occlusion Segmentation Datasets

170
pytorch-retraining
pytorch-retraining ahirner Jupyter Notebook

Transfer Learning Shootout for PyTorch's model zoo (torchvision)

169
p2plab
p2plab Netflix Go

performance benchmark infrastructure for IPLD DAGs

169
BizFinBench
BizFinBench HiThink-Research Python

A Business-Driven Real-World Financial Benchmark for Evaluating LLMs

166
HPOBench
HPOBench automl Python

Collection of hyperparameter optimization benchmark problems

166
Hardening-Audit-Tool-AuditTAP
Hardening-Audit-Tool-AuditTAP fbprogmbh PowerShell

FBPro Audit Test Automation Package allows you to create compliance reports for your systems. The resulting HTML-reports provide a transparent overvie...

166
LLM-Agent-Benchmark-List
LLM-Agent-Benchmark-List zhangxjohn

A banchmark list for evaluation of large language models.

165
BLINK_Benchmark
BLINK_Benchmark zeyofu Python

This repo contains evaluation code for the paper "BLINK: Multimodal Large Language Models Can See but Not Perceive". https://arxiv.org/abs/2404.12...

165
benchmarking-fft
benchmarking-fft project-gemmi C++

choosing FFT library...

165
PointCloud-C
PointCloud-C ldkong1205 Python

Benchmarking and Analyzing Point Cloud Perception Robustness under Corruptions

165
LLM-RGB
LLM-RGB babelcloud TypeScript

LLM Reasoning and Generation Benchmark. Evaluate LLMs in complex scenarios systematically.

165
SGI-Bench
SGI-Bench InternScience Python

Probing Scientific General Intelligence of LLMs with Scientist-Aligned Workflows

164
OSWorld-G
OSWorld-G xlang-ai TypeScript

[NeurIPS 2025 Spotlight] Scaling Computer-Use Grounding via UI Decomposition and Synthesis

163
backup-bench
backup-bench deajan Shell

Quick and dirty backup tool benchmark with reproducible results

163
ossf-cve-benchmark
ossf-cve-benchmark ossf-cve-benchmark TypeScript

The OpenSSF CVE Benchmark consists of code and metadata for over 200 real life CVEs, as well as tooling to analyze the vulnerable codebases using a va...

162
SWE-CI
SWE-CI SKYLENAGE-AI Python

SWE-CI: Evaluating Agent Capabilities in Maintaining Codebases via Continuous Integration

162
TurtleBench
TurtleBench mazzzystar Jupyter Notebook

TurtleBench: Evaluating Top Language Models via Real-World Yes/No Puzzles.

162
ollama-benchmark
ollama-benchmark LarHope Python

Ollama based Benchmark with detail I/O token per second. Python with Deepseek R1 example.

161
benchi
benchi ConduitIO Go

Benchmark any tool from the CLI

161
MMToM-QA
MMToM-QA chuanyangjin Python

[🏆Outstanding Paper Award at ACL 2024] MMToM-QA: Multimodal Theory of Mind Question Answering

161
active_genie
active_genie Roriz Ruby

The Lodash for GenAI: Real Value + Consistent + Model-Agnostic

160
compiler-benchmark
compiler-benchmark nordlow Python

Benchmarks compilation speeds of different combinations of languages and compilers.

159
bsuccinct-rs
bsuccinct-rs beling Rust

Rust libraries and programs focused on succinct data structures

159
python-benchmark-harness
python-benchmark-harness JoeyHendricks Python

A micro/macro benchmark framework for the Python programming language that helps with optimizing your software.

158
benchmark-driver
benchmark-driver benchmark-driver Ruby

Fully-featured benchmark driver for Ruby

157
segment
segment houbb Java

The jieba-analysis tool for java.(基于结巴分词词库实现的更加灵活优雅易用,高性能的 java 分词实现。支持词性标注。)

156
codspeed
codspeed CodSpeedHQ Rust

CodSpeed is the all-in-one performance testing toolkit. Optimize code performance and catch regressions early.

155
clinical-trial-outcome-prediction
clinical-trial-outcome-prediction futianfan Python

benchmark dataset and Deep learning method (Hierarchical Interaction Network, HINT) for clinical trial approval probability prediction, published in C...

155
MedXpertQA
MedXpertQA TsinghuaC3I Python

[ICML 2025] MedXpertQA: Benchmarking Expert-Level Medical Reasoning and Understanding

154
7guis
7guis 7guis JavaScript

7GUIs is a GUI programming usability benchmark.

154
gatling-dubbo
gatling-dubbo youzan Scala

A gatling plugin for running load tests on Apache Dubbo(https://github.com/apache/incubator-dubbo) and other java ecosystem.

153
wake-word-benchmark
wake-word-benchmark Picovoice Python

wake word engine benchmark framework

153
tcpulse
tcpulse yuuki Go

A TCP/UDP load generator that provides fine-grained, flow-level control in Go.

153
SICK
SICK Yuri-NagaSaki Shell

Server Info & Check Kit

153
sltbench
sltbench ivafanas C++

C++ benchmark tool. Practical, stable and fast performance testing framework.

152
OpenSceneFlow
OpenSceneFlow KTH-RPL Python

A codebase for point cloud scene flow estimation research. Latest works: TeFlow(CVPR'26), DeltaFlow(NeurIPS'25), HiMo(T-RO'25), VoteFlow(CVPR'25), Flo...

152
deepchange
deepchange PengBoXiangShang

ICCV 2023, project page of the paper "DeepChange: A Long-term Person Re-identification Benchmark"

152
coir
coir CoIR-team Python

(ACL 2025 Main) A Comprehensive Benchmark for Code Information Retrieval.

151
math-parser-benchmark-project
math-parser-benchmark-project ArashPartow C++

C++ Mathematical Expression Parser Benchmark

151
TeaStore
TeaStore DescartesResearch Java

A micro-service reference test application for model extraction, cloud management, energy efficiency, power prediction, single- and multi-tier auto-sc...

151
plf_nanotimer
plf_nanotimer mattreecebentley C++

A simple C++ 03/11/etc timer class for ~microsecond-precision cross-platform benchmarking. The implementation is as limited and as simple as possible...

151
NAS-Benchmark
NAS-Benchmark antoyang Python

[ICLR 2020] NAS evaluation is frustratingly hard

150
pddl-instances
pddl-instances potassco Common Lisp

🌍 PDDL instances covering the International Planning Competitions

150