🌍 PDDL instances covering the International Planning Competitions
[ICLR 2025] Benchmarking Agentic Workflow Generation
Automated CIS Benchmark Compliance Remediation for Windows Server 2019 with Ansible
A collection of RISC-V Vector (RVV) benchmarks to help developers write portably performant RVV code
An Alibaba open-source multi-language benchmark for evaluating LLMs in repository-level automatic code review, featuring an AI-assisted and expert-ver...
[NeurIPS 2024] CharXiv: Charting Gaps in Realistic Chart Understanding in Multimodal LLMs
xVerify: Efficient Answer Verifier for Reasoning Model Evaluations
FunctionBench : A Suite of Workloads for Serverless Cloud Function Service
Go-based framework for running benchmarks against Docker, containerd, runc, or any CRI-compliant runtime
Benchmark ClassEval for class-level code generation.
GameWorld: Towards Standardized and Verifiable Evaluation of Multimodal Game Agents
DocILE: Document Information Localization and Extraction Benchmark
Go(lang) benchmarks - (measure the speed of golang)
Robot Learning Beyond Earth
Goku is an HTTP load testing application written in Rust
Comparing performance-oriented string-processing libraries for substring search, multi-pattern matching, hashing, edit-distances, sketching, and sorti...
The Turing Change Point Detection Benchmark: An Extensive Benchmark Evaluation of Change Point Detection Algorithms on real-world data
PHP ORM Benchmark
Benchmark of open source, embedded, memory-mapped, key-value stores available from Java (JMH)
Uses FFmpeg to benchmark video encoders to compare VMAF, SSIM and PSNR with different encoder settings.
You can find the most recent KGQA benchmark numbers from publications here.
jsbench.me - JavaScript performance benchmarking playground
Windows 11 post-install pipeline — audit, tweak, harden, customize
SB Curated is a curated dataset of Solidity smart contracts annotated with tagged vulnerabilities. The dataset was created to evaluate the accuracy of...
[RA-L2022] V2X-Sim Dataset and Benchmark
comparing the execution speeds of various programming languages
[ICLR 2025] AuroraCap: Efficient, Performant Video Detailed Captioning and a New Benchmark
An elegant RTSS Overlay to showcase your benchmark stats in style.
A Comprehensive Benchmark and Toolkit for Evaluating Video-based Large Language Models!
The repository includes PyTorch code, and the data, to reproduce the results for our paper titled "A Machine Learning Benchmark for Facies Classificat...
EmoBench-M: A benchmark for evaluating Emotional Intelligence in Multimodal Large Language Models.
⚖️ ORM benchmarking for Node.js applications written in TypeScript
Helper tool for manual Go code optimization.
Video Copy Segment Localization (VCSL) dataset and benchmark [CVPR2022]
[NeurIPS 2024] Touchstone - Benchmarking AI on 5,172 o.o.d. CT volumes and 9 anatomical structures
How good are LLMs at chemistry?
NOT MAINTAINED ANYMORE! New project is located on https://github.com/mozilla-frontend-infra/js-perf-dashboard -- AreWeFastYet is a set of tools used f...
[COLM 2025] Know Me, Respond to Me: Benchmarking LLMs for Dynamic User Profiling and Personalized Responses at Scale
This is a benckmark for domain generalization-based fault diagnosis (基于领域泛化的相关代码)
A curated collection of research papers, models, and resources tracing the evolution from specialized models to unified world models.
Evergreen, contamination-free, real-world, domain-specific AI evaluation framework
Evaluation of API and performance of different actor libraries
benchmark of golang GraphQL framework.
Java Virtual Machine (JVM) Performance Benchmarks with a primary focus on top-tier Just-In-Time (JIT) Compilers, such as C2 JIT, Graal JIT, and the Fa...
WideSearch: Benchmarking Agentic Broad Info-Seeking
Templated hierarchical spatial trees designed for high-peformance.
spam EVM execution nodes over JSON-RPC & run benchmarks
MTAD: Tools and Benchmark for Multivariate Time Series Anomaly Detection