Topic

benchmark

Repositories (1763)

pddl-instances
pddl-instances potassco Common Lisp

🌍 PDDL instances covering the International Planning Competitions

150
mqperf
mqperf softwaremill Scala
149
WorfBench
WorfBench zjunlp Python

[ICLR 2025] Benchmarking Agentic Workflow Generation

148
Windows-2019-CIS
Windows-2019-CIS ansible-lockdown YAML

Automated CIS Benchmark Compliance Remediation for Windows Server 2019 with Ansible

148
rvv-bench
rvv-bench camel-cdr Assembly

A collection of RISC-V Vector (RVV) benchmarks to help developers write portably performant RVV code

148
aacr-bench
aacr-bench alibaba Python

An Alibaba open-source multi-language benchmark for evaluating LLMs in repository-level automatic code review, featuring an AI-assisted and expert-ver...

147
CharXiv
CharXiv princeton-nlp Python

[NeurIPS 2024] CharXiv: Charting Gaps in Realistic Chart Understanding in Multimodal LLMs

147
xVerify
xVerify IAAR-Shanghai Jupyter Notebook

xVerify: Efficient Answer Verifier for Reasoning Model Evaluations

147
serverless-faas-workbench
serverless-faas-workbench ddps-lab Python

FunctionBench : A Suite of Workloads for Serverless Cloud Function Service

147
bucketbench
bucketbench estesp Go

Go-based framework for running benchmarks against Docker, containerd, runc, or any CRI-compliant runtime

146
ClassEval
ClassEval FudanSELab Python

Benchmark ClassEval for class-level code generation.

146
gameworld
gameworld gameworld-project Python

GameWorld: Towards Standardized and Verifiable Evaluation of Multimodal Game Agents

146
docile
docile rossumai Python

DocILE: Document Information Localization and Extraction Benchmark

146
golang-benchmarks
golang-benchmarks SimonWaldherr Go

Go(lang) benchmarks - (measure the speed of golang)

145
space_robotics_bench
space_robotics_bench AndrejOrsula Python

Robot Learning Beyond Earth

145
goku
goku jcaromiq Rust

Goku is an HTTP load testing application written in Rust

145
StringWars
StringWars ashvardanian Rust

Comparing performance-oriented string-processing libraries for substring search, multi-pattern matching, hashing, edit-distances, sketching, and sorti...

144
TCPDBench
TCPDBench alan-turing-institute

The Turing Change Point Detection Benchmark: An Extensive Benchmark Evaluation of Change Point Detection Algorithms on real-world data

144
php-orm-benchmark
php-orm-benchmark kenjis PHP

PHP ORM Benchmark

143
benchmarks
benchmarks lmdbjava Shell

Benchmark of open source, embedded, memory-mapped, key-value stores available from Java (JMH)

143
video-quality-metrics
video-quality-metrics CrypticSignal Python

Uses FFmpeg to benchmark video encoders to compare VMAF, SSIM and PSNR with different encoder settings.

143
leaderboard
leaderboard KGQA Jupyter Notebook

You can find the most recent KGQA benchmark numbers from publications here.

143
jsbench-me
jsbench-me psiho

jsbench.me - JavaScript performance benchmarking playground

142
Winrift
Winrift emylfy PowerShell

Windows 11 post-install pipeline — audit, tweak, harden, customize

142
smartbugs-curated
smartbugs-curated smartbugs Solidity

SB Curated is a curated dataset of Solidity smart contracts annotated with tagged vulnerabilities. The dataset was created to evaluate the accuracy of...

142
V2X-Sim
V2X-Sim ai4ce

[RA-L2022] V2X-Sim Dataset and Benchmark

142
SpeedTests
SpeedTests jabbalaci Python

comparing the execution speeds of various programming languages

142
service-mesh-benchmark
service-mesh-benchmark kinvolk Shell
141
aurora
aurora wenhaochai Python

[ICLR 2025] AuroraCap: Efficient, Performant Video Detailed Captioning and a New Benchmark

141
ElegantMustard
ElegantMustard lscambo13

An elegant RTSS Overlay to showcase your benchmark stats in style.

139
Video-Bench
Video-Bench PKU-YuanGroup Python

A Comprehensive Benchmark and Toolkit for Evaluating Video-based Large Language Models!

138
facies_classification_benchmark
facies_classification_benchmark yalaudah Python

The repository includes PyTorch code, and the data, to reproduce the results for our paper titled "A Machine Learning Benchmark for Facies Classificat...

138
EmoBench-M
EmoBench-M Emo-gml Python

EmoBench-M: A benchmark for evaluating Emotional Intelligence in Multimodal Large Language Models.

138
typescript-orm-benchmark
typescript-orm-benchmark emanuelcasco TypeScript

⚖️ ORM benchmarking for Node.js applications written in TypeScript

137
go-perftuner
go-perftuner go-perf Go

Helper tool for manual Go code optimization.

137
VCSL
VCSL alipay Python

Video Copy Segment Localization (VCSL) dataset and benchmark [CVPR2022]

137
Touchstone
Touchstone MrGiovanni Jupyter Notebook

[NeurIPS 2024] Touchstone - Benchmarking AI on 5,172 o.o.d. CT volumes and 9 anatomical structures

136
chembench
chembench lamalab-org Python

How good are LLMs at chemistry?

136
arewefastyet
arewefastyet mozilla JavaScript

NOT MAINTAINED ANYMORE! New project is located on https://github.com/mozilla-frontend-infra/js-perf-dashboard -- AreWeFastYet is a set of tools used f...

135
PersonaMem
PersonaMem bowen-upenn Python

[COLM 2025] Know Me, Respond to Me: Benchmarking LLMs for Dynamic User Profiling and Personalized Responses at Scale

135
Domain-generalization-fault-diagnosis-benchmark
Domain-generalization-fault-diagnosis-benchmark CHAOZHAO-1 Python

This is a benckmark for domain generalization-based fault diagnosis (基于领域泛化的相关代码)

135
awesome-world-model-evolution
awesome-world-model-evolution OpenRaiser

A curated collection of research papers, models, and resources tracing the evolution from specialized models to unified world models.

134
xbench-evals
xbench-evals xbench-ai Python

Evergreen, contamination-free, real-world, domain-specific AI evaluation framework

134
actors
actors plokhotnyuk Scala

Evaluation of API and performance of different actor libraries

133
golang-graphql-benchmark
golang-graphql-benchmark appleboy Go

benchmark of golang GraphQL framework.

132
jvm-performance-benchmarks
jvm-performance-benchmarks ionutbalosin Java

Java Virtual Machine (JVM) Performance Benchmarks with a primary focus on top-tier Just-In-Time (JIT) Compilers, such as C2 JIT, Graal JIT, and the Fa...

132
WideSearch
WideSearch ByteDance-Seed Python

WideSearch: Benchmarking Agentic Broad Info-Seeking

132
THST
THST tuxalin C++

Templated hierarchical spatial trees designed for high-peformance.

132
contender
contender flashbots Rust

spam EVM execution nodes over JSON-RPC & run benchmarks

132
MTAD
MTAD OpsPAI Python

MTAD: Tools and Benchmark for Multivariate Time Series Anomaly Detection

132