ClickBench: a Benchmark For Analytical Databases
Monocular Depth Estimation Toolbox based on MMSegmentation.
Molecular Sets (MOSES): A Benchmarking Platform for Molecular Generation Models
A.S.E (AICGSecEval) is a repository-level AI-generated code security evaluation benchmark developed by Tencent Wukong Code Security Team.
KernelBench: Can LLMs Write GPU Kernels? - Benchmark + Toolkit with Torch -> CUDA (+ more DSLs)
Various gRPC benchmarks
BlazeHTTP 是一款简单易用的 WAF 防护效果测试工具。BlazeHTTP stands as a user-friendly WAF protection efficacy evaluation tool.
Model Zoo For OpenCV DNN and Benchmarks.
A High Performance HTTP Server for Ruby
VPS benchmark script — based on the popular bench.sh, plus CPU and ioping tests, and dual-stack IPv4 and v6 speedtests by default
Measure Amazon S3's performance from any location.
AoE (AI on Edge,终端智能,边缘计算) 是一个终端侧AI集成运行时环境 (IRE),帮助开发者提升效率。
Performance comparison of .NET IoC containers
Python suite to construct benchmark machine learning datasets from the MIMIC-III 💊 clinical database.
Open Source Continuous Inference Benchmarking Qwen3.5, DeepSeek, GPTOSS - GB200 NVL72 vs MI355X vs B200 vs GB300 NVL72 vs H100 & soon™ TPUv6e/v7/Train...
A PyTorch library for all things Reinforcement Learning (RL) for Combinatorial Optimization (CO)
C++ Benchmark Authoring Library/Framework
CUDA Kernel Benchmarking Library
[CBLUE1] 中文医疗信息处理基准CBLUE: A Chinese Biomedical Language Understanding Evaluation Benchmark
Windows, macOS and Android storage (HDD, SSD, RAM) speed testing/performance benchmarking app
Natural Intelligence is still a pretty good idea.
High-performance Distributed Storage
🐰 Bencher - Continuous Benchmarking
A benchmark dataset for data-driven weather forecasting
📊 Benchmark Comparison of Packages with Runtime Validation and TypeScript Support
A dataset of datasets for learning to learn from few examples
"Trust no one, bench everything." - sbt plugin for JMH (Java Microbenchmark Harness)
Yet another implementation of computer language benchmarks game
golang HTTP stress testing tool, support single and distributed, http/1, http/2 and http/3.
Internal Safety Collapse: Turning the LLM or an AI Agent into a sensitive data generator.
Easily monitor your ThreeJS performances.
S3 benchmarking tool
RobustBench: a standardized adversarial robustness benchmark [NeurIPS 2021 Benchmarks and Datasets Track]
HammerDB: The industry standard open-source database benchmark
Evaluation of the CNN design choices performance on ImageNet-2012.
OpenCUA: Open Foundations for Computer-Use Agents
Tasks Assessing Protein Embeddings (TAPE), a set of five biologically relevant semi-supervised learning tasks spread across different domains of prote...
Another benchmark for some python frameworks
Raw benchmarks on throughput, latency and transfer of Hello World on popular microservices frameworks
[EMNLP 2024 & AAAI 2026] A powerful toolkit for compressing large models including LLMs, VLMs, and video generative models.
comparing the c ffi (foreign function interface) overhead on various programming languages
A blockchain benchmark framework to measure performance of multiple blockchain solutions https://wiki.hyperledger.org/display/caliper
DeepResearch Bench: A Comprehensive Benchmark for Deep Research Agents
An agent benchmark with tasks in a simulated software company.
Point based and tiny object detection and localization code set of UCAS-VG
Benchmarking long-form factuality in large language models. Original code for our paper "Long-form factuality in large language models".
A benchmarking framework for the Julia language
Frontier Models playing the board game Diplomacy.
CAIRI Supervised, Semi- and Self-Supervised Visual Representation Learning Toolbox and Benchmark
A repository of pretty cool datasets that I collected for network science and machine learning research.