A Python tool to evaluate the performance of VLM on the medical domain.
Repo for "Physion: Evaluating Physical Prediction from Vision in Humans and Machines", presented at NeurIPS 2021 (Datasets & Benchmarks track)
An open-source benchmark for evaluating AI agents' ability to find backdoors hidden in compiled binaries.
A Visual Analytics System for Comparative Evaluation of Machine Unlearning Methods (TVCG 2026)
Provide full reinforcement learning benchmark on mujoco environments, including ddpg, sac, td3, pg, a2c, ppo, library
Low-level dotnet network benchmark for UDP socket performance (.NET and Unity compatible)
Code Efficiency Benchmark
Code repo for "LifelongAgentBench: Evaluating LLM Agents as Lifelong Learners"
[ECCV 2024] Official PyTorch Implementation of "How Many Unicorns Are in This Image? A Safety Evaluation Benchmark for Vision LLMs"
A Chinese National Medical Licensing Examination dataset and large languge model benchmarks
[IV2024] MultiCorrupt: A benchmark for robust multi-modal 3D object detection, evaluating LiDAR-Camera fusion models in autonomous driving. Includes d...
Comparison of OpenGL and Vulkan API in terms of performance.
Mapping of the SimpleQuestions dataset to Wikidata
Multi-Agent Step Race Benchmark: Assessing LLM Collaboration and Deception Under Pressure. A multi-player “step-race” that challenges LLMs to engage i...
[ECCV 2024] Elysium: Exploring Object-level Perception in Videos via MLLM
A vision-language-safety action architecture, named AEGIS, which contains a plug-and-play safety constraint layer formulated via control barrier funct...
[ICLR'2025 Spotlight] Official repository for "SVBench: A Benchmark with Temporal Multi-Turn Dialogues for Streaming Video Understanding"
Official code for "CreativeBench: Benchmarking and Enhancing Machine Creativity via Self-Evolving Challenges"
Turkish LM Tuner
[ECCV 2024] BenchLMM: Benchmarking Cross-style Visual Capability of Large Multimodal Models
ncnn android benchmark app
A thread-safe fixed-size circular buffer written in safe Rust.
Locust4j is a load generator for locust, written in Java.
PEACE: Empowering Geologic Map Holistic Understanding with MLLMs [Official, CVPR 2025]
Benchmark the init cost of Go packages
A six-dimensional evaluation framework for drama script continuation with interactive leaderboard and case studies
GMAI-MMBench: A Comprehensive Multimodal Evaluation Benchmark Towards General Medical AI.
TUM Traffic Dataset Development Kit
The Benchmark⏲ module provides methods to measure and report the time used to execute Swift code.
ICML 2025 - Impossible Videos
Counting-Stars (★)
Featherlight benchmark framework, drop-in replacement for criterion and gauge.
Intel Memory Latency Checker GUI
Multi-modal AI-generated content detection: image, video, and audio. Benchmarks, training code (DINOv2, DINOv3, ReStraV, BreathNet), and evaluation pi...
Generate performance reports from your django database performance tests.
Quickly generate, start and analyze benchmarks for molecular dynamics simulations.
Logs performance benchmark repo: Comparing Elastic, Loki and SigNoz
Benchmark of the most commonly used http routers
Libsodium WebAssembly benchmarks results.
[IJRR2024] The official repository for the WildScenes: A Benchmark for 2D and 3D Semantic Segmentation in Natural Environments
List of Ruby Tools for doing Performance.
EvoEval: Evolving Coding Benchmarks via LLM
The benchmark to compare performance of PHP ORM solutions.
Simple DNS bench util that supports encrypted protocols.
A benchmarking suite and tooling for Wasmtime and Cranelift
Learned Sort: a model-enhanced sorting algorithm
This repo contains the codes of the penetration test benchmark for Generative Agents presented in the paper "AutoPenBench: Benchmarking Generative Age...
Handy tool to measure the performance and efficiency of LLMs workloads.
Record "perf" performance metrics for individual functions/regions of an ELF binary.
Program to benchmark various speech recognition APIs