Topic

benchmark

Repositories (1763)

MultiMedEval
MultiMedEval corentin-ryr Python

A Python tool to evaluate the performance of VLM on the medical domain.

88
physics-benchmarking-neurips2021
physics-benchmarking-neurips2021 cogtoolslab Jupyter Notebook

Repo for "Physion: Evaluating Physical Prediction from Vision in Humans and Machines", presented at NeurIPS 2021 (Datasets & Benchmarks track)

88
BinaryAudit
BinaryAudit QuesmaOrg Shell

An open-source benchmark for evaluating AI agents' ability to find backdoors hidden in compiled binaries.

88
Machine-Unlearning-Comparator
Machine-Unlearning-Comparator gnueaj TypeScript

A Visual Analytics System for Comparative Evaluation of Machine Unlearning Methods (TVCG 2026)

88
mujoco-benchmark
mujoco-benchmark ChenDRAG

Provide full reinforcement learning benchmark on mujoco environments, including ddpg, sac, td3, pg, a2c, ppo, library

88
NetworkBenchmarkDotNet
NetworkBenchmarkDotNet JohannesDeml C#

Low-level dotnet network benchmark for UDP socket performance (.NET and Unity compatible)

87
Mercury
Mercury Elfsong Jupyter Notebook

Code Efficiency Benchmark

87
LifelongAgentBench
LifelongAgentBench caixd-220529 Python

Code repo for "LifelongAgentBench: Evaluating LLM Agents as Lifelong Learners"

87
vllm-safety-benchmark
vllm-safety-benchmark UCSC-VLAA Python

[ECCV 2024] Official PyTorch Implementation of "How Many Unicorns Are in This Image? A Safety Evaluation Benchmark for Vision LLMs"

87
CMExam
CMExam williamliujl Python

A Chinese National Medical Licensing Examination dataset and large languge model benchmarks

86
MultiCorrupt
MultiCorrupt ika-rwth-aachen Jupyter Notebook

[IV2024] MultiCorrupt: A benchmark for robust multi-modal 3D object detection, evaluating LiDAR-Camera fusion models in autonomous driving. Includes d...

86
GL_vs_VK
GL_vs_VK RippeR37 C++

Comparison of OpenGL and Vulkan API in terms of performance.

86
wikidata-simplequestions
wikidata-simplequestions askplatypus Jupyter Notebook

Mapping of the SimpleQuestions dataset to Wikidata

86
step_game
step_game lechmazur

Multi-Agent Step Race Benchmark: Assessing LLM Collaboration and Deception Under Pressure. A multi-player “step-race” that challenges LLMs to engage i...

86
Elysium
Elysium Hon-Wong Python

[ECCV 2024] Elysium: Exploring Object-level Perception in Videos via MLLM

86
vlsa-aegis
vlsa-aegis THU-RCSCT Python

A vision-language-safety action architecture, named AEGIS, which contains a plug-and-play safety constraint layer formulated via control barrier funct...

86
SVBench
SVBench sotayang Python

[ICLR'2025 Spotlight] Official repository for "SVBench: A Benchmark with Temporal Multi-Turn Dialogues for Streaming Video Understanding"

86
CreativeBench
CreativeBench ZethWang Python

Official code for "CreativeBench: Benchmarking and Enhancing Machine Creativity via Self-Evolving Challenges"

86
turkish-lm-tuner
turkish-lm-tuner boun-tabi-LMG Python

Turkish LM Tuner

86
BenchLMM
BenchLMM AIFEG Python

[ECCV 2024] BenchLMM: Benchmarking Cross-style Visual Capability of Large Multimodal Models

86
ncnn-android-benchmark
ncnn-android-benchmark nihui C++

ncnn android benchmark app

86
rb
rb klingtnet Rust

A thread-safe fixed-size circular buffer written in safe Rust.

86
locust4j
locust4j myzhan Java

Locust4j is a load generator for locust, written in Java.

86
PEACE
PEACE microsoft Python

PEACE: Empowering Geologic Map Holistic Understanding with MLLMs [Official, CVPR 2025]

85
benchinit
benchinit mvdan Go

Benchmark the init cost of Go packages

85
DramaBench
DramaBench IIIIQIIII HTML

A six-dimensional evaluation framework for drama script continuation with interactive leaderboard and case studies

85
GMAI-MMBench
GMAI-MMBench uni-medical

GMAI-MMBench: A Comprehensive Multimodal Evaluation Benchmark Towards General Medical AI.

84
tum-traffic-dataset-dev-kit
tum-traffic-dataset-dev-kit tum-traffic-dataset Python

TUM Traffic Dataset Development Kit

84
Benchmark
Benchmark WorldDownTown Swift

The Benchmark⏲ module provides methods to measure and report the time used to execute Swift code.

84
Impossible-Videos
Impossible-Videos showlab Python

ICML 2025 - Impossible Videos

83
Counting-Stars
Counting-Stars nick7nlp Jupyter Notebook

Counting-Stars (★)

83
tasty-bench
tasty-bench Bodigrim Haskell

Featherlight benchmark framework, drop-in replacement for criterion and gauge.

83
IMLCGui
IMLCGui FarisR99 C#

Intel Memory Latency Checker GUI

83
AI-Generated-Content-Detection
AI-Generated-Content-Detection rustyneuron01 Python

Multi-modal AI-generated content detection: image, video, and audio. Benchmarks, training code (DINOv2, DINOv3, ReStraV, BreathNet), and evaluation pi...

83
pytest-django-queries
pytest-django-queries NyanKiyoshi Python

Generate performance reports from your django database performance tests.

83
MDBenchmark
MDBenchmark bio-phys Python

Quickly generate, start and analyze benchmarks for molecular dynamics simulations.

83
logs-benchmark
logs-benchmark SigNoz Shell

Logs performance benchmark repo: Comparing Elastic, Loki and SigNoz

82
router-benchmark
router-benchmark delvedor JavaScript

Benchmark of the most commonly used http routers

82
webassembly-benchmarks
webassembly-benchmarks jedisct1

Libsodium WebAssembly benchmarks results.

82
WildScenes
WildScenes csiro-robotics Python

[IJRR2024] The official repository for the WildScenes: A Benchmark for 2D and 3D Semantic Segmentation in Natural Environments

82
ruby-performance-tools
ruby-performance-tools JuanitoFatas

List of Ruby Tools for doing Performance.

81
evoeval
evoeval evo-eval Python

EvoEval: Evolving Coding Benchmarks via LLM

81
php-orm-benchmark
php-orm-benchmark sergeyklay PHP

The benchmark to compare performance of PHP ORM solutions.

81
godnsbench
godnsbench ameshkov Go

Simple DNS bench util that supports encrypted protocols.

81
sightglass
sightglass bytecodealliance C

A benchmarking suite and tooling for Wasmtime and Cranelift

81
LearnedSort
LearnedSort anikristo C++

Learned Sort: a model-enhanced sorting algorithm

81
auto-pen-bench
auto-pen-bench lucagioacchini Python

This repo contains the codes of the penetration test benchmark for Generative Agents presented in the paper "AutoPenBench: Benchmarking Generative Age...

81
ollama-benchmark
ollama-benchmark cloudmercato Python

Handy tool to measure the performance and efficiency of LLMs workloads.

81
perforator
perforator zyedidia Go

Record "perf" performance metrics for individual functions/regions of an ELF binary.

81
ASR_benchmark
ASR_benchmark Franck-Dernoncourt Python

Program to benchmark various speech recognition APIs

81