Hallucinations (Confabulations) Document-Based Benchmark for RAG. Includes human-verified questions and answers.
Python timeit CLI for the 21st century! colored output, multi-line input with syntax highlighting and autocompletion and much more!
The evaluation benchmark on MCP servers
No-code CLI designed for accelerating ONNX workflows
C++ Binary Fixed-Point Arithmetic
HAKE: Human Activity Knowledge Engine (CVPR'18/19/20, NeurIPS'20, TPAMI'21)
A mini-framework for evaluating LLM performance on the Bulls and Cows number guessing game, supporting multiple LLM providers.
The code release of paper "Ultra-High-Definition Low-Light Image Enhancement: A Benchmark and Transformer-Based Method", AAAI 2023
Official repository of the ICML2025 paper “Dynamic Benchmarking of Reasoning Capabilities in Code Large Language Models Under Data Contamination”
Vulkan benchmark
Comparative Evaluation of Hand-Crafted and Learned Local Features
An automated scoring function to facilitate and standardize the evaluation of goal-directed generative models for de novo molecular design
nablaDFT: Large-Scale Conformational Energy and Hamiltonian Prediction benchmark and dataset
[ICLR 2025] A trinity of environments, tools, and benchmarks for general virtual agents
A rapid http(s) benchmark tool written in Go
Performance benchmarks for Kubernetes
Python & Matlab code for local feature descriptor evaluation with the HPatches dataset.
Benchmark your Kubernetes storage.
🚀🪑 evm-bench is a suite of Ethereum Virtual Machine stress tests and benchmarks.
Benchmark of Apple MLX operations on all Apple Silicon chips (GPU, CPU) + MPS and CUDA.
Comprehensive benchmarking of protein-ligand structure prediction methods. (Nature Machine Intelligence)
Performance of various open source GBM implementations
This project collects GPU benchmarks from various cloud providers and compares them to fixed per token costs. Use our tool for efficient LLM GPU selec...
Awesome Mojo🔥
Seamless analysis of your PyTorch models (RAM usage, FLOPs, MACs, receptive field, etc.)
Benchmark that evaluates LLMs using 759 NYT Connections puzzles extended with extra trick words
DBNet: A Large-Scale Dataset for Driving Behavior Learning, CVPR 2018
A lightweight harness designed to be simple to use, efficient, universially compatible with any coding agent to evaluate them over a broad set of codi...
This repository is the official implementation of the paper Convolutional Neural Operators for robust and accurate learning of PDEs
A universal load testing framework for Rust, with real-time tui support.
A benchmark for spaced repetition schedulers/algorithms
Repo for "Benchmarking Robustness of 3D Point Cloud Recognition against Common Corruptions" https://arxiv.org/abs/2201.12296
TPC-H benchmark kit with some modifications/additions
A self-ailgnment method for role-play. Benchmark for role-play. Resources for "Large Language Models are Superpositions of All Characters: Attaining A...
[NeurIPS 2024 D&B Spotlight🔥] ChronoMagic-Bench: A Benchmark for Metamorphic Evaluation of Text-to-Time-lapse Video Generation
An End-to-End Benchmark Suite for Univariate Time-Series Anomaly Detection
[Pytorch] The repo contains the code for "FORGE: Forming Semantic Identifiers for Generative Retrieval in Industrial Datasets"
[NeurIPS 2025 D&B🔥] OpenS2V-Nexus: A Detailed Benchmark and Million-Scale Dataset for Subject-to-Video Generation
Benchmark for Prometheus-compatible systems
CVE-Bench: A Benchmark for AI Agents’ Ability to Exploit Real-World Web Application Vulnerabilities
Meta Self-learning for Multi-Source Domain Adaptation: A Benchmark
📊 zig benchmark
A large-scale multilingual dataset for Information Retrieval. Thorough human-annotations across 18 diverse languages.
Apache Kafka load testing "...basically a cloth bag filled with small jagged pieces of scrap iron"
This repository contains 47,398 smart contracts extracted from the Ethereum network.
benchmark for embededded-ai deep learning inference engines, such as NCNN / TNN / MNN / TensorFlow Lite etc.
A few quick scripts focused on testing TensorFlow/PyTorch/Llama 2 on macOS.
Automated CIS Benchmark Compliance Remediation for RHEL 9 with Ansible
🎨 Simple but attractive graphic a calculator built with Jetpack Compose
Rodinia benchmark