Most popular benchmark repositories and open source projects

indonlu IndoNLP Jupyter Notebook

The first-ever vast natural language processing benchmark for Indonesian Language. We provide multiple downstream tasks, pre-trained IndoBERT models,...

641 213 641

Awesome-LLM-Eval onejune2018

Awesome-LLM-Eval: a curated list of tools, datasets/benchmark, demos, leaderboard, papers, docs and models, mainly for Evaluation on LLMs. 一个由工具...

635 61 635

kotlinx-benchmark Kotlin Kotlin

Kotlin multiplatform benchmarking toolkit

633 43 633

VLM2Vec TIGER-AI-Lab Python

This repo contains the code for "VLM2Vec: Training Vision-Language Models for Massive Multimodal Embedding Tasks" [ICLR 2025]

631 59 631

TrustLLM HowieHwong Python

[ICML 2024] TrustLLM: Trustworthiness in Large Language Models

623 67 623

rspec-benchmark piotrmurach Ruby

Performance testing matchers for RSpec

618 21 618

NIID-Bench Xtra-Computing Python

Federated Learning Benchmark - Federated Learning on Non-IID Data Silos: An Experimental Study (ICDE 2022)

616 123 616

BenchMARL facebookresearch Python

BenchMARL is a library for benchmarking Multi-Agent Reinforcement Learning (MARL). BenchMARL allows to quickly compare different MARL algorithms, task...

611 125 611

TextClassificationBenchmark FreedomIntelligence Python

A Benchmark of Text Classification in PyTorch

607 134 607

chillout polygonplanet JavaScript

Reduce CPU usage by non-blocking async loop and psychologically speed up in JavaScript

597 19 597

KLUE KLUE-benchmark

📖 Korean NLU Benchmark

594 58 594

rewrk lnx-search Rust

A more modern http framework benchmarker supporting HTTP/1 and HTTP/2 benchmarks.

582 45 582

DeeperForensics-1.0 EndlessSora Python

[CVPR 2020] A Large-Scale Dataset for Real-World Face Forgery Detection

579 77 579

SensatUrban QingyongHu C++

🔥Urban-scale point cloud dataset (CVPR 2021 & IJCV 2022)

573 59 573

AgentLab ServiceNow Python

AgentLab: An open-source framework for developing, testing, and benchmarking web agents on diverse tasks, designed for scalability and reproducibility...

571 110 571

Visual-Tracking-Development DavidZhangdw Python

Visual Object Tracking

570 63 570

alpaca p-ranav C++

Serialization library written in C++17 - Pack C++ structs into a compact byte-array without any macros or boilerplate code

557 46 557

CValues X-PLUG Python

面向中文大模型价值观的评估与对齐研究

557 22 557

completely-unscientific-benchmarks frol C++

Naive performance comparison of a few programming languages (JavaScript, Kotlin, Rust, Swift, Nim, Python, Go, Haskell, D, C++, Java, C#, Object Pasca...

556 67 556

Leaderboard SpeechColab Python

SpeechIO Leaderboard: a large, robust, comprehensive, benchmarking platform for Automatic Speech Recognition.

545 71 545

agentdojo ethz-spylab Python

A Dynamic Environment to Evaluate Attacks and Defenses for LLM Agents.

542 138 542

NBench petabridge C#

Performance benchmarking and testing framework for .NET applications :chart_with_upwards_trend:

540 48 540

rpc-benchmark hank-whu Java

java rpc benchmark, 灵感源自 https://www.techempower.com/benchmarks/

531 123 531

ClawProBench suyoumo Python

ClawProBench is a live-first benchmark harness for evaluating LLM agents in the OpenClaw runtime with deterministic grading and repeated-trial rel...

528 45 528

pcam basveeling Python

The PatchCamelyon (PCam) deep learning classification benchmark.

520 108 520

LongCite THUDM Python

LongCite: Enabling LLMs to Generate Fine-grained Citations in Long-context QA

520 31 520

FewCLUE CLUEbenchmark Python

FewCLUE 小样本学习测评基准，中文版

519 75 519

CL-bench Tencent-Hunyuan Python

CL-bench: A Benchmark for Context Learning

518 29 518

glmark2 glmark2 C

glmark2 is an OpenGL 2.0 and ES 2.0 benchmark

512 206 512

globalping jsdelivr TypeScript

A global network of probes to run network tests like ping, traceroute and DNS resolve

511 51 511

awesome-state-of-depth-completion alexklwong

Current state of supervised and unsupervised depth completion methods

510 24 510

Static-to-Dynamic-LLMEval SeekingDream

The official GitHub repository of the paper "Recent advances in large language model benchmarks against data contamination: From static to dynamic eva...

505 39 505

z-bench zhenbench

Z-Bench 1.0 by 真格基金：一个麻瓜的大语言模型中文测试集。Z-Bench is a LLM prompt dataset for non-technical users, developed by an enthusiastic AI-focu...

504 42 504

bigcodebench bigcode-project Python

[ICLR'25] BigCodeBench: Benchmarking Code Generation Towards AGI

498 71 498

web-tooling-benchmark v8 JavaScript

JavaScript benchmark for common web developer workloads

495 78 495

RHEL7-CIS ansible-lockdown YAML

Automated CIS Benchmark Compliance Remediation for RHEL 7 with Ansible

486 299 486

meta-agents-research-environments facebookresearch Python

Meta Agents Research Environments is a comprehensive platform designed to evaluate AI agents in dynamic, realistic scenarios. Unlike static benchmarks...

479 64 479