Topic

benchmark

Repositories (1763)

indonlu
indonlu IndoNLP Jupyter Notebook

The first-ever vast natural language processing benchmark for Indonesian Language. We provide multiple downstream tasks, pre-trained IndoBERT models,...

641
Awesome-LLM-Eval
Awesome-LLM-Eval onejune2018

Awesome-LLM-Eval: a curated list of tools, datasets/benchmark, demos, leaderboard, papers, docs and models, mainly for Evaluation on LLMs. 一个由工具...

635
kotlinx-benchmark
kotlinx-benchmark Kotlin Kotlin

Kotlin multiplatform benchmarking toolkit

633
VLM2Vec
VLM2Vec TIGER-AI-Lab Python

This repo contains the code for "VLM2Vec: Training Vision-Language Models for Massive Multimodal Embedding Tasks" [ICLR 2025]

631
TrustLLM
TrustLLM HowieHwong Python

[ICML 2024] TrustLLM: Trustworthiness in Large Language Models

623
rspec-benchmark
rspec-benchmark piotrmurach Ruby

Performance testing matchers for RSpec

618
NIID-Bench
NIID-Bench Xtra-Computing Python

Federated Learning Benchmark - Federated Learning on Non-IID Data Silos: An Experimental Study (ICDE 2022)

616
BenchMARL
BenchMARL facebookresearch Python

BenchMARL is a library for benchmarking Multi-Agent Reinforcement Learning (MARL). BenchMARL allows to quickly compare different MARL algorithms, task...

611
TextClassificationBenchmark
TextClassificationBenchmark FreedomIntelligence Python

A Benchmark of Text Classification in PyTorch

607
chillout
chillout polygonplanet JavaScript

Reduce CPU usage by non-blocking async loop and psychologically speed up in JavaScript

597
KLUE
KLUE KLUE-benchmark

📖 Korean NLU Benchmark

594
rewrk
rewrk lnx-search Rust

A more modern http framework benchmarker supporting HTTP/1 and HTTP/2 benchmarks.

582
DeeperForensics-1.0
DeeperForensics-1.0 EndlessSora Python

[CVPR 2020] A Large-Scale Dataset for Real-World Face Forgery Detection

579
SensatUrban
SensatUrban QingyongHu C++

🔥Urban-scale point cloud dataset (CVPR 2021 & IJCV 2022)

573
AgentLab
AgentLab ServiceNow Python

AgentLab: An open-source framework for developing, testing, and benchmarking web agents on diverse tasks, designed for scalability and reproducibility...

571
Visual-Tracking-Development
Visual-Tracking-Development DavidZhangdw Python

Visual Object Tracking

570
alpaca
alpaca p-ranav C++

Serialization library written in C++17 - Pack C++ structs into a compact byte-array without any macros or boilerplate code

557
CValues
CValues X-PLUG Python

面向中文大模型价值观的评估与对齐研究

557
completely-unscientific-benchmarks
completely-unscientific-benchmarks frol C++

Naive performance comparison of a few programming languages (JavaScript, Kotlin, Rust, Swift, Nim, Python, Go, Haskell, D, C++, Java, C#, Object Pasca...

556
Leaderboard
Leaderboard SpeechColab Python

SpeechIO Leaderboard: a large, robust, comprehensive, benchmarking platform for Automatic Speech Recognition.

545
agentdojo
agentdojo ethz-spylab Python

A Dynamic Environment to Evaluate Attacks and Defenses for LLM Agents.

542
NBench
NBench petabridge C#

Performance benchmarking and testing framework for .NET applications :chart_with_upwards_trend:

540
rpc-benchmark
rpc-benchmark hank-whu Java

java rpc benchmark, 灵感源自 https://www.techempower.com/benchmarks/

531
ClawProBench
ClawProBench suyoumo Python

ClawProBench is a live-first benchmark harness for evaluating LLM agents in the OpenClaw runtime with deterministic grading and repeated-trial rel...

528
pcam
pcam basveeling Python

The PatchCamelyon (PCam) deep learning classification benchmark.

520
LongCite
LongCite THUDM Python

LongCite: Enabling LLMs to Generate Fine-grained Citations in Long-context QA

520
FewCLUE
FewCLUE CLUEbenchmark Python

FewCLUE 小样本学习测评基准,中文版

519
CL-bench
CL-bench Tencent-Hunyuan Python

CL-bench: A Benchmark for Context Learning

518
glmark2
glmark2 glmark2 C

glmark2 is an OpenGL 2.0 and ES 2.0 benchmark

512
globalping
globalping jsdelivr TypeScript

A global network of probes to run network tests like ping, traceroute and DNS resolve

511
awesome-state-of-depth-completion
awesome-state-of-depth-completion alexklwong

Current state of supervised and unsupervised depth completion methods

510
Static-to-Dynamic-LLMEval
Static-to-Dynamic-LLMEval SeekingDream

The official GitHub repository of the paper "Recent advances in large language model benchmarks against data contamination: From static to dynamic eva...

505
z-bench
z-bench zhenbench

Z-Bench 1.0 by 真格基金:一个麻瓜的大语言模型中文测试集。Z-Bench is a LLM prompt dataset for non-technical users, developed by an enthusiastic AI-focu...

504
bigcodebench
bigcodebench bigcode-project Python

[ICLR'25] BigCodeBench: Benchmarking Code Generation Towards AGI

498
web-tooling-benchmark
web-tooling-benchmark v8 JavaScript

JavaScript benchmark for common web developer workloads

495
RHEL7-CIS
RHEL7-CIS ansible-lockdown YAML

Automated CIS Benchmark Compliance Remediation for RHEL 7 with Ansible

486
meta-agents-research-environments
meta-agents-research-environments facebookresearch Python

Meta Agents Research Environments is a comprehensive platform designed to evaluate AI agents in dynamic, realistic scenarios. Unlike static benchmarks...

479
PaddleFleetX
PaddleFleetX PaddlePaddle Python

飞桨大模型开发套件,提供大语言模型、跨模态大模型、生物计算大模型等领域的全流程开发工具链。

479
xfr
xfr lance0 Rust

A modern iperf3 alternative with a live TUI, multi-client server, and QUIC support. Built in Rust.

474
ant-application-security-testing-benchmark
ant-application-security-testing-benchmark alipay Java

xAST评价体系,让安全工具不再“黑盒”. The xAST evaluation benchmark makes security tools no longer a "black box".

472
little-coder
little-coder itayinbarr TypeScript

A coding agent optimized to smaller LLMs

462
tf_to_trt_image_classification
tf_to_trt_image_classification NVIDIA-AI-IOT Python

Image classification with NVIDIA TensorRT from TensorFlow models.

461
automlbenchmark
automlbenchmark openml Python

OpenML AutoML Benchmarking Framework

459
mixbench
mixbench ekondis C++

A GPU benchmark tool for evaluating GPUs and CPUs on mixed operational intensity kernels (CUDA, OpenCL, HIP, SYCL, OpenMP)

454
script
script adysec

VPS测试脚本 | VPS性能测试(VPS基本信息、IO性能、全球测速、ping、回程路由测试)、BBR加速脚本(一种加速TCP的拥堵算法技术)、三网测速脚本(三网测速、流媒...

448
LayoutFrameworkBenchmark
LayoutFrameworkBenchmark layoutBox Swift

Benchmark the performances of various Swift layout frameworks (autolayout, UIStackView, PinLayout, LayoutKit, FlexLayout, Yoga, ...)

445
prophiler
prophiler fabfuel PHP

PHP Profiler & Developer Toolbar (built for Phalcon)

441
sympact
sympact simonepri JavaScript

🔥 Stupid Simple CPU/MEM "Profiler" for your JS code.

441
gymfc
gymfc wil3 Python

A universal flight control tuning framework

441
ParseBench
ParseBench run-llama Python

ParseBench - A Document Parsing Benchmark for AI Agents

437