lechmazur

lechmazur

👤 Developer

7 repositories on SrcLog

View on GitHub
7 Repos
724 Stars
26 Forks
724 Watchers

Repositories (7)

elimination_game lechmazur/elimination_game

A multi-player tournament benchmark that tests LLMs in social reasoning, strategy, and deception. Players engage in public and private conversations, form alliances, and vote to eliminate each other

289
confabulations lechmazur/confabulations HTML

Hallucinations (Confabulations) Document-Based Benchmark for RAG. Includes human-verified questions and answers.

166
nyt-connections lechmazur/nyt-connections Python

Benchmark that evaluates LLMs using 651 NYT Connections puzzles extended with extra trick words

93
generalization lechmazur/generalization

Thematic Generalization Benchmark: measures how effectively various LLMs can infer a narrow or specific "theme" (category/rule) from a small set of examples and anti-examples, then detect which item truly fits that theme among a collection of misleading candidates.

57
step_game lechmazur/step_game

Multi-Agent Step Race Benchmark: Assessing LLM Collaboration and Deception Under Pressure. A multi-player “step-race” that challenges LLMs to engage in public conversation before secretly picking a move (1, 3, or 5 steps). Whenever two or more players choose the same number, all colliding players fail to advance.

51
pgg_bench lechmazur/pgg_bench

Public Goods Game (PGG) Benchmark: Contribute & Punish is a multi-agent benchmark that tests cooperative and self-interested strategies among Large Language Models (LLMs) in a resource-sharing economic scenario. Our experiment extends the classic PGG with a punishment phase, allowing players to penalize free-riders or retaliate against others.

36
divergent lechmazur/divergent

LLM Divergent Thinking Creativity Benchmark. LLMs generate 25 unique words that start with a given letter with no connections to each other or to 50 initial random words.

32