llm-bug-bench
gabrieldiem/llm-bug-bench
Python
Web-based benchmark suite that evaluates how well LLMs catch real-world concurrency bugs, deadlocks, and distributed systems edge cases. Includes automated 1–20 scoring via LLM judge, side-by-side run comparisons, and support for local (Ollama) and cloud providers.