HK
Heykuki News
Top
New
Best
Ask
Show
Jobs
Toggle theme
Top
New
Best
Ask
Show
Jobs
Request
511.
▲
GEDD – Grounded Eval-Driven Development for AI Agents
github.com/aws-samples
discuss
13 days ago
balasvce2026
1 points
512.
▲
Show HN: VQAScore – open eval metric/reward model, now for text-to-video
github.com/linzhiqiu
discuss
15 days ago
linzhiqiu
1 points
513.
▲
LLM INQUISITOR: Evaluating how AI models handle long, realistic tasks
github.com/AssimilatedHuman
discuss
a month ago
ballista2026
1 points
514.
▲
Show HN: TweakIdea – 14-dimension startup idea evaluation in Claude Code
github.com/eph5xx
discuss
2 months ago
ephx
1 points
515.
▲
Show HN: Evaluate Python functions at their singularities
github.com/FWDhr
discuss
2 months ago
calculusmachine
1 points
516.
▲
Show HN: 2500 vision benchmarks / evals for Vision Language Models
github.com/Overshoot-ai
discuss
2 months ago
zakariaelhjouji
1 points
517.
▲
Show HN: An agent skill for eval-driven development of LLM-powered app
github.com/yiouli
discuss
3 months ago
yol
1 points
518.
▲
ReqIf OPA SARIF – CI/CD semantically evaluated policy gates
github.com/PromptExecution
discuss
3 months ago
elasticventures
1 points
519.
▲
Show HN: Vibe Coding Review Checklist – Evaluate AI-Generated Code Quality
github.com/aiqualitylab
discuss
4 months ago
LetsAutomate
1 points
520.
▲
Show HN: Orangensaft – A mini Python-like language with LLM eval in lang runtime
github.com/jargnar
discuss
4 months ago
jargnar
1 points
521.
▲
Show HN: Praetorian Guard – Free AI tool to self-evaluate your CV (educational)
github.com/simonesan-afk
discuss
4 months ago
saimonsan
1 points
522.
▲
MiRAGE: Open-source framework for multimodal RAG evaluation
discuss
4 months ago
mmhetric
1 points
523.
▲
The Vocabulary Priming Confound in LLM Evaluation [pdf]
github.com/Palmerschallon
discuss
4 months ago
palmerschallon
1 points
524.
▲
Open source agents to evaluate, debug, and optimize your prompts
github.com/comet-ml
discuss
5 months ago
ChefboyOG
1 points
525.
▲
Simboba: Evals for your AI product in under 5 mins
github.com/ntkris
discuss
6 months ago
handfuloflight
1 points
526.
▲
Live-trade-bench: Live evaluation of trading agents
github.com/ulab-uiuc
discuss
6 months ago
simonpure
1 points
527.
▲
Show HN: Dokimos – LLM evaluation framework for Java
github.com/dokimos-dev
discuss
6 months ago
fkapsahili
1 points
528.
▲
Benchmark that evaluates LLMs using 759 NYT Connections puzzles
github.com/lechmazur
discuss
6 months ago
ShrugLife
1 points
529.
▲
Show HN: smallevals – Local LLM Evaluation Framework with Tiny 0.6B Models
github.com/mburaksayici
discuss
7 months ago
mburaksayici
1 points
530.
▲
Open source LLM prompt eval and optimization CLI
github.com/davismartens
discuss
7 months ago
davismartens
1 points
531.
▲
Show HN: StructEval - a structured output evaluation and comparison tool
github.com/jhiker
discuss
7 months ago
jwesleyharding
1 points
532.
▲
Rogue – The AI Agent Evaluator
github.com/qualifire-dev
discuss
8 months ago
maxloh
1 points
533.
▲
Show HN: Local RAG Eval Harness – reproducible benchmarksfor retrieval pipelines
discuss
8 months ago
myroslavmokhamm
1 points
534.
▲
TinyExpr: Parser, compiler, and evaluation engine for math expressions
github.com/codeplea
discuss
8 months ago
gregsadetsky
1 points
535.
▲
Benchmark code for evaluating different ASR packages and APIs
github.com/huggingface
discuss
9 months ago
pinter69
1 points
536.
▲
Show HN: PromptDev – Prompt eval and testing for AI agents across providers
github.com/artefactop
discuss
10 months ago
sabatesduran
1 points
537.
▲
numexpr: fast numerical array expression evaluator for Python
github.com/pydata
discuss
10 months ago
cl3misch
1 points
538.
▲
Quality and Safety Evaluations for AI Agents on Azure
github.com/aymenfurter
discuss
10 months ago
jacksensi
1 points
539.
▲
Show HN: Hypersigil – Prompt management UI – test, evaluate, deploy
github.com/hypersigilhq
discuss
a year ago
piterrro
1 points
540.
▲
Safe-MCP: Security Analysis Framework for Evaluation of Model Context Protocol
github.com/fkautz
discuss
a year ago
mooreds
1 points
More