HK
Heykuki News
Top
New
Best
Ask
Show
Jobs
Toggle theme
Top
New
Best
Ask
Show
Jobs
Request
391.
▲
Fast, portable, non-Turing complete expression evaluation with gradual typing
github.com/google
discuss
a month ago
tjek
2 points
392.
▲
Show HN: Nexa-Gauge – LLM eval framework, now with self-hosted model support
github.com/harnexa
discuss
a month ago
Sardhendu
2 points
393.
▲
How many of us are evaling our skills?
github.com/BintzGavin
discuss
2 months ago
GavinBintz
2 points
394.
▲
Show HN: Verdict – model evals on your own data, not someone else's benchmark
github.com/aevyraai
discuss
2 months ago
agunapal
2 points
395.
▲
Show HN: SkillCompass – open-source quality evaluator for your AI skills
github.com/Evol-ai
discuss
2 months ago
yo103jg
2 points
396.
▲
Stockfish removes classical evaluation functions in favor of NNUE only (2023)
github.com/official-stockfish
discuss
2 months ago
knuckleheads
2 points
397.
▲
Show HN: We Evaluates Medical Research Agent Skills
github.com/aipoch
discuss
2 months ago
The_resa
2 points
398.
▲
Tax Logic Evaluation with Prolog
github.com/mthom
discuss
3 months ago
triska
2 points
399.
▲
Show HN: Aludel – LLM eval workbench for Phoenix apps
github.com/ccarvalho-eng
discuss
3 months ago
wood-archer
2 points
400.
▲
Show HN: A tool to create and evaluate document processing pipelines for RAG
ragbandit.com
discuss
3 months ago
martimchaves
2 points
401.
▲
I built a local-only eval runner for AI agents (quickbench)
github.com/iamGodofall
discuss
3 months ago
Godofall
2 points
402.
▲
LLM evals test outputs. Rarely whether the model understood first
github.com/NoxionAI
discuss
3 months ago
noxion
2 points
403.
▲
Dynamic E2E Agentic Simulation and Evaluation with Cypress
github.com/gojiplus
discuss
3 months ago
neehao
2 points
404.
▲
TLAi+ Benchmarks for Evaluating LLMs
github.com/tlaplus
discuss
3 months ago
alhazrod
2 points
405.
▲
Edge – Generate structured evaluation criteria for any domain using a local LLM
github.com/EviAmarates
discuss
4 months ago
TiagoSantos
2 points
406.
▲
Engine-Bench: Evaluating Coding Agents on Writing Game Engine Code
github.com/JoshuaPurtell
discuss
5 months ago
JoshPurtell
2 points
407.
▲
Show HN: Simboba – Evals in under 5 mins
github.com/ntkris
discuss
6 months ago
ntkris
2 points
408.
▲
Show HN: Dokimos – LLM Evaluation Framework for Java
github.com/dokimos-dev
discuss
6 months ago
fkapsahili
2 points
409.
▲
Chess LLM Benchmark: Evaluating LLMs' ability to play chess
github.com/lightnesscaster
discuss
7 months ago
dwohnitmok
2 points
410.
▲
Show HN: AI PM Evaluation Framework (Open Source)
aipmframework.com
discuss
8 months ago
abediaz
2 points
411.
▲
Codegen Scorer – evaluate the quality of code generated by LLMs
github.com/angular
discuss
9 months ago
martypitt
2 points
412.
▲
Physical_Atari: Platform for evaluating RL algorithms on a physical Atari
github.com/Keen-Technologies
discuss
9 months ago
simonpure
2 points
413.
▲
OpenBench: Provider-agnostic, open-source evaluation infrastructure for LLMs
github.com/groq
discuss
10 months ago
gmays
2 points
414.
▲
Show HN: KARMA – An evaluation framework for Medical AI systems
karma.eka.care
discuss
10 months ago
k2so
2 points
415.
▲
LLM Speedrunner: Eval for frontier models to reproduce scientific findings
github.com/facebookresearch
discuss
a year ago
zerojames
2 points
416.
▲
MAIR: A Benchmark for Evaluating Instructed Retrieval
github.com/sunnweiwei
discuss
a year ago
fzliu
2 points
417.
▲
Doyensec – Security Policy Evaluation Framework
github.com/gravitational
discuss
a year ago
tony-ds
2 points
418.
▲
Evaluate Any Model from the HuggingFace Hub on the ImageNet on Free Colab GPUs
github.com/SauravMaheshkar
discuss
a year ago
sauravmaheshkar
2 points
419.
▲
Lambda calculus - compiler, type inference, and evaluator in less than 100 LOC
gist.github.com
discuss
a year ago
tearflake
2 points
420.
▲
Show HN: I built an open-source benchmark that evaluates LLMs through gameplay
llmshowdown.io
discuss
a year ago
jmogi
2 points
More