HK
Heykuki News
Top
New
Best
Ask
Show
Jobs
Toggle theme
Top
New
Best
Ask
Show
Jobs
Request
91.
▲
Launch HN: Relari (YC W24) – Identify the root cause of problems in LLM apps
15 comments
2 years ago
antonap
106 points
92.
▲
Show HN: Web-eval-agent – Let the coding agent debug itself
github.com/Operative-Sh
12 comments
a year ago
neversettles
84 points
93.
▲
Show HN: Ellipsis – Automatic pull request reviews
ellipsis.dev
11 comments
2 years ago
hunterbrooks
18 points
94.
▲
Bad MCP design costs your agent 5x more tokens
1 comment
18 days ago
JohnnyZhang483
17 points
95.
▲
Show HN: Honcho – Open-source memory infrastructure, powered by custom models
github.com/plastic-labs
discuss
5 months ago
vvoruganti
8 points
96.
▲
Show HN: Agent Tinman – Autonomous failure discovery for LLM systems
github.com/oliveskin
discuss
5 months ago
oliveskin
4 points
97.
▲
Show HN: Open Operator Evals – real-world benchmarks for LLM web agents
github.com/nottelabs
1 comment
a year ago
monoid73
3 points
98.
▲
Show HN: PromptProof – CI gate for LLM outputs (schema/regex/cost; no API keys)
news.ycombinator.com
discuss
10 months ago
geminimir
2 points
99.
▲
Show HN: I made web agents reliable with smaller LLMs via natural language
github.com/nottelabs
discuss
a year ago
giordanol
2 points
100.
▲
Deprecating A/B tests with offline policy evaluation
discuss
5 years ago
econti
1 points
101.
▲
Show HN: Agent-skills-eval – Test whether Agent Skills improve outputs
github.com/darkrishabh
37 comments
2 months ago
darkrishabh
79 points
102.
▲
Show HN: Continuous-eval – Granular evaluation of GenAI pipelines
github.com/relari-ai
2 comments
2 years ago
antonap
10 points
103.
▲
Show HN: I designed a ChatGPT prompt evaluator to ruin your fun;)
github.com/alignedai
1 comment
4 years ago
buildaligned
8 points
104.
▲
Show HN: Image Eval – An evaluation toolkit for image generation models
github.com/Storia-AI
discuss
3 years ago
nutellalover
7 points
105.
▲
Open RAG Eval
github.com/vectara
1 comment
a year ago
TastyLamps
6 points
106.
▲
In a sample of >1000 games, GPT-3.5-turbo-instruct plays chess with ~1800 elo
github.com/adamkarvonen
4 comments
3 years ago
sebzim4500
4 points
107.
▲
Show HN: Eval.js – a JavaScript interpreter written in JavaScript
github.com/marten-de-vries
1 comment
11 years ago
marten-de-vries
4 points
108.
▲
Open Game Eval: an eval for agentic Lua game development in Roblox
github.com/Roblox
discuss
6 months ago
kartayyar
3 points
109.
▲
Show HN: TypeScript type-level math expression parser and evaluator
github.com/dqbd
discuss
3 years ago
dqbd
3 points
110.
▲
GPT4 Learning from Reflection
github.com/GammaTauAI
discuss
3 years ago
agomez314
3 points
111.
▲
Can LLMs accurately evaluate their own confidence?
github.com/anerli
2 comments
a year ago
anerli
2 points
112.
▲
Show HN: CLI tool to analyze your Vector Embeddings!
github.com/dakshjain-1616
1 comment
4 months ago
gauravvij137
2 points
113.
▲
Show HN: OpenSciEval-AI Deriving Prime Theorem from Chaos
github.com/maris205
1 comment
6 months ago
mairswang
2 points
114.
▲
Show HN: PromptProof – CI gate for LLM outputs (schema/regex/cost; no API keys)
github.com/marketplace
1 comment
10 months ago
geminimir
2 points
115.
▲
Keyboard Layout Evaluation
github.com/bclnr
1 comment
4 years ago
Egoist
2 points
116.
▲
Evaluation Code – GPT-5 on Multimodal Medical Reasoning
github.com/wangshansong1
discuss
10 months ago
Topfi
2 points
117.
▲
Opensource operators evals
github.com/nottelabs
discuss
a year ago
kernelito
2 points
118.
▲
Show HN: Python library to run a “function” over a set of data via ChatGPT
github.com/TylerGlaiel
discuss
3 years ago
TylerGlaiel
2 points
119.
▲
Show HN: Spark-LLM-eval – Distributed LLM evaluation for Spark
github.com/bassrehab
1 comment
6 months ago
subhadipmitra
1 points
120.
▲
Show HN: Rubric – test what your LLM agent did, not just what it said
github.com/Kareem-Rashed
discuss
10 days ago
kareemrashed
1 points
More