Search: github.com/eval | Heykuki News

HK

Heykuki News

Top New Best Ask Show Jobs

Top New Best Ask Show Jobs

91.

Launch HN: Relari (YC W24) – Identify the root cause of problems in LLM apps

2 years ago

106 points

92.

Show HN: Web-eval-agent – Let the coding agent debug itself

github.com/Operative-Sh

a year ago

84 points

93.

Show HN: Ellipsis – Automatic pull request reviews

2 years ago

18 points

94.

Bad MCP design costs your agent 5x more tokens

18 days ago

17 points

95.

Show HN: Honcho – Open-source memory infrastructure, powered by custom models

github.com/plastic-labs

5 months ago

8 points

96.

Show HN: Agent Tinman – Autonomous failure discovery for LLM systems

github.com/oliveskin

5 months ago

4 points

97.

Show HN: Open Operator Evals – real-world benchmarks for LLM web agents

github.com/nottelabs

a year ago

3 points

98.

Show HN: PromptProof – CI gate for LLM outputs (schema/regex/cost; no API keys)

news.ycombinator.com

10 months ago

2 points

99.

Show HN: I made web agents reliable with smaller LLMs via natural language

github.com/nottelabs

a year ago

2 points

100.

Deprecating A/B tests with offline policy evaluation

5 years ago

1 points

101.

Show HN: Agent-skills-eval – Test whether Agent Skills improve outputs

github.com/darkrishabh

2 months ago

79 points

102.

Show HN: Continuous-eval – Granular evaluation of GenAI pipelines

github.com/relari-ai

2 years ago

10 points

103.

Show HN: I designed a ChatGPT prompt evaluator to ruin your fun;)

github.com/alignedai

4 years ago

8 points

104.

Show HN: Image Eval – An evaluation toolkit for image generation models

github.com/Storia-AI

3 years ago

7 points

105.

github.com/vectara

a year ago

6 points

106.

In a sample of >1000 games, GPT-3.5-turbo-instruct plays chess with ~1800 elo

github.com/adamkarvonen

3 years ago

4 points

107.

Show HN: Eval.js – a JavaScript interpreter written in JavaScript

github.com/marten-de-vries

11 years ago

marten-de-vries

4 points

108.

Open Game Eval: an eval for agentic Lua game development in Roblox

github.com/Roblox

6 months ago

3 points

109.

Show HN: TypeScript type-level math expression parser and evaluator

github.com/dqbd

3 years ago

3 points

110.

GPT4 Learning from Reflection

github.com/GammaTauAI

3 years ago

3 points

111.

Can LLMs accurately evaluate their own confidence?

github.com/anerli

a year ago

2 points

112.

Show HN: CLI tool to analyze your Vector Embeddings!

github.com/dakshjain-1616

4 months ago

2 points

113.

Show HN: OpenSciEval-AI Deriving Prime Theorem from Chaos

github.com/maris205

6 months ago

2 points

114.

Show HN: PromptProof – CI gate for LLM outputs (schema/regex/cost; no API keys)

github.com/marketplace

10 months ago

2 points

115.

Keyboard Layout Evaluation

github.com/bclnr

4 years ago

2 points

116.

Evaluation Code – GPT-5 on Multimodal Medical Reasoning

github.com/wangshansong1

10 months ago

2 points

117.

Opensource operators evals

github.com/nottelabs

a year ago

2 points

118.

Show HN: Python library to run a “function” over a set of data via ChatGPT

github.com/TylerGlaiel

3 years ago

2 points

119.

Show HN: Spark-LLM-eval – Distributed LLM evaluation for Spark

github.com/bassrehab

6 months ago

1 points

120.

Show HN: Rubric – test what your LLM agent did, not just what it said

github.com/Kareem-Rashed

10 days ago

1 points