HK
Heykuki News
Top
New
Best
Ask
Show
Jobs
Toggle theme
Top
New
Best
Ask
Show
Jobs
Request
1.
▲
How does misalignment scale with model intelligence and task complexity?
alignment.anthropic.com
78 comments
5 months ago
salkahfi
242 points
2.
▲
Subliminal learning: Models transmit behaviors via hidden signals in data
alignment.anthropic.com
40 comments
a year ago
treebrained
208 points
3.
▲
Teaching Claude Why
alignment.anthropic.com
3 comments
a month ago
cebert
8 points
4.
▲
Findings from a Pilot Anthropic–OpenAI Alignment Evaluation Exercise
alignment.anthropic.com
discuss
10 months ago
dramebaaz
4 points
5.
▲
Anthropic's Pilot Sabotage Risk Report
alignment.anthropic.com
discuss
8 months ago
allenleee
3 points
6.
▲
Stress-testing model specs reveals character differences among language models
alignment.anthropic.com
1 comment
8 months ago
diwank
2 points
7.
▲
Model Spec Midtraining: Improving How Alignment Training Generalizes
alignment.anthropic.com
discuss
2 months ago
bearseascape
2 points
8.
▲
The Persona Selection Model: Why AI Assistants Might Behave Like Humans
alignment.anthropic.com
discuss
4 months ago
JnBrymn
2 points
9.
▲
Bloom: An open source tool for automated behavioral evaluations
alignment.anthropic.com
discuss
6 months ago
pbd
2 points
10.
▲
Bloom: An open source tool for automated behavioral evaluations
alignment.anthropic.com
discuss
6 months ago
sonabinu
2 points
11.
▲
Automated Researchers Can Subtly Sandbag
alignment.anthropic.com
discuss
a year ago
bearseascape
2 points
12.
▲
Automated Researchers Can Subtly Sandbag
alignment.anthropic.com
discuss
a year ago
Anon84
2 points
13.
▲
A Toy Evaluation of Inference Code Tampering
alignment.anthropic.com
discuss
2 years ago
allenleein
2 points
14.
▲
Three Sketches of ASL-4 Safety Case Components
alignment.anthropic.com
discuss
4 months ago
consumer451
1 points
15.
▲
Training and Evaluating LLMs as General-Purpose Activation Explainers
alignment.anthropic.com
discuss
6 months ago
not4uffin
1 points
16.
▲
Training on Documents About Reward Hacking Induces Reward Hacking
alignment.anthropic.com
discuss
a year ago
polygot
1 points
17.
▲
Monitoring computer use via hierarchical summarization
alignment.anthropic.com
discuss
a year ago
davidbarker
1 points