Search: alignment.anthropic.com | Heykuki News

HK

Heykuki News

Top New Best Ask Show Jobs

Top New Best Ask Show Jobs

1.

How does misalignment scale with model intelligence and task complexity?

alignment.anthropic.com

5 months ago

242 points

2.

Subliminal learning: Models transmit behaviors via hidden signals in data

alignment.anthropic.com

a year ago

208 points

3.

Teaching Claude Why

alignment.anthropic.com

a month ago

8 points

4.

Findings from a Pilot Anthropic–OpenAI Alignment Evaluation Exercise

alignment.anthropic.com

10 months ago

4 points

5.

Anthropic's Pilot Sabotage Risk Report

alignment.anthropic.com

8 months ago

3 points

6.

Stress-testing model specs reveals character differences among language models

alignment.anthropic.com

8 months ago

2 points

7.

Model Spec Midtraining: Improving How Alignment Training Generalizes

alignment.anthropic.com

2 months ago

2 points

8.

The Persona Selection Model: Why AI Assistants Might Behave Like Humans

alignment.anthropic.com

4 months ago

2 points

9.

Bloom: An open source tool for automated behavioral evaluations

alignment.anthropic.com

6 months ago

2 points

10.

Bloom: An open source tool for automated behavioral evaluations

alignment.anthropic.com

6 months ago

2 points

11.

Automated Researchers Can Subtly Sandbag

alignment.anthropic.com

a year ago

2 points

12.

Automated Researchers Can Subtly Sandbag

alignment.anthropic.com

a year ago

2 points

13.

A Toy Evaluation of Inference Code Tampering

alignment.anthropic.com

2 years ago

2 points

14.

Three Sketches of ASL-4 Safety Case Components

alignment.anthropic.com

4 months ago

1 points

15.

Training and Evaluating LLMs as General-Purpose Activation Explainers

alignment.anthropic.com

6 months ago

1 points

16.

Training on Documents About Reward Hacking Induces Reward Hacking

alignment.anthropic.com

a year ago

1 points

17.

Monitoring computer use via hierarchical summarization

alignment.anthropic.com

a year ago

1 points