Dario Amodei

12 posts

Safety Gym
Safety Gym

We're releasing Safety Gym, a suite of environments and tools for measuring progress towards reinforcement learning agents which respect safety constraints while training.

Fine-Tuning GPT-2 from Human Preferences

Better Language Models and Their Implications

We’ve trained a large-scale unsupervised language model which generates coherent paragraphs of text, achieves state-of-the-art performance on many language modeling benchmarks, and performs rudimentary reading comprehension, machine translation, question answering, and summarization.

24 minute read

How AI Training Scales

How AI Training Scales

We've discovered that the gradient noise scale, a simple statistical metric, predicts the parallelizability of neural network training on a wide range of tasks.

7 minute read

Learning Complex Goals with Iterated Amplification

AI and Compute

AI Safety via Debate

Preparing for Malicious Uses of AI

Gathering Human Feedback

Learning from Human Preferences

Faulty Reward Functions in the Wild

Special Projects