Category: General AI safety monthly meeting

AI safety monthly meeting


October 20, 2024

Iterated Distillation and Amplification (IDA) is Paul Christiano’s proposed scheme for training machine learning systems that can be robustly aligned to complex and fuzzy values. The idea is that you can train an AI to predict what a human which consulted an AI model of the same person consulting an AI model [etc] would do, and then do that. Here is a summary and the explanation of the idea.

Is this a feasible way to align a model to a human? Why or why not?

Is the supposition of "delegation essentially works most of the time" correct (see Wentworth for discussion)? What alternative models may be interesting to explore to arrive at AIs understanding fuzziness of human values?

(The above links tend to be ~5 minute reads, but if this interests you, the FAQ on the alignment agenda seems great, but is a bit longer).

Green Bar House
Dr. Dragoslava Popovica 24
Belgrade, Serbia

View full calendar