Key Authors and Papers
Anthropic Papers
- Helpful, honest and harmless
- A General Language Assistant as a Laboratory for Alignment
- text based assistant aligned with human values of helpful, honest and harmless. Alignment evaluations:prompting.
- scaling trends for several training objectives relevant to alignment, comparing imitation learning, binary discrimination and ranked perference modeling.
- ranked preference modeling performs much better than imitation learning, scales more favorably with model size.
- binary discrimination typically performs and scales very similarly to imitation learning.