reinforcement learning from human feedback

Reinforcement Learning from Human Feedback (RLHF), or Reinforcement Learning from Human Preferences, is a machine learning technique that combines [[reinforcement learning]] with human guidance to train a [[machine learning model]]. This technique trains a “reward model” directly from human feedback and uses the model as a reward function to optimize the model using reinforcement learning (RL). The reward model is trained in advance to predict if a given output is good (high reward) or bad (low reward). Human feedback is most commonly collected by asking humans to rank instances of the model’s behavior. [[reinforcement learning]] < [[Hands-on LLMs]]/[[1 Machine Learning Basics]] > [[machine learning algorithm]]