Reinforcement Learning from Human Feedback (RLHF), or Reinforcement Learning from Human Preferences, is a machine learning technique that combines [[reinforcement learning]] with human guidance to train a [[machine learning model]].
This technique trains a “reward model” directly from human feedback and uses the model as a reward function to optimize the model using reinforcement learning (RL). The reward model is trained in advance to predict if a given output is good (high reward) or bad (low reward). Human feedback is most commonly collected by asking humans to rank instances of the model’s behavior.
[[reinforcement learning]] < [[Hands-on LLMs]]/[[1 Machine Learning Basics]] > [[machine learning algorithm]]