On the Algorithmic Bias of Aligning Large Language Models with RLHF: Preference Collapse and Matching Regularization

<p>Accurately aligning large language models (LLMs) with human preferences is crucial for informing fair, economically sound, and statistically efficient decision-making processes. However, we argue that the predominant approach for aligning LLMs with human preferences through a reward model—r...

وصف كامل

محفوظ في:

التفاصيل البيبلوغرافية
المؤلف الرئيسي:	Jiancong Xiao (22265935) (author)
مؤلفون آخرون:	Ziniu Li (18134434) (author), Xingyu Xie (13788991) (author), Emily Getzen (22265938) (author), Cong Fang (516517) (author), Qi Long (198526) (author), Weijie J. Su (12964285) (author)
منشور في:	2025
الموضوعات:	Evolutionary Biology Sociology Cancer Biological Sciences not elsewhere classified Mathematical Sciences not elsewhere classified Information Systems not elsewhere classified Large Language Model Reinforcement Learning from Human Feedback Preference Matching Reward Model
الوسوم:	إضافة وسم لا توجد وسوم, كن أول من يضع وسما على هذه التسجيلة!

On the Algorithmic Bias of Aligning Large Language Models with RLHF: Preference Collapse and Matching Regularization

مواد مشابهة